CAL_HDL

7/30/2019 CAL_HDL

1/15

Low Power ASIC Implementation froma CAL Dataflow Description

Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues

Department of Electrical and Information Technology, Lund University

Box 118, SE-221 00 Lund, Sweden

Thomas Olsson and Anders Carlsson

Ericsson Research, Lund, Sweden

Abstract

This paper presents a flow for low power hardware generation, based on a CAL actor language.

CAL is a dataflow language which provides a higher level of abstraction and generate both hard-

ware and software description. A dataflow language is appropriate for signal processing systems

since algorithms are typically specified in dataflow graphs, using the same method for specifi-

cation and high level implementation offers rapid prototyping. Also the block partitioning ca-

pability of the CAL language makes it ideal for hardware-software co-design and programming

reconfigurable processor arrays.

The original CAL flow, is targeted for hardware-software co-design of complex systems on

FPGA, this is modified to facilitate low power ASIC implementations. In case of ASIC the

partitioning capability allows for implementing different clock domains, and by introducing a

token based clock gating to each processing block further reduces power consumption. As a

case study to evaluate the methodology and optimizations incorporated in the flow, an Orthogo-

nal Frequency-Division Multiplexing (OFDM) multi-standard channel estimator is implemented.

Hardware-Software co-design and Globally Asynchronous Locally Synchronous (GALS) design

at a higher level of abstraction provides more freedom for design-space exploration and reduced

design time.

Keywords: CAL Dataflow Language, High-Level Synthesis, Hw-Sw Co-Design, Design

Partition, GALS, Token based Clock Gating, Low Power ASIC.

1. Introduction

There is an increase in complexity of signal processing systems, driven by ever increasing

demand for faster devices with more features. The implementation of complex designs require

6

Email addresses: [email protected], [email protected],

[email protected] (Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues ),

[email protected],[email protected] (Thomas Olsson and Anders Carlsson )

Preprint submitted to Embedded Hardware Design (Microprocessors and Microsystems) March 26, 2012

7/30/2019 CAL_HDL

2/15

hardware platforms with multiple processors, accelerators, peripherals and reconfigurable arrays.This kind of hardware platforms require very detailed cycle accurate description code. Typically

used are Hardware Description Languages(HDL) like Verilog and VHDL. Register Transfer

Level (RTL) implementation of complex algorithms and their reference design tend to be time

consuming tasks.

Implementation of hardware at a higher abstraction level requires new design flow and tools.

A dataflow language is appropriate for signal processing systems since algorithms are typically

specified in dataflow graphs. Using the same method for specification and implementation offers

relative easiness due to rapid prototyping. CAL is a dataflow oriented language that was specified

and developed as part of the Ptolemy project at the University of California, Berkeley [1]. The

CAL dataflow language is extensively described in CAL language reference manual [2]. The

CAL language gives a high level of abstraction and is able to generate a synthesizable hardware

and software description. However, the current version of CAL to RTL generator (OpenDF)

introduces redundant logic, which increases area and power cost. Therefore, in this study the

RTL mapping efficiency from a CAL dataflow description was increased, and evaluated by a

case study.

The block partitioning capability of the CAL language may be used to efficiently implement

Globally Asynchronous Locally Synchronous(GALS) designs, which has the major advantage

that a traditional synchronous design flow may be applied. Furthermore, power consumption of

the design needs to be addressed to increase the battery life. As part of the study of hardware im-

plementation in CAL, a clock gating scheme based on the activity of a network is implemented.

Several modifications on the CAL to RTL tool were performed to support these features.

As a case study, an Orthogonal Frequency-Division Multiplexing (OFDM) multi-standard

MMSE (minimum mean squared error) channel estimator is implemented in CAL to evaluate

the methodology and optimizations incorporated in the flow. The channel estimator was synthe-

sized in 65 nm CMOS technology. A GALS architecture was realized by dividing the design intodifferent clock domains. A low power clock gating scheme was included in the implementation

and an analysis on hardware parameters were performed.

The remaining part of this paper is organized into the following sections. In Sec. 2 a brief

introduction of the CAL dataflow language is presented, and Sec. 3 addresses the optimizations

on the CAL flow. Sec. 4 presents GALS design and clock gating technique incorporated into the

tool. In Sec. 5 hardware implementation of the channel estimator in CAL is described, and the

various results obtained are discussed in Sec. 5. Finally conclusions are drawn in Sec. 6.

2. Background of CAL Dataflow Language

A dataflow model of an algorithm consists of nodes and communication arcs. The nodes rep-

resent combinational logic, and the communication arcs are used to transfer data tokens betweennodes. A variety of such models exist, which have different trade-offs between expressiveness

and ability for analysis. Of particular interest are the synchronous dataflow networks, which are

applied in several academic modelling tools to represent streaming applications [3]. Synchronous

dataflow networks are constrained, which leads to an efficient synthesized code [4]. The advan-

tage of a dataflow model is that it is possible to have a one-to-one mapping between nodes or

computational units in hardware. The nodes act asynchronous to each other, and communication

arcs are used to transfer data tokens with insignificant control mechanism costs. In [5] it is shown

that dataflow models offer a representation that may effectively support the parallelization of the

design for higher performance, which is required for a lot of applications in wireless systems.

2

7/30/2019 CAL_HDL

3/15

N1

N2

N3

A

B

C

Figure 1: Dataflow Graph.

2.1. CAL Programming Model

In Fig. 1, a dataflow graph with three nodes (N1, N2, N3) and three communication arcs (A, B,

C) is shown. In a CAL implementation the nodes are the actors which represents computation-

al/logical tasks. Communication arcs are the buffers or FIFOs through which data tokens are

transferred between actors.

The CAL actors are isolated computational units which consist of input/output(IO) ports,

actions, state variables, and parameters. The state of an actor is not shareable with other actors,

and interaction between actors is accomplished through IO ports based on data tokens. An action

defines computational/logical operations performed on the data tokens based on the actor states.

When an action is fired, it may consume and/or produce data tokens. Afterwards, state of the

actor is modified and an output data token is produced.

2.2. RTL Generation from CAL

The CAL flow provides a high level of abstraction. The design cycle offers a wide range of

design space exploration and optimization techniques. A CAL program may be compiled to both

hardware and software. A software implementation is realized by translating CAL to C program-

ming language, and hardware is realized by HDL. This ability to perform both hardware-software

CAL Model

OpenDF simulations

A

B

C

Action

state

Example merger actor

actor merge () in1,in2 ==> out

A : action in1 : [a] ==> [a]

B : action in2 : [a] ==> [a]

selector (AB) *

end

end

Software Generation

C Code

Generated by

CAL2C

main() {

int i = 0;

HDL Code

Generated by

CAL2HDL

process()

begin

Hardware Generation

Figure 2: CAL Frame Work.

3

7/30/2019 CAL_HDL

4/15

n ut

fb_infb_out

A

B Add

FB

Figure 3: Simple Feedback Network.

1

2 A c t o r d e s c r i p t i o n

3

4 / / A c t o r f o r s i m p le F ee db a ck ( d e l a y )

5 a c t o r FB ( i n i t ) f b i n ==> f b o u t

6 a c t i o n [ f b i n ] ==> [ f b o u t ]

7 e nd

8

9 / / A c to r f o r A d d it i on , t o k en s a t p o r t A a nd B i s a dd ed .

10 a c t o r a dd ( ) A , B ==> Sum

11 a ct io n [ a ] , [ b ] ==> [ a+b ] en d

12 e n d

13

14 Netwo rk d e s c r i p t i o n

15

16 / / T op L e v e l N e tw o r k

17 / / I n t e g r a t i n g A c to r s

18 n et wo r k f b a d d ( ) I n ==> Ou t

19 e n t i t i e s

20 f b 0 = FB( i n i t = [ 0 ] ) ;21 add = a d d ( ) ;

22 s t r u c t u r e

23 I n > add .A;

24 f b 0 . f b o u t > a d d . B ;

25 add . sum > f b 0 . f b i n ;

26 add . sum > Out ;

27 e n d

Listing 1: Example Feedback Network using CAL

synthesis enables the development of a unified platform for hardware-software co-design of com-

plex systems, like embedded systems consisting of processors and hardware accelerators.

A complete framework called Open Dataflow, supports CAL network simulation and genera-

tion of hardware-software code, see Fig. 2. This capability of CAL to support hardware software

co-design enables a common tool, architectural definition and specification for both platforms

[6], simplifies the design of complex systems. Details of the translation of CAL to HDL or C

are described in [7]. These tools are open source and in this paper original version refers to the

version available at [8].

2.3. CAL Network Implementation

The topology of actors connected to each other is referred as a network of actors, a simple

network of actors implementation in CAL is shown in Fig. 3, and CAL description is shown in

4

7/30/2019 CAL_HDL

5/15

Communication

Arcs

Actors

Scheduler

Action States

Separate local

scheduler for actions

encapsulated states

FIFO FIFO

FIFOFIFO FIFO

Action

Figure 4: Conceptional illustration of an Actor Network.

List. 1. In hardware, each actor is an independent entity and the communication between theactors is based on handshake protocols (4-phase bundled-data protocol). Each communication

arcs in the CAL network is implemented as a FIFO with a handshakeprotocol wrapper, see Fig. 4.

Furthermore, an actor also facilitates handshake protocol to consume and produce tokens. If two

connected actors belong to different clock domains, an asynchronous FIFO implementation is

selected.

2.4. Modifications to existing Framework

A brief description of CAL dataflow language was presented in previous subsections. In Fig-

ure 5, the overall existing CAL framework/tools is shown. The support for ASIC Implementation

was added to this framework by modifications/

optimization of the tool, described in next section.

CAL Description

*.cal

CAL Front End Toolsopendf, orcc

Open Source Tools

Intermediate

Representation

Top Level Files(network

description *.nl)

XLIM Backend Files

( input to other possible

translator tools )

Software

Implementation

C Files

*.c , *.h

C++ Files

*.cpp

Java Files

*.class, *.jar

Hardware

Implementation

FPGA Implementation

using xilinx libraries

*.vhd , *.v

ASIC Implementation

*.vhd , *.v

Support For ASIC Implementation :

* Remove all xilinx li brary dependecies

* Infer Block memories (pick from library)

* Optimizations to reduce Area Cost.

* Partitioning into clock domains (GALS support)

* Automatically Generate clock gating logic (Low Power )

Figure 5: Included ASIC features to existing CAL framework.

5

7/30/2019 CAL_HDL

6/15

Memory Unit

Block

Memory

Clocked

Registers

Scheduler

Reset Sync

Logic

Kicker Circuit

Handshake for

input token

Handshake for

output token

reset clk

Actions

Mathematical &

logical operations

Actions

Actions

Figure 6: Generated Actor Implementation.

3. ASIC Implementation from CAL

The modification performed on the tools to support ASIC implementation is divided into two

parts, first the reduction of the hardware area cost was taken into account. Various modifications

are performed on the tool/flow to reduce area of the hardware implementation from CAL. The

second part involved incorporating existing low power methods (GALS, clock gating) into the

flow to enable low power ASIC implementation from CAL.

3.1. CAL generated hardware - Area Optimizations

A CAL actor consists of various units based on the computational and logical tasks. Fig. 6

shows a generalized actor implementation in hardware. The various sub-modules in the actor are

explained as below.

The Reset Synchronizer Logic is used to synchronize the reset with the actor clock.

The Kicker Circuit generates a pulse based on the synchronized reset signal. The pulse

generated is used by the scheduler logic to begin the protocol handshake mechanism and

action scheduling.

The Memory Unit contains state, global variables and constants. The global variables and

constants are used by the action units for computational purpose. The state is used by the

scheduler for firing sequences of actions in an actor.

The Action Units are computational or logical units of an actor.

The Scheduler Unithandles the token handshake protocol and firing of action units.

6

7/30/2019 CAL_HDL

7/15

3.1.1. Removal of Redundant LogicThe actor hardware generated by the original version of the tool assumes that each and every

actor is a separate asynchronous block. Hence a synchronous reset logic and a corresponding

kicker (pulse gen) circuit is implemented for every actor. This logic is redundant, since for a

single clock domain the resets may be synchronized once, and routed to all the actors of that

clock domain consequently. The CAL tool was optimized to generate only one reset logic and

kicker circuit for a clock domain.

3.1.2. Infer ASIC memory

The memory unit in the actor hardware implementation consists of registers which hold the

state variables of the controller. If the actor contains an array of variables (list) of length greater

than 128, a RAM behavioral model is inferred. Some actors may require a list of constants whichare strictly read-only elements. CAL language has a provision to declare a list as read-only.

However, in the original version of the tool, a ROM is implemented as a RAM with initialized

constant values. In an FPGA, RAM and ROM are automatically inferred by the synthesis tool.

However, in an ASIC flow, memories needed to be inferred by manual integration. Consequently,

modifications to generate appropriate behavioral models of memories based on the FPGA or

ASIC flow were incorporated in the tool.

3.1.3. FIFO Optimization

In a CAL network, actors communicate by transfer of data tokens, this communication is

implemented using FIFO and handshake protocol resulting in a communication cost between

actors. The depth of each FIFO is constrained in the CAL network description. Optimizationsare applied to the FIFO implementation to minimize the communication cost. Based on the FIFO

depth, the implementation is either a memory or register array. For a FIFO depth of 2 or less, the

controller and handshake protocol are designed as glue logic.

To further reduce the communication overhead modification are done in the flow to support

merging of actors, this is performed by removing the FIFO (registers) between actors and glue

logic handles the handshake protocol between actors. This is done by specifying fifosize as null,

as shown in Fig. 7 and the corresponding pseudo network file List. 2. An implementation by

merging smaller actors makes a design more compact, however the merging of larger actors may

increase the critical paths of the design.

A1

A2

B1 C1 D1

10

10

5HM

actors merged

only handshake mechanism

(no registers)

Figure 7: FIFO optimization by merging Actors.

7

7/30/2019 CAL_HDL

8/15

12 Netwo rk d e s c r i p t i o n

3

4 / / T op L e v e l N e tw o r k

5 n et wo r k t o p ( ) I n ==> Ou t

6 e n t i t i e s

7 / / a c t o r d e c l a r a t i o n

8 A1 = A ( ) ;

9 A2 = A ( ) ;

10 B1 = B ( )

11 C1 = C ( ) ;

12 D1 = D ( )


14 / / a c t o r c o n n e c t i v i t y a lo n g w i th FIFO s i z e s .

15 . . . .

16 A1 . o u t p u t > B1 . i n p u t 1 { f i f o s i z e = 10 } ;

17 A2 . o u t p u t > B1 . i n p u t 2 { f i f o s i z e = 10 } ;

18 B1 . s u m o u t > C 1 . i n { f i f o s i z e = 5 } ;

19 / / s p e c i f y i n g n u l l a c t s l i k e m er gi ng C1 a nd D1 a c t o rs

20 C1 . o u t > D1 . i n { f i f o s i z e = N u l l } ;

21 . . . .

22 e n d

Listing 2: Example Feedback Network using CAL

3.2. Low Power ASIC Support

GALS designs are very suitable for low power hardware implementations. Typically, in GALS

based designs, a large system is divided into smaller synchronous blocks (or clock domains). The

inherent independent nature of these smaller blocks offer the possibilities to implement variousstandard low power techniques like clock gating, power gating, dynamic voltage and frequency

scaling [9].

3.2.1. Clock Domain Partitioning

The number of transistors that fit on a single die increases and the feature size decreases with

improvements in silicon fabrication technology. The clock generation and distribution becomes

increasingly difficult with large designs. The clock load increases with higher level of integra-

tion and larger dies. This increase requires more clock buffers and hence increases the clock

distribution latency. This in turn makes it more difficult to design a global-clock network that

may control all the blocks in the design. Furthermore, as the clock frequency increases, there is

more cross coupling in long wires which increases the clock jitter. The clock network occupies

significant portion of the design area and the power consumption may lead to 35% of the totalpower consumption [10].

GALS design provides a promising solution which eliminates the need of synchronous low

skew global clock network. The main advantage of GALS design is that the design may be di-

vided into smaller clock domains and there may be arbitrary clock skew between clock domains.

The clock domains are independent synchronous blocks and use synchronization circuits for inter

domain communication.

The signal processing systems have a dataflow realization and may be easily mapped into such

hardware structures. The capability of the CAL language to partition the design into smaller

blocks may be used to efficiently implement GALS design. There are various implementation of

8

7/30/2019 CAL_HDL

9/15

GALS design [11], the CAL flow used implements GALS design using a FIFO based handshakemechanism. It interesting to note that the hardware implementation of handshake mechanism

for data tokens and FIFOs are not part of the CAL language, since CAL only gives a high level

abstraction of the dataflow algorithm. Hence there is more flexibility for the end user to tailor

these mechanism based on application.

A CAL network divided into clock domains is shown in Fig. 8 along with a pseudo code of the

network description in List. 3. The hardware partitioning is done using the keyword clkdomain,

similarly in case of software implementation the partition is done by specifying processorId

along with the actor declaration.

The partitioning into clock domains is straight forward (by using clkdomain keyword), in the-

ory the maximum number of clock domains is equal to the number of actors in a network. This

however would be an unrealistic implementation since asynchronous FIFOs used for communi-

cation between clock domains are expensive compared to the synchronous FIFOs.

3.2.2. Token-Based Clock Gating

Power consumption is becoming an increasingly important metric in large hardware platforms.

Clock gating is a well known method to decrease the dynamic power by reducing the number of

transitions in registers. GALS divides the design into smaller blocks and clock gating schemes

are applied to these blocks. The inherent advantages of clock gating in GALS design are dis-

cussed abundantly in literature [12]. For ASIC implementation using CAL language new key-

words like powerdomain, clkgating are introduced for low power implementation.

Sync FIFO

Clock

Manager &

Reset Logic

Clk2

Global

Reset

Kicker

Pulse

Domain

resetClk2

Kicker

Pulse

Clock

Manager &

Reset Logic

Clk1

Global

Reset Domain

resetClk1

Async FIFO

Async FIFO

Clk3

Clk3

Sync FIFO Sync FIFO

Sync FIFO Sync FIFO

Sync FIFO

Sync FIFO

Sync FIFO

Kicker

Pulse

Domain

reset

A1 B1

A2 B2

A3

B3

C4

C3 D3

Figure 8: Clock Gating Scheme.

9

7/30/2019 CAL_HDL

10/15

12 P s e u d o Networ k D es c ri p ti on

3

4 \\ Top l e v e l d e s c r i p t i o n

5 n e tw or k t o p g a l s ( ) ==>

6

7 e n t i t i e s

8 \\ e n t i t y d e c l a r a t i o n c lo ck do main c l k 1

9 \\ s i m i l a r t o v h d l e n t i t y d e c l ar a t i o n s

10 A1 = a c t o r A ( ) ; { c l k d o m a i n = c l k 1 } ;

11 B1 = a c t o r B ( ) ; { c l k d o m a i n = c l k 1 } ;

12 . . . . . . .

13 . . . . . . .


15 . . . . . . .


17 D3 = a c t o r D ( ) ; { c l k d o m a i n = c l k 3 } ;

18 . . . . . . .

19


21 \\ c o n n ec t i ng d i f f e r e n t a c t o r s

22 \\ s i m i l a r t o p o rt map c o nn e ct i on i s v hd l

23 A1 . o u t > B 1 . i n { f i f o s i z e = 4 } ;

24 B1 . o u t > A3 . i n { f i f o s i z e = 1 0 } ;

25 . . . . .

26 . . . . . .

27 B2 . o u t > B 3 . i n { f i f o s i z e = 5 } ;

28 B3 . o u t > C 3 . i n { f i f o s i z e = 1 } ;

29 e n d

Listing 3: Psuedo Code example for partitioning design

A clock gating scheme based on the data token activity is depicted in Fig. 8. The clocks to the

actors are not required when no data token needs to be processed. The availability of data tokens

is detected at the synchronous FIFO, and the arrival of a data token to the domain is detected at

the asynchronous FIFO. Based on the availability/arrival of a data token, the clock signal to the

domain is gated.

If a data token exists or arrives in a domain, the clock domain is set in active state. When the

domain is inactive the clock to the domain is gated. This is managed by the clock manager, see

Fig. 8. The arrival/availability of data token in case of a FIFO based implementation is detectedby the fifo empty signal. It takes 3 clock cycles for a token to be consumed by the next actor from

a FIFO, hence there is no latency between token detection in a FIFO and clock activation.

Consequently, token based clock gating does not effect the behaviour (functionality) of the

design. The token based clock gating scheme has been incorporated as part of the CAL hardware

generation. There are features included to disable/enable clock gating to domains. Based on

this a clock manager with appropriate state machines and clock gating logic is generated. This

automization which divides the design into different clock domains with inbuilt clock gating

feature, makes hardware implementation with CAL dataflow language even more interesting.

10

7/30/2019 CAL_HDL

11/15

Controller

ROM

Pilot

Extraction

MMSE

moduleLS

ModuleLS

RAM

Pilot

ROMStart MMSE

Extracted

Pilots

WiFi data

out (11:0)

Input data

out (11:0) LTE/DVB-H

data out (11:0)

Expected

Pilots

done valid busy

clk

rststart

LS

Estimates

Pilot_loc(2:0)

Op_mode(1:0)

Figure 9: MMSE Channel Estimator.

4. Case Study: Channel Estimator

An OFDM based multi-standard channel estimator is implemented as a case study for a low

power ASIC hardware implementation with the CAL dataflow language. The channel estimator

was chosen as case study since the algorithm is of moderate complexity and requires significant

hardware.

The implemented channel estimator is reconfigurable to concurrently support various stan-

dards like 3GPP LTE, IEEE 802.11n and DVB-H. A Robust MMSE algorithm is employed forthe channel estimator. Details about the multi-standard environment for channel estimation is

described extensively in [13]. The algorithm approximations and hardware mapping (data width,

MMSE matrix coefficients) chosen for CAL hardware implementation is same as the reference

design [14].

The hardware architecture of the channel estimator, see Fig. 9, is divided into several blocks

as described below.

LS module - Least square estimation module consists of a complex multiplier. The inputs

to the multiplier are pilot data and the inverse of the expected pilot values stored in pilot

ROM. The output from the complex multiplier is stored in LS RAM for use with MMSE

module.

Controller module - This module is the main controller of channel estimator which takes

care of pilot separation, Least Square Estimation, and the memory operations.

MMSE module - This module consists of a matrix multiplier that is implemented with

12 Multiply-accumulate (MAC) units. The appropriate matrix inputs are sent serially from

the LS RAM and MMSE ROM.

Memories - The channel estimator consists of 3 memory units. Pilot ROM is implemented

with 2 ROMs (1300x12). MMSE ROM stores the coefficients for the MMSE algorithm

which is implemented as 2 ROMs (200x120). LS RAM stores the output from LS module

for further processing by MMSE module. It is implemented as 2 RAMs (334x12).

11

7/30/2019 CAL_HDL

12/15

LS Memory

Unit

Pilot

Extraction

Least

Estimator

Pilot

ROM

12 MAC

Units

MMSE

SchedulerPISO

Clock Domain (clk3)

Clock Domain (clk2)Clock Domain (clk1)

Figure 10: Dataflow description of MMSE Estimator.

The inputs to the channel estimator module is of real and imaginary data of 12-bits, a 3-bit

input which shows the location of the pilot, a 2-bit input which shows the type of data and a start

signal. The outputs from the channel estimator are 12-bit real and imaginary data, a validsignal

and a busy signal.The CAL dataflow implementation is a straight forward mapping of the algorithm, see Fig. 10.

The Pilot Extraction actor will process the OFDM symbols and send the pilot data tokens to

the Least Estimator actor. Afterwards, the data tokens are multiplied by the expected inverse

pilot values and stored in the Memory actor. After completing Least Estimation on all the pilots

for a particular OFDM standard, the Memory actor sends data tokens to MMSE network. The

MMSE network contains a matrix multiplier implemented withMACactors. Inputs to these MAC

actors are from the Memory actor andMMSE coefficients from ROM. TheMMSE controlleractor

handles MAC actors. A Parallel Input Serial Output(PISO) actor receives data from the MAC

actor and sends the final results serially.

Moreover, in order to reduce power consumption the design is divided into three clock do-

mains. This division was performed based on functionality. The algorithm implemented in hard-

ware works in sequence and there is no need for all the domains to be active all the time. The

three clock domains are clk1, clk2 and clk3. These domains may theoretically run at arbitraryfrequencies. For area efficient implementation the depth of asynchronous FIFOs for communi-

cation between clock domains is kept minimal, hence only certain ratios of clock frequencies are

supported in the implementation.

5. Results and Analysis of ASIC implementation from CAL

The RTL description generated by CAL implementation was synthesized in 65 nm CMOS

technology. Synthesis was performed on the original CAL flow and the optimized CAL flow for

12

7/30/2019 CAL_HDL

13/15

Table 1: Area details at 250 MHz.

Actors Memory Communication

(FIFO)

Total Area

CAL Original 0.080 0.1 0.051 0.238

CAL Optimized 0.073 0.1 0.040 0.22

Table 2: Hardware comparison.

Area

[mm2]

Clock

Domains

Max Freq

[MHz]

Throughput

[Samples/s]

CAL max freq 0.25 1 414 169 M

CAL at 250 MHz 0.22 1 250 102 M

[14] at 250 MHz 0.19 1 250 78 M

comparison. The design is synthesized with the same clock constraints, which are bound by the

reference design implementation.

5.1. Hardware Results

Table 1 shows details of reduction in area due to different optimizations in the CAL to RTL

tool. There is a 8% reduction in area compared to the original version of the CAL to RTL tool.

The reduction is mainly from the removal of redundant logic in actors and FIFO optimizations.

The design implemented by the optimized CAL flow is further compared with a reference

RTL design, see Table 2. The critical path of the CAL based design is 2.4 ns. At maximum

frequency of 414 MHz, the reported area is 0.25 mm2. The channel estimator based on CALis also synthesized with a clock constraint of 250 MHz and the total reported area is 0.22 mm2.

The area of the CAL implementation is 15% larger than reference RTL design, but still very

encouraging considering the design time.

The maximum clock for the CAL implementation is higher since the critical path is bounded in

an actor, and actors in a CAL implementation are connected to each other by FIFOs (registers).

CAL implementation throughput is higher than the reference design due to the inherent paral-

lelism involved in the dataflow language. At 250 MHz the throughput in CAL implementation is

102 M samples/sec and for the reference design implementation is 78 M samples/sec.

It is possible to manually implement parallelism in RTL, but may need more time to analyse

data dependencies and control mechanism. The bottleneck for the RTL implementation is in the

matrix multiplication unit. The throughput would have been same as CAL, if a set of register

are used to store the final results of all the MAC units, and during the streaming out of the finalresults the MAC units can continue to operate on next set of data.

These results are in line with the research published on CAL hardware implementation on

FPGA of the MPEG-4 standard [15], inner receiver [16]. In [15] the throughput of CAL im-

plementation is much higher and now used in standardization of MPEG RVC. However in case

of an FFT (designed by co-authors in [16]) the CAL implementation had a higher area cost for

same throughput, since the RTL implementations of FFT are highly optimized. Hence it should

be noted that the throughput gain will depend both on the complexity of design and reference

implementation, but the main advantage is in the high level of abstraction, which leads to ease in

design space exploration lower design time.

13

7/30/2019 CAL_HDL

14/15

Table 3: Functionality based clock domain partition.

Clock Domain Area [mm2] Active

period (%)

clk1 0.04 100

clk2 0.06 45

clk3 0.14 40

Table 4: Power comparison at 250 MHz.

Area

[mm2]

Num of

clock

domains

Power

[mW]

Clock

Freq

[MHz]

Throughput

[Sam-

ples/sec]

Normalized Power

[mW/(MSamples/Sec)]

CAL 0.22 1 18 250 102 M 0.21

CAL Low

Power

0.24 3 10 250 102 M 0.12

5.2. Low Power Implementation

A low power implementation is realized by partitioning the channel estimator into three clock

domains based on the functionality. The total reported area of the low power implementation is

0.24mm2. There is an increase of 10% in area due to the overhead in communication, mainly

from the asynchronous queue. Table 3 shows area occupied by each clock domain and it can be

seen that clk2, clk3 can be turned offfor around 50% of the time.

The power simulation were performed on the gate level netlist with back annotated timing

and toggle information. The power consumption was estimated at a clock rate of 250 MHz toall clock domains, for comparison power is normalized to throughput as presented in Table 4.

The normalized power consumption is reduced by 45% for the low power CAL implementation

compared to the hardware generated by the original CAL tool. Further reduction in power con-

sumption is possible by varying the clock rate (dynamic voltage frequency scaling techniques)

for different clock domains.

6. Conclusions

This paper presents a method to generate an efficient hardware design with CAL dataflow

language. Since the currently available CAL generation tool was designed for FPGA hardware

implementation, there were changes performed to facilitate ASIC implementation. Further mod-

ifications were done in the tool to optimize hardware generation. An OFDM channel estimatorwas implemented in 65 nm CMOS technology with the modified CAL generation tool. The hard-

ware implemented by CAL has a higher throughput performance compared with the reference

design. Due to the higher abstraction and handshake based interface of an actor, the design is not

based on clock cycles like in RTL. Hence changes done to one or more actors does not affect the

rest, which makes it more easy for design space exploration.

A study on GALS design implementation with CAL was done by dividing the design into

smaller clock domains. This division into clock domains is easily done in the CAL network.

The tool generates the asynchronous handshakesbetween clock domains which makes the GALS

implementation a very simple task. A clock gating scheme was integrated into the tool to support

14

7/30/2019 CAL_HDL

15/15

low power ASIC implementation. The data token based clock gating gave remarkable reductionin dynamic power consumption.

The reduced design time for comparable area and low power consumption in CAL based

design is very encouraging, considering that the CAL implementation is at a higher level of

abstraction.

7. Acknowledgments

We thank Ericsson research and Lund University for providing the opportunity to work on

this project. Also would like to thank MULTI-BASE and ACTORS project, both funded by 7th

Framework Programme (FP7) of the European Commission and Swedish VINNOVA Industrial

Excellence Center (SOS).

References

[1] Ptolemy Project, UC Berkeley EECS Dept., http://ptolemy.eecs.berkeley.edu/ptolemyII/index.htm.

[2] J. Eker, J. W. Janneck, CAL Language Report Specification of The CAL Actor Language, Tech. Rep. UCB/ERL

M03/48, EECS Department, University of California, Berkeley (2003).

[3] G. Kahn, The Semantics of a Simple Language For Parallel Programming, in: IFIP (Information processing)

Congress, 1974, pp. 471475.

[4] M. Chen, E. Lee, Design and implementation of a multidimensional synchronous dataflow environment, in: Sig-

nals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on,

Vol. 1, 1994, pp. 519 524 vol.1. doi:10.1109/ACSSC.1994.471507.

[5] S. Ritz, M. Pankert, V. Zivojinovic, H. Meyr, Optimum vectorization of scalable synchronous dataflow graphs,

in: Application-Specific Array Processors, 1993. Proceedings., International Conference on, 1993, pp. 285 296.

doi:10.1109/ASAP.1993.397152.

[6] N. Siret, I. Sabry, J. Nezan, M. Raulet, A codesign synthesis from an mpeg-4 decoder dataflow description, in:

Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 1995 1998.doi:10.1109/ISCAS.2010.5537107.

[7] Open RVC-CAL Compiler, http://orcc.sourceforge.net/, http://opendf.sourceforge.net/ (Open Dataflow Source

Forge Project).

[8] CAL Tool Version Used For This Project, Open Dataflow Version : 1131, Open Forge Version : 16.

[9] A. Chattopadhyay, Z. Zilic, Galds: a complete framework for designing multiclock asics and socs, Very Large Scale

Integration (VLSI) Systems, IEEE Transactions on 13 (6) (2005) 641 654. doi:10.1109/TVLSI.2005.848825.

[10] S. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch, E. Schmidt, System level clock tree synthesis for power

optimization, in: Design, Automation Test in Europe Conference Exhibition, 2007. DATE 07, 2007, pp. 1 6.

doi:10.1109/DATE.2007.364543.

[11] P. Teehan, M. Greenstreet, G. L emieux, A survey and taxonomy of gals design styles, Design Test of Computers,

IEEE 24 (5) (2007) 418 428. doi:10.1109/MDT.2007.151.

[12] E. Amini, M. Najibi, H. Pedram, Globally asynchronous locally synchronous wrapper circuit based on clock gating,

in: Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, Vol. 00,

2006, p. 6 pp. doi:10.1109/ISVLSI.2006.48.

[13] F. Foroughi, J. Lofgren, O. Edfors, Channel estimation for a mobile terminal in a multi-standard environment (lte

and dvb-h), in: Signal Processing and Communication Systems, 2009. ICSPCS 2009. 3rd International Conferenceon, 2009, pp. 1 9. doi:10.1109/ICSPCS.2009.5306380.

[14] I. Diaz, B. Sathyanarayanan, A. Malek, F. Foroughi, J. Rodrigues, Highly scalable implementation of a robust

mmse channel estimator for ofdm multi-standard environment, in: Signal Processing Systems (SiPS), 2011 IEEE

Workshop on, 2011, pp. 311 315. doi:10.1109/SiPS.2011.6088995.

[15] J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, M. Raulet, Synthesizing hardware from dataflow pro-

grams: An mpeg-4 simple profile decoder case study, in: Signal Processing Systems, 2008. SiPS 2008. IEEE

Workshop on, 2008, pp. 287 292. doi:10.1109/SIPS.2008.4671777.

[16] T. Olsson, A. Carlsson, L. Wilhelmsson, J. Eker, C. von Platen, I. Diaz, A reconfigurable ofdm inner receiver im-

plemented in the cal dataflow language, in: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International

Symposium on, 2010, pp. 2904 2907. doi:10.1109/ISCAS.2010.5538042.

15

Documents

CAL_HDL