Upload
mohan-raj
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 CAL_HDL
1/15
Low Power ASIC Implementation froma CAL Dataflow Description
Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues
Department of Electrical and Information Technology, Lund University
Box 118, SE-221 00 Lund, Sweden
Thomas Olsson and Anders Carlsson
Ericsson Research, Lund, Sweden
Abstract
This paper presents a flow for low power hardware generation, based on a CAL actor language.
CAL is a dataflow language which provides a higher level of abstraction and generate both hard-
ware and software description. A dataflow language is appropriate for signal processing systems
since algorithms are typically specified in dataflow graphs, using the same method for specifi-
cation and high level implementation offers rapid prototyping. Also the block partitioning ca-
pability of the CAL language makes it ideal for hardware-software co-design and programming
reconfigurable processor arrays.
The original CAL flow, is targeted for hardware-software co-design of complex systems on
FPGA, this is modified to facilitate low power ASIC implementations. In case of ASIC the
partitioning capability allows for implementing different clock domains, and by introducing a
token based clock gating to each processing block further reduces power consumption. As a
case study to evaluate the methodology and optimizations incorporated in the flow, an Orthogo-
nal Frequency-Division Multiplexing (OFDM) multi-standard channel estimator is implemented.
Hardware-Software co-design and Globally Asynchronous Locally Synchronous (GALS) design
at a higher level of abstraction provides more freedom for design-space exploration and reduced
design time.
Keywords: CAL Dataflow Language, High-Level Synthesis, Hw-Sw Co-Design, Design
Partition, GALS, Token based Clock Gating, Low Power ASIC.
1. Introduction
There is an increase in complexity of signal processing systems, driven by ever increasing
demand for faster devices with more features. The implementation of complex designs require
6
Email addresses: [email protected], [email protected],
[email protected] (Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues ),
[email protected],[email protected] (Thomas Olsson and Anders Carlsson )
Preprint submitted to Embedded Hardware Design (Microprocessors and Microsystems) March 26, 2012
7/30/2019 CAL_HDL
2/15
hardware platforms with multiple processors, accelerators, peripherals and reconfigurable arrays.This kind of hardware platforms require very detailed cycle accurate description code. Typically
used are Hardware Description Languages(HDL) like Verilog and VHDL. Register Transfer
Level (RTL) implementation of complex algorithms and their reference design tend to be time
consuming tasks.
Implementation of hardware at a higher abstraction level requires new design flow and tools.
A dataflow language is appropriate for signal processing systems since algorithms are typically
specified in dataflow graphs. Using the same method for specification and implementation offers
relative easiness due to rapid prototyping. CAL is a dataflow oriented language that was specified
and developed as part of the Ptolemy project at the University of California, Berkeley [1]. The
CAL dataflow language is extensively described in CAL language reference manual [2]. The
CAL language gives a high level of abstraction and is able to generate a synthesizable hardware
and software description. However, the current version of CAL to RTL generator (OpenDF)
introduces redundant logic, which increases area and power cost. Therefore, in this study the
RTL mapping efficiency from a CAL dataflow description was increased, and evaluated by a
case study.
The block partitioning capability of the CAL language may be used to efficiently implement
Globally Asynchronous Locally Synchronous(GALS) designs, which has the major advantage
that a traditional synchronous design flow may be applied. Furthermore, power consumption of
the design needs to be addressed to increase the battery life. As part of the study of hardware im-
plementation in CAL, a clock gating scheme based on the activity of a network is implemented.
Several modifications on the CAL to RTL tool were performed to support these features.
As a case study, an Orthogonal Frequency-Division Multiplexing (OFDM) multi-standard
MMSE (minimum mean squared error) channel estimator is implemented in CAL to evaluate
the methodology and optimizations incorporated in the flow. The channel estimator was synthe-
sized in 65 nm CMOS technology. A GALS architecture was realized by dividing the design intodifferent clock domains. A low power clock gating scheme was included in the implementation
and an analysis on hardware parameters were performed.
The remaining part of this paper is organized into the following sections. In Sec. 2 a brief
introduction of the CAL dataflow language is presented, and Sec. 3 addresses the optimizations
on the CAL flow. Sec. 4 presents GALS design and clock gating technique incorporated into the
tool. In Sec. 5 hardware implementation of the channel estimator in CAL is described, and the
various results obtained are discussed in Sec. 5. Finally conclusions are drawn in Sec. 6.
2. Background of CAL Dataflow Language
A dataflow model of an algorithm consists of nodes and communication arcs. The nodes rep-
resent combinational logic, and the communication arcs are used to transfer data tokens betweennodes. A variety of such models exist, which have different trade-offs between expressiveness
and ability for analysis. Of particular interest are the synchronous dataflow networks, which are
applied in several academic modelling tools to represent streaming applications [3]. Synchronous
dataflow networks are constrained, which leads to an efficient synthesized code [4]. The advan-
tage of a dataflow model is that it is possible to have a one-to-one mapping between nodes or
computational units in hardware. The nodes act asynchronous to each other, and communication
arcs are used to transfer data tokens with insignificant control mechanism costs. In [5] it is shown
that dataflow models offer a representation that may effectively support the parallelization of the
design for higher performance, which is required for a lot of applications in wireless systems.
2
7/30/2019 CAL_HDL
3/15
N1
N2
N3
A
B
C
Figure 1: Dataflow Graph.
2.1. CAL Programming Model
In Fig. 1, a dataflow graph with three nodes (N1, N2, N3) and three communication arcs (A, B,
C) is shown. In a CAL implementation the nodes are the actors which represents computation-
al/logical tasks. Communication arcs are the buffers or FIFOs through which data tokens are
transferred between actors.
The CAL actors are isolated computational units which consist of input/output(IO) ports,
actions, state variables, and parameters. The state of an actor is not shareable with other actors,
and interaction between actors is accomplished through IO ports based on data tokens. An action
defines computational/logical operations performed on the data tokens based on the actor states.
When an action is fired, it may consume and/or produce data tokens. Afterwards, state of the
actor is modified and an output data token is produced.
2.2. RTL Generation from CAL
The CAL flow provides a high level of abstraction. The design cycle offers a wide range of
design space exploration and optimization techniques. A CAL program may be compiled to both
hardware and software. A software implementation is realized by translating CAL to C program-
ming language, and hardware is realized by HDL. This ability to perform both hardware-software
CAL Model
OpenDF simulations
A
B
C
Action
state
Example merger actor
actor merge () in1,in2 ==> out
A : action in1 : [a] ==> [a]
B : action in2 : [a] ==> [a]
selector (AB) *
end
end
Software Generation
C Code
Generated by
CAL2C
main() {
int i = 0;
HDL Code
Generated by
CAL2HDL
process()
begin
Hardware Generation
Figure 2: CAL Frame Work.
3
7/30/2019 CAL_HDL
4/15
n ut
fb_infb_out
A
B Add
FB
Figure 3: Simple Feedback Network.
1
2 A c t o r d e s c r i p t i o n
3
4 / / A c t o r f o r s i m p le F ee db a ck ( d e l a y )
5 a c t o r FB ( i n i t ) f b i n ==> f b o u t
6 a c t i o n [ f b i n ] ==> [ f b o u t ]
7 e nd
8
9 / / A c to r f o r A d d it i on , t o k en s a t p o r t A a nd B i s a dd ed .
10 a c t o r a dd ( ) A , B ==> Sum
11 a ct io n [ a ] , [ b ] ==> [ a+b ] en d
12 e n d
13
14 Netwo rk d e s c r i p t i o n
15
16 / / T op L e v e l N e tw o r k
17 / / I n t e g r a t i n g A c to r s
18 n et wo r k f b a d d ( ) I n ==> Ou t
19 e n t i t i e s
20 f b 0 = FB( i n i t = [ 0 ] ) ;21 add = a d d ( ) ;
22 s t r u c t u r e
23 I n > add .A;
24 f b 0 . f b o u t > a d d . B ;
25 add . sum > f b 0 . f b i n ;
26 add . sum > Out ;
27 e n d
Listing 1: Example Feedback Network using CAL
synthesis enables the development of a unified platform for hardware-software co-design of com-
plex systems, like embedded systems consisting of processors and hardware accelerators.
A complete framework called Open Dataflow, supports CAL network simulation and genera-
tion of hardware-software code, see Fig. 2. This capability of CAL to support hardware software
co-design enables a common tool, architectural definition and specification for both platforms
[6], simplifies the design of complex systems. Details of the translation of CAL to HDL or C
are described in [7]. These tools are open source and in this paper original version refers to the
version available at [8].
2.3. CAL Network Implementation
The topology of actors connected to each other is referred as a network of actors, a simple
network of actors implementation in CAL is shown in Fig. 3, and CAL description is shown in
4
7/30/2019 CAL_HDL
5/15
Communication
Arcs
Actors
Scheduler
Action States
Separate local
scheduler for actions
encapsulated states
FIFO FIFO
FIFOFIFO FIFO
Action
Figure 4: Conceptional illustration of an Actor Network.
List. 1. In hardware, each actor is an independent entity and the communication between theactors is based on handshake protocols (4-phase bundled-data protocol). Each communication
arcs in the CAL network is implemented as a FIFO with a handshakeprotocol wrapper, see Fig. 4.
Furthermore, an actor also facilitates handshake protocol to consume and produce tokens. If two
connected actors belong to different clock domains, an asynchronous FIFO implementation is
selected.
2.4. Modifications to existing Framework
A brief description of CAL dataflow language was presented in previous subsections. In Fig-
ure 5, the overall existing CAL framework/tools is shown. The support for ASIC Implementation
was added to this framework by modifications/
optimization of the tool, described in next section.
CAL Description
*.cal
CAL Front End Toolsopendf, orcc
Open Source Tools
Intermediate
Representation
Top Level Files(network
description *.nl)
XLIM Backend Files
( input to other possible
translator tools )
Software
Implementation
C Files
*.c , *.h
C++ Files
*.cpp
Java Files
*.class, *.jar
Hardware
Implementation
FPGA Implementation
using xilinx libraries
*.vhd , *.v
ASIC Implementation
*.vhd , *.v
Support For ASIC Implementation :
* Remove all xilinx li brary dependecies
* Infer Block memories (pick from library)
* Optimizations to reduce Area Cost.
* Partitioning into clock domains (GALS support)
* Automatically Generate clock gating logic (Low Power )
Figure 5: Included ASIC features to existing CAL framework.
5
7/30/2019 CAL_HDL
6/15
Memory Unit
Block
Memory
Clocked
Registers
Scheduler
Reset Sync
Logic
Kicker Circuit
Handshake for
input token
Handshake for
output token
reset clk
Actions
Mathematical &
logical operations
Actions
Actions
Figure 6: Generated Actor Implementation.
3. ASIC Implementation from CAL
The modification performed on the tools to support ASIC implementation is divided into two
parts, first the reduction of the hardware area cost was taken into account. Various modifications
are performed on the tool/flow to reduce area of the hardware implementation from CAL. The
second part involved incorporating existing low power methods (GALS, clock gating) into the
flow to enable low power ASIC implementation from CAL.
3.1. CAL generated hardware - Area Optimizations
A CAL actor consists of various units based on the computational and logical tasks. Fig. 6
shows a generalized actor implementation in hardware. The various sub-modules in the actor are
explained as below.
The Reset Synchronizer Logic is used to synchronize the reset with the actor clock.
The Kicker Circuit generates a pulse based on the synchronized reset signal. The pulse
generated is used by the scheduler logic to begin the protocol handshake mechanism and
action scheduling.
The Memory Unit contains state, global variables and constants. The global variables and
constants are used by the action units for computational purpose. The state is used by the
scheduler for firing sequences of actions in an actor.
The Action Units are computational or logical units of an actor.
The Scheduler Unithandles the token handshake protocol and firing of action units.
6
7/30/2019 CAL_HDL
7/15
3.1.1. Removal of Redundant LogicThe actor hardware generated by the original version of the tool assumes that each and every
actor is a separate asynchronous block. Hence a synchronous reset logic and a corresponding
kicker (pulse gen) circuit is implemented for every actor. This logic is redundant, since for a
single clock domain the resets may be synchronized once, and routed to all the actors of that
clock domain consequently. The CAL tool was optimized to generate only one reset logic and
kicker circuit for a clock domain.
3.1.2. Infer ASIC memory
The memory unit in the actor hardware implementation consists of registers which hold the
state variables of the controller. If the actor contains an array of variables (list) of length greater
than 128, a RAM behavioral model is inferred. Some actors may require a list of constants whichare strictly read-only elements. CAL language has a provision to declare a list as read-only.
However, in the original version of the tool, a ROM is implemented as a RAM with initialized
constant values. In an FPGA, RAM and ROM are automatically inferred by the synthesis tool.
However, in an ASIC flow, memories needed to be inferred by manual integration. Consequently,
modifications to generate appropriate behavioral models of memories based on the FPGA or
ASIC flow were incorporated in the tool.
3.1.3. FIFO Optimization
In a CAL network, actors communicate by transfer of data tokens, this communication is
implemented using FIFO and handshake protocol resulting in a communication cost between
actors. The depth of each FIFO is constrained in the CAL network description. Optimizationsare applied to the FIFO implementation to minimize the communication cost. Based on the FIFO
depth, the implementation is either a memory or register array. For a FIFO depth of 2 or less, the
controller and handshake protocol are designed as glue logic.
To further reduce the communication overhead modification are done in the flow to support
merging of actors, this is performed by removing the FIFO (registers) between actors and glue
logic handles the handshake protocol between actors. This is done by specifying fifosize as null,
as shown in Fig. 7 and the corresponding pseudo network file List. 2. An implementation by
merging smaller actors makes a design more compact, however the merging of larger actors may
increase the critical paths of the design.
A1
A2
B1 C1 D1
10
10
5HM
actors merged
only handshake mechanism
(no registers)
Figure 7: FIFO optimization by merging Actors.
7
7/30/2019 CAL_HDL
8/15
12 Netwo rk d e s c r i p t i o n
3
4 / / T op L e v e l N e tw o r k
5 n et wo r k t o p ( ) I n ==> Ou t
6 e n t i t i e s
7 / / a c t o r d e c l a r a t i o n
8 A1 = A ( ) ;
9 A2 = A ( ) ;
10 B1 = B ( )
11 C1 = C ( ) ;
12 D1 = D ( )
13 s t r u c t u r e
14 / / a c t o r c o n n e c t i v i t y a lo n g w i th FIFO s i z e s .
15 . . . .
16 A1 . o u t p u t > B1 . i n p u t 1 { f i f o s i z e = 10 } ;
17 A2 . o u t p u t > B1 . i n p u t 2 { f i f o s i z e = 10 } ;
18 B1 . s u m o u t > C 1 . i n { f i f o s i z e = 5 } ;
19 / / s p e c i f y i n g n u l l a c t s l i k e m er gi ng C1 a nd D1 a c t o rs
20 C1 . o u t > D1 . i n { f i f o s i z e = N u l l } ;
21 . . . .
22 e n d
Listing 2: Example Feedback Network using CAL
3.2. Low Power ASIC Support
GALS designs are very suitable for low power hardware implementations. Typically, in GALS
based designs, a large system is divided into smaller synchronous blocks (or clock domains). The
inherent independent nature of these smaller blocks offer the possibilities to implement variousstandard low power techniques like clock gating, power gating, dynamic voltage and frequency
scaling [9].
3.2.1. Clock Domain Partitioning
The number of transistors that fit on a single die increases and the feature size decreases with
improvements in silicon fabrication technology. The clock generation and distribution becomes
increasingly difficult with large designs. The clock load increases with higher level of integra-
tion and larger dies. This increase requires more clock buffers and hence increases the clock
distribution latency. This in turn makes it more difficult to design a global-clock network that
may control all the blocks in the design. Furthermore, as the clock frequency increases, there is
more cross coupling in long wires which increases the clock jitter. The clock network occupies
significant portion of the design area and the power consumption may lead to 35% of the totalpower consumption [10].
GALS design provides a promising solution which eliminates the need of synchronous low
skew global clock network. The main advantage of GALS design is that the design may be di-
vided into smaller clock domains and there may be arbitrary clock skew between clock domains.
The clock domains are independent synchronous blocks and use synchronization circuits for inter
domain communication.
The signal processing systems have a dataflow realization and may be easily mapped into such
hardware structures. The capability of the CAL language to partition the design into smaller
blocks may be used to efficiently implement GALS design. There are various implementation of
8
7/30/2019 CAL_HDL
9/15
GALS design [11], the CAL flow used implements GALS design using a FIFO based handshakemechanism. It interesting to note that the hardware implementation of handshake mechanism
for data tokens and FIFOs are not part of the CAL language, since CAL only gives a high level
abstraction of the dataflow algorithm. Hence there is more flexibility for the end user to tailor
these mechanism based on application.
A CAL network divided into clock domains is shown in Fig. 8 along with a pseudo code of the
network description in List. 3. The hardware partitioning is done using the keyword clkdomain,
similarly in case of software implementation the partition is done by specifying processorId
along with the actor declaration.
The partitioning into clock domains is straight forward (by using clkdomain keyword), in the-
ory the maximum number of clock domains is equal to the number of actors in a network. This
however would be an unrealistic implementation since asynchronous FIFOs used for communi-
cation between clock domains are expensive compared to the synchronous FIFOs.
3.2.2. Token-Based Clock Gating
Power consumption is becoming an increasingly important metric in large hardware platforms.
Clock gating is a well known method to decrease the dynamic power by reducing the number of
transitions in registers. GALS divides the design into smaller blocks and clock gating schemes
are applied to these blocks. The inherent advantages of clock gating in GALS design are dis-
cussed abundantly in literature [12]. For ASIC implementation using CAL language new key-
words like powerdomain, clkgating are introduced for low power implementation.
Sync FIFO
Clock
Manager &
Reset Logic
Clk2
Global
Reset
Kicker
Pulse
Domain
resetClk2
Kicker
Pulse
Clock
Manager &
Reset Logic
Clk1
Global
Reset Domain
resetClk1
Async FIFO
Async FIFO
Clk3
Clk3
Sync FIFO Sync FIFO
Sync FIFO Sync FIFO
Sync FIFO
Sync FIFO
Sync FIFO
Kicker
Pulse
Domain
reset
A1 B1
A2 B2
A3
B3
C4
C3 D3
Figure 8: Clock Gating Scheme.
9
7/30/2019 CAL_HDL
10/15
12 P s e u d o Networ k D es c ri p ti on
3
4 \\ Top l e v e l d e s c r i p t i o n
5 n e tw or k t o p g a l s ( ) ==>
6
7 e n t i t i e s
8 \\ e n t i t y d e c l a r a t i o n c lo ck do main c l k 1
9 \\ s i m i l a r t o v h d l e n t i t y d e c l ar a t i o n s
10 A1 = a c t o r A ( ) ; { c l k d o m a i n = c l k 1 } ;
11 B1 = a c t o r B ( ) ; { c l k d o m a i n = c l k 1 } ;
12 . . . . . . .
13 . . . . . . .
14 A2 = a c t o r A ( ) ; { c l k d o m a i n = c l k 2 } ;
15 . . . . . . .
16 A3 = a c t o r A ( ) ; { c l k d o m a i n = c l k 3 } ;
17 D3 = a c t o r D ( ) ; { c l k d o m a i n = c l k 3 } ;
18 . . . . . . .
19
20 s t r u c t u r e
21 \\ c o n n ec t i ng d i f f e r e n t a c t o r s
22 \\ s i m i l a r t o p o rt map c o nn e ct i on i s v hd l
23 A1 . o u t > B 1 . i n { f i f o s i z e = 4 } ;
24 B1 . o u t > A3 . i n { f i f o s i z e = 1 0 } ;
25 . . . . .
26 . . . . . .
27 B2 . o u t > B 3 . i n { f i f o s i z e = 5 } ;
28 B3 . o u t > C 3 . i n { f i f o s i z e = 1 } ;
29 e n d
Listing 3: Psuedo Code example for partitioning design
A clock gating scheme based on the data token activity is depicted in Fig. 8. The clocks to the
actors are not required when no data token needs to be processed. The availability of data tokens
is detected at the synchronous FIFO, and the arrival of a data token to the domain is detected at
the asynchronous FIFO. Based on the availability/arrival of a data token, the clock signal to the
domain is gated.
If a data token exists or arrives in a domain, the clock domain is set in active state. When the
domain is inactive the clock to the domain is gated. This is managed by the clock manager, see
Fig. 8. The arrival/availability of data token in case of a FIFO based implementation is detectedby the fifo empty signal. It takes 3 clock cycles for a token to be consumed by the next actor from
a FIFO, hence there is no latency between token detection in a FIFO and clock activation.
Consequently, token based clock gating does not effect the behaviour (functionality) of the
design. The token based clock gating scheme has been incorporated as part of the CAL hardware
generation. There are features included to disable/enable clock gating to domains. Based on
this a clock manager with appropriate state machines and clock gating logic is generated. This
automization which divides the design into different clock domains with inbuilt clock gating
feature, makes hardware implementation with CAL dataflow language even more interesting.
10
7/30/2019 CAL_HDL
11/15
Controller
ROM
Pilot
Extraction
MMSE
moduleLS
ModuleLS
RAM
Pilot
ROMStart MMSE
Extracted
Pilots
WiFi data
out (11:0)
Input data
out (11:0) LTE/DVB-H
data out (11:0)
Expected
Pilots
done valid busy
clk
rststart
LS
Estimates
Pilot_loc(2:0)
Op_mode(1:0)
Figure 9: MMSE Channel Estimator.
4. Case Study: Channel Estimator
An OFDM based multi-standard channel estimator is implemented as a case study for a low
power ASIC hardware implementation with the CAL dataflow language. The channel estimator
was chosen as case study since the algorithm is of moderate complexity and requires significant
hardware.
The implemented channel estimator is reconfigurable to concurrently support various stan-
dards like 3GPP LTE, IEEE 802.11n and DVB-H. A Robust MMSE algorithm is employed forthe channel estimator. Details about the multi-standard environment for channel estimation is
described extensively in [13]. The algorithm approximations and hardware mapping (data width,
MMSE matrix coefficients) chosen for CAL hardware implementation is same as the reference
design [14].
The hardware architecture of the channel estimator, see Fig. 9, is divided into several blocks
as described below.
LS module - Least square estimation module consists of a complex multiplier. The inputs
to the multiplier are pilot data and the inverse of the expected pilot values stored in pilot
ROM. The output from the complex multiplier is stored in LS RAM for use with MMSE
module.
Controller module - This module is the main controller of channel estimator which takes
care of pilot separation, Least Square Estimation, and the memory operations.
MMSE module - This module consists of a matrix multiplier that is implemented with
12 Multiply-accumulate (MAC) units. The appropriate matrix inputs are sent serially from
the LS RAM and MMSE ROM.
Memories - The channel estimator consists of 3 memory units. Pilot ROM is implemented
with 2 ROMs (1300x12). MMSE ROM stores the coefficients for the MMSE algorithm
which is implemented as 2 ROMs (200x120). LS RAM stores the output from LS module
for further processing by MMSE module. It is implemented as 2 RAMs (334x12).
11
7/30/2019 CAL_HDL
12/15
LS Memory
Unit
Pilot
Extraction
Least
Estimator
Pilot
ROM
12 MAC
Units
MMSE
SchedulerPISO
Clock Domain (clk3)
Clock Domain (clk2)Clock Domain (clk1)
Figure 10: Dataflow description of MMSE Estimator.
The inputs to the channel estimator module is of real and imaginary data of 12-bits, a 3-bit
input which shows the location of the pilot, a 2-bit input which shows the type of data and a start
signal. The outputs from the channel estimator are 12-bit real and imaginary data, a validsignal
and a busy signal.The CAL dataflow implementation is a straight forward mapping of the algorithm, see Fig. 10.
The Pilot Extraction actor will process the OFDM symbols and send the pilot data tokens to
the Least Estimator actor. Afterwards, the data tokens are multiplied by the expected inverse
pilot values and stored in the Memory actor. After completing Least Estimation on all the pilots
for a particular OFDM standard, the Memory actor sends data tokens to MMSE network. The
MMSE network contains a matrix multiplier implemented withMACactors. Inputs to these MAC
actors are from the Memory actor andMMSE coefficients from ROM. TheMMSE controlleractor
handles MAC actors. A Parallel Input Serial Output(PISO) actor receives data from the MAC
actor and sends the final results serially.
Moreover, in order to reduce power consumption the design is divided into three clock do-
mains. This division was performed based on functionality. The algorithm implemented in hard-
ware works in sequence and there is no need for all the domains to be active all the time. The
three clock domains are clk1, clk2 and clk3. These domains may theoretically run at arbitraryfrequencies. For area efficient implementation the depth of asynchronous FIFOs for communi-
cation between clock domains is kept minimal, hence only certain ratios of clock frequencies are
supported in the implementation.
5. Results and Analysis of ASIC implementation from CAL
The RTL description generated by CAL implementation was synthesized in 65 nm CMOS
technology. Synthesis was performed on the original CAL flow and the optimized CAL flow for
12
7/30/2019 CAL_HDL
13/15
Table 1: Area details at 250 MHz.
Actors Memory Communication
(FIFO)
Total Area
CAL Original 0.080 0.1 0.051 0.238
CAL Optimized 0.073 0.1 0.040 0.22
Table 2: Hardware comparison.
Area
[mm2]
Clock
Domains
Max Freq
[MHz]
Throughput
[Samples/s]
CAL max freq 0.25 1 414 169 M
CAL at 250 MHz 0.22 1 250 102 M
[14] at 250 MHz 0.19 1 250 78 M
comparison. The design is synthesized with the same clock constraints, which are bound by the
reference design implementation.
5.1. Hardware Results
Table 1 shows details of reduction in area due to different optimizations in the CAL to RTL
tool. There is a 8% reduction in area compared to the original version of the CAL to RTL tool.
The reduction is mainly from the removal of redundant logic in actors and FIFO optimizations.
The design implemented by the optimized CAL flow is further compared with a reference
RTL design, see Table 2. The critical path of the CAL based design is 2.4 ns. At maximum
frequency of 414 MHz, the reported area is 0.25 mm2. The channel estimator based on CALis also synthesized with a clock constraint of 250 MHz and the total reported area is 0.22 mm2.
The area of the CAL implementation is 15% larger than reference RTL design, but still very
encouraging considering the design time.
The maximum clock for the CAL implementation is higher since the critical path is bounded in
an actor, and actors in a CAL implementation are connected to each other by FIFOs (registers).
CAL implementation throughput is higher than the reference design due to the inherent paral-
lelism involved in the dataflow language. At 250 MHz the throughput in CAL implementation is
102 M samples/sec and for the reference design implementation is 78 M samples/sec.
It is possible to manually implement parallelism in RTL, but may need more time to analyse
data dependencies and control mechanism. The bottleneck for the RTL implementation is in the
matrix multiplication unit. The throughput would have been same as CAL, if a set of register
are used to store the final results of all the MAC units, and during the streaming out of the finalresults the MAC units can continue to operate on next set of data.
These results are in line with the research published on CAL hardware implementation on
FPGA of the MPEG-4 standard [15], inner receiver [16]. In [15] the throughput of CAL im-
plementation is much higher and now used in standardization of MPEG RVC. However in case
of an FFT (designed by co-authors in [16]) the CAL implementation had a higher area cost for
same throughput, since the RTL implementations of FFT are highly optimized. Hence it should
be noted that the throughput gain will depend both on the complexity of design and reference
implementation, but the main advantage is in the high level of abstraction, which leads to ease in
design space exploration lower design time.
13
7/30/2019 CAL_HDL
14/15
Table 3: Functionality based clock domain partition.
Clock Domain Area [mm2] Active
period (%)
clk1 0.04 100
clk2 0.06 45
clk3 0.14 40
Table 4: Power comparison at 250 MHz.
Area
[mm2]
Num of
clock
domains
Power
[mW]
Clock
Freq
[MHz]
Throughput
[Sam-
ples/sec]
Normalized Power
[mW/(MSamples/Sec)]
CAL 0.22 1 18 250 102 M 0.21
CAL Low
Power
0.24 3 10 250 102 M 0.12
5.2. Low Power Implementation
A low power implementation is realized by partitioning the channel estimator into three clock
domains based on the functionality. The total reported area of the low power implementation is
0.24mm2. There is an increase of 10% in area due to the overhead in communication, mainly
from the asynchronous queue. Table 3 shows area occupied by each clock domain and it can be
seen that clk2, clk3 can be turned offfor around 50% of the time.
The power simulation were performed on the gate level netlist with back annotated timing
and toggle information. The power consumption was estimated at a clock rate of 250 MHz toall clock domains, for comparison power is normalized to throughput as presented in Table 4.
The normalized power consumption is reduced by 45% for the low power CAL implementation
compared to the hardware generated by the original CAL tool. Further reduction in power con-
sumption is possible by varying the clock rate (dynamic voltage frequency scaling techniques)
for different clock domains.
6. Conclusions
This paper presents a method to generate an efficient hardware design with CAL dataflow
language. Since the currently available CAL generation tool was designed for FPGA hardware
implementation, there were changes performed to facilitate ASIC implementation. Further mod-
ifications were done in the tool to optimize hardware generation. An OFDM channel estimatorwas implemented in 65 nm CMOS technology with the modified CAL generation tool. The hard-
ware implemented by CAL has a higher throughput performance compared with the reference
design. Due to the higher abstraction and handshake based interface of an actor, the design is not
based on clock cycles like in RTL. Hence changes done to one or more actors does not affect the
rest, which makes it more easy for design space exploration.
A study on GALS design implementation with CAL was done by dividing the design into
smaller clock domains. This division into clock domains is easily done in the CAL network.
The tool generates the asynchronous handshakesbetween clock domains which makes the GALS
implementation a very simple task. A clock gating scheme was integrated into the tool to support
14
7/30/2019 CAL_HDL
15/15
low power ASIC implementation. The data token based clock gating gave remarkable reductionin dynamic power consumption.
The reduced design time for comparable area and low power consumption in CAL based
design is very encouraging, considering that the CAL implementation is at a higher level of
abstraction.
7. Acknowledgments
We thank Ericsson research and Lund University for providing the opportunity to work on
this project. Also would like to thank MULTI-BASE and ACTORS project, both funded by 7th
Framework Programme (FP7) of the European Commission and Swedish VINNOVA Industrial
Excellence Center (SOS).
References
[1] Ptolemy Project, UC Berkeley EECS Dept., http://ptolemy.eecs.berkeley.edu/ptolemyII/index.htm.
[2] J. Eker, J. W. Janneck, CAL Language Report Specification of The CAL Actor Language, Tech. Rep. UCB/ERL
M03/48, EECS Department, University of California, Berkeley (2003).
[3] G. Kahn, The Semantics of a Simple Language For Parallel Programming, in: IFIP (Information processing)
Congress, 1974, pp. 471475.
[4] M. Chen, E. Lee, Design and implementation of a multidimensional synchronous dataflow environment, in: Sig-
nals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on,
Vol. 1, 1994, pp. 519 524 vol.1. doi:10.1109/ACSSC.1994.471507.
[5] S. Ritz, M. Pankert, V. Zivojinovic, H. Meyr, Optimum vectorization of scalable synchronous dataflow graphs,
in: Application-Specific Array Processors, 1993. Proceedings., International Conference on, 1993, pp. 285 296.
doi:10.1109/ASAP.1993.397152.
[6] N. Siret, I. Sabry, J. Nezan, M. Raulet, A codesign synthesis from an mpeg-4 decoder dataflow description, in:
Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 1995 1998.doi:10.1109/ISCAS.2010.5537107.
[7] Open RVC-CAL Compiler, http://orcc.sourceforge.net/, http://opendf.sourceforge.net/ (Open Dataflow Source
Forge Project).
[8] CAL Tool Version Used For This Project, Open Dataflow Version : 1131, Open Forge Version : 16.
[9] A. Chattopadhyay, Z. Zilic, Galds: a complete framework for designing multiclock asics and socs, Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on 13 (6) (2005) 641 654. doi:10.1109/TVLSI.2005.848825.
[10] S. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch, E. Schmidt, System level clock tree synthesis for power
optimization, in: Design, Automation Test in Europe Conference Exhibition, 2007. DATE 07, 2007, pp. 1 6.
doi:10.1109/DATE.2007.364543.
[11] P. Teehan, M. Greenstreet, G. L emieux, A survey and taxonomy of gals design styles, Design Test of Computers,
IEEE 24 (5) (2007) 418 428. doi:10.1109/MDT.2007.151.
[12] E. Amini, M. Najibi, H. Pedram, Globally asynchronous locally synchronous wrapper circuit based on clock gating,
in: Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, Vol. 00,
2006, p. 6 pp. doi:10.1109/ISVLSI.2006.48.
[13] F. Foroughi, J. Lofgren, O. Edfors, Channel estimation for a mobile terminal in a multi-standard environment (lte
and dvb-h), in: Signal Processing and Communication Systems, 2009. ICSPCS 2009. 3rd International Conferenceon, 2009, pp. 1 9. doi:10.1109/ICSPCS.2009.5306380.
[14] I. Diaz, B. Sathyanarayanan, A. Malek, F. Foroughi, J. Rodrigues, Highly scalable implementation of a robust
mmse channel estimator for ofdm multi-standard environment, in: Signal Processing Systems (SiPS), 2011 IEEE
Workshop on, 2011, pp. 311 315. doi:10.1109/SiPS.2011.6088995.
[15] J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, M. Raulet, Synthesizing hardware from dataflow pro-
grams: An mpeg-4 simple profile decoder case study, in: Signal Processing Systems, 2008. SiPS 2008. IEEE
Workshop on, 2008, pp. 287 292. doi:10.1109/SIPS.2008.4671777.
[16] T. Olsson, A. Carlsson, L. Wilhelmsson, J. Eker, C. von Platen, I. Diaz, A reconfigurable ofdm inner receiver im-
plemented in the cal dataflow language, in: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International
Symposium on, 2010, pp. 2904 2907. doi:10.1109/ISCAS.2010.5538042.
15