CAL_HDL

Embed Size (px)

Citation preview

  • 7/30/2019 CAL_HDL

    1/15

    Low Power ASIC Implementation froma CAL Dataflow Description

    Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues

    Department of Electrical and Information Technology, Lund University

    Box 118, SE-221 00 Lund, Sweden

    Thomas Olsson and Anders Carlsson

    Ericsson Research, Lund, Sweden

    Abstract

    This paper presents a flow for low power hardware generation, based on a CAL actor language.

    CAL is a dataflow language which provides a higher level of abstraction and generate both hard-

    ware and software description. A dataflow language is appropriate for signal processing systems

    since algorithms are typically specified in dataflow graphs, using the same method for specifi-

    cation and high level implementation offers rapid prototyping. Also the block partitioning ca-

    pability of the CAL language makes it ideal for hardware-software co-design and programming

    reconfigurable processor arrays.

    The original CAL flow, is targeted for hardware-software co-design of complex systems on

    FPGA, this is modified to facilitate low power ASIC implementations. In case of ASIC the

    partitioning capability allows for implementing different clock domains, and by introducing a

    token based clock gating to each processing block further reduces power consumption. As a

    case study to evaluate the methodology and optimizations incorporated in the flow, an Orthogo-

    nal Frequency-Division Multiplexing (OFDM) multi-standard channel estimator is implemented.

    Hardware-Software co-design and Globally Asynchronous Locally Synchronous (GALS) design

    at a higher level of abstraction provides more freedom for design-space exploration and reduced

    design time.

    Keywords: CAL Dataflow Language, High-Level Synthesis, Hw-Sw Co-Design, Design

    Partition, GALS, Token based Clock Gating, Low Power ASIC.

    1. Introduction

    There is an increase in complexity of signal processing systems, driven by ever increasing

    demand for faster devices with more features. The implementation of complex designs require

    6

    Email addresses: [email protected], [email protected],

    [email protected] (Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues ),

    [email protected],[email protected] (Thomas Olsson and Anders Carlsson )

    Preprint submitted to Embedded Hardware Design (Microprocessors and Microsystems) March 26, 2012

  • 7/30/2019 CAL_HDL

    2/15

    hardware platforms with multiple processors, accelerators, peripherals and reconfigurable arrays.This kind of hardware platforms require very detailed cycle accurate description code. Typically

    used are Hardware Description Languages(HDL) like Verilog and VHDL. Register Transfer

    Level (RTL) implementation of complex algorithms and their reference design tend to be time

    consuming tasks.

    Implementation of hardware at a higher abstraction level requires new design flow and tools.

    A dataflow language is appropriate for signal processing systems since algorithms are typically

    specified in dataflow graphs. Using the same method for specification and implementation offers

    relative easiness due to rapid prototyping. CAL is a dataflow oriented language that was specified

    and developed as part of the Ptolemy project at the University of California, Berkeley [1]. The

    CAL dataflow language is extensively described in CAL language reference manual [2]. The

    CAL language gives a high level of abstraction and is able to generate a synthesizable hardware

    and software description. However, the current version of CAL to RTL generator (OpenDF)

    introduces redundant logic, which increases area and power cost. Therefore, in this study the

    RTL mapping efficiency from a CAL dataflow description was increased, and evaluated by a

    case study.

    The block partitioning capability of the CAL language may be used to efficiently implement

    Globally Asynchronous Locally Synchronous(GALS) designs, which has the major advantage

    that a traditional synchronous design flow may be applied. Furthermore, power consumption of

    the design needs to be addressed to increase the battery life. As part of the study of hardware im-

    plementation in CAL, a clock gating scheme based on the activity of a network is implemented.

    Several modifications on the CAL to RTL tool were performed to support these features.

    As a case study, an Orthogonal Frequency-Division Multiplexing (OFDM) multi-standard

    MMSE (minimum mean squared error) channel estimator is implemented in CAL to evaluate

    the methodology and optimizations incorporated in the flow. The channel estimator was synthe-

    sized in 65 nm CMOS technology. A GALS architecture was realized by dividing the design intodifferent clock domains. A low power clock gating scheme was included in the implementation

    and an analysis on hardware parameters were performed.

    The remaining part of this paper is organized into the following sections. In Sec. 2 a brief

    introduction of the CAL dataflow language is presented, and Sec. 3 addresses the optimizations

    on the CAL flow. Sec. 4 presents GALS design and clock gating technique incorporated into the

    tool. In Sec. 5 hardware implementation of the channel estimator in CAL is described, and the

    various results obtained are discussed in Sec. 5. Finally conclusions are drawn in Sec. 6.

    2. Background of CAL Dataflow Language

    A dataflow model of an algorithm consists of nodes and communication arcs. The nodes rep-

    resent combinational logic, and the communication arcs are used to transfer data tokens betweennodes. A variety of such models exist, which have different trade-offs between expressiveness

    and ability for analysis. Of particular interest are the synchronous dataflow networks, which are

    applied in several academic modelling tools to represent streaming applications [3]. Synchronous

    dataflow networks are constrained, which leads to an efficient synthesized code [4]. The advan-

    tage of a dataflow model is that it is possible to have a one-to-one mapping between nodes or

    computational units in hardware. The nodes act asynchronous to each other, and communication

    arcs are used to transfer data tokens with insignificant control mechanism costs. In [5] it is shown

    that dataflow models offer a representation that may effectively support the parallelization of the

    design for higher performance, which is required for a lot of applications in wireless systems.

    2

  • 7/30/2019 CAL_HDL

    3/15

    N1

    N2

    N3

    A

    B

    C

    Figure 1: Dataflow Graph.

    2.1. CAL Programming Model

    In Fig. 1, a dataflow graph with three nodes (N1, N2, N3) and three communication arcs (A, B,

    C) is shown. In a CAL implementation the nodes are the actors which represents computation-

    al/logical tasks. Communication arcs are the buffers or FIFOs through which data tokens are

    transferred between actors.

    The CAL actors are isolated computational units which consist of input/output(IO) ports,

    actions, state variables, and parameters. The state of an actor is not shareable with other actors,

    and interaction between actors is accomplished through IO ports based on data tokens. An action

    defines computational/logical operations performed on the data tokens based on the actor states.

    When an action is fired, it may consume and/or produce data tokens. Afterwards, state of the

    actor is modified and an output data token is produced.

    2.2. RTL Generation from CAL

    The CAL flow provides a high level of abstraction. The design cycle offers a wide range of

    design space exploration and optimization techniques. A CAL program may be compiled to both

    hardware and software. A software implementation is realized by translating CAL to C program-

    ming language, and hardware is realized by HDL. This ability to perform both hardware-software

    CAL Model

    OpenDF simulations

    A

    B

    C

    Action

    state

    Example merger actor

    actor merge () in1,in2 ==> out

    A : action in1 : [a] ==> [a]

    B : action in2 : [a] ==> [a]

    selector (AB) *

    end

    end

    Software Generation

    C Code

    Generated by

    CAL2C

    main() {

    int i = 0;

    HDL Code

    Generated by

    CAL2HDL

    process()

    begin

    Hardware Generation

    Figure 2: CAL Frame Work.

    3

  • 7/30/2019 CAL_HDL

    4/15

    n ut

    fb_infb_out

    A

    B Add

    FB

    Figure 3: Simple Feedback Network.

    1

    2 A c t o r d e s c r i p t i o n

    3

    4 / / A c t o r f o r s i m p le F ee db a ck ( d e l a y )

    5 a c t o r FB ( i n i t ) f b i n ==> f b o u t

    6 a c t i o n [ f b i n ] ==> [ f b o u t ]

    7 e nd

    8

    9 / / A c to r f o r A d d it i on , t o k en s a t p o r t A a nd B i s a dd ed .

    10 a c t o r a dd ( ) A , B ==> Sum

    11 a ct io n [ a ] , [ b ] ==> [ a+b ] en d

    12 e n d

    13

    14 Netwo rk d e s c r i p t i o n

    15

    16 / / T op L e v e l N e tw o r k

    17 / / I n t e g r a t i n g A c to r s

    18 n et wo r k f b a d d ( ) I n ==> Ou t

    19 e n t i t i e s

    20 f b 0 = FB( i n i t = [ 0 ] ) ;21 add = a d d ( ) ;

    22 s t r u c t u r e

    23 I n > add .A;

    24 f b 0 . f b o u t > a d d . B ;

    25 add . sum > f b 0 . f b i n ;

    26 add . sum > Out ;

    27 e n d

    Listing 1: Example Feedback Network using CAL

    synthesis enables the development of a unified platform for hardware-software co-design of com-

    plex systems, like embedded systems consisting of processors and hardware accelerators.

    A complete framework called Open Dataflow, supports CAL network simulation and genera-

    tion of hardware-software code, see Fig. 2. This capability of CAL to support hardware software

    co-design enables a common tool, architectural definition and specification for both platforms

    [6], simplifies the design of complex systems. Details of the translation of CAL to HDL or C

    are described in [7]. These tools are open source and in this paper original version refers to the

    version available at [8].

    2.3. CAL Network Implementation

    The topology of actors connected to each other is referred as a network of actors, a simple

    network of actors implementation in CAL is shown in Fig. 3, and CAL description is shown in

    4

  • 7/30/2019 CAL_HDL

    5/15

    Communication

    Arcs

    Actors

    Scheduler

    Action States

    Separate local

    scheduler for actions

    encapsulated states

    FIFO FIFO

    FIFOFIFO FIFO

    Action

    Figure 4: Conceptional illustration of an Actor Network.

    List. 1. In hardware, each actor is an independent entity and the communication between theactors is based on handshake protocols (4-phase bundled-data protocol). Each communication

    arcs in the CAL network is implemented as a FIFO with a handshakeprotocol wrapper, see Fig. 4.

    Furthermore, an actor also facilitates handshake protocol to consume and produce tokens. If two

    connected actors belong to different clock domains, an asynchronous FIFO implementation is

    selected.

    2.4. Modifications to existing Framework

    A brief description of CAL dataflow language was presented in previous subsections. In Fig-

    ure 5, the overall existing CAL framework/tools is shown. The support for ASIC Implementation

    was added to this framework by modifications/

    optimization of the tool, described in next section.

    CAL Description

    *.cal

    CAL Front End Toolsopendf, orcc

    Open Source Tools

    Intermediate

    Representation

    Top Level Files(network

    description *.nl)

    XLIM Backend Files

    ( input to other possible

    translator tools )

    Software

    Implementation

    C Files

    *.c , *.h

    C++ Files

    *.cpp

    Java Files

    *.class, *.jar

    Hardware

    Implementation

    FPGA Implementation

    using xilinx libraries

    *.vhd , *.v

    ASIC Implementation

    *.vhd , *.v

    Support For ASIC Implementation :

    * Remove all xilinx li brary dependecies

    * Infer Block memories (pick from library)

    * Optimizations to reduce Area Cost.

    * Partitioning into clock domains (GALS support)

    * Automatically Generate clock gating logic (Low Power )

    Figure 5: Included ASIC features to existing CAL framework.

    5

  • 7/30/2019 CAL_HDL

    6/15

    Memory Unit

    Block

    Memory

    Clocked

    Registers

    Scheduler

    Reset Sync

    Logic

    Kicker Circuit

    Handshake for

    input token

    Handshake for

    output token

    reset clk

    Actions

    Mathematical &

    logical operations

    Actions

    Actions

    Figure 6: Generated Actor Implementation.

    3. ASIC Implementation from CAL

    The modification performed on the tools to support ASIC implementation is divided into two

    parts, first the reduction of the hardware area cost was taken into account. Various modifications

    are performed on the tool/flow to reduce area of the hardware implementation from CAL. The

    second part involved incorporating existing low power methods (GALS, clock gating) into the

    flow to enable low power ASIC implementation from CAL.

    3.1. CAL generated hardware - Area Optimizations

    A CAL actor consists of various units based on the computational and logical tasks. Fig. 6

    shows a generalized actor implementation in hardware. The various sub-modules in the actor are

    explained as below.

    The Reset Synchronizer Logic is used to synchronize the reset with the actor clock.

    The Kicker Circuit generates a pulse based on the synchronized reset signal. The pulse

    generated is used by the scheduler logic to begin the protocol handshake mechanism and

    action scheduling.

    The Memory Unit contains state, global variables and constants. The global variables and

    constants are used by the action units for computational purpose. The state is used by the

    scheduler for firing sequences of actions in an actor.

    The Action Units are computational or logical units of an actor.

    The Scheduler Unithandles the token handshake protocol and firing of action units.

    6

  • 7/30/2019 CAL_HDL

    7/15

    3.1.1. Removal of Redundant LogicThe actor hardware generated by the original version of the tool assumes that each and every

    actor is a separate asynchronous block. Hence a synchronous reset logic and a corresponding

    kicker (pulse gen) circuit is implemented for every actor. This logic is redundant, since for a

    single clock domain the resets may be synchronized once, and routed to all the actors of that

    clock domain consequently. The CAL tool was optimized to generate only one reset logic and

    kicker circuit for a clock domain.

    3.1.2. Infer ASIC memory

    The memory unit in the actor hardware implementation consists of registers which hold the

    state variables of the controller. If the actor contains an array of variables (list) of length greater

    than 128, a RAM behavioral model is inferred. Some actors may require a list of constants whichare strictly read-only elements. CAL language has a provision to declare a list as read-only.

    However, in the original version of the tool, a ROM is implemented as a RAM with initialized

    constant values. In an FPGA, RAM and ROM are automatically inferred by the synthesis tool.

    However, in an ASIC flow, memories needed to be inferred by manual integration. Consequently,

    modifications to generate appropriate behavioral models of memories based on the FPGA or

    ASIC flow were incorporated in the tool.

    3.1.3. FIFO Optimization

    In a CAL network, actors communicate by transfer of data tokens, this communication is

    implemented using FIFO and handshake protocol resulting in a communication cost between

    actors. The depth of each FIFO is constrained in the CAL network description. Optimizationsare applied to the FIFO implementation to minimize the communication cost. Based on the FIFO

    depth, the implementation is either a memory or register array. For a FIFO depth of 2 or less, the

    controller and handshake protocol are designed as glue logic.

    To further reduce the communication overhead modification are done in the flow to support

    merging of actors, this is performed by removing the FIFO (registers) between actors and glue

    logic handles the handshake protocol between actors. This is done by specifying fifosize as null,

    as shown in Fig. 7 and the corresponding pseudo network file List. 2. An implementation by

    merging smaller actors makes a design more compact, however the merging of larger actors may

    increase the critical paths of the design.

    A1

    A2

    B1 C1 D1

    10

    10

    5HM

    actors merged

    only handshake mechanism

    (no registers)

    Figure 7: FIFO optimization by merging Actors.

    7

  • 7/30/2019 CAL_HDL

    8/15

    12 Netwo rk d e s c r i p t i o n

    3

    4 / / T op L e v e l N e tw o r k

    5 n et wo r k t o p ( ) I n ==> Ou t

    6 e n t i t i e s

    7 / / a c t o r d e c l a r a t i o n

    8 A1 = A ( ) ;

    9 A2 = A ( ) ;

    10 B1 = B ( )

    11 C1 = C ( ) ;

    12 D1 = D ( )

    13 s t r u c t u r e

    14 / / a c t o r c o n n e c t i v i t y a lo n g w i th FIFO s i z e s .

    15 . . . .

    16 A1 . o u t p u t > B1 . i n p u t 1 { f i f o s i z e = 10 } ;

    17 A2 . o u t p u t > B1 . i n p u t 2 { f i f o s i z e = 10 } ;

    18 B1 . s u m o u t > C 1 . i n { f i f o s i z e = 5 } ;

    19 / / s p e c i f y i n g n u l l a c t s l i k e m er gi ng C1 a nd D1 a c t o rs

    20 C1 . o u t > D1 . i n { f i f o s i z e = N u l l } ;

    21 . . . .

    22 e n d

    Listing 2: Example Feedback Network using CAL

    3.2. Low Power ASIC Support

    GALS designs are very suitable for low power hardware implementations. Typically, in GALS

    based designs, a large system is divided into smaller synchronous blocks (or clock domains). The

    inherent independent nature of these smaller blocks offer the possibilities to implement variousstandard low power techniques like clock gating, power gating, dynamic voltage and frequency

    scaling [9].

    3.2.1. Clock Domain Partitioning

    The number of transistors that fit on a single die increases and the feature size decreases with

    improvements in silicon fabrication technology. The clock generation and distribution becomes

    increasingly difficult with large designs. The clock load increases with higher level of integra-

    tion and larger dies. This increase requires more clock buffers and hence increases the clock

    distribution latency. This in turn makes it more difficult to design a global-clock network that

    may control all the blocks in the design. Furthermore, as the clock frequency increases, there is

    more cross coupling in long wires which increases the clock jitter. The clock network occupies

    significant portion of the design area and the power consumption may lead to 35% of the totalpower consumption [10].

    GALS design provides a promising solution which eliminates the need of synchronous low

    skew global clock network. The main advantage of GALS design is that the design may be di-

    vided into smaller clock domains and there may be arbitrary clock skew between clock domains.

    The clock domains are independent synchronous blocks and use synchronization circuits for inter

    domain communication.

    The signal processing systems have a dataflow realization and may be easily mapped into such

    hardware structures. The capability of the CAL language to partition the design into smaller

    blocks may be used to efficiently implement GALS design. There are various implementation of

    8

  • 7/30/2019 CAL_HDL

    9/15

    GALS design [11], the CAL flow used implements GALS design using a FIFO based handshakemechanism. It interesting to note that the hardware implementation of handshake mechanism

    for data tokens and FIFOs are not part of the CAL language, since CAL only gives a high level

    abstraction of the dataflow algorithm. Hence there is more flexibility for the end user to tailor

    these mechanism based on application.

    A CAL network divided into clock domains is shown in Fig. 8 along with a pseudo code of the

    network description in List. 3. The hardware partitioning is done using the keyword clkdomain,

    similarly in case of software implementation the partition is done by specifying processorId

    along with the actor declaration.

    The partitioning into clock domains is straight forward (by using clkdomain keyword), in the-

    ory the maximum number of clock domains is equal to the number of actors in a network. This

    however would be an unrealistic implementation since asynchronous FIFOs used for communi-

    cation between clock domains are expensive compared to the synchronous FIFOs.

    3.2.2. Token-Based Clock Gating

    Power consumption is becoming an increasingly important metric in large hardware platforms.

    Clock gating is a well known method to decrease the dynamic power by reducing the number of

    transitions in registers. GALS divides the design into smaller blocks and clock gating schemes

    are applied to these blocks. The inherent advantages of clock gating in GALS design are dis-

    cussed abundantly in literature [12]. For ASIC implementation using CAL language new key-

    words like powerdomain, clkgating are introduced for low power implementation.

    Sync FIFO

    Clock

    Manager &

    Reset Logic

    Clk2

    Global

    Reset

    Kicker

    Pulse

    Domain

    resetClk2

    Kicker

    Pulse

    Clock

    Manager &

    Reset Logic

    Clk1

    Global

    Reset Domain

    resetClk1

    Async FIFO

    Async FIFO

    Clk3

    Clk3

    Sync FIFO Sync FIFO

    Sync FIFO Sync FIFO

    Sync FIFO

    Sync FIFO

    Sync FIFO

    Kicker

    Pulse

    Domain

    reset

    A1 B1

    A2 B2

    A3

    B3

    C4

    C3 D3

    Figure 8: Clock Gating Scheme.

    9

  • 7/30/2019 CAL_HDL

    10/15

    12 P s e u d o Networ k D es c ri p ti on

    3

    4 \\ Top l e v e l d e s c r i p t i o n

    5 n e tw or k t o p g a l s ( ) ==>

    6

    7 e n t i t i e s

    8 \\ e n t i t y d e c l a r a t i o n c lo ck do main c l k 1

    9 \\ s i m i l a r t o v h d l e n t i t y d e c l ar a t i o n s

    10 A1 = a c t o r A ( ) ; { c l k d o m a i n = c l k 1 } ;

    11 B1 = a c t o r B ( ) ; { c l k d o m a i n = c l k 1 } ;

    12 . . . . . . .

    13 . . . . . . .

    14 A2 = a c t o r A ( ) ; { c l k d o m a i n = c l k 2 } ;

    15 . . . . . . .

    16 A3 = a c t o r A ( ) ; { c l k d o m a i n = c l k 3 } ;

    17 D3 = a c t o r D ( ) ; { c l k d o m a i n = c l k 3 } ;

    18 . . . . . . .

    19

    20 s t r u c t u r e

    21 \\ c o n n ec t i ng d i f f e r e n t a c t o r s

    22 \\ s i m i l a r t o p o rt map c o nn e ct i on i s v hd l

    23 A1 . o u t > B 1 . i n { f i f o s i z e = 4 } ;

    24 B1 . o u t > A3 . i n { f i f o s i z e = 1 0 } ;

    25 . . . . .

    26 . . . . . .

    27 B2 . o u t > B 3 . i n { f i f o s i z e = 5 } ;

    28 B3 . o u t > C 3 . i n { f i f o s i z e = 1 } ;

    29 e n d

    Listing 3: Psuedo Code example for partitioning design

    A clock gating scheme based on the data token activity is depicted in Fig. 8. The clocks to the

    actors are not required when no data token needs to be processed. The availability of data tokens

    is detected at the synchronous FIFO, and the arrival of a data token to the domain is detected at

    the asynchronous FIFO. Based on the availability/arrival of a data token, the clock signal to the

    domain is gated.

    If a data token exists or arrives in a domain, the clock domain is set in active state. When the

    domain is inactive the clock to the domain is gated. This is managed by the clock manager, see

    Fig. 8. The arrival/availability of data token in case of a FIFO based implementation is detectedby the fifo empty signal. It takes 3 clock cycles for a token to be consumed by the next actor from

    a FIFO, hence there is no latency between token detection in a FIFO and clock activation.

    Consequently, token based clock gating does not effect the behaviour (functionality) of the

    design. The token based clock gating scheme has been incorporated as part of the CAL hardware

    generation. There are features included to disable/enable clock gating to domains. Based on

    this a clock manager with appropriate state machines and clock gating logic is generated. This

    automization which divides the design into different clock domains with inbuilt clock gating

    feature, makes hardware implementation with CAL dataflow language even more interesting.

    10

  • 7/30/2019 CAL_HDL

    11/15

    Controller

    ROM

    Pilot

    Extraction

    MMSE

    moduleLS

    ModuleLS

    RAM

    Pilot

    ROMStart MMSE

    Extracted

    Pilots

    WiFi data

    out (11:0)

    Input data

    out (11:0) LTE/DVB-H

    data out (11:0)

    Expected

    Pilots

    done valid busy

    clk

    rststart

    LS

    Estimates

    Pilot_loc(2:0)

    Op_mode(1:0)

    Figure 9: MMSE Channel Estimator.

    4. Case Study: Channel Estimator

    An OFDM based multi-standard channel estimator is implemented as a case study for a low

    power ASIC hardware implementation with the CAL dataflow language. The channel estimator

    was chosen as case study since the algorithm is of moderate complexity and requires significant

    hardware.

    The implemented channel estimator is reconfigurable to concurrently support various stan-

    dards like 3GPP LTE, IEEE 802.11n and DVB-H. A Robust MMSE algorithm is employed forthe channel estimator. Details about the multi-standard environment for channel estimation is

    described extensively in [13]. The algorithm approximations and hardware mapping (data width,

    MMSE matrix coefficients) chosen for CAL hardware implementation is same as the reference

    design [14].

    The hardware architecture of the channel estimator, see Fig. 9, is divided into several blocks

    as described below.

    LS module - Least square estimation module consists of a complex multiplier. The inputs

    to the multiplier are pilot data and the inverse of the expected pilot values stored in pilot

    ROM. The output from the complex multiplier is stored in LS RAM for use with MMSE

    module.

    Controller module - This module is the main controller of channel estimator which takes

    care of pilot separation, Least Square Estimation, and the memory operations.

    MMSE module - This module consists of a matrix multiplier that is implemented with

    12 Multiply-accumulate (MAC) units. The appropriate matrix inputs are sent serially from

    the LS RAM and MMSE ROM.

    Memories - The channel estimator consists of 3 memory units. Pilot ROM is implemented

    with 2 ROMs (1300x12). MMSE ROM stores the coefficients for the MMSE algorithm

    which is implemented as 2 ROMs (200x120). LS RAM stores the output from LS module

    for further processing by MMSE module. It is implemented as 2 RAMs (334x12).

    11

  • 7/30/2019 CAL_HDL

    12/15

    LS Memory

    Unit

    Pilot

    Extraction

    Least

    Estimator

    Pilot

    ROM

    12 MAC

    Units

    MMSE

    SchedulerPISO

    Clock Domain (clk3)

    Clock Domain (clk2)Clock Domain (clk1)

    Figure 10: Dataflow description of MMSE Estimator.

    The inputs to the channel estimator module is of real and imaginary data of 12-bits, a 3-bit

    input which shows the location of the pilot, a 2-bit input which shows the type of data and a start

    signal. The outputs from the channel estimator are 12-bit real and imaginary data, a validsignal

    and a busy signal.The CAL dataflow implementation is a straight forward mapping of the algorithm, see Fig. 10.

    The Pilot Extraction actor will process the OFDM symbols and send the pilot data tokens to

    the Least Estimator actor. Afterwards, the data tokens are multiplied by the expected inverse

    pilot values and stored in the Memory actor. After completing Least Estimation on all the pilots

    for a particular OFDM standard, the Memory actor sends data tokens to MMSE network. The

    MMSE network contains a matrix multiplier implemented withMACactors. Inputs to these MAC

    actors are from the Memory actor andMMSE coefficients from ROM. TheMMSE controlleractor

    handles MAC actors. A Parallel Input Serial Output(PISO) actor receives data from the MAC

    actor and sends the final results serially.

    Moreover, in order to reduce power consumption the design is divided into three clock do-

    mains. This division was performed based on functionality. The algorithm implemented in hard-

    ware works in sequence and there is no need for all the domains to be active all the time. The

    three clock domains are clk1, clk2 and clk3. These domains may theoretically run at arbitraryfrequencies. For area efficient implementation the depth of asynchronous FIFOs for communi-

    cation between clock domains is kept minimal, hence only certain ratios of clock frequencies are

    supported in the implementation.

    5. Results and Analysis of ASIC implementation from CAL

    The RTL description generated by CAL implementation was synthesized in 65 nm CMOS

    technology. Synthesis was performed on the original CAL flow and the optimized CAL flow for

    12

  • 7/30/2019 CAL_HDL

    13/15

    Table 1: Area details at 250 MHz.

    Actors Memory Communication

    (FIFO)

    Total Area

    CAL Original 0.080 0.1 0.051 0.238

    CAL Optimized 0.073 0.1 0.040 0.22

    Table 2: Hardware comparison.

    Area

    [mm2]

    Clock

    Domains

    Max Freq

    [MHz]

    Throughput

    [Samples/s]

    CAL max freq 0.25 1 414 169 M

    CAL at 250 MHz 0.22 1 250 102 M

    [14] at 250 MHz 0.19 1 250 78 M

    comparison. The design is synthesized with the same clock constraints, which are bound by the

    reference design implementation.

    5.1. Hardware Results

    Table 1 shows details of reduction in area due to different optimizations in the CAL to RTL

    tool. There is a 8% reduction in area compared to the original version of the CAL to RTL tool.

    The reduction is mainly from the removal of redundant logic in actors and FIFO optimizations.

    The design implemented by the optimized CAL flow is further compared with a reference

    RTL design, see Table 2. The critical path of the CAL based design is 2.4 ns. At maximum

    frequency of 414 MHz, the reported area is 0.25 mm2. The channel estimator based on CALis also synthesized with a clock constraint of 250 MHz and the total reported area is 0.22 mm2.

    The area of the CAL implementation is 15% larger than reference RTL design, but still very

    encouraging considering the design time.

    The maximum clock for the CAL implementation is higher since the critical path is bounded in

    an actor, and actors in a CAL implementation are connected to each other by FIFOs (registers).

    CAL implementation throughput is higher than the reference design due to the inherent paral-

    lelism involved in the dataflow language. At 250 MHz the throughput in CAL implementation is

    102 M samples/sec and for the reference design implementation is 78 M samples/sec.

    It is possible to manually implement parallelism in RTL, but may need more time to analyse

    data dependencies and control mechanism. The bottleneck for the RTL implementation is in the

    matrix multiplication unit. The throughput would have been same as CAL, if a set of register

    are used to store the final results of all the MAC units, and during the streaming out of the finalresults the MAC units can continue to operate on next set of data.

    These results are in line with the research published on CAL hardware implementation on

    FPGA of the MPEG-4 standard [15], inner receiver [16]. In [15] the throughput of CAL im-

    plementation is much higher and now used in standardization of MPEG RVC. However in case

    of an FFT (designed by co-authors in [16]) the CAL implementation had a higher area cost for

    same throughput, since the RTL implementations of FFT are highly optimized. Hence it should

    be noted that the throughput gain will depend both on the complexity of design and reference

    implementation, but the main advantage is in the high level of abstraction, which leads to ease in

    design space exploration lower design time.

    13

  • 7/30/2019 CAL_HDL

    14/15

    Table 3: Functionality based clock domain partition.

    Clock Domain Area [mm2] Active

    period (%)

    clk1 0.04 100

    clk2 0.06 45

    clk3 0.14 40

    Table 4: Power comparison at 250 MHz.

    Area

    [mm2]

    Num of

    clock

    domains

    Power

    [mW]

    Clock

    Freq

    [MHz]

    Throughput

    [Sam-

    ples/sec]

    Normalized Power

    [mW/(MSamples/Sec)]

    CAL 0.22 1 18 250 102 M 0.21

    CAL Low

    Power

    0.24 3 10 250 102 M 0.12

    5.2. Low Power Implementation

    A low power implementation is realized by partitioning the channel estimator into three clock

    domains based on the functionality. The total reported area of the low power implementation is

    0.24mm2. There is an increase of 10% in area due to the overhead in communication, mainly

    from the asynchronous queue. Table 3 shows area occupied by each clock domain and it can be

    seen that clk2, clk3 can be turned offfor around 50% of the time.

    The power simulation were performed on the gate level netlist with back annotated timing

    and toggle information. The power consumption was estimated at a clock rate of 250 MHz toall clock domains, for comparison power is normalized to throughput as presented in Table 4.

    The normalized power consumption is reduced by 45% for the low power CAL implementation

    compared to the hardware generated by the original CAL tool. Further reduction in power con-

    sumption is possible by varying the clock rate (dynamic voltage frequency scaling techniques)

    for different clock domains.

    6. Conclusions

    This paper presents a method to generate an efficient hardware design with CAL dataflow

    language. Since the currently available CAL generation tool was designed for FPGA hardware

    implementation, there were changes performed to facilitate ASIC implementation. Further mod-

    ifications were done in the tool to optimize hardware generation. An OFDM channel estimatorwas implemented in 65 nm CMOS technology with the modified CAL generation tool. The hard-

    ware implemented by CAL has a higher throughput performance compared with the reference

    design. Due to the higher abstraction and handshake based interface of an actor, the design is not

    based on clock cycles like in RTL. Hence changes done to one or more actors does not affect the

    rest, which makes it more easy for design space exploration.

    A study on GALS design implementation with CAL was done by dividing the design into

    smaller clock domains. This division into clock domains is easily done in the CAL network.

    The tool generates the asynchronous handshakesbetween clock domains which makes the GALS

    implementation a very simple task. A clock gating scheme was integrated into the tool to support

    14

  • 7/30/2019 CAL_HDL

    15/15

    low power ASIC implementation. The data token based clock gating gave remarkable reductionin dynamic power consumption.

    The reduced design time for comparable area and low power consumption in CAL based

    design is very encouraging, considering that the CAL implementation is at a higher level of

    abstraction.

    7. Acknowledgments

    We thank Ericsson research and Lund University for providing the opportunity to work on

    this project. Also would like to thank MULTI-BASE and ACTORS project, both funded by 7th

    Framework Programme (FP7) of the European Commission and Swedish VINNOVA Industrial

    Excellence Center (SOS).

    References

    [1] Ptolemy Project, UC Berkeley EECS Dept., http://ptolemy.eecs.berkeley.edu/ptolemyII/index.htm.

    [2] J. Eker, J. W. Janneck, CAL Language Report Specification of The CAL Actor Language, Tech. Rep. UCB/ERL

    M03/48, EECS Department, University of California, Berkeley (2003).

    [3] G. Kahn, The Semantics of a Simple Language For Parallel Programming, in: IFIP (Information processing)

    Congress, 1974, pp. 471475.

    [4] M. Chen, E. Lee, Design and implementation of a multidimensional synchronous dataflow environment, in: Sig-

    nals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on,

    Vol. 1, 1994, pp. 519 524 vol.1. doi:10.1109/ACSSC.1994.471507.

    [5] S. Ritz, M. Pankert, V. Zivojinovic, H. Meyr, Optimum vectorization of scalable synchronous dataflow graphs,

    in: Application-Specific Array Processors, 1993. Proceedings., International Conference on, 1993, pp. 285 296.

    doi:10.1109/ASAP.1993.397152.

    [6] N. Siret, I. Sabry, J. Nezan, M. Raulet, A codesign synthesis from an mpeg-4 decoder dataflow description, in:

    Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 1995 1998.doi:10.1109/ISCAS.2010.5537107.

    [7] Open RVC-CAL Compiler, http://orcc.sourceforge.net/, http://opendf.sourceforge.net/ (Open Dataflow Source

    Forge Project).

    [8] CAL Tool Version Used For This Project, Open Dataflow Version : 1131, Open Forge Version : 16.

    [9] A. Chattopadhyay, Z. Zilic, Galds: a complete framework for designing multiclock asics and socs, Very Large Scale

    Integration (VLSI) Systems, IEEE Transactions on 13 (6) (2005) 641 654. doi:10.1109/TVLSI.2005.848825.

    [10] S. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch, E. Schmidt, System level clock tree synthesis for power

    optimization, in: Design, Automation Test in Europe Conference Exhibition, 2007. DATE 07, 2007, pp. 1 6.

    doi:10.1109/DATE.2007.364543.

    [11] P. Teehan, M. Greenstreet, G. L emieux, A survey and taxonomy of gals design styles, Design Test of Computers,

    IEEE 24 (5) (2007) 418 428. doi:10.1109/MDT.2007.151.

    [12] E. Amini, M. Najibi, H. Pedram, Globally asynchronous locally synchronous wrapper circuit based on clock gating,

    in: Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, Vol. 00,

    2006, p. 6 pp. doi:10.1109/ISVLSI.2006.48.

    [13] F. Foroughi, J. Lofgren, O. Edfors, Channel estimation for a mobile terminal in a multi-standard environment (lte

    and dvb-h), in: Signal Processing and Communication Systems, 2009. ICSPCS 2009. 3rd International Conferenceon, 2009, pp. 1 9. doi:10.1109/ICSPCS.2009.5306380.

    [14] I. Diaz, B. Sathyanarayanan, A. Malek, F. Foroughi, J. Rodrigues, Highly scalable implementation of a robust

    mmse channel estimator for ofdm multi-standard environment, in: Signal Processing Systems (SiPS), 2011 IEEE

    Workshop on, 2011, pp. 311 315. doi:10.1109/SiPS.2011.6088995.

    [15] J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, M. Raulet, Synthesizing hardware from dataflow pro-

    grams: An mpeg-4 simple profile decoder case study, in: Signal Processing Systems, 2008. SiPS 2008. IEEE

    Workshop on, 2008, pp. 287 292. doi:10.1109/SIPS.2008.4671777.

    [16] T. Olsson, A. Carlsson, L. Wilhelmsson, J. Eker, C. von Platen, I. Diaz, A reconfigurable ofdm inner receiver im-

    plemented in the cal dataflow language, in: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International

    Symposium on, 2010, pp. 2904 2907. doi:10.1109/ISCAS.2010.5538042.

    15