A Scalable and High-Performance FFT Processor Opti
A Scalable and High-Performance FFT Processor Opti
net/publication/265568025
Article
CITATIONS READS
8 1,927
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Rene Van leuken on 01 June 2016.
Ramesh Chidambaram
Advisors:
Dr. Rene van Leuken (Delft University of Technology)
And
Ir. Jos Huisken (Silicon Hive, Philips Technology Incubator)
Microelectronics
Department of Electrical Engineering
Faculty of Electrical Engineering, Mathematics and Computer Science
“Thaivathal agatheninum muyarchi than mey varuthak kooli tharum”
The Thirukural, verse 619 (A Tamil epic written by Valluvar, dated 500-550 AD)
Interpretations:
Even a task rendered impossible by divine fate can be conquered by sheer human effort
(And)
Even the effort invested into a task made impossible by fate, has its own rewards
II
“Dance like a butterfly, sting like a bee!”
- Muhhamed Ali (born Cassius Clay)
III
IV
Preface
Retargetable application specific instruction-set processor (ASIP) based designs aim to match
the extent of programmability of a general-purpose solution and yet provide efficient
performance – in terms of data computation rate and area utilization - as delivered by an
application specific integrated circuit (ASIC). Present work examines the design of such a
retargetable core and the accompanying embedded software for a very high performance Fast
Fourier Transform (FFT). The underlying aim is to raise the performance calibre of ANSI-C
programmable ASIP designs to match applications that until now belonged exclusively to the
domain of ASIC solutions.
One of the primary project objectives is that the application should be easily scalable for the
order of the FFT. As a primary target for scalability, the application should cater to any order of
FFT, with minimal changes to the hardware. Secondly, from a performance perspective, the
design should be capable of meeting the requirements of a highly time critical FFT used for
orthogonal frequency division multiplexed ultra wideband (UWB-OFDM), i.e. compute a 128-
point FFT within 312.5ns.
At present, the standardization of UWB-OFDM is still to be finalized by the IEEE
802.15.3a workgroup. Thus, such a design essentially translates into minimizing risk while
producing a core for an application even before it’s comprehensively standardized. Such an
approach brings with it significant business advantages of being first-to-market, and also lends
to gathering useful research regarding implementation issues.
V
Overview
The processor design, with the generation of the corresponding embedded software, is
performed using the coarse-grained retargetable computing technology proprietary to Silicon
Hive. Silicon Hive belongs to the portfolio of the Philips Technology Incubator and is a fully
owned business of Philips Electronics N.V. The technology has been developed and patented by
Philips. Chapter 1 describes the philosophy behind ASIP design and how the design template of
Silicon Hive is utilized for the same.
There is a plethora of literature published that relates to algorithms for implementing the
Discrete Fourier Transform (DFT). However, algorithmic and hardware implementation
perspectives greatly differ. Chapter 2 addresses this issue and compares various algorithmic
techniques – including Radix-2, higher radix, the constant geometry approach, multiple radix,
etc - from the standpoint of a retargetable ASIP.
Aspects of the UWB-OFDM proposal that are pertinent to the FFT block are examined
in Chapter 3. This chapter also deals with formulating the FFT block’s requirements and
shedding light on the design challenge involved.
Chapter 4 describes the approach to algorithm selection and the design choices made to
realize a scalable FFT processor, which is optimised for UWB-OFDM performance
specifications. Finally, Chapter 5 presents a compilation of obtained results and the conclusions
reached. Scope for further improvement is also discussed.
VII
Contents
Preface V
Overview VII
IX
4.6 Signal flow graph (SFG) based analysis 37
4.7 Processor architecture requirements 39
4.8 Bit reversed addressing in the last stage 41
4.9 Inter-stage scaling for SQNR 41
4.10 Conclusion 42
4.11 List of references 42
List of Abbreviations 63
Acknowledgements 65
X
Chapter 1
Overview of ASIP Design
2
1.2 ASIP as an accelerator
An ASIP design involves matching of the architecture to a dynamically configurable set of
applications. In order to utilize the functionality of an ASIP in realizing an embedded system
solution, a set-up as shown in Figure 1.2 is commonly used. Communication between the CPU
and ASIP is performed via the system bus. This is further aided by a direct memory access
(DMA) controller, used in conjunction with a FIFO unit to load or store bulk data. The
processing data handled by the DMA controller behaves independent of the CPU, as it will
allow the CPU to continue performing other tasks while the DMA is engaging the ASIP.
The CPU instructs the DMA to fetch a block of instructions from the Program and data memory
and store it on the internal-program memory of the ASIP. The FIFO is used to control data flow
between DMA and ASIP by passing tokens back and forth. The list of operations in an
application is partitioned into those executed on the CPU itself and those dispatched to the
ASIP. This is the function of partitioning in a user-dominated hardware/software codesign
approach. Owing to adopting a VLIW type ASIP, a spatial compiler is also used to bind
instructions to a set of hardware resources. On occasions, an operation requires usage of two or
more non-adjacent hardware units. The spatial compiler needs to deduce the hardware resources
required and the on-chip data routing to realise such an operation. In digital circuits the
predominant expenditure of energy comes from switching. Given that the VLIW instruction
word length is going to be the cause for activating and deactivating bus lines and functional
clusters, an optimal operation refinement scheme is imperative, especially in the case of low-
power implementations of highly time-critical systems. In a VLIW processor, the instruction is
straightway mapped as a control vector for the hardware. Hence the op-code is generated in a
binary format from the instructions specified in the software application kernel. Thus,
3
optimisation of the instruction word length, extent of encoding, efficient operation grouping, etc
are part of the tool’s functions in generating a controller that works the data-path. The VLIW
instructions have to be tailored in a few ways for effective hardware utilisation and software
efficiency. Firstly, when the ASIP is made up of multiple IP Cores, the tool complexity to
formulate VLIW instructions increases. The encoding scheme for instruction words is optimised
to have a suitable Hamming distance between them and also exploit parallelism. VLIW
instruction word formulation for exploiting ILP can be performed statically (offline) or
dynamically (during run-time, as in the case of super-scalar architectures). Another level of
instruction tailoring targets efficient hardware resource utilization by techniques such as
pipelining and software loop folding. Given the effort required to construct optimal VLIW
instructions from an application kernel specified in a high-level language (like C), the
compiler’s capability is of substantial importance. For ease of design, the focus of most system
design flow is restricted to the data-path construction. This translates into a number of functional
units (ALUs, MACs, SHUs, etc) and storage related units like registers, load/store units, etc.
The control section, comprising of program instruction memory, address generation and
addressing flags, etc are automatically generated to satisfy the requirements of the data-path
section.
4
construction of a processor cell is depicted in Figure 1.4. As depicted, a few PSE units are made
to interact by means of intra cell communication lines (CL).
C CL C
L L
IN
RF RF RF
IN
IN
MEM
FU FU FU LD/ST
IN
IS IS IS
CELL
CONFIG
CTRL
PSE PSE
PSE PSE
Figure 1.4. Usage of Processing and storage element (PSE) in building a Cell
A more holistic view of the ULIW architecture [3] that goes to make the ASIP is depicted in
Figure 1.5. The architectural template shown in Figure 1.5 is configurable towards a particular
family of applications. Silicon Hive’s proprietary design methodology involves automated
instantiation of such a architecture from a hardware specification, written in a highly abstract
hardware description language called TIM.
5
Figure 1.5. Holistic view of the architectural template
Using a modular, highly abstract language for hardware description enables incorporating rapid
changes to the hardware architecture thus easing design iterations for co-simulation. A graphical
view of the data path for a processor (titled Moustique MPP ME, Copyright Philips Electronics
NV) developed using such a template and methodology is depicted in Figure 1.6.
Figure 1.6. Data path view generated for example processor architecture
The design flow begins with formulating a machine description and a concomitant application.
The former serves to generate the instruction set for cosimulation, and is also the input to the
hardware design flow. The latter is partitioned and compiled over a model emulating the
6
behaviour of the processor. This process of co-simulation comes in different flavours, each
catering to different simulation objectives. A graphical depiction of the design flow and co-
simulation strategy that leads to the generating the processor architecture side-by-side with a
pertinent application is shown in Figure 1.7.
1.4 Utilizing the Silicon Hive template for designing a FFT processor
The usage of the Silicon Hive template is in the generation of an ASIP. Present work makes use
of many advantages provided by the same in the design of a streamlined ASIP, such as the ease
of effecting rapid design iterations for user dominated hardware/software codesign, automatic
processor generation, easy application compilation and cosimulation, etc. These features are
made use of in the designing of a FFT processor and writing the concomitant application kernel.
The characteristics that define the present target processor are:
1. The processor caters efficiently to an application kernel – it is application specific in that
sense – but isn’t constructed with priority towards a plurality of kernels. In the present
case, it refers to the processor being capable of servicing a Fast Fourier Transform (FFT)
algorithm (viz. the application kernel), but not to switch contexts to other related kernels
such as Minimum Mean-Square error (MMSE) algorithm, Channel equalization, etc.
2. The processor description is such that it’s possible to rapidly prototype a target device
given different requirements. For example, if the bit-reversal addressing of the last stage
of an in-place FFT algorithm can be omitted, then the issue slot that handles the same
7
can be excluded in the generated architecture. This would mean advantages as regards to
program memory width, area, power, cost, etc. This exploits the ability of rapid
prototyping and automatic core generation inherent with the Silicon Hive template.
3. The application should be generic in the sense easily adaptable, encompass all special
cases, be efficiently scalable, etc. And yet, the application is meant to target a particular
specialized case of and satisfy all specifications meant for the same. In the present case,
this implies that the application and core should possess quite generic capabilities, and
yet attain all requirements of the 128-point FFT for UWB-OFDM specifications.
4. In order to be fully compatible with the compiler and the ANSI-C based programming
environment, a 16-bit based data value width is chosen. In an ANSI-C based application-
programming environment, the short data type represents a 16-bit value.
5. It would be possible to utilize the arithmetic and logic units to embed a few other
simplistic application kernels (like a FIR filter). However, these applications are not
accompanied by guarantees.
The utility of an ASIP based solution is apparent, especially in the context of UWB-OFDM.
This essentially implies the ability to produce an IP-core for a market, even before the complete
standardization is finalized. Being first-to-market has significant business advantages, and also
lends to gathering research inputs regarding implementation prior to ASIC competitors.
8
Chapter 2
Examination of FFT implementations
This section examines the details of existing DFT implementations. A basic overview of each
algorithm is followed by an enumeration of its related characteristics and a discussion about its
suitability for a retargetable hardware implementation.
2.1 Introduction
Digital technology has evolved to the extent that rigorous computations can be carried
out with a high level of performance at low cost. A significant example of this would be the
present day ability to implement the operation of a Discrete Fourier transform (DFT) in
hardware. The implementation takes as many forms as the possibilities for algorithmic
optimisation. The Fast Fourier Transform (FFT) algorithm is one such example and it exploits
the symmetry and periodicity of sinusoids within the DFT summation, the definition of which is
shown in Equation (2.1).
N −1
X (k ) = ∑ x ( n )W
n=0
N
kn
,0 ≤ k ≤ N − 1 - Equation (2.1)
10
W Nk + N = W Nk - Equation (2.3)
Details about the Radix-2 FFT decimation-in-time (DIT) and decimation-in-frequency (DIF)
algorithms are described in [12]. The description includes the explanation for the total number
of complex multiplications and additions getting reduced to (N/2)log2N and Nlog2N
respectively. A basic unit called a butterfly performs the most fundamental operation of the
FFT. The entire signal flow corresponding to the FFT algorithm can be realized by repeated
instantiation of the butterfly unit with selectively ordered inputs. The most rudimentary butterfly
unit is the DIT Cooley-Tukey butterfly [12], depicted in Figure 2.1, where ⊗ refers to complex
multiplication and ⊕ refers to complex addition.
a Nk ⊕ A Nk
⊗ WNk
bNk
⊗ −WNk ⊕ B Nk
In the Radix-2 implementation, a Nk and bNk are N/2 samples apart. The Radix-2 algorithm is an
in-place algorithm, which implies that ANk and B Nk is also N/2 samples apart. Thus, if a Nk is
addressed by an index ‘i’ (equal to k), and bNk is addressed by index ‘i + N/2’, then ANk and B Nk
can also be addressed by the same pair of indices. This translates into simpler data handling in
hardware. An optimum way to implement the Cooley-Tukey butterfly is shown in Figure 2.2,
which reduces the number of complex multiplications.
a Nk ⊕ A Nk
bNk
⊗W k
N × −1 ⊕ B Nk
While these butterflies are for the DIT algorithm, the twiddle factor appears on different edges
in the DIF butterfly since in that case the frequency sum that is being divided into even and odd
11
sums. The corresponding butterfly is the Gentleman-Sande butterfly [12], in Figure 2.3(a). An
optimised version of the same butterfly is depicted in Figure 2.3(b).
a Nk ⊕A
k
N a Nk ⊕ A Nk
⊗ WNk
⊕ ⊕ ⊗W B
k
bNk B Nk bNk
⊗ −W k
N
× −1 k N
N
(a) (b)
Figure 2.3 The Gentleman-Sande butterfly:
(a) Direct algorithmic version
(b) Optimised version
It’s worth noting that even within a single butterfly unit, there is scope for fiddling with primary
operations. For example, in the optimised Gentleman-Sande butterfly the value of twiddle factor
is also addressed by the same index k as used for a Nk . While implementing the butterfly unit
under a resource restriction of a single, clock-driven arithmetic logic unit (ALU), the twiddle
factor can be fetched from memory after the two input data have been fetched. This allows the
same data path to be reused for fetching the twiddle factor. By way of an example, consider the
signal flow graph for DIF implementation of a Radix-2 8-point FFT is shown in Figure 2.4.
12
For sake of simplicity, arrowheads and other implicit details have been excluded. As can be
observed, Stage 1 performs a bifurcation of data into Stage 2. Thus Stage 2 comprises of two
FFT blocks, each of N=4. The twiddle addressing is performed with reference to N=4 in the
second stage (i.e. 0 corresponds to W40 ). The ‘shuffling’ of data inherent for ‘in place’
algorithms, necessitates a reordering stage. This reordering is based on bit-reversed addressing
for Radix-2. The regular structure of the signal flow graph (i.e. same number of butterflies
working per stage, etc) allows for a high level of parallelism and pipelining to be incorporated in
such ‘in place’ algorithms. The simplicity of the stages however is eclipsed by the overhead of
the reordering of the reordering stage. The reordering done for a bit-reversed addressing is
probably the biggest bottleneck that is experienced in designing vectorial processing cores. The
reordering stage requires data be handled individually, which translates into the consumption of
N clock cycles for reordering. In a retargetable solution, the bit-reversed addressing becomes
cumbersome, as it doesn’t retain any form of regularity with N. This implies is that the
reordering is highly software intensive. This tends to make designers wary of Radix-2 (or in
general Radix-2v). Since even more attractive solutions exist for ASIC designs (like frozen Split
radix/ Mixed Radix algorithms), there too the Radix-2 algorithm isn’t preferred.
However, the Radix-2 algorithm lends for significant versatility owing to the simple
dataflow structure. Highly vectorial processing for a Radix-2 FFT suffers from overheads owing
to twiddle addressing. This is demonstrated in the example that W41 = W82 , which then implies
that interleaving of twiddle factors is required in-between stages for a regular vectorial twiddle
handling.
A feature of the DIF type structure over the DIT structure is that the DIF gives better
Signal-to-noise ratio that stems from quantisation effects after twiddle factor multiplication and
truncation [19]. Additionally, the DIF structure lends itself more easily to the twiddle factor
driven approach, explained in section 2.7.
13
2.3.1 The Radix-4 algorithm
The Radix-4 algorithm incorporates a series sum split of four every stage, as compared to two in
Radix-2. This way, the initial N-point DFT is divided to give four (N/4)-point FFT
computations. Since the Radix-4 algorithm is the mutation of the Radix-2, all predominant
features that apply to Radix-2 hold true here as well namely the highly regular structure, the ‘in-
place’ data-addressing between stages, scope of pipelining, etc. The mathematical formulation
of the algorithm can be found in [11]. Since, every butterfly works on four inputs - x(4n),
x(4n+1), x(4n+2), x(4n+3), where n = 0, 1, ... , (N/4)-1 – feeding the butterfly with non-
consecutive data ensemble values makes vectorial processing a bit tricky for generic type DSP
processor cores. The same applies to the addressing and handling of twiddle factors, since each
twiddle factor needs to be referenced and fetched separately. Thus, while the Radix-4 algorithm
is ‘faster’ from an algorithmic perspective – as it completes the N point FFT in log4(N) stages,
as compared to log2(N) stages in Radix-2. There is an added detail that the algorithm is better
suited for a FFT of size N that is a proper power of 4. While the same techniques of zero
padding and ensemble manipulating (to make the Radix-4 butterfly behave like a Radix-2
butterfly) can be applied, it is a drop in optimality. The Radix-4 butterfly’s nodal symbol is
shown in Figure 2.5. In it’s essence, a two stage Radix-2 implementation of a 4-point FFT, as
shown in Figure 2.6.
14
As an example, consider a 16-point DIF Radix-4 FFT. The twiddle factors in the internal section
are referenced to N=4, while the twiddles to be used in the global signal flow graph are with
respect to N, which is 16 in the case of this example. This is indicated by the subscript 2 to
depict Radix-2, and subscript 4 to depict Radix-4. The value of k is equivalent to 4n. Thus, if 4n
points to x(1), then the set of global twiddles are {0, 1, 2, 3}. A significant feature here is the
occurrence of trivial operations within the butterfly, which can be exploited.
Since W40 , W41 and W20 are always 1, -j, and 1 respectively, these complex multiplication
operations are made trivial. If data value c = Re{c} + j Im{c} is multiplied with (–j), this
operation reduces to interchanging real (Re{c}) and imaginary (Im{c}) quantities, followed by
changing of sign (performing 2’s compliment for negation) to give Im{c}- j Re{c}. This is data-
flow more than a complex multiplication. Although the butterfly is shown to work in two stages,
this needn’t be from an implementation perspective. Functional units representing the butterfly
would have four inputs and four outputs, and the data flow for each output is made independent
of each other and computed in parallel. This would allow the butterfly to operate in a single
clock cycle and not require intermediate data storage. The SFG for a Radix-4 implementation of
a 16-point FFT is shown in Figure 2.7.
15
address a value, then the digit reversal is equivalent to reversing the address sequence, taken two
bits at a time, as shown in Figure 2.8.
For generic FFT cores, it’s a commonly adopted practise in the case of N being in the form of
2x4V (as in N=128), to have just {log4(N) – 1} number of Radix-4 stages. The last stage is made
to mimic Radix-2 by making x(4n+1) and x(4n+3) zero in the Radix-4 butterfly shown in
Figure.2.6. Although this is wastage of hardware capability in the last stage, it consumes lesser
number of stages than a Radix-2 implementation. However, a point to be noted here is that when
such a manipulation is performed, the solution is a Mixed-radix solution. The reordering of data
after the last stage must take this into consideration and the operation digit-reversal embedded in
the core is made to work conditionally. Another method to circumvent the hindrance to vectorial
processing owing to simultaneous non-sequential input requirements is to adopt a constant
geometry structure with Radix-4 butterflies. Here x(4n), x(4n+1), x(4n+2) and x(4n+3) are
stored on four separate logical memories and are read in a vectorial packed format. The constant
geometry architecture is dealt with in Section 2.4.
16
core. There are however a few features that accompany the Radix-8 and Radix-16 butterflies
which are noteworthy.
The twiddle factors in the internal section are referenced to N=8 in Stage 1. The twiddle factors
from Stage 2 onwards allow for trivial operations. Examination of W81 ,W82 and W83 gives the
values of (1 – j)/√2, -j and (-1 + j)/√2 respectively. In the case of a pure integer machine, the
scaling by a factor of (1/√2) or (0.707) isn’t very straightforward. Care must be taken to
appropriately scale it to fit in the number system of the whole integer based core. It’s also
possible to approximate this by bit shifting distributed across many stages. But ‘safer’ ways that
guarantee accuracy can be adopted, like implementing the internal structure of the Radix-8
butterfly by a Radix-4 stage cascaded to a Radix-2 stage. While in certain specialized ASICs
this extent of overhead is excused, it isn’t worth the trade-off in a retargetable core. This line of
deduction holds true for Radix-16 butterflies also.
Therefore, higher Radix solutions are generally considered only for ASIC designs, in
which case the restriction is no more to just Radix-16. A 64-point FFT can be carried out in two
stages of Radix-8, or a Radix-32 followed by a Radix-2 stage since a single Radix-2 last stage is
entirely trivial. The choice depends on specifications like the cost of arithmetic capability,
scheduling objectives, etc.
17
2.4 Split/Mixed Radix solutions
The extension of single-radix solutions is to use a combination of two or more blocks of
butterflies that operate with different radix. The algorithmic basis of the multiple-radix solution
is treated in [10]. It’s apparent from first glance that the parallel usage of two or more types of
butterflies in parallel for a single stage doesn’t allow for efficient hardware utilisation in generic
cases. The ideal scenario to be able to consider multiple radix solutions is when the
specifications of the FFT are known before hand. As an example, an ASIC that implements only
a 128-point FFT doesn’t need to consider satisfying generic cases. Thus ASIC designs can
afford the luxury of implementing the same 128-point FFT using multiple radix solutions like
using Radix-8, 4 and 2 butterflies. This doesn’t prove to be hardware efficient from a viewpoint
of silicon utilisation if the same core must work for other values of FFT size or execute another
application. Thus, while it is the design methodology of ‘frozen’ solutions, it’s not ideal for a
retargetable implementation.
⊗ WNk __
x(k + N/2) y(2k+1)
Figure.2.7. Butterfly for DIT constant geometry architecture
18
Figure.2.8. Constant geometry architecture for a Radix-2 DIT FFT
It can be deduced from Figure.2.8 that such an architecture allows data of top-(N)/2 {i.e. x(0),
x(1), x(2), x(3)} to reside on one logical memory and data of bottom-(N)/2 {i.e. x(4), x(5), x(6),
x(7)} to reside on another logical memory. This allows the butterflies to be fed data every clock
cycle without a bottleneck owing to memory input/output bandwidth capability. The principle is
extendable to Radix-4 wherein N/4 data are stored on a single memory and the architecture
operates with four logical memories for its input. The design choice of extending the same to
even higher Radices depends on specifications and memory cost.
While the constant geometry architecture is advantageous owing to its being streamlined
and regular, it has a few limitations. There are data path width and memory port width
considerations that come into play while allowing multiple butterflies to operate in parallel. As
an example, consider the case of two butterflies operating in tandem. Assume that the bit-
resolution of packed complex data is 32-bits (packed with 16-bits for real component and 16-
bits for the imaginary component). If top-(N)/2 and bottom-(N)/2 reside in separate memories,
reading data for two butterflies requires a data path width of 64-bit per memory for the read
cycle and 128-bit per memory for the write cycle. The fact that the data path width for the write
cycle is double of the read cycle is all the more pronounced while specifications warrant eight or
more butterflies to operate in tandem. As such large data path widths mean large chip area and
complicated routing, there is a ceiling limit on the extent to which the processing can be
19
vectorial. Also notable is that the need for data reordering exists, that is bit-reversed addressing
in the case of Radix-2. The constant geometry architecture is quite useful in retargetable cores
for medium FFT computation requirements. However, the architecture offers no advantage for
the vectorial handling of twiddles. A discussion about the utility of the constant geometry
architecture from the perspective of an ASIC implementation that targets OFDM can be found
in [13].
20
generation of twiddle factors [17]. The CORDIC can be implemented in Radix-2 and is built
from the basic blocks of adders and shifters. The principle of the CORDIC is to calculate the
real and imaginary coordinates of a vector, given a base vector’s polar coordinates and an angle
through which it has to be rotated, along the lines of a mixer. The implementation has a pipeline
depth that determines the accuracy of the produced result. The depth of the pipeline also
determines the latency and can be the driving factor in scheduling a FFT meant to
predominantly run on a single-cycle loop. An advantage however is that the CORDIC can
replace the need of a complex multiplier as the complex data value arriving in the lower arm of
the butterfly can be scaled and rotated by an angle governed by the corresponding twiddle
factor.
While both schemes are valid possibilities with room for tailoring and optimisation, the
design choice is governed by whether or not the design would significantly benefit from a well-
tailored CORDIC over a well planned addressing and indexing scheme, with the most
significant deciding factor being chip area. This distinction becomes prominent depending on
performance requirement. In a high-performance FFT implementation, wherein a few butterflies
are to be computed in parallel to meet cycle budget requirement, each butterfly would require
it’s own dedicated CORDIC. This translates into larger core area, which isn’t worth the trade-off
especially in the case of a small FFT order, wherein twiddle factors can be stored in memory
and be handled in a vectorial manner.
2.9 Summary
In this Chapter, aspects of various DFT algorithms have been examined from the perspective of
ASIP implementation. The observations made are used for making design choices in Chapter 4.
21
In the next Chapter, the requirements of the FFT block for UWB-OFDM are deduced and
analysed.
22
[11] Theory and Application of Digital Signal Processing
Rabiner, L. R.; Gold, B.: Englewood CLiffs, N. J.:Prentice-Hall Inc., 1975
[12] Understanding Digital Signal Processing
Lyons, R. G: Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1996
[13] Fast Fourier Transform for high speed OFDM wireless multimedia system
Sadat, A.; Mikhael, W.B.;
Circuits and Systems, 2001. MWSCAS 2001. Proceedings of the 44th IEEE 2001 Midwest
Symposium on; Volume 2, 14-17 Aug. 2001 Page(s):938 - 942 vol.2
[14] Multiplierless Winograd and prime factor FFT implementation
Macleod, M.D.;
Signal Processing Letters, IEEE
Volume 11, Issue 9, Sept. 2004 Page(s):740 - 743
[15] Conflict free memory addressing for dedicated FFT hardware
Johnson, L.G.;
Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on [see also
Circuits and Systems II: Express Briefs, IEEE Transactions on]; Volume 39, Issue 5, May
1992 Page(s):312 - 316
[16] A CORDIC processor for FFT computation and its implementation using gallium
arsenide technology
Sarmiento, R.; de Armas, V.; Lopez, J.F.; Montiel-Nelson, J.A.; Nunez, A.;
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on
Volume 6, Issue 1, March 1998 Page(s):18 - 30
[17] Parallel FFT with CORDIC for ultra wide band
Guoping Zhang; Chen, F.;
Personal, Indoor and Mobile Radio Communications, 2004. PIMRC 2004. 15th IEEE
International Symposium on; Volume 2, 5-8 Sept. 2004 Page(s):1173 - 1177 Vol.2
[18] Twiddle-factor-based FFT algorithm with reduced memory access
Yingtao Jiang; Ting Zhou; Yiyan Tang; Yuke Wang;
Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002,
15-19 April 2002 Page(s):70 – 77
[19] Error analysis and resulting structural improvements for fixed point FFTs
Elterich, A.; Stammler, W.;
Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on
11-14 April 1988 Page(s):1419 - 1422 vol.3
23
Chapter 3
The UWB-OFDM Standard
In this section, pertinent details of the UWB MB-OFDM standard are discussed. This is
followed by a derivation of cycle count budget, and other requirements of a FFT meant for the
same.
3.1 Introduction
The IEEE 802 working group was established in 1980, and their purpose to establish standards
for Local Area Networks (LAN) and Metropolitan Area Networks (MAN) [1]. Each standard (or
recommended practice, or guide) is initiated by a group of people with an interest in developing
the standard. Presently a list of active work and study groups within the LAN/MAN standards
committee (LMSC) is provided in Table 3.1:
Table 3.1 List of Active work or Study groups within LMSC
Study or Work Group title
group
The 802.15 WPAN working group’s effort focuses on the development of consensus standards
for Personal Area Networks or short distance wireless networks. These address wireless
networking of portable and mobile computing devices such as PCs, Personal Digital Assistants
(PDAs), peripherals, cell phones, pagers, and consumer electronics; allowing these devices to
communicate and interoperate with one another. The goal is to publish standards, recommended
practices, or guides that have broad market applicability and deal effectively with the issues of
coexistence and interoperability with other wired and wireless networking solutions. At the time
of this writing, the list of standards under the purview of the WPAN work group is shown in
Table 3.2. An alternative being discussed for high data-rate WPAN is the usage of Ultra-
wideband (UWB).
Table 3.2 List of standards within 802.15
Standard code Topic of specification
802.15.1 Bluetooth
802.15.2 Coexistence of WPAN (802.15) and WLAN (802.11)
802.15.3 High-rate (20 Mbps or greater) WPAN
802.15.4 Low data-rate, low power, low complexity WPAN
26
In order to ensure the coexistence of other standards (especially WLAN 802.11a/g) with UWB,
the FCC has put in place a transmit power spectral density (measured with respect to 1 MHz)
restriction as show in Figure 3.1. The standardisation for adopting UWB for WPAN is still a
topic of debate at the time of this writing. The two contending standards are labelled Direct
Sequence (DS-UWB) and Multi-band Orthogonal frequency division multiplexing (MB-
OFDM). The MB-OFDM standard for UWB is being formulated by the workgroup IEEE
802.15.3a. The alliance of the industries involved in this workgroup has come to be known as
the Multiband-OFDM Alliance (MBOA) [2], co-founded by Philips. The reason FCC’s
initiative to grant unlicensed use of the UWB channel has caused fervent interest is owing to the
implication that such a substantial bandwidth allows for very high data transmission rates (~1
Gbps). However, in adherence to the FCC’s spectral mask, the transmitted power has to be quite
low (comparative to ambient noise as unintentionally radiated by computers). So, while this
allows transmissions at very high speeds, the distance of transmission is quite limited (in the
range of about 10 meters).
In MB-OFDM, the spectral range (3.1 to 10.6 GHz) is divided into 14 bands of 528 MHz
each. The lower three bands are compulsory and to be used for standard operation in the present.
The remaining bands are for optional use or for purposes of future expansion. In the draft
specification of the MBOA special interest group [3] [4], the spectral details for OFDM symbol
construction are as shown below in Figure 3.2.
1. The OFDM system employs 128 subcarriers, 122 of which carry energy.
2. Out of the 122 subcarriers, 100 are devoted to data (ci) and 12 are assigned to serve as
pilot subcarriers (Pi).
27
The above described transmit spectra, spans a band of 528 MHz. In time domain, the symbol is
transmitted for 312.5 ns. A break-up of a transmitted word is shown in Figure 3.4. A cyclic
prefix of 60.6ns provides for resilience against multi-path errors. A guard interval of 9.5 ns is
present to separate consecutively transmitted symbols and allows time for switching between
bands.
The typical block diagram of a transmitter and receiver section for such a MB-OFDM schema is
shown in Figure 3.5 and Figure 3.6 respectively [5].
28
The transmitter and receiver sections are just like as in conventional wireless OFDM. There are
however, a few modifications and restrictions incorporated. The constellation mapping is
restricted to Quadrature Phase Shift Keying (QPSK) so that the precision of internal digital
blocks (IFFT, FFT, ADC, DAC, etc) can be kept low. A 128-point FFT/IFFT to required for
generating or extracting ensemble information from the symbol, and a maximum latency of
312.5ns to perform the computation before the next symbol has to be dealt with. As far as
designing a FFT, these alone form the specifications – the order of the FFT and the time
constraint. It’s worth mentioning that the intricate nuances of the MB-OFDM UWB standard are
still under discussion and subject to changes. However, minor changes to the standard do not
affect the requirements of the FFT and are irrelevant to present work.
Thus, for the case of UWB OFDM, ns= 7. The number of butterflies for a stage of the signal
flow graph of the FFT is denoted by nstage. Since each butterfly handles two data inputs at an
instance, for N = 128, nstage = 64. Thus, for the entire 128-point FFT, the number of butterflies
(BFtotal) to be computed is given by,
= 7 x 64
= 448 Radix-2 butterflies.
A high-end retargetable core can at best be worked at an internal clock speed (C.S.) propagated
in the range of 300 MHz. As a measure of computational capability required, the extent of
butterflies operating in parallel is derived. The time period of a clock cycle is given by,
29
Thus, the number of cycles available for computing complete FFT (Cbudget) is given by,
C budget = ⎡ c ⎤
t
-- Equation 3.4
⎢⎢ T ⎥⎥
= 312.5
3.33
= 94 cycles
⎡ BF ⎤
n parallel = ⎢ total ⎥ -- Equation 3.5
⎢ C budget ⎥
⎡
= 448
94 ⎤
= 5 Radix-2 butterflies in parallel.
Thus, in the ideal case of being able to compute five Radix-2 butterflies every clock tick, the
cycle budget of 94 cycles would be just met. A summary of the values is depicted in Table 3.3.
However, the practical working case would require some overhead in between stages.
This is owing to having to recalculate parameters for vectorial processing on a domain based on
N/2, as compared to N for the previous stage. There is also the presence of a bit-reversal
addressed rearrangement in the last stage, which will require N/2 or 64 cycles if done one
butterfly output pair at a time. This borders on a near unacceptable number of cycles (~68%) for
just rearrangement of data after complete computation. Thus, an estimate of nparallel as 8 would
seem acceptable given the need to handle overheads, area considerations, and that N is a
multiple of 8 (unlike 5). This can be interpreted as the total number of cycles that can be devoted
towards butterfly computations (Ccomp) is,
BFtotal
C comp = -- Equation 3.6
n parallel
= 56 cycles
30
Thus the remaining 38 cycles (C budget − C comp ) must account for all intermediate parameter
3.4 Summary
The FFT for UWB-OFDM has a time budget of 312.5ns for computation. Given a clock speed
in the neighbourhood of 300MHz, a radix-2 implementation would entail computing 8
butterflies in tandem. The details discussed and analysed in this section form the inputs to
making design decisions, the content of Chapter 4.
[3] Multiband OFDM Physical layer Proposal for IEEE 802.15 Task group 3a
IEEE P802.15-03/141r1, March 2004
31
Chapter 4
Explanation of Design Choices
The focus of this section is the reasoning behind design choices. Starting with a review of
requirements, each design choice and its implication in attaining requirements is examined.
1. The solution should be programmable, meaning the application kernel should be scalable
with the order of the FFT in terms of clock cycles. Emphasis is also placed on
encompassing special cases implying special provisions to still handle rudimentary cases
(e.g. N=2) even while incorporating a highly vectorial processing approach.
In the present case, if 8 butterflies are simultaneously to be fed data, it translates into two 8x32 =
256-bits wide inputs to the Issue Slot (IS) that performs the FFT computation. There are two
issues that come into view at this stage. Firstly, from the perspective of efficient hardware
utilization, the FFT IS unit should be supplied with two 256-bit values every clock cycle. In the
34
case of having to provide such a continuous flow of input data, just LSUs are sufficient. The
alternatives are to cache data considering basic data size to be 32-bits – implying a requirement
of about sixteen cache hits every clock cycle - or having bulk data residing in processor
registers. Aspects such as overhead in packing data, accompanying cost of registers, etc. become
significant. In comparison to such alternatives, using LSUs in providing data only when
required not only meets performance requirements but is also the cheaper, more reliable
alternative.
The second important issue is that of having to fetch data from different locations, (N/2)
apart for each stage. This fetching has to take place in a very regular and streamlined manner.
The need for as high an extent of regularity as possible is to help in pipelining chunks of the
application kernel. There’s a significant advantage to be gained in terms of cycle-budget and
power by pipelining the operations within loops - also known as loop folding - to the extent of
realizing a single-cycle loop. In most standard cases, to realize this high extent of regularity, the
constant-geometry type architecture is preferred. In keeping with the current radix-2 approach,
this can be performed with an architecture that has two logical memories – one for top (N/2)
data values, and the other for the bottom (N/2) data values – as detailed in Section 2.4. However,
in the case of 8 simultaneous butterfly computations with two single-ported memories, this
implies loading a 256-bit vectorial value from each memory, but storing a 512-bit value to a
single memory. This becomes a memory bandwidth issue. A similar argument holds true for a
radix-4 constant geometry implementation with four logical memories, with the added overhead
of data handling complexity if the order of the FFT isn’t an exact multiple of four. This issue is
however circumvented by using dual ported memories, which would allow two distinct data to
be stored to two distinct locations within a memory, simultaneously. This is a point that strongly
favours usage of dual-ported memories in the processor architecture as it maintains the
possibility of adopting a constant-geometry approach to writing the application kernel.
35
interchanged between stages. A pictorial representation of memory ping ponging is depicted in
Figure 4.2.
36
highly regular manner without affecting the FFT IS’s operation, and without costing additional
clock cycles. Such an arrangement can be realised by an adaptation of the memory ping-ponging
technique used for data handling. The adaptation is to place a separate IS in the data path to
perform the twiddle selection and repacking operation, as depicted in Figure 4.3. In the case of
radix-2, the IS selects alternate twiddles and packs them together. It’s a point worth noting that,
if a constant geometry approach is adopted, making the FFT twiddle factor driven (explained in
Section 2.8) between stages is not straightforward. Thus, while the constant geometry approach
is still a possibility, in view of targeting a twiddle factor driven approach to the FFT, the in-
place approach is retained for the application kernel.
37
three stages are special cases in the sense the butterflies operate in an intra-vectorial manner
within the same input, and thereby necessitates inclusion of separate instructions for handle
these stages in the FFT IS. The DIF architecture is preferred owing to better signal to
quantisation noise ratio, as stated in Chapter 2. The memory ping ponging can be visualized
using separate logical memories - Memory1, Memory2 to handle data, and Twiddle1, Twiddle2
to handle twiddle factors. Then, Stage 1 would operate on data fetched from Memory1 and
twiddle factors from Twiddle1. The output data is stored in Memory2, and while the FFT issue
slot is operating, Twiddle2 is populated with alternative twiddle factors from Twiddle1. In the
next stage, Memory2 provides input, and Memory1 serves as the target memory for that stage.
The twiddle factors are fetched from Twiddle2, and the interleaving operation allows for them to
be read in a single fetch as a 256-bit word.
38
while the other loop deals with handling and rearranging twiddle factors. Twiddle factor sets
are read as 256-bit vectors each, once per stage. For each twiddle set, the inner loop has many
iterations depending on the FFT stage, and there are three primary operations that take place
within the inner loop during the same clock instance:
i. Loading the next input to be operated on in the next cycle
ii. The FFT IS operates on the present input
iii. And the output data made available during the previous cycle is stored to memory.
The twiddles that serve as input to the inner loop are handled by the outer loop, which performs
fetching of twiddles, interleaved rearrangement using a dedicated issue slot, and storage to
memory – again pipelined to happen within a single cycle. Well co-ordinated and yet isolated
index handling for all the LSUs is crucial to achieving an optimal scheduling scheme.
The implication of this is two fold – firstly it allows for more efficient pipelining and loop
folding of the inner loop. Secondly, it allows for sufficient cycles for twiddle rearrangement
(interleaved storage), even when the inner loop is folded to a single-cycle loop.
39
There are certain implications that accompany this processor architecture. Firstly, the minimum
data memory size required is at least twice the order of the FFT. Secondly, the need for an
Address Generation Unit (AGU) to individually address each twiddle factor is obviated as it can
be handled in a vectorial manner. If not for this, dealing with twiddles becomes a bottleneck
while computing eight butterflies in parallel. Another important implication is that two distinct
data values are to be procured from the same memory per clock cycle. Since the distance
between these two data value indexes varies between stages, a dual-ported memory is required.
It is worth recalling that the dual-ported memory type architecture is also required in the case of
adopting a constant-geometry approach. This way, two data values from a memory can be
loaded every clock tick and the FFT issue slot is effectively utilized. Indexing both data values
independently helps loop folding. Each data path is 256-bits wide. While in Figure 4.5 the data
path for twiddle rearrangement has been shown as distinct from the twiddle fetching data path
for sake of clarity, both operations share the same LSU and data path.
Besides the above requirements, in order to compute the entire FFT in a contiguous
manner over time, separate Input and Output memories are used. During the first stage, data is
read from Input memory, processed and stored to Memory2. While the memory ping ponging
takes place between Memory2 and Memory1 for the rest of the FFT stages, the next set of
ensemble values is loaded into the Input memory via a DMA controller. In the last stage of the
FFT, the stage is fed either from Memory1 or Memory2 and stored to Output memory. This
arrangement is graphically depicted in Figure 4.6.
40
4.8 Bit reversed addressing in the last stage
The usage of a radix-2 in-place DIF algorithm warrants bit reversed addressing of data for the
last stage. Hence provision is made for an issue slot that performs the bit-reversal operation for
the last stage. However, taking into consideration that the specific case that needs to be
optimised for is a 128-point FFT that has a budget of about 94 clock cycles, the bit reversal type
addressing takes the form of a severe bottleneck. The 128-point FFT requires an order of 64
clock cycles for just storing data in a bit reversed manner into the Output memory [3]. Thus, for
the high-performance case of UWB-OFDM, the bit reversed addressing is almost prohibitive as
it dominates the cycle budget. Considering the problem at a system level, where the FFT
computation is followed by channel estimation and equalization, circumvents this issue. These
subsequent blocks are designed to procure their input data in a bit-reversed order from Output
memory. This obviates the need for bit reversal issue slot and makes the last stage’s
computation vectorial and highly regular. Since the processor is described in TIM, a highly
modular hardware description language, the bit reversal block is excluded easily prior to
committing the design to silicon.
41
4.10 Conclusion
In keeping with the design choices enumerated in this section, the processor description of the
ASIP is carried out in TIM, and the corresponding application kernel is written. A short
summary of the design choices is listed below:
1. A DIF algorithm is chosen in comparison to a DIT algorithm owing to its suitability
towards the twiddle factor driven approach, and for SQNR. Also for SQNR, the data
values are shifted right by one bit every alternate stage.
2. Given cycle budget considerations, the bit-reversed addressing to be performed during
the last stage is omitted and compensated for during channel estimation and equalization.
3. For achieving cycle-budget, eight radix-2 butterflies are computed simultaneously.
Given a 16-bit data resolution requirement, this implies a 256-bit wide data path.
4. Twiddles are fetched and stored on processor memory as opposed to generating twiddles
dynamically using a CORDIC. Also incorporated is a scheme for twiddle ping ponging
and inter-stage rearrangement for vectorial handling of twiddles.
5. Dual ported memories are used as they allow the application kernel to adopt either the
constant geometry approach, or the in-place approach. The in-place approach is adopted
in present work with the aim of using a twiddle factor driven approach.
While writing the application kernel, the principle has been to adhere to intrinsic operations of
the processor rather than adopt regular C-language constructs that are later spatially compiled on
to the processor. The subsequent section deals with results.
42
Chapter 5
Results and Conclusions
In this section, results obtained the current design are enumerated. This section begins with a
discussion on the Signal-to-quantisation noise ratio, formation of test input and pertinent details
of the test bench. This is followed by a discussion on the schedule, and results obtained from
synthesis. Lastly, a section suggesting possibilities worth exploring has been included.
The quality and capability of a processor can be proved only with an application kernel that is
written to suit the processor, and the quality of the kernel in turn is assessed by a host test-bench
written in the C-language. For simulations, data is initially loaded on to the processor memories
through an Application Program Interface (API) macro. A similar API macro is used to retrieve
data from the processor Output memory after the FFT is fully computed. In this case, the test
bench must also be scalable with the order of the FFT, and the test bench generates a report.
Figure 5.2. Comparison between results as obtained from test bench and core (processor)
44
The sections of the report can be used for comparison as shown in the snapshot of Figure 5.2.
The FFT computation has a scaling factor of 1 2 per stage, performed by a right shift every
alternate stage. The test bench keeps track of the extent of intermediate scaling that takes place
so that its accounted for while examining the result. The test bench is also useful to provide a
step-by-step data check, which is useful in tracing errors. The simulation process for checking
functional accuracy is depicted in Figure 5.3. The test bench’s FFT subroutine is corroborated
with Matlab, and thereafter used to provide reference data for comparison.
The statistics file in its entirety is provided in Appendix 5.a, wherein each instruction number is
listed along side the number of cycles the particular instruction is executed. As a result of having
to use many issue slots to facilitate ping ponging of data, twiddle factor ping ponging and
rearrangement, handling of input and output ensembles separately, etc., the instruction word
width required is 618-bits wide. This is provided by Apex, the Silicon Hive tool that verifies the
45
consistency of the processor description. There are a few aspects that come to the forefront.
Firstly, the extent of a relatively high percentage of overheads, outside of the single cycle folded
loops. Since pipelining happens within loop structures, the advantage takes on significant
proportions when the inner loop executes many loop iterations. However, with 8 butterflies
executing in parallel, the number of times the inner loop executes per stage is just a maximum of
eight for a 128-point FFT. The second point is the small difference between the number of
program instruction words and the number of clock cycles. Typically, there should be a
significant difference between the two, with the number of cycles being much more than the
number of program instructions. This is yet again owing to having to optimise for a relatively
very small FFT size.
Given that the 128-point FFT application kernel has been capable of executing in 105
cycles, a recalculation is required to reassess the clock speed requirements as shown below:
Time period for computation of entire 128-point FFT (T) = 312.5 ns -- Equation 5.3.
From Equations 5.2 and 5.3,
= 312.5ns
105cycles
= 2.98 ns per cycle
Clock speed (C.S.) required = 1 (t c ) -- Equation 5.5.
46
Figure 5.4. Schedule view of the issue slot usage
5.3 Processor architecture
The processor description in TIM is quite abstract in nature. A graphical view that is useful to
visualize the processor architecture and extent of connectivity is obtained from Corebrowser, a
custom tool within Silicon Hive. Such a graphical view of the processor wherein levels of
hierarchy can be traversed is useful while writing the application kernel. It not only helps check
for a routing path for data between issue slots, but also helps easily spotting mistakes in the
processor description files, labelling used to reference specific blocks when explicitly used
while writing the application kernel. A snapshot of the entire processor diagram, as generated by
Corebrowser, is shown in Figure 5.6.1, and the enlarged view for readability’s sake is depicted
in Figures 5.6.2 to 5.6.5. The processor diagram is the complete version of the schematic
architecture diagrams discussed in Figures 4.5 and 4.6 in section 4.7, and by Figure 5.5.
Figure 5.5 Schematic of FFT Processor Architecture in terms of the Silicon Hive template
The FFT cluster contains the LSUs for interfacing data and twiddle factor memories. Each data
handling LSU is accompanied by an ALU to locally generate data indices. The initialisation of
indices and the indices for twiddle factors is obtained from general-purpose cluster, which
contains primarily arithmetic and logic capabilities. The FFT cluster also contains a separate
issue slot for twiddle rearrangement. The control cluster primarily handles the branching
instruction for loops.
48
corresponding to the processor. Beyond that point, the standard hardware synthesis flow is
adopted in generating a netlist, and estimating parameters such as gate count, area, power
dissipation, etc. Synthesis results were generated using Ambit, a synthesis tool from the
Cadence tool suite. Ambit provides details about the area occupied by the different blocks. The
technology setting was for 120nm CMOS, and the generated result for the various blocks are
tabulated in Table 5.2.
For the 1.2V supply 120nm CMOS fabrication process, 180k gates reside within 1 mm2
of chip area. The extent of power dissipation can be calculated by using a formula stated in
Equation 5.6.
PowerDissipation(mW ) = 0.92 × Area(mm 2 ) × C.S .( MHz ) -- Equation 5.6
The value of Clock Speed (C.S.) used is 336 MHz, as derived in Section 5.2. From Table 5.2,
the area corresponding to logic blocks comes to 2.03mm2. There would be a fractional increase
of approximately 20% after routing, giving an area of 2.44mm2. As can also be noted, the
dominant area is by the memories, here chosen to allow the processor capability to scale up to
an order of a 1024-point FFT. An advantage of the twiddle factor driven approach is that, when
not required, the twiddle factor memory can be turned off. This advantage becomes more
prominent towards the later stages of the FFT, as the number of butterflies that make use of the
same twiddle factor set increases. It must also be noted that the power dissipation values cited in
Table 5.1 are estimated for the case of full activity levels for all blocks. However, an example
that is representative of processor activity level is that, out of the eight LSUs of the FFT cluster
that handle data, only four operate at a given moment. Thus, if clock gating was incorporated,
sections of the core turned off when they aren’t being used, etc. the power dissipation of the core
would be almost halved. With the average core activity measured to be in the range of 30%, the
average power dissipation will be 188.4 mW.
49
The area and power dissipation figures for the memory blocks, catering to a 1024-point FFT, yet
again assuming full activity levels are listed in Table 5.2. As only two of the four dual port data
ensemble memories operate in the average case, the power dissipation for the memories would
be in the range of 1.4 Watts.
Two 128 bit x 1024 single port C12FSRAM for Twiddle factors 0.54 166.92
Five 128 bit x 128 Single port C12FSRAM for Program memory 0.45 139.10
Since the functionality of such an ASIP is due to the cohesive operation of both the
hardware and software sections, comparison with a non-programmable ASIC solution isn’t
entirely justified. However, to obtain a qualitative understanding of the ASIP, the area of
optimal context (i.e., 128-point FFT for UWB-OFDM) is compared with that of an ASIC
solution, the details of which are provided in Table 5.3.
Table 5.3. Area comparison with an ASIC solution
Design Process Technology feature size (nm Area of FFT module (mm2)
CMOS)
[1] 180 4.22
Present ASIP 120 2.44
The FFT module of the UWB MB-OFDM transceiver described in [1] uses a mixed-radix
approach to the 128-point FFT. In contrast, the present ASIP design is scalable with the order of
the FFT and has a radix-2 based method of attaining UWB performance requirements. An
aspect to be considered is that, with change in the order of the FFT, only the application kernel -
hence, program length, cycle count, etc. – experiences scaling, and the hardware costs remain
the same. Scalability is examined more in detail in section 5.5.
50
Figure 5.6.1. Graphical view of the processor as generated from TIM description files using Corebrowser
51
Figure 5.6.2. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)
52
Figure 5.6.3. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)
53
Figure 5.6.4. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)
54
Figure 5.6.5. Graphical view of the processor as generated
from TIM description files using Corebrowser (cont’d)
55
5.5 Examining Scalability
Besides attaining UWB MB-OFDM standard’s performance requirements, the ASIP is designed
having ease of scalability as a primary objective. The application kernel was compiled and
scheduled for the FFT sizes of 64 and 256. The cycle count for the different FFT sizes are
enumerated in Table 5.4.
Table 5.4. Scalability in terms of number of cycles
Order of FFT (N) Cycle count
64 54
128 105
256 216
These values of the actual schedule as obtained from Manifold is plotted in Figure 5.6, and
compared against the theoretical minimum cycle count that would be obtained if there were no
overhead. As visible from the graph lines in Figure 5.7, the overhead has an additive component
and a multiplicative component (manifesting as a difference in slopes).
220
Obtained from Scheduler
200
Ideal (no overhead)
180
160
140
No. of Cycles
120
100
80
60
40
20
0
0 50 100 150 200 250
Order of FFT (N)
Figure 5.7. Plot of scalability
The amount of percentage overhead decreases with better loop folding. Thus, for the cases of
larger N - wherein the number of loop iterations will be more – the extent of relative overhead is
smaller. This is illustrated in Figure 5.8.
56
56
54
50
48
46
44
42
40
50 100 150 200 250 300
Order of FFT (N)
Figure 5.8. Percentage overhead decrease with increase in FFT size
To further shed perspective on the scalability of the present design, a comparison is performed
with a FFT implemented on a reconfigurable platform, described in [2]. The technology
described in [2] utilizes an array of reconfigurable cells, and hence the hardware is dynamically
retargetable. The results acquired in [2] shows that the solution outperforms other multimedia
processors and that its processing time is judged to be comparable with that of ASIC
implementations. However, the solution in [2] and the present work are juxtaposed in Table 5.5
with regard to various parameters. As can be noted, the present ASIP outperforms the FFT
implementation in [2] in all aspects.
57
It’s further worth noting that the application kernel is fully scalable given an order of FFT N,
and the number of stages in the FFT – at a system level, both can be dispatched to the ASIP
from the host processor. On the hardware front, the FFT and generic clusters remain unchanged.
The size of the memory has to be increased to accommodate data and twiddles for the FFT.
Present area and power figures assume a maximum of a 1024-point FFT.
58
section were to receive data in polar form – as in Magnitude ∠PhaseArgument - the
functional units become a lot simpler. This would be well suited to twiddle factors being
generated on chip in polar coordinates.
59
5.8 References
[1] A 480Mb/s LDPC-COFDM-Based UWB Baseband Transceiver
Liu, H.Y.; Lin, C.C.; Lin, Y.W.; Chung, C.C.; Lin, K.L.; Chang, W.C.; Chen, L.H.; Chang,
H.C.; Lee, C.Y.; Proceedings of the IEEE International Solid State Circuits Conference
(ISSCC), February 2005, San Francisco
60
Appendix 5.a.
Statistics Report of Scheduler for N = 128
PC cycles PC cycles PC cycles
---------------- ---------------- ----------------
0 1 44 1 88 3
1 1 45 1 89 1
2 1 46 1 90 1
3 1 47 1 91 1
4 1 48 1 92 1
5 1 49 1 93 1
6 1 50 1 94 1
7 1 51 1 ----------------
8 1 52 3 Total 105
9 1 53 1
10 1 54 1
11 1 55 1
12 3 56 1 runtime = 0.071185
13 1 57 1 ilp = 6.02857
14 1 58 1 instructions = 105
15 1 59 1 instrs/sec = 1475
16 1 60 1 operations = 633
17 1 61 1 ops/sec = 8892
18 1 62 1 stages = 962,
19 1 63 1 stages/sec = 13514
20 1 64 1
21 1 65 1
22 1 66 1
23 1 67 1
24 1 68 3
25 1 69 1
26 1 70 1
27 1 71 1
28 1 72 1
29 1 73 1
30 1 74 1
31 1 75 1
32 1 76 1
33 1 77 1
34 1 78 3
35 1 79 1
36 1 80 1
37 1 81 1
38 1 82 1
39 1 83 1
40 1 84 1
41 1 85 1
42 1 86 1
43 1 87 1
PC – Program Counter
61
List of Abbreviations
ASIP Application Specific Instruction set Processor
AGU Address Generation Unit
ALU Arithmetic Logic Unit
ANSI American National Standards Institute
API Application Program Interface
ASIC Application Specific Integrated Circuit
CMOS Complimentary Metal Oxide Semiconductor
CORDIC Coordinate Rotation Digital Computer
DFT Discrete Fourier Transform
DIF Decimation In Frequency
DIT Decimation In Time
DSP Digital Signal Processing
DS-UWB Direct Sequence - UWB
FCC Federal Communications Commission
FFT Fast Fourier Transform
FIR Finite Impulse Response
FU Functional Unit
IEEE International Electrical and Electronic Engineers
ILP Instruction Level Parallelism
IN Interconnection Network
IP Intellectual Property
IS Issue Slot
LAN Local Area Network
LSU Load/Store Unit
MBOA Multiband-OFDM Alliance
MB-OFDM Multiband-OFDM
MMSE Minimum Mean Square Error
OFDM Orthogonal Frequency Division Multiplex
PC Program Counter
PSE Processing and Storage Element
QPSK Quadrature Phase Shift Keying
SIMD Single Instruction Multiple Data
SFG Signal Flow Graph
SQNR Signal-to-Quantisation Noise Ratio
TIM The Incredible Machine (abstract hardware description language)
ULIW Ultra Long Instruction Word
UWB Ultra Wide Band
VHDL VHSIC Hardware Description Language
VHSIC Very High Speed Integrated Circuitry
VLIW Very Long Instruction Word
WAN Wide Area Network
WPAN Wireless Personal Area Network
63
Acknowledgements
First and foremost, I would convey my gratitude to Rene van Leuken, my supervisor at
the Delft University of Technology, for creating this opportunity at Silicon Hive. I am grateful
to him for proving very instrumental in my undertaking this graduation project. It has been a
wonderful 7 months, and I owe it to his dynamism.
I gratefully thank Jos Huisken - my supervisor at Silicon Hive - for giving me the
opportunity to work on this exciting project. I'm very grateful for our project discussions, when
I was given the opportunity to benefit from the wealth of his insight and experience.
The culmination of such a thesis was probably more of a burden on others at Silicon
Hive than myself. To that end, I'm very indebted to Marc Quax for mentoring me through the
design-flow and tool usage. He definitely deserves a medal for the patience he demonstrated
with a greenhorn such as myself. I'm very grateful to Ingolf Held for guiding me through the
nuances of the UWB-OFDM standard, and for contributing to the thesis report. I'm very
thankful for the inputs he provided me with on our many technical and non-technical
discussions.
In general, I thank all those at Silicon Hive and Philips who contributed to the thesis. It
is this stimulating environment - inadequately indicated by names, but surely including the
above - that I am deeply indebted to.
On a more personal note, I am very thankful to my parents - Mrs. Lalitha Chidambaram
and Mr. T. S. Chidambaram - for their constant support and encouragement. I'm also thankful to
all those who engraved their footprint on my life during this period: all my wacky housemates
at Galileistraat for keeping me sane, yet driving me wild at the same time; the Tamilian troupe at
Delft - Rajakumari Kumaraswamy and L. K. Madan; Richa Kumar for an awakening; Anup
Barve for the experiments in his kitchen; Deng Wei, Xiaodong Liu and the rest of the basketball
gang at Delft; the list is endless...
As a dedication, I would dedicate this effort on my part to those teachers in my past and
present for having shaped my future without even knowing about it. Lastly, but by no means
least, I thank the one who orchestrated it all knowing fully well - the guy upstairs and the best
teacher of them all!
65
66