0% found this document useful (0 votes)
3 views

An analytical approach for sizing of heterogeneous multiprocessor flexible platform

Uploaded by

lia.haddad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

An analytical approach for sizing of heterogeneous multiprocessor flexible platform

Uploaded by

lia.haddad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

An analytical approach for sizing of heterogeneous multiprocessor flexible platforms for iterative

demapping and channel decoding

Vianney Lapôtre, Guy Gogniat, Jean-Philippe Diguet Salim Haddad, Amer Baghdadi
Université de bretagne Sud, CNRS Lab-STICC UMR 6285 Institut Telecom, Telecom Bretagne, CNRS Lab-STICC UMR 6285
Lorient, France Brest, France
Email: [email protected] Email: [email protected]

Abstract—Flexible baseband receivers gain the interest of many platform for iterative demapping and channel decoding and propose
research efforts to enable the design of future multi-modes multi- a novel approach for efficient design-time and run-time sizing. We
standards terminals. A main challenge in this domain is to provide this
illustrate how for a given level of requirement several architecture
flexibility with minimum overhead in terms of area, speed, and energy.
In this regard, heterogeneous multiprocessor platforms are emerging alternatives with different number of processors exist. A formal
as a promising implementation solution. However, the heterogeneity representation of the architectural solution space is proposed. This
of such platforms makes it complex to find the required number of formulation enables the designer to find the most efficient hardware
processors supporting a specific configuration (i.e. requirements level). configuration. Based on this formal representation, the architecture
This paper investigates, in this context, the significant optimization
potential both at design-time and at run-time regarding the selection can be chosen both at design-time and at run-time according to an
of the most appropriate hardware configuration of a multiprocessor optimization objective which could be, for example, minimizing
platform for iterative demapping and channel decoding. A formal the number of processors, reducing the active area on the chip,
representation of the architectural solution space which allows de- reducing the clock frequency, etc.
signers to find the minimum hardware configuration is proposed.
The proposed approach is illustrated through a flexible multi-
The proposed approach is illustrated through a flexible multi-ASIP
hardware platform for iterative demapping and channel decoding. ASIP hardware platform for iterative demapping and channel
decoding. This platform integrates two different types of ASIPs
Keywords-Multiprocessor, ASIP; Self-adaptation; Wireless multi-
standards receiver; Platform sizing; Run-time; Design-time; (Application-Specific Instruction-set Processor): one for demap-
ping, called DemASIP, and the second for turbo decoding, called
I. I NTRODUCTION DecASIP. This paper presents the following contributions:
Last years have seen considerable evolutions of wireless com- • A formal representation of the architectural solution space is
munication standards in the domain of cellular telephone networks, proposed.
local/wide wireless area networks, and Digital Video Broadcasting • A method to apply this formal representation at design-time
(DVB). Besides the increasing requirements in terms of throughput and at run-time is defined.
and robustness against destructive channel effects, the convergence • A use case that demonstrates the interest of the proposed
of services in single smart terminal becomes a crucial and challeng- method to reduce the chip area at design-time and the active
ing feature. As an example, the fourth generation (4G) of cellular area at run-time is presented and evaluated.
wireless standards aims at providing mobile broadband solution to The rest of the paper is organized as follows. Section II provides
laptop computer wireless modems, smartphones, and other mobile an overview of relevant literature. Section III presents the system
devices. Diverse features such as ultra-broadband Internet access, model and the configuration parameters. Section IV describes the
IP telephony, gaming services, and streamed multimedia will be proposed formal representation of the architectural solution space
provided. which allows the designer to size the platform depending on the
In order to enable such advanced services at the algorithmic system configuration. Section V evaluates the impact of a design-
level, new state of the art data processing techniques have been time sizing on the chip area and the impact of a run-time sizing on
developed and adopted in the emerging wireless communication the active area for different receiver configurations. Finally, section
standards. At the architecture level, many efforts are being con- VI provides a discussion on the proposed work and concludes the
ducted towards the design of flexible high throughput hardware paper.
platforms which can be configured to the required configuration. II. S TATE OF THE A RT
The overall flexibility of the radio platform can be achieved The high throughput requirement of emerging services imposes
through the flexibility of individual components at transmitter side the efficient exploitation of different parallelism levels. In this
(encoder, interleaver, mapper, etc.) and at receiver side (demapper, context, multiprocessor architecture [4], [5], [6], [7] is a promising
deinterleaver, decoder, etc.). In this context, heterogeneous multi- approach to reach high flexibility, high throughput and energy
processors platforms [1], [2], [3] have been widely adopted. These efficiency. In [4], an heterogeneous architecture for convolutional
platforms usually integrate different tiles that provide high perfor- and turbo-decoding consisting of a dedicated 150Mbps IP block
mances and high flexibility to respect services requirements. ASIP and a cluster of ASIPs is presented. The dedicated IP block is
based tiles have been adopted to provide flexible and powerful used when high throughput is required while ASIPs are used for
solutions. For example, in [1], an 10.8 Mbps ASIP core is used lower throughput. Even if the authors superficially describe a multi-
for turbo-decoding. However, the high throughput requirement of ASIP architecture in which the ASIPs are connected through a
emerging services imposes the efficient exploitation of different crossbar, the sizing of such an architecture is not addressed. The
parallelism levels. Several recent works propose multiprocessor presented results are limited to two ASIPs that share a memory.
approaches to build these tiles [4], [5], [6], [7]. In this work we In [5] and [6], the authors present a multi-ASIP platform for
investigate the sizing of a heterogeneous multiprocessor flexible decoding in which the ASIPs are connected through a Network

978-1-4673-2921-7/12/$31.00 c 2012 IEEE


Architecture
Alternatives TURBO ENCODER
SSD
(e.g. Throughput, BER, …) 17 Proc1 + 10 Proc2 s
Requirements
U p1 C INTERLEAVER V MAPPER S
7 Proc1 + 16 Proc2
CC-1 PUNCTURING Q (QPSK, QAM16, QAM64) ROTATION
2 (QAM256)
Q
CC-2 p2
Config1 13 Proc1 + 8 Proc2 1 Sr
(HD Media)
7 Proc1 + 10 Proc2

3 Proc1 + 12 Proc2 S0r


RAYLEIGH FADING CHANNEL (with or without erasure)
Config2 3 Proc1 + 2 Proc2 DELAY d
(Web Brows.)
Brows X0r
2 Proc1 + 5 Proc2 TURBO DECODER
Xr0Q Xr0I L(s)
1 Proc1 + 6 Proc2
L(p1) DEC1

DELAY d
Config3 Q Q−1 Q
(Voice conv.) Lapr,Dem INTERLEAVER 1 1 1
Q PUNCTURING
2
L(p2) DEC2
2 4 6 8 10 SNR (dB)
Figure 1. Usage scenario example of the considered heterogeneous multi-
DEINTERLEAVER
processor platform integrates two different types of processors that perform DEMAPPER Q−1 DEPUNCTURING
TURBO
Xr Lext,Dem 2 DECODER Decoded Bits
demapping and decoding algorithms respectively (Proc1 and Proc2).

on Chip. The high flexibility of such architectures allows dynamic Figure 2. System model with TBICM-ID-SSD.
reconfiguration at run-time but the run-time sizing task is not
A. System model
addressed. In [5], different decoding tasks can be mapped at
run time on different ASIPs but no methodology to define the Fig. 2 presents a simplified structure of the transmitter, the
number of processors necessary to perform a given configuration channel, and the receiver. On the transmitter side, information bits
is presented. As in [7], recent works propose to combine several U which are called systematic bits are regrouped into symbols ui
functionalities, like decoding and demapping, in a multi-ASIP consisting of k bits, and encoded with a q-binary turbo encoder. It
heterogeneous platform. Unfortunately, the sizing of such platforms consists of a parallel concatenation of two identical convolutional
is not well addressed in the literature. We assume that the designer, codes (PCCC). The output codeword C is then punctured to
based on his background and simulations, has to deal with the reach a desired coding rate Rc . In order to gain resilience against
sizing of these complex heterogeneous platforms. However this error bursts, resulting sequence is interleaved using an S-random
approach could provide sub-optimal solutions and decrease the interleaver Π2 . Punctured and interleaved bits denoted by vi are
sizing flexibility at run-time since all the decisions are taken at then Gray mapped to complex channel symbols sq chosen from
design-time. In fact, flexible hardware multiprocessor platforms for a 2M -ary constellation X, where M is the number of bits per
iterative demapping and channel decoding are generally designed modulated symbol. Applying the SSD consists of a rotation of the
to support a set of communication standards which correspond to constellation followed by a signal space component interleaving.
some specific application needs and usage scenarios. Each usage At the receiver side (which is the topic of this paper), the
scenario corresponds to particular requirements for example in corresponding operations to the transmitter ones are applied in
terms of throughput, latency, error rates, and/or others. Fig. 1 gives reverse order. However, in order to meet the increasing require-
an example of such usage scenario which corresponds to a mobile ments in terms of reduced error rates, the iterative processing is
terminal supporting different services (High Definition Multimedia, considered at two levels. The first level is at the channel decoding
Web Browsing, Voice Conversation) at different channel conditions. by adopting a turbo decoding process. The second level is between
Hence, at design-time, the platform must be dimensioned to support the channel decoder and the soft demapper. In fact, besides extrinsic
the highest requirements while, at run-time, the number of proces- information exchange inside the channel turbo decoder, additional
sors can be chosen depending of the current level of requirements. extrinsic information is feedback as a priori information used by
Furthermore, for a heterogeneous multiprocessor platform, and for the demapper to improve the symbol to bit conversion. Thus, the
a specific requirement level, several architectural configurations receiver model, denoted as TBICM-ID-SSD, implements iterative
(i.e. with a different number of each type of processors) exist demapping with turbo decoding.
(Fig. 1). The alternatives exploration and the selection of the
B. Receiver configuration
most appropriate one is a complex task. However, it represents
an important optimization room which is not investigated in the A flexible software model of the whole system of the Fig.2
literature. In this paper we address this point and propose a formal (transmitter, channel, and receiver) was developed. This model
approach to find the optimal configuration. supports many parameters corresponding to the constellation type
and modulation order, interleaving laws, turbo code type, code rate,
III. S YSTEM MODEL AND CONFIGURATION
and frame size.
In order to illustrate the proposed approach for sizing of hetero- Furthermore, the receiver can be configured to execute iterative
geneous multiprocessor flexible platform for iterative demapping or non iterative demodulation. For the case of iterative demod-
and channel decoding we consider in this paper the communication ulation, state of the art implementations apply one turbo code
system model of Fig. 2. It consists of a multi-modes advanced wire- iteration for each demapping iteration [8]. Thus, the number of
less communication system integrating convolutional turbo code, demapping iterations (itdem ) is equal to the number of turbo
Bit Interleaved Coded Modulation (BICM), various modulation decoding iterations (itdec ) in this case. This number of iterations
schemes, and Signal Space Diversity (SSD). A brief presentation constitutes another flexible parameter of the system model. On the
of the system model and the considered parameters is given in this other hand, for non iterative demodulation itdem will be equal to
section. 1.
Global Receiver Controller Input
Channel
designed for the worst case configuration in which all processors
Control of Input Channel Data
Data
exchange data at the same time and it is congestion and conflict
free.
DemProc DemProc DemProc DemProc DemProc DemProc DemProc
B. Formal representation of the architectural solution space
The generic architecture of Fig. 3 can be abstracted as two
communication interconnect ( Π 2 , Π −21 )
components: one demapper and one decoder. Each component uses
several processors in parallel to perform the frame computation
Decoded
DecProc DecProc DecProc DecProc DecProc DecProc DecProc
bits exploiting sub-bloc parallelism. These two components are serially
connected. The time required to process one frame (Tsyst ) corre-
communication interconnect ( Π1 , Π1−1 ) sponds to the sum of the time required by the demapper (Tdem )
and the time required by the decoder (Tdec ) to execute all their
Figure 3. Generic architecture of the heterogeneous multiprocessor iterations on the frame. It can be expressed as:
receiver. In this configuration, 2 DemProcs and 4 DecProcs are not used.
Tsyst = Tdem + Tdec
The configuration of these flexible parameters is generally con- = Ndem .Tdem/symb + Ndec .Tdec/symb (1)
strained by the available communication standard, the channel
condition, and the target system requirements in terms of through- where Ndem and Ndec represent, respectively, the number of mod-
put, latency, and error rate performance. The determination of ulated and coded symbols per frame. Tdem/symb and Tdec/symb
their values should also take into consideration the complexity represent the time required by the demapper and the time required
issue in order to advise the most efficient configuration (as many by the decoder to execute all their iterations on one modulated
solutions generally exist). This task is out of the scope of this paper. and coded symbol respectively. Hence, the system throughput
However, in order to define the suitable system configurations of (Dsyst = Ndec /Tsyst ) can be expressed as below.
the usage scenario that will be considered in Section V, commu- Ddem .Ddec .Ndec
Dsyst = (2)
nication system experts were inquired and extensive simulations Ndem .Ddec + Ndec .Ddem
were conducted. where Ddem (= 1/Tdem/symb ) and Ddec (= 1/Tdec/symb ) are
Based on the system model, the next section proposes a generic the demapper and the decoder throughputs (in modulated and coded
architecture model and a formal method for an efficient sizing of symbols, respectively). In fact, considering the code rate Rc and the
such platforms. number of bits per symbol M , the relation between the number of
IV. P LATFORM SIZING coded symbols (Ndec ) and the corresponding number of modulated
Multi-standards and multi-modes platforms have to be able to symbols (Ndem ) can be written as follows.
self-adapt when application requirements and environment evolve q
Ndem = .Ndec
at run-time. A configuration is defined by the communication M.Rc
parameters which are chosen in accordance with the application = α.Ndec (3)
requirements and the environment in which the communication is
where q depends on the coding scheme (q = 1 for simple binary
established. In this section we propose formal expressions which
turbo code and q = 2 for double binary turbo code). Introducing
allow designers to optimize the receiver architecture by computing
this expression of Ndem into equation (2) gives the following
the required number of processors depending on each configuration.
system throughput expression.
This point is essential as it enables designers to formally explore
Ddem .Ddec
potential architectures that will meet performance constraints. Dsyst = (4)
Ddem + α.Ddec
A. Generic heterogeneous multiprocessor architecture model The throughput of the system Dsyst is generally imposed by the
application requirement. On the other hand, the throughputs of the
Fig. 3 presents the generic architecture of a flexible multi-
demapper and the decoder depend on the number of processors,
processor hardware platform for iterative demapping and channel
the number of iterations, the number of clock cycles required to
decoding. The aim of this platform is to provide a flexible and
process on symbol, and the clock frequency. They can be expressed
dynamic solution compared to existing ones [1], [2], [3] (generally
as follows.
based on hardware accelerators) where designer can tune the N bdemP roc .Fdem
number of resources both at design-time and at run-time. As it Ddem = (5)
itdem .cyclesdem/symb
will be presented, such an approach allows the system meeting
where N bdemP roc is the number of demapping processors, itdem
performance constraints without loosing its flexibility. These fea-
is the number of demapping iterations, cyclesdem/symb is the
tures will be mandatory for future communication systems. In
number of cycles necessary to demap one symbol, and Fdem is
Fig. 3, DemProc and DecProc perform demapping and decoding
the clock frequency.
algorithms respectively. These two processors are characterized by
N bdecP roc .Fdec
their area, maximum frequency, and their performance defined by Ddec = (6)
2.itdec .cyclesdec/symb
the number of cycles to demap or decode one modulated or coded
symbol respectively. The platform integrates a communication where N bdecP roc is the number of decoding processors, itdec is
interconnect that allows extrinsic information exchanges (between the number of decoding iterations, cyclesdec/symb is the number
DecProcs themselves and between DecProcs and DemProcs). In of cycles necessary to decode one symbol, and Fdec is the clock
this paper, we assume that the communication interconnect is frequency.
n N bdemP roc N bdecP roc goal depends of designers priorities and could be for example the
0.25 40 44 number of processors used for each possible configuration, the total
0.75 56 21 area of the chip, the clock frequency for each type of processor, etc.
1 64 18 In this paper we extend the previous equations in order to optimize
1.25 72 16 the total area of the chip at design-time. The same optimization can
1.75 88 14
be applied at run-time in order to reduce the active area for the
Table I configurations performed on the platform.
A RCHITECTURE ALTERNATIVES IN FUNCTION OF N . E XAMPLE FOR :
Dsyst = 200 M BPS , QPSK, Rc = 0.5, C. Area optimization
itdem = itdec = 8, cyclesdem/symb = 6, cyclesdec/symb = 1.75 AND
0.75 FOR THE LAST ITERATION, Fdec = Fdem = 300M Hz Heterogeneous processors have typically different areas and
It is worth noting that the linear increase in throughput with performances. One main optimization objective is to determine
the number of decoding processors is limited due to the sub-bloc the number of DemProcs and DecProcs in order to minimize the
initialization issue [9]. This limitation, which depends on the target receiver area for a given configuration. The total area of the receiver
frame size and code rate, should be considered in the platform depends on n. It can be computed using the expression below.
sizing. However, this issue is not encountered in the demapping An = Adem .N bdemP roc + Adec .NdecP roc (15)
sub-bloc parallelism. where Adem and Adec are the area of one DemProc and one
In order to establish a relation between the demapping time and DecProc respectively. Therefore, by putting equations (11), (12)
the decoding time, we define the ratio n as follows.
and (8) into equation (15), An can be expressed as a function of
n.Tdem = Tdec (7)
Cdem and Cdec .
From this equation we can obtain a relation between the
throughputs of the demapper and the decoder: An = (Cdec .Adec + Cdem .Adem .α.n)Ddec (16)
On the other hand, using equation (4), Ddem can be expressed
Ndem Ndec
n. = as:
Ddem Ddec α.Dsyst .Ddec
Ndem Ddem = (17)
Ddem = Ddec .n. Ddec − Dsyst
Ndec
Ddem = Ddec .n.α (8) Moreover, Ddec can be expressed as a function of Dsyst and n
by putting equation (8) equals to equation (17).
We deduce from (8) and (4) the equations which link the
throughput of the system with the throughputs of the demapper Dsyst (n + 1)
Ddec = (18)
and the decoder: n
n+1 Finally, An can be expressed as a function of n by putting the
Ddec = .Dsyst (9)
n equation of Ddec above into equation (16).
Ddem = α.(n + 1).Dsyst (10)
a.n2 + b.n + c
Finally, from equations (5) and (6) we can express N bdemP roc An = (19)
n
and N bdecP roc as follows. where a = Cdem .Adem .Dsyst .α
N bdemP roc = Cdem .Ddem (11) c = Cdec .Adec .Dsyst
N bdecP roc = Cdec .Ddec (12) b=a+c
where The derivative function of the equation 19 is then computed.
it .cyclesdem/symb 2.itdec .cyclesdec/symb Only one extremum (next ) is found.
Cdem = dem Fdem and Cdec = Fdec
depend on the system configuration and the processor parameters. q r
c 2.itdec .cyclesdec .Fdem .Adec
Replacing Ddem and Ddec by their expressions from equations next = = (20)
a itdem .cyclesdem .Fdec .Adem .α
(10) and (9) allows to compute the number of processors necessary
for a given configuration and a given n. The second derivative function is also computed at next . It shows
N bdemP roc = Cdem .α.(n + 1).Dsyst (13) a positive value corresponding to the minimum area (Anext ) of the
n+1 receiver. Finally, Anext can be expressed as:
N bdecP roc = Cdec . .Dsyst (14) √
n Anext = a + c + 2 a.c (21)
Table I illustrates, for a given configuration, how different values For a given configuration, next is determined with equation
of n lead to different architecture alternatives, although all of them (20). With the obtained value of next , the number of DemProc
acheiving the target throughput and supporting the target system and DecProc which minimizes the area can be calculated using
configuration. Depending on n we observe that the architecture equations (13) and (14). The number of processors is then rounded
alternative could be quite different. For example, when n= 0,25 up to guarantee the throughput constraint. Note that, due to the
the architecture consists of 40 processors for demapping and 44 sub-bloc initialization issue [9], the number of DecProc is limited
processors for decoding while when n=1.25, 72 processors for by the maximum number of frame sub-blocs that can be extracted
demapping and 16 processors for decoding are necessary. It is from the entire frame. If the number of processors determined is
essential, both at design-time and at run-time, to determine the upper that this limit, their number is saturated in accordance to
value of n which optimizes the resources use. The optimization the maximum level of available parallelism and the corresponding
number of DemProc is computed with respect to the throughput Conf. Mod. Throughput Freq. itdem itdec Rc
( Mbps) (MHz)
requirement.
Based on the set of equations above it is now possible to analyze 1 QPSK 200 300 8 8 1/2
how the system can be tuned both at design-time and at run-time 2 16-QAM 200 300 6 6 1/2
3 64-QAM 200 300 1 8 1/2
to meet performance requirements for a given configuration.
4 64-QAM 200 300 1 9 2/3
D. Design-time sizing Table II
Platform sizing at design-time allows designers to determine U SE CASE CONFIGURATIONS
the hardware configuration which minimizes the total area of the up to eight-state double binary turbo codes or sixteen-state simple
chip. This objective has been considered as it strongly impacts binary codes. It is able to decode a symbol in 1.75 clock cycles
the cost of the chip. The design space of potential configurations in turbo-demodulation scheme and in 0.75 clock cycle when only
is too large to allow designers to efficiently explore all possible turbo-decoding is applied.
architecture alternatives. So first, the designer needs to list the Based on this platform and in order to demonstrate the benefits
critical configurations that will be executed on the platform. These of the proposed approach, a representative use case has been
configurations will require the largest number of operations to developed.
demap and decode a frame. Then, equations (20), (13) and (14)
are successively used to determine for each critical configuration B. Use Case Study
the architecture alternative which minimizes the total area. Finally, The considered use case targets at the output of the channel
The number of processors implemented on the chip is the maximum decoder a throughput of 200 Mbps with BER performance between
number of DemProcs and DecProcs among the different hardware 10−5 and 10−6 for an SNR range from 3dB to 13dB. It can
configurations. correspond to future wireless HD Media service in mobility context
(e.g. during a train trip).
E. Run-time sizing
In order to find the suitable system parameters of the flexible
Platform sizing at run-time allows to determine the hardware communication system model of Fig. 2 with respect to this use
configuration which optimizes the resources usage. Several objec- case, extensive simulations were conducted. Results of this step
tives can be addressed using equations previously described. When (out of the scope of this paper, cf. sub-section III-B) correspond to
a configuration has to be executed on the platform, the architecture the configurations described in Table II. The frequency of each
alternative which optimizes the objective can be determined using ASIP is 300 MHz. The other parameters evolve depending on
the proposed equations. Configuration parameters are used to the environment conditions: Conf. 1 corresponds to severe channel
determine next which optimizes the objective. Then, next is used conditions (i.e. lowest SNR) whereas Conf. 4 corresponds to good
in equations (13) and (14) to compute the required number of channel conditions (i.e. highest SNR).
DemProcs and DecProcs. If the number of processors required
to perform the configuration is lower than the platform capacity, C. Results
which is defined at design-time, unused cores can be for example The first step of platform sizing is performed at design-time. For
switched-off as they will not be use during the execution of the this purpose, the method described in sub-section IV-D is used to
current configuration. determine the best hardware sizing for the critical configurations
which will be performed on the platform. For the scenario previ-
In this section, we have proposed formal equations to explore the ously explained, the critical configurations are Conf. 1 and Conf.
alternative architectures of a heterogeneous multiprocessor receiver. 2. They require higher number of iterations than Conf. 3 and Conf.
These equations have been applied on a particular optimization 4 (Table II). These configurations determine the maximum number
objective in order to optimize the total area used for a given of ASIPs which needs to be implemented on the chip. Table III
configuration. We have also explained how to consider such a shows the comparison between the number of processors and the
solution both at design-time and at run-time. The next section total area of the chip using the proposed method and an approach
presents the results of this method on a typical heterogeneous multi- where next is not defined. In that last case, the designer may not
ASIP receiver. be able to efficiently tune the ratio between the time to demap a
V. C ASE STUDY AND RESULTS
frame and the time to decode a frame in order to minimize the
A. Multi-ASIP platform total area of the chip. A manual exploration based on the designer
In order to apply and evaluate the proposed approach we consider experience can still be performed in order to test several ratio but
the heterogeneous multi-ASIP receiver platform presented in [7]. such an approach is time consuming and there is no guarantee to
This platform integrates two types of ASIPs which perform the find the best one. Thus a default value of n = 1 is generally used.
main functions of iterative demodulation and turbo decoding. n = 1 means that the time to demap a frame is equal to the time
The first, called DemASIP [10], is dedicated to the Max-Log- to decode a frame. The last row of Table III corresponds to the
MAP Demapping Algorithm. This ASIP can be used for multiple number of processors that have to be implemented to support the
modulation schemes adopted at the transmitter side. The DemASIP highest requirements. It is determined by selecting the maximum
provides support for BPSK to 256-QAM constellation for any number of processors needed to perform critical configurations. For
mapping style with or without SSD. Depending on the modulation this case, results show that using our formal equations allows to
scheme, the time to demap one symbol evolves from 6 to 258 clock save 9.6% of the total area by implementing 55 DemASIP and 23
cycles. The second called DecASIP [6], performs the Max-Log- DecASIP instead of 72 DemASIP and 18 DecASIP when a default
MAP Decoding Algorithm. It supports convolutional turbo codes value of n is used.
Conf. proposed method no exploration Gain
n N bdemASIP N bdecASIP Area (in mm2 ) n N bdemASIP N bdecASIP Area (in mm2 ) (in %)
1 0.63 53 23 8.75 1 64 18 9.1 3.8
2 0.51 55 19 8.35 1 72 13 9.15 8.7
Chip - 55 23 8.95 - 72 18 9.9 9.6
Table III
D ESIGN TIME : A PPLICATION OF THE PROPOSED METHOD ON THE TWO CRITICAL CONFIGURATIONS OF THE CONSIDERED CASE STUDY.
Adec = 0.15mm2 AND Adem = 0.1mm2 (90nm CMOS).
Conf. n N bdemASIP N bdecASIP Active area heterogeneous flexible multiprocessor for iterative demapping and
(in mm2 )
channel decoding. In fact, for a given communication requirement
1 0.63 53 23 8.75 many architecture alternatives exist and selecting the right one at
2 0.51 55 19 8.35 design-time and at run-time is an essential issue. The proposed
3 0.91 29 9 4.25
approach defines the mathematical expressions which exhibit the
4 1.1 24 9 3.75
number of heterogeneous cores and their features. It has been
Table IV applied on a flexible multi-ASIP hardware platform for iterative
RUN TIME : A PPLICATION OF THE PROPOSED
METHOD .Adec = 0.15mm2 AND Adem = 0.1mm2 (90nm CMOS). demapping and channel decoding. Results analysis demonstrates
a reduction of the chip area of 9.6% compared to an approach
Once platform sizing performed, various required configurations in which alternative architectures presented in this paper are not
(Table II) can be selected at run-time. The number of ASIPs that explored. Future work targets the model extension with more func-
the configuration mode requires can be calculated at run-time on tionalities, like equalization, and the application of the proposed
a GRC processor (Global Receiver Controller, Fig. 3). As the sizing approach on a dynamic reconfigurable platform and to build
complexity of the proposed approach is very low (constant time), optimization strategy to dynamically adapt the configuration.
it allows a very efficient analysis of the best configuration at run-
time. This point will be mandatory to meet real-time constraints for R EFERENCES
future adaptive communication systems. Once a new configuration [1] F. Clermidy, C. Bernard, R. Lemaire, J. Martin, I. Miro-Panades,
has been computed the whole platform can be reconfigured.The Y. Thonnart, P. Vivet, and N. Wehn, “MAGALI: A Network-
reconfiguration mechanism itself is out of the scope of this paper on-Chip based multi-core system-on-chip for MIMO 4G SDR,”
but the general schedule can be sketched. The ASIPs that will be in Proc. of IEEE International Conference on IC Design and
used to perform the new configuration can be loaded with appro- Technology (ICICDT), 2010, pp. 74 –77.
priate parameters and program whereas the rest of the ASIPs will [2] U. Ramacher, “Software-Defined Radio Prospects for Multistan-
be idle. Once the right configuration is available the computation dard Mobile Phones,” Computer, vol. 40, no. 10, pp. 62 –69,
2007.
starts. Depending on the context the unused ASIPs can be for
[3] J. Declerck, P. Raghavan, F. Naessens, T.V. Aa, L. Hollevoet,
example powered down to reduce the total power consumption. A. Dejonghe, and L. Van der Perre, “SDR platform for 802.11n
Table IV shows the number of ASIPs necessary to perform the and 3-GPP LTE,” in Proc. of International Conference on
different configurations using our approach to reduce the active Embedded Computer Systems (SAMOS), 2010, pp. 318 –323.
area. Results demonstrate a significant reduction of the active [4] C. Brehm, T. Ilnseher, and N. Wehn, “A scalable multi-ASIP
area when configuration corresponding to low requirement are architecture for standard compliant trellis decoding,” in Interna-
performed. For example, in the case of Conf. 4, only 3.75 mm2 of tional SoC Design Conference (ISOCC), 2011, pp. 349 –352.
the chip have to be activated while 8.75 mm2 are necessary for the [5] T. Vogt, C. Neeb, and N. Wehn, “A reconfigurable multi-processor
highest requirements corresponding to Conf. 1. Such an approach platform for convolutional and turbo decoding,” in Proc. of In-
allows to build optimization strategies for example to tune power ternational Workshop on Reconfigurable Communication-centric
consumption or to minimize platform aging. Systems-on-Chip (ReCoSoC), 2006, pp. 16–23.
[6] P. Murugappa, Al-Khayat R., A. Baghdadi, and M. Jézéquel, “A
VI. F INAL DISCUSSION AND CONCLUSIONS Flexible High Throughput Multi-ASIP Architecture for LDPC and
Heterogeneous multiprocessor platforms for iterative demapping Turbo Decoding,” in Proc. of Design, Automation and Test in
and channel decoding provide high performance and high flexi- Europe Conference & Exhibition (DATE), 2011.
bility to perform several configurations. Moreover, they provide [7] A. R. Jafri, A. Baghdadi, and M. Jezequel, “FPGA Prototype
promising solutions to be integrated in future flexible baseband of Flexible Heterogeneous multi-ASIP NoC-based Unified Turbo
Receiver,” in University Booth of the Design, Automation and
receivers. Unfortunately, the first degree of flexibility of a multi-
Test in Europe Conference & Exhibition, DATE’11, 2011.
processor system (i.e. the number of processors used for a given
[8] S. Haddad, A. Baghdadi, and M. Jézéquel, “Reducing the Number
configuration) is currently not taken into account. The platforms of Iterations in Iterative Demodulation with Turbo Decoding,” in
are generally statically sized at design-time to reach a given Proc. of International Conference on Software, Telecommunica-
maximum requirement and used, at run-time, without changing tions and Computer Networks (SoftCOM), 2011.
the architecture configuration. In this context, the proposed work [9] O. Muller, A. Baghdadi, and M. Jézéquel, “Parallelism Efficiency
provides an efficient method for platform sizing which could be in Convolutional Turbo Decoding,” EURASIP Journal on Ad-
used both at design-time and run-time. Depending on the actual vances in Signal Processing, 2010.
requirements, this method allows a dynamic sizing at run-time [10] A. R. Jafri, A. Baghdadi, and M. Jezequel, “ASIP-Based Uni-
which optimizes the resources management of the platform. versal Demapper for Multiwireless Standards,” IEEE Embedded
In this paper we propose an approach for efficient sizing of Systems Letters, vol. 1, no. 1, pp. 9–13, 2009.

You might also like