Paper - A Reconfigurable Architecture For MIMO Detection Using CORDIC Operator
Paper - A Reconfigurable Architecture For MIMO Detection Using CORDIC Operator
Abstract— An implementation of reconfigurable ar- DSP processors cannot achieve high performance
chitecture for MIMO V-BLAST (Vertical Bell Labo- on throughput in highly parallel 3G/4G applications.
ratories Layered Space-Time) detection based on the The traditional architecture solutions ASIC imple-
square root algorithm is presented. The decoder sup- mentations are the most computationally efficient sys-
ports MIMO system with various number of anten-
tem, but its implementations are not flexible enough
nas, different throughputs and different signal constel-
lations. The decoder architecture is based on various to the wide diversity of the future systems. FPGAs
number of CORDIC operators (COordinate Rotation are widely used in signal processing because of their
DIgital Computer). The system prototype of the de- reconfigurability and support of parallelism. An im-
coder reaches 600Mbit/s data rate on an Xilinx Virtex- plementation of square root algorithm is realized by
II FPGA for a 2 antennas system with a QPSK signal Z.Guo in [5], which is not adaptable to different re-
constellation. quirements of the future system. We will propose
Index Terms—MIMO, V-BLAST, Square Root algo- here a FPGA implementation of the square root al-
rithm, CORDIC. gorithm for V-BLAST detection which is based on
various number of CORDIC operators. We will show
I. I NTRODUCTION that this square root detector is reconfigurable to be
To meet the demand for higher data rates and more adapted to a various number of antennas, different
system capacity without increasing bandwidth, an signal constellations and throughputs.
emerging technology called Multiple-Input Multiple- In this paper we first overview the MIMO detec-
Output (MIMO) has appeared. It is well known that tion techniques in Section II. The square root algo-
an extraordinary spectral efficiency can be achieved rithm is briefly described in section III. The reconfig-
in MIMO system [1]. MIMO is one of the most urable architecture for square root decoder is detailed
promising technologies to improve the performance in section IV. The throughput is analysed in section
of a wireless link. For example, it will be adopted in V. The experimental results and performance analysis
the next phase of the 3GPP (3rd Generation Partner- are provided in Section VI. The conclusions are stated
ship Project) standards in order to further increase the in section VII.
HSDPA (High-Speed Downlink Packet Access) sys-
tem capacity and enhance the quality of Internet and II. OVERVIEW OF MIMO DETECTION
multimedia services. The MIMO system is also the
The multiple antennas system with M transmits an-
candidate to answer the high performance expected
tennas and N ≥ M receive antennas is modeled in
in 4G broadband wireless for future mobile services
baseband by following relation:
[2]. In order to be used in these wireless standards,
future MIMO systems will need to support multiple
r = Hs + v . (1)
air-interfaces and modulation formats. These are the
reasons for the recent interest in reconfigurable archi- In the relation(1), s = [s1 ,s2 ,. . . ,sM ]T is the trans-
tectures to MIMO system. mitted symbol vector, in which each component si is
MIMO decoders are generally implemented us- independently drawn from a complex constellation.
ing DSP (Digital Signal Processing) processors. But The total transmit power is normalized to unity. The
vector r = [r1 ,r2 ,. . . ,rN ]T is the received symbol vec- B2 ) Find a unitary Σ to block upper triangularize
tor and v = [v1 ,v2 ,. . . ,vN ]T is an independently iden- P 1/2 :
tically distributed (i.i.d) complex zero-mean Gaussian "
1/2
#
noise vector with variance σ 2 per dimension. The el- 1/2 Pi−1 ×(i−1 )×1
Pi Σi = (3)
ements hij represent complex channel gain between 0 pi
the j-th transmit antenna and the i-th receive antenna.
B3 ) Update Qa to Qa Σi , the nulling vector for the
These path gains are modeled with zero mean and 0.5
i-th signal is given by
variance independent complex Gaussian random vari-
ables per dimension. The channel characteristics are ∗
wi = pi qα,i (4)
not changed during the transmission period of an en-
∗ is the i-th column of Q ∗ .
where qα,i
tire frame in accordance with the quasi-static flat fad- a
ing assumption. B4 ) Compute yi = wi r, and then the i-th transmitted
In various MIMO detection algorithms, the com- signal in s is detected as the closest point in the signal
plexity of the optimal ML (maximum likelihood) de- constellation.
tector is too huge to be implemented for a system with B5 ) Cancel the interferences of the detected signal
a large number of antennas and a large signal constel- in the remaining received signal s:
lation size. The sphere detector has more complex-
r = r − si (H )i (5)
ity than the V-BLAST square root detector [5]. The
linear detector like MMSE (Minimum Mean Squared 1/2
B6 ) Go back to the step B1 , but now with Pi−1 and
Error) and ZF (Zero-Forcing) is poor in BER (bit er-
Qα,i−1 (the first i-1 columns of Qa ).
ror rate) performance. Hence the square root detector
is an attractive solution to obtain a high performance
with reasonable complexity. IV. A RCHITECTURE
H Qa
M1 M2 M3
III. D ECODING ALGORITHM Unitary
transformation 4 i
Unitary
transformation ¦i
Unitary
transformation ¦i
input data P1/2
r
The V-BLAST square root algorithm is proposed in pi Q*a
M6 si M5 wi M4
[4], which successfully avoids the repeated pseudo- ri Interferences Calculation of Calculation of
Cancellation yi and si nulling vectors
inverse and matrix inverse computations by using
output data
unitary transformations. The computational cost is
reduced effectively from O(M 4 ) to O(M 3 ) without Fig. 1. Block diagram of square root decoder architecture
degradation in BER performance, where M is the
number of transmit antennas. The whole algorithm The architecture of the MIMO square root decoder
is described in the following steps: is illustrated in figure 1. It consists of 6 processing
A) Compute P 1/2 and Qa : for i= 1, 2,. . . ,N: modules. The values of matrix channel H, r are as-
sumed to have been pre-calculated. The three first
modules(M1 ,M2 ,M3 ) use unitary transformations to
(H )i P M
1 ×M
i−1 × 01 ×M compute P 1/2 (Step A), Qa (Step A), pi (Step B2 )
M ×1 M ×M M ×M
0 Pi−1 Θi = ×Pi (2)
∗ (Step B ) by employing various numbers of
and qa,i 3
N ×1 N ×M N ×M
−ei Qi−1 × Qi CORDIC. The following module (M4 ) calculates the
optimal ordering and nulling vectors wi (Step B3 ).
In this relation, P0 = βI , Q0 = 0N ×M , ei is the i-th Module M5 compute the transmitted symbol vector
unit vector of dimension N, Θi is any unitary transfor- (Step B4 ). The last module(M6 ) performs interfer-
mation that block lower triangularizes the pre-array ences cancellation (Step B5 ).
and × is the ignored result. After N steps, we obtain: The modules (M1 ,M2 ,M3 ) have the similar archi-
P 1/2 = PN and Qα = QN . tecture [6]. Instead of the conventional QR triangu-
B) Determine the optimal ordering and nulling vec- lar array which employs too high number of proces-
tors:for i=M, M-1,. . . ,1: sors, unitary transformations are used in these mod-
B1 )Find the minimum length row of P 1/2 and per- ules. Unitary transformations are performed by a se-
mute it to be the last (Mth) row. Permute s accord- quence of numerically stable complex Givens rota-
ingly. tions which is suitable for implementation because
the hardware elementary is based on CORDIC in of CORDIC. On the other hand, if the throughput re-
which only shifters and adder are involved [3]. It re- quirement is not crucial, the number of CORDIC can
duces the computational complexity significantly. be decreased by a single CORDIC.
We take an example, the calculation of P 1/2 and
Qa (first iteration) in the module M1 , to show how (a) 3 parallel CORDIC
0 0 1/2 0
the module output becomes P and Qa0. 0 0 Ø1 Ø2 ș4 Ø3 Ø4
Ø2 ș4 Ø3 Ø4
1/2
P0 and Q0
1 2 3 4 5 6 7
N cycles
ș1 ș2
0 0 0 0 Fig. 3. Different number of parallel CORDIC for different
Ø1 Ø1 Ø1 Ø1 throughput
-1 0
0 0 0 0
Ø2 Ø2 Ø2 Ø2 Ø2 All iterations of CORDIC algorithm are performed
0 in parallel, using a pipelined structure, as shown in
ș3 ș3 ș4 ș4 ș3 figure 4. The pipeline structure ensures the highest
0
0 throughput possible, because a CORDIC transforma-
Ø3 Ø3 Ø3 Ø3 Ø3 Ø3 tion can be performed each clock cycle.
-1 0 0 0 0 0 0 0 x y a
Ø4 Ø4 Ø4 Ø4 Ø4 Ø4 Ø4
Cordic_pipe
Register xi yi ai constanti
P1/2 and Qa
Signe of
Cordic_pipe
yi or ai
+/- +/- +/-
Fig. 2. 29 CORDIC operations required for the calculation of Register
1/2 Register Register Register
P1 and Qa (first iteration) in the module M1
Cordic_pipe xi+1 yi+1 ai+1
N Cused
T hroughput = (F req ×N ×b)( ) (7)
N Cminimal
In the equation(7), Freq represents the clock fre- VII. C ONCLUSIONS
quency of CORDIC, b is the bit per symbol, NCused A reconfigurable square root decoder for MIMO
is the used number of CORDIC. The relation of system has been designed and implemented on a
throughput and number of CORDIC is illustrated in FPGA. It is attractive for the future wireless appli-
figure 5. The other factors are considered as constants cations, supporting different antenna sizes, different
in a same MIMO system. modulation schemes and different throughputs. The
architecture is adaptable by employing different num-
Throughput bers of CORDIC. The CORDIC operator can be also
used like a common operator for the SDR applica-
Max.
tions [7]. The architectures of modules are defined
and synthesized individually by Xilinx software tool
Required [8]. Future works will carry on managing dynamic
reconfiguration of this decoder.