0% found this document useful (0 votes)
19 views

High Performance Integer DCT Architectures For Hevc: Mohamed Asan Basiri M, Noor Mahammad SK

Uploaded by

nabila brahimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

High Performance Integer DCT Architectures For Hevc: Mohamed Asan Basiri M, Noor Mahammad SK

Uploaded by

nabila brahimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems

High Performance Integer DCT Architectures for


HEVC
Mohamed Asan Basiri M, Noor Mahammad Sk,
Department of Computer Science and Engineering, Department of Computer Science and Engineering,
IIITD&M Kancheepuram, Chennai, IIITD&M Kancheepuram, Chennai,
Email: [email protected] Email: [email protected]

Abstract—This paper proposes an efficient VLSI architecture


for integer discrete cosine transform (integer DCT) that is used in
real time high efficiency video coding (HEVC) applications. The
proposed N -point 1D-Integer DCT architecture consists of signed
configurable carry save adder tree based multiplier unit. So, the
depth of the architecture falls within the bounds of O(log2 N ).
The proposed 1D architecture is used to perform one N -point or
multiple N2 , N4 , ...2-point Integer DCTs in parallel. The proposed
1D architecture is used to design 2D folded and parallel designs.
The performance results show that the proposed architecture Fig. 1. Example for row and column process of 4 × 4-point 2D-Integer DCT
gives better performance compared with existing architectures
using 45 nm CMOS TSMC library. The proposed 32 × 32-point
parallel Integer DCT achieves 59.1% of improvement in worst
path delay compared with odd-even decomposition [3] based
architecture.
Index Terms—DCT, DSP, Integer DCT, and HEVC

I. I NTRODUCTION
Digital signal processors (DSPs) are essential for real-time
processing of real-world digitized data to perform high-speed
numeric calculations used for a broad range of applications
from basic consumer electronics to sophisticated industrial
instrumentation. The discrete transform [1] is used to change Fig. 2. Basic architecture for 2D-Integer DCT (a) Folded (b) Parallel
the representation of a signal from one domain to another
for reducing the complexity of a particular digital signal
processing application. Discrete cosine transform (DCT) is DCT architecture, where two 1D-Integer DCT units are used
very powerful transformation used in image compression. The to perform the row and column processes. In all the cases, the
circuit complexity of DCT is greater than integer DCT because transpose buffer is used to store the results from row process
DCT is floating point and the integer DCT is fixed point. In to find the column process values.
the recent trends, HEVC [2] is widely used in multimedia     
application, where the integer DCT is incorporated [3]. o11 c11 c12 c13 x11
The 1D and 2D discrete transformations are represented o12  = c21 c22 c23  x12  (1)
as (1) and (2) respectively, where O is the output matrix, X o13 c31 c32 c33 x13
is the input signal matrix, and C is the co-efficient matrix.     
The 4-point integer DCT co-efficient matrix is shown in (3). o11 o12 o13 c11 c12 c13 x11 x12 x13
Fig. 1 shows the 4 × 4-point 2D-integer DCT. During row o21 o22 o23  = c21 c22 c23  x21 x22 x23 
process, each row of 4 × 4-input matrix is 1D transformed o31 o32 o33 c31 c32 c33 x31 x32 x33
and the results are stored in each row of 4 × 4-buffer. During (2)
column process, each column of 4 × 4-buffer matrix is 1D  
64 64 64 64
transformed and the results are the required 2D transformed  83 36 − 36 − 83 
4×4
values. Fig. 2(a) shows the separable folded 2D-Integer DCT CInteger DCT =
 64 − 64 − 64
 (3)
64 
architecture, where one 1D-Integer DCT unit is used to
36 − 83 83 − 36
perform the both row and column processes. If sel = 0,
then row process is performed otherwise column process is The odd-even decomposition based N -point Integer DCT
performed. Fig. 2(b) shows the separable parallel 2D-Integer is shown in [3], where the N2 numbers of even ordered input

2380-6923/16 $31.00 © 2016 IEEE 121


DOI 10.1109/VLSID.2017.68
signal samples values are sent to N2 -point Integer DCT unit. with carry look ahead adder (CLA), which will produce the
The configurable Integer DCT is shown in [4], where the multiplication result oi . The corresponding resultant sign bit
multiplier is designed in such a way that to perform N or (oi s) will be obtained from the Fig. 3(c), where the series
N N
2 or 4 -point Integer DCTs. The 8-point Integer transform of multiplexers are used to store the xor-ed sign bit values
based HEVC architectures are shown in [5], [6], [7]. The accu- of input signal sample values (xi s) and the co-efficient values
mulators based N -point Integer DCT architectures are shown (cij s), where the i and j are varied from 0 to 31 for a 32-point
in [8] and [9], where N accumulators are used to produce N Integer DCT. Here, s32 , s16 , s8 , s4 , and s2 are incremented
outputs for 1D-DCT with N cycles. In all the above mentioned (initially s32 , s16 , s8 , s4 , and s2 are equal to 0) during each
existing architectures, add-shift network based multipliers are cycle using 5, 4, 3, 2, and 1-bit up counters respectively.
used. Therefore, the multiplier involves more number of CLAs So, the one of the operand for the proposed multiplier will
(carry look ahead adders), which causes to increase in worst be configured (varied) during each cycle. Fig. 3(a), (b), and
path delay. (c) are named together as Block. The critical path depth
mul, pro
of the proposed Block architecture (Tdelay ) is shown in
A. Contribution of this paper equation (4), which is equal to the critical path depth of the
The multiplier unit used in the latest N -point Integer DCT proposed multiplier in the N -point Integer DCT. The total
architectures is in the form of add-shift network, whereas in number of CSA levels used for the proposed N -point Integer
the proposed architecture, signed configurable carry save adder DCT is log2 log2 N . Here, T (csa) and T (cla) are the critical
tree [11] is used. Therefore, the depth of the architecture falls path depth of carry save adder and carry look ahead adder
within the bounds of O(log2 N ). The proposed 1D architecture respectively. If se = 0, 1, 2, 3, and 4, then 32, 16, 8, 4,
is used to perform one N -point or multiple N2 , N4 , ...2-point and 2-point Integer DCTs will be performed respectively. The
Integer DCTs in parallel. The performance results show that output from the Block is {oi s, oi }. Therefore, 32 numbers of
the proposed architecture gives better performance compared Blocks are required to obtain one output of 1D-Integer DCT.
with existing architectures using 45 nm CMOS TSMC library. Fig. 4 shows the overall architecture of proposed 32-point
The rest of the paper is organized as follows, Section II 1D-Integer DCT, where the inputs are from 32 numbers
elaborates the proposed architecture for Integer DCT. Design of Blocks as shown in Fig. 3. Therefore, log2 32 = 5
modeling, implementation, and results are stated in Section levels of signed fixed point adders are used. Therefore,
add, pro
III, followed by a Section IV as conclusion. the critical path depth of the signed adder tree (Tdelay )
used in the N -point proposed Integer DCT architecture
II. T HE PROPOSED ARCHITECTURE FOR I NTEGER DCT is (log2 N )T (add), which is shown in (5). Here, T (add)
Fig. 3 shows the proposed block architecture used for 32-point represents the critical path depth of the signed adder. The
1D-Integer DCT. In 32-point 1D-Integer DCT, the co-efficient proposed 32-point 1D architecture is used to perform one
matrix is in the size of 32×32. The input signal sample values 32-point or two 16-point or four 8-point or eight 4-point or
should be multiplied with the co-efficient, which forms the sixteen 2-point Integer DCTs in parallel. The 32-point Integer
matrix-vector multiplier. In all the existing architectures, the DCT output is {ou32 s, ou32 }. The 16-point Integer DCT
add-shift network based multiplier is used. So, the delay of outputs are {ou160 s, ou160 } and {ou161 s, ou161 }. The 8-
the multiplier is based on the number of adders used in the point Integer DCT outputs are {ou80 s, ou80 }, {ou81 s, ou81 },
add-shift network. In the proposed architecture, configurable {ou82 s, ou82 }, and {ou83 s, ou83 }. The 4-point Integer DCT
carry save adder (CSA) tree based multiplier is used. Fig. 3(a) outputs are {ou40 s, ou40 }, {ou41 s, ou41 }, {ou42 s, ou42 },
shows the series of multiplexers used for configurable carry {ou43 s, ou43 }, {ou44 s, ou44 }, {ou45 s, ou45 },
save addition based multiplication in the proposed architec- {ou46 s, ou46 }, and {ou47 s, ou47 }. The 2-point Integer DCT
ture. The maximum number of values to be added in the outputs are {ou20 s, ou20 }, {ou21 s, ou21 },...{ou215 s, ou215 }.
configurable carry save addition based 32-point Integer DCT Fig. 4(b) shows the 32 X 32-Buffer architecture, where 32
is log2 N = log2 32 = 5. For example, the multiplication of numbers of 1 × 32-Buffers are used. The 1 × 32-Buffer inputs
the co-efficient 87 with the input signal sample value xi is are the outputs from the column of 5-to-1 multiplexers, with
equal to 87xi = 64xi + 16xi + 4xi + 2xi + xi . The minimum select line se. Here, se = 0, 1, 2, 3, and 4 for 32, 16, 8, 4,
number of values to be added in the configurable carry save and 2-point Integer DCTs respectively. Each 1 × 32-Buffer is
addition based 32-point Integer DCT is 1. For example, the made up of 32 numbers of registers and 2-to-1 multiplexers
multiplication of the co-efficient 4 with the input signal sample with common select line. The select lines used in the
value xi is equal to 4xi = 4xi + 0xi + 0xi + 0xi + 0xi . 1 × 32-Buffers 0, 1, ... 30, and 31 are en0 , en1 ,...en30 , and
So, the corresponding left-shifted (power of two) input signal en31 respectively. The output from Fig. 4(a) can be stored at
values are sent as the input of the series of multiplexers used in one particular 1 × 32-Buffer with corresponding select line
Fig. 3(a), which is named as Cell. The maximum possible cells as 1. The 1 × 32-Buffer architecture is shown in Fig. 5. The
used to obtain one multiplication result is 5. Therefore, five outputs of ith 1 × 32-Buffer are b32 i, b16 i, b8 i, b4 i, and b2 i,
Cells are used in Fig. 3(b). So, the maximum possible levels which are the resultants of 32, 16, 8, 4, and 2-point Integer
of the configurable carry save adder (CSA) tree is log2 5 = 3. DCTs respectively. Here, eni = 0 to maintain the values (32
The Sum and Carry from the final carry save adder are added values) stored in the buffer and eni = 1 if the the new value

122
Fig. 3. The proposed block architecture (Block) used for 32-point 1D-Integer DCT with (a) Series of multiplexers used for configurable carry save addition
based multiplication (Cell) (b) configurable carry save adder tree based multiplication unit (c) Series of multiplexers used to find the resultant sign bits for
the multiplication.
TABLE I
T HEORETICAL ANALYSIS OF VARIOUS ARCHITECTURES FOR I NTEGER DCT

N = 32 N = 16 N = 8 N = 4 N = 2 Critical path depth No. of cycles


N -point 1D Odd even [3] YES YES YES YES NO (1 + log2 N2 )T (add)+T (add-shif t)+T (mux) 1
N -point 1D [4] YES YES YES YES NO (log2 N )T (add)+T (add-shif t)+T (mux) N
N -point 1D [5] NO NO YES NO NO (log2 N )T (add)+T (add-shif t)+T (mux) 1
N -point 1D [6] NO NO YES NO NO (log2 N )T (add)+T (add-shif t)+T (mux) N
N -point 1D [7] NO NO YES NO NO (log2 N )T (add)+T (add-shif t)+T (mux) N
N -point 1D [8] YES YES YES YES NO T (add-shif t)+T (mux)+T (add) N
N -point 1D [10] YES YES YES YES NO (log2 N )T (add)+T (add-shif t) 1
N -point 1D Proposed YES YES YES YES YES (log2 N )T (add)+T (cla)+T (mux)+
(log2 log2 N )T (csa) N
N X N -point 2D Folded/Parallel [3] YES YES YES YES NO (1 + log2 N2 )T (add)+T (add-shif t)+T (mux) 2N
N X N -point 2D Folded/Parallel [4] YES YES YES YES NO (log2 N )T (add)+T (add-shif t)+T (mux) 2N 2
N X N -point 2D Parallel [5] NO NO YES NO NO (log2 N )T (add)+T (add-shif t)+T (mux) 2N
N X N -point 2D Folded/Parallel [8] YES YES YES YES NO T (add-shif t)+T (mux)+T (add) 2N 2
N X N -point 2D Parallel [9] YES YES YES YES NO T (add-shif t)+T (add)+T (mux) 2N 2
N X N -point 2D Parallel [10] YES YES YES YES NO (log2 N )T (add)+T (add-shif t)+T (mux) 2N
N X N-point 2D Folded/Parallel YES YES YES YES YES (log2 N )T (add)+T (cla)+T (mux)+
Proposed (log2 log2 N )T (csa) 2N 2
T (add), T (mux), T (cla), T (csa), and T (add-shif t) are the critical path depth of signed fixed point adder, multiplexer, recursive doubling
based carry look ahead adder, carry save adder, and add-shift network based multiplier respectively.

Integer DCT, pro mul, pro add, pro


is arrived from input. Tdelay = T (mux) + Tdelay + Tdelay (6)
In the Buffer architecture, the shaded boxes represent the ( N
) ( N
× N
)
clocked registers. The critical path depth for the proposed MN,2kpro = M(N2k×N2),k pro = 2k ; k = 0, 1, 2, ...(log2 N ) − 1
Integer DCT, pro
N -point Integer DCT (Tdelay ) is shown in (6). (7)
N
N ( )
The equation (7) shows the number of 2k
-point (MN,2kpro ) III. D ESIGN M ODELING , I MPLEMENTATION , AND R ESULTS
N N
( k× k)
and number of ( 2Nk × 2Nk )-point (M(N2 ×N2), pro ) Integer DCTs All the existing and proposed designs are modeled in Verilog
using proposed N -point 1D and N ×N -point 2D architectures HDL. These Verilog HDL models are simulated and verified
respectively. Here, T (mux) is the critical path depth for using Xilinx ISE simulator. The timing, area, total number
multiplexers used in the proposed architecture. The proposed of cells, and power analysis of this implementation are done
N -point 1D and N × N -point 2D Integer DCTs require N with Cadence 6.1 ASIC design tool. All the designs are
and 2N 2 cycles to complete the operation respectively. Here, implemented for 45 nanometer technology, where the library
the row and column process will take N 2 cycles for each. tcbn45gsbwpbc088 ccs.lib is used. Here, the operating volt-
age is 0.88v. In general, performance of a circuit depends
mul, pro
Tdelay = T (cla) + (log2 log2 N )T (csa) (4) on circuit delay, circuit area, and power dissipation. The
worst path circuit delay is defined as the path from input
add, pro
Tdelay = (log2 N )T (add) (5) to output with largest (worst path) delay in the circuit. The

123
Fig. 4. VLSI architectures for (a) proposed 32-point 1D-Integer DCT (b) 32 X 32-Buffer

Fig. 5. 1 × 32-Buffer architecture

careful optimization in these parameters will ensure the highest area, net power, and power delay product (PDP) or energy per
performance. Table I shows the theoretical analysis of various operation [12] between various 1D and 2D Integer DCT ar-
Integer DCT architectures, where add-shif t network based chitectures. The PDP stands for the average energy consumed
multipliers along with adders are the part of critical path in per switching event and it is apparent from the units (W.s =
existing designs while the CSA based multipliers along with Joule). The PDP can be easily calculated by multiplying worst
adders are the part of critical path in proposed designs. Also, path delay with sum of switching and leakage powers. The
Table I shows the possible length (32 or 16 or 8 or 4 or 2- proposed 32 × 32-point parallel Integer DCT achieves 59.1%
point), critical path depth, and number of cycles of various N of improvement in worst path delay compared with odd-even
and N × N -point Integer DCTs. decomposition [3] based architecture because regular adders
Table II shows the comparison of worst path delay, total are used in [3], whereas in proposed technique, CSA based

124
TABLE II
P ERFORMANCE ANALYSIS OF DIFFERENT ARCHITECTURES FOR I NTEGER DCT WITH INPUT SIGNAL SAMPLE VALUES AS 8- BITS WIDE WITH 45 nm
CMOS TECHNOLOGY.

Worst path Frequency Total area Total no. Net power Switching Leakage EOP
1D/2D Integer DCT architecture delay (ps) (M Hz) (µm2 ) of cells (nw) power (nw) power (nw) (f J)
32-point 1D Odd even [3] 3026.2 330.4 83051.3 64868 1623391.2 5274320.3 3515339.8 26599.2
32-point 1D [4] 1560.9 640.6 67379.3 57839 731229.9 2929816.4 4470877.4 11551.7
8-point 1D [5] 1768.4 565.6 36795.2 35579 499123.2 1781991.2 2567233.2 7691.1
8-point 1D [6] 1167.1 856.8 30685.1 21569 461001.2 1311001.1 1142243.1 2863.1
8-point 1D [7] 1682.2 594.5 33588.2 31168 485291.5 1671071.7 2340745.8 6748.6
32-point 1D [8] 1587.4 630.1 81836.2 52111 853460.1 2796111.6 4384529.6 11398.5
32-point 1D [10] 1889.4 529.2 89845.3 66789 1832311.1 5424219.3 3835311.8 17494.9
32-point 1D Proposed 1399.7 714.4 42810.2 42578 517698.2 2218746.4 3333070.4 7770.8
32 X 32-point 2D Folded [3] 3967.8 252.0 361980.2 211072 3140026.7 11773025.1 17121276.8 114646.8
32 X 32-point 2D Folded [4] 1568.8 637.7 265778.1 65140 889125.9 7893432.2 10009573.4 28086.2
32 X 32-point 2D Folded [8] 1773.9 564.0 321985.1 172032 2054512.9 8343453.9 14794677.3 41044.7
32 X 32-point 2D Folded Proposed 1755.1 569.8 164754.3 57839 731227.5 3620249.4 6767937.4 18232.3
32 X 32-point 2D Parallel [3] 3835.0 260.7 441948.4 223040 3194824.4 13092679.2 18148370.2 119809.4
32 X 32-point 2D Parallel [4] 1568.1 637.7 367075.3 156717 5918420.1 9003839.2 10125521.3 29996.7
8 X 8-point 2D Parallel [5] 1762.9 567.2 170122.1 93829 1454342.1 4731477.9 9007501.5 24220.4
32 X 32-point 2D Parallel [8] 1589.0 629.3 401226.1 218432 2612410.6 10386085.2 19721696.1 47841.2
32 X 32-point 2D Parallel [9] 2256.3 443.2 385511.7 219539 2706847.8 10496871.8 18285502.7 64941.6
32 X 32-point 2D Parallel [10] 1899.5 526.4 467981.2 237872 3314227.7 14312679.2 20226511.2 65607.1
32 X 32-point 2D Parallel Proposed 1569.2 637.3 269967.8 131798 1315835.7 6982468.5 11017179.3 28245.0

adders are used. The architectures shown in [5], [6], and [7]
require less area than proposed design because these existing
techniques are only for 8-point Integer DCT operation. The
parallel 2D architectures [4] and [8] achieve high performance
than proposed design but the area of those existing techniques
are greater than proposed design because of parallel refinement
units and accumulators respectively. Since the critical path of
[8] includes only one accumulator, the critical path delay of [8]
is less than other existing designs. Fig. 6 shows the chip layout
diagram for proposed folded 32 × 32-point 2D-Integer DCT
architecture using 45-nm technology. The main difference
between the proposed parallel and folded architectures is the
number of clock cycles and area. In parallel architecture, total Fig. 6. Chip layout diagram for proposed 32×32-point 2D-Integer DCT using
folded architecture with core area as 181229.7µm2 , die space around core
area is greater than folded. In folded architecture, number of as 60µm, and total chip area as 235904.49µm2 using 45 nm technology.
clock cycles is greater than parallel. Therefore, the parallel
architecture can be used in the applications, where time opti-
mization (high throughput) is primary goal (Example - Super
Computer). Similarly, the folded architecture can be used in TSMC library. The proposed 32 × 32-point parallel Integer
the applications, where area optimization is the primary goal DCT achieves 59.1% of improvement in worst path delay
(Example - Handheld devices). compared with odd-even decomposition [3] based architecture.

IV. C ONCLUSION R EFERENCES


In this paper, high performance VLSI architecture for integer [1] Mohamed Asan Basiri M and Noor Mahammad Sk, “Multimode Par-
discrete cosine transform (DCT) is proposed that are used in allel and Folded VLSI Architectures for 1D-Fast Fourier Transform”,
real time high efficiency video coding (HEVC) applications. Integration, the VLSI Journal, Elsevier, vol. 55, pp. 43-56, Sept. 2016.
[2] Fei Liang, Xiulian Peng, and Jizheng Xu2, “A light-weight HEVC
Here, the multiplier is designed with configurable carry save encoder for image coding”, IEEE International Conference on Visual
adder tree and hence the depth of the circuit is within the Communications and Image Processing (VCIP), pp. 1-5, Nov. 2013.
bounds of O(log2 N ). The proposed 1D Integer DCT is used [3] Pramod Kumar Meher, Sang Yoon Park, Basant Kumar Mohanty, Khoon
Seong Lim, and Chuohao Yeo,, “Efficient Integer DCT Architectures
to perform one N -point or multiple N2 , N4 , ...2-point transfor- for HEVC”, IEEE Transactions on Circuits and Systems for Video
mations in parallel. The proposed 1D architecture is used to Technology, vol. 24, no. 1, pp. 168- 178, Jan. 2014.
design 2D folded and parallel designs. The performance results [4] Pai-Tse Chiang and Tian Sheuan Chang, “A Reconfigurable Inverse
Transform Architecture Design for HEVC Decoder”, IEEE International
show that the proposed architecture gives good improvement Symposium on Circuits and Systems (ISCAS), pp. 1006-1009, May
as compared with existing architectures using 45 nm CMOS 2013.

125
[5] Honggang Qi, Qingming Huang, and Wen Gao, “A Low-Cost Very Large (ISCAS), pp. 2511-2514, June 2014.
Scale Integration Architecture for Multi Standard Inverse Transform”, [9] Hong Liang, He Weifeng, Zhu Hu, and Mao Zhigang, “A Cost Effective
IEEE Transactions on Circuits and Systems - II, Express Briefs, vol. 2-D Adaptive Block Size IDCT Architecture for HEVC Standard”,
57, no. 7, pp. 551-555, July 2010. IEEE 56th International Midwest Symposium on Circuits and Systems
[6] Khan Wahid, Muhammad Martuza, Mousumi Das, and Carl McCrosky, (MWSCAS), pp. 1290-1293, Aug. 2013.
“Resource Shared Architecture of Multiple Transforms for Multiple Video [10] Wenjun Zhao, Takao Onoye, and Tian Song, “High-Performance Mul-
Codecs”, IEEE International Canadian Conference on Electrical and tiplierless Transform Architecture for HEVC”, IEEE International Sym-
Computer Engineering (CCECE), pp. 947-950, May 2011. posium on Circuits and Systems, pp. 1668-1671, May 2013.
[11] Mohamed Asan Basiri M and Noor Mahammad Sk, “An Efficient VLSI
[7] Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang, and
Architecture for Discrete Hadamard Transform”, IEEE International
Jiarong Tong, “A Reconfigurable Multi-Transform VLSI Architecture
VLSI Design Conference, pp. 140-145, Jan. 2016.
Supporting Video Codec Design”, IEEE Transactions on Circuits and
[12] Ricardo Gonzalez, Benjamin M. Gordon, and Mark A. Horowitz,
Systems - II, Express Briefs, vol. 58, no. 7, pp. 432-436, July 2011.
“Supply and Threshold Voltage Scaling for Low Power CMOS”, IEEE
[8] Yao Ziyou, He Weifeng, Hong Liang, He Guanghui, and Mao Zhigang, Journal of Solid State Circuits, vol. 32, no. 8, pp. 1210-1216, Aug. 1997.
“Area and Throughput Efficient IDCT/IDST Architecture for HEVC
Standard”, IEEE International Symposium on Circuits and Systems

126

You might also like