High Performance Integer DCT Architectures For Hevc: Mohamed Asan Basiri M, Noor Mahammad SK
High Performance Integer DCT Architectures For Hevc: Mohamed Asan Basiri M, Noor Mahammad SK
I. I NTRODUCTION
Digital signal processors (DSPs) are essential for real-time
processing of real-world digitized data to perform high-speed
numeric calculations used for a broad range of applications
from basic consumer electronics to sophisticated industrial
instrumentation. The discrete transform [1] is used to change Fig. 2. Basic architecture for 2D-Integer DCT (a) Folded (b) Parallel
the representation of a signal from one domain to another
for reducing the complexity of a particular digital signal
processing application. Discrete cosine transform (DCT) is DCT architecture, where two 1D-Integer DCT units are used
very powerful transformation used in image compression. The to perform the row and column processes. In all the cases, the
circuit complexity of DCT is greater than integer DCT because transpose buffer is used to store the results from row process
DCT is floating point and the integer DCT is fixed point. In to find the column process values.
the recent trends, HEVC [2] is widely used in multimedia
application, where the integer DCT is incorporated [3]. o11 c11 c12 c13 x11
The 1D and 2D discrete transformations are represented o12 = c21 c22 c23 x12 (1)
as (1) and (2) respectively, where O is the output matrix, X o13 c31 c32 c33 x13
is the input signal matrix, and C is the co-efficient matrix.
The 4-point integer DCT co-efficient matrix is shown in (3). o11 o12 o13 c11 c12 c13 x11 x12 x13
Fig. 1 shows the 4 × 4-point 2D-integer DCT. During row o21 o22 o23 = c21 c22 c23 x21 x22 x23
process, each row of 4 × 4-input matrix is 1D transformed o31 o32 o33 c31 c32 c33 x31 x32 x33
and the results are stored in each row of 4 × 4-buffer. During (2)
column process, each column of 4 × 4-buffer matrix is 1D
64 64 64 64
transformed and the results are the required 2D transformed 83 36 − 36 − 83
4×4
values. Fig. 2(a) shows the separable folded 2D-Integer DCT CInteger DCT =
64 − 64 − 64
(3)
64
architecture, where one 1D-Integer DCT unit is used to
36 − 83 83 − 36
perform the both row and column processes. If sel = 0,
then row process is performed otherwise column process is The odd-even decomposition based N -point Integer DCT
performed. Fig. 2(b) shows the separable parallel 2D-Integer is shown in [3], where the N2 numbers of even ordered input
122
Fig. 3. The proposed block architecture (Block) used for 32-point 1D-Integer DCT with (a) Series of multiplexers used for configurable carry save addition
based multiplication (Cell) (b) configurable carry save adder tree based multiplication unit (c) Series of multiplexers used to find the resultant sign bits for
the multiplication.
TABLE I
T HEORETICAL ANALYSIS OF VARIOUS ARCHITECTURES FOR I NTEGER DCT
123
Fig. 4. VLSI architectures for (a) proposed 32-point 1D-Integer DCT (b) 32 X 32-Buffer
careful optimization in these parameters will ensure the highest area, net power, and power delay product (PDP) or energy per
performance. Table I shows the theoretical analysis of various operation [12] between various 1D and 2D Integer DCT ar-
Integer DCT architectures, where add-shif t network based chitectures. The PDP stands for the average energy consumed
multipliers along with adders are the part of critical path in per switching event and it is apparent from the units (W.s =
existing designs while the CSA based multipliers along with Joule). The PDP can be easily calculated by multiplying worst
adders are the part of critical path in proposed designs. Also, path delay with sum of switching and leakage powers. The
Table I shows the possible length (32 or 16 or 8 or 4 or 2- proposed 32 × 32-point parallel Integer DCT achieves 59.1%
point), critical path depth, and number of cycles of various N of improvement in worst path delay compared with odd-even
and N × N -point Integer DCTs. decomposition [3] based architecture because regular adders
Table II shows the comparison of worst path delay, total are used in [3], whereas in proposed technique, CSA based
124
TABLE II
P ERFORMANCE ANALYSIS OF DIFFERENT ARCHITECTURES FOR I NTEGER DCT WITH INPUT SIGNAL SAMPLE VALUES AS 8- BITS WIDE WITH 45 nm
CMOS TECHNOLOGY.
Worst path Frequency Total area Total no. Net power Switching Leakage EOP
1D/2D Integer DCT architecture delay (ps) (M Hz) (µm2 ) of cells (nw) power (nw) power (nw) (f J)
32-point 1D Odd even [3] 3026.2 330.4 83051.3 64868 1623391.2 5274320.3 3515339.8 26599.2
32-point 1D [4] 1560.9 640.6 67379.3 57839 731229.9 2929816.4 4470877.4 11551.7
8-point 1D [5] 1768.4 565.6 36795.2 35579 499123.2 1781991.2 2567233.2 7691.1
8-point 1D [6] 1167.1 856.8 30685.1 21569 461001.2 1311001.1 1142243.1 2863.1
8-point 1D [7] 1682.2 594.5 33588.2 31168 485291.5 1671071.7 2340745.8 6748.6
32-point 1D [8] 1587.4 630.1 81836.2 52111 853460.1 2796111.6 4384529.6 11398.5
32-point 1D [10] 1889.4 529.2 89845.3 66789 1832311.1 5424219.3 3835311.8 17494.9
32-point 1D Proposed 1399.7 714.4 42810.2 42578 517698.2 2218746.4 3333070.4 7770.8
32 X 32-point 2D Folded [3] 3967.8 252.0 361980.2 211072 3140026.7 11773025.1 17121276.8 114646.8
32 X 32-point 2D Folded [4] 1568.8 637.7 265778.1 65140 889125.9 7893432.2 10009573.4 28086.2
32 X 32-point 2D Folded [8] 1773.9 564.0 321985.1 172032 2054512.9 8343453.9 14794677.3 41044.7
32 X 32-point 2D Folded Proposed 1755.1 569.8 164754.3 57839 731227.5 3620249.4 6767937.4 18232.3
32 X 32-point 2D Parallel [3] 3835.0 260.7 441948.4 223040 3194824.4 13092679.2 18148370.2 119809.4
32 X 32-point 2D Parallel [4] 1568.1 637.7 367075.3 156717 5918420.1 9003839.2 10125521.3 29996.7
8 X 8-point 2D Parallel [5] 1762.9 567.2 170122.1 93829 1454342.1 4731477.9 9007501.5 24220.4
32 X 32-point 2D Parallel [8] 1589.0 629.3 401226.1 218432 2612410.6 10386085.2 19721696.1 47841.2
32 X 32-point 2D Parallel [9] 2256.3 443.2 385511.7 219539 2706847.8 10496871.8 18285502.7 64941.6
32 X 32-point 2D Parallel [10] 1899.5 526.4 467981.2 237872 3314227.7 14312679.2 20226511.2 65607.1
32 X 32-point 2D Parallel Proposed 1569.2 637.3 269967.8 131798 1315835.7 6982468.5 11017179.3 28245.0
adders are used. The architectures shown in [5], [6], and [7]
require less area than proposed design because these existing
techniques are only for 8-point Integer DCT operation. The
parallel 2D architectures [4] and [8] achieve high performance
than proposed design but the area of those existing techniques
are greater than proposed design because of parallel refinement
units and accumulators respectively. Since the critical path of
[8] includes only one accumulator, the critical path delay of [8]
is less than other existing designs. Fig. 6 shows the chip layout
diagram for proposed folded 32 × 32-point 2D-Integer DCT
architecture using 45-nm technology. The main difference
between the proposed parallel and folded architectures is the
number of clock cycles and area. In parallel architecture, total Fig. 6. Chip layout diagram for proposed 32×32-point 2D-Integer DCT using
folded architecture with core area as 181229.7µm2 , die space around core
area is greater than folded. In folded architecture, number of as 60µm, and total chip area as 235904.49µm2 using 45 nm technology.
clock cycles is greater than parallel. Therefore, the parallel
architecture can be used in the applications, where time opti-
mization (high throughput) is primary goal (Example - Super
Computer). Similarly, the folded architecture can be used in TSMC library. The proposed 32 × 32-point parallel Integer
the applications, where area optimization is the primary goal DCT achieves 59.1% of improvement in worst path delay
(Example - Handheld devices). compared with odd-even decomposition [3] based architecture.
125
[5] Honggang Qi, Qingming Huang, and Wen Gao, “A Low-Cost Very Large (ISCAS), pp. 2511-2514, June 2014.
Scale Integration Architecture for Multi Standard Inverse Transform”, [9] Hong Liang, He Weifeng, Zhu Hu, and Mao Zhigang, “A Cost Effective
IEEE Transactions on Circuits and Systems - II, Express Briefs, vol. 2-D Adaptive Block Size IDCT Architecture for HEVC Standard”,
57, no. 7, pp. 551-555, July 2010. IEEE 56th International Midwest Symposium on Circuits and Systems
[6] Khan Wahid, Muhammad Martuza, Mousumi Das, and Carl McCrosky, (MWSCAS), pp. 1290-1293, Aug. 2013.
“Resource Shared Architecture of Multiple Transforms for Multiple Video [10] Wenjun Zhao, Takao Onoye, and Tian Song, “High-Performance Mul-
Codecs”, IEEE International Canadian Conference on Electrical and tiplierless Transform Architecture for HEVC”, IEEE International Sym-
Computer Engineering (CCECE), pp. 947-950, May 2011. posium on Circuits and Systems, pp. 1668-1671, May 2013.
[11] Mohamed Asan Basiri M and Noor Mahammad Sk, “An Efficient VLSI
[7] Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang, and
Architecture for Discrete Hadamard Transform”, IEEE International
Jiarong Tong, “A Reconfigurable Multi-Transform VLSI Architecture
VLSI Design Conference, pp. 140-145, Jan. 2016.
Supporting Video Codec Design”, IEEE Transactions on Circuits and
[12] Ricardo Gonzalez, Benjamin M. Gordon, and Mark A. Horowitz,
Systems - II, Express Briefs, vol. 58, no. 7, pp. 432-436, July 2011.
“Supply and Threshold Voltage Scaling for Low Power CMOS”, IEEE
[8] Yao Ziyou, He Weifeng, Hong Liang, He Guanghui, and Mao Zhigang, Journal of Solid State Circuits, vol. 32, no. 8, pp. 1210-1216, Aug. 1997.
“Area and Throughput Efficient IDCT/IDST Architecture for HEVC
Standard”, IEEE International Symposium on Circuits and Systems
126