0% found this document useful (0 votes)
17 views

Yuhong 2009

Uploaded by

Mahendra Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Yuhong 2009

Uploaded by

Mahendra Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A 360Mbin/s CABAC Decoder for H.

264/AVC Level
5.1 Applications
Yu Hong, Peilin Liu, Hang Zhang, Zongyuan You Dajiang Zhou, Satoshi Goto
Department of Electronic Engineering Graduate School of Information, Production and Systems
Shanghai Jiao Tong University Waseda University
Shanghai 200240, China 2-7 Hibikino, 808-0135, Japan

Abstract—This paper presents a VLSI architecture of CABAC critical path length was doubled, thus the operating frequency
decoder for H.264/AVC Level 5.1 applications. It adopts a and the performance of the whole decoder was constrained.
symbol-prediction-based decision engine with extra-bypass Kim et al. [5] solved the critical path problem by adopting a
decoding support, a four-stage bypass engine, along with most-probable-symbol prediction based scheme, but their
dedicated arithmetic decoding modes to increase the throughput decoder could process only one bin if the first bin was not
rate. It also reduces the context model access time significantly by MPS, so the performance decreased.
applying Context Pre-fetch Register Set. The proposed design
can decode an average of 1.08 bins per cycle, and can be operated In order to meet the requirements of H.264/AVC Level 5.1,
at a maximum frequency of 333MHz using SMIC 0.13m the throughput rate should be increased, while the critical path
technology. Therefore, it is able to provide a throughput of length should be reduced to achieve higher operating
360Mbins/s, and hence can meet the requirements of Level 5.1 in frequency. In this paper, a symbol-prediction-based scheme is
H.264/AVC standard. proposed to solve the critical path problem in multi-bin
decoding. Besides, the extra-bypass scheme for decision engine
Keywords-H.264; CABAC; Level 5.1 and the four-stage bypass engine are also employed. Following
the dedicated arithmetic decoding modes, the optimized
I. INTRODUCTION engines can achieve a higher throughput rate compared to
previous works. Moreover, Context Pre-fetch Register Set
Context-based Adaptive Binary Arithmetic Coding (CPRS) is proposed to reduce the context model access time.
(CABAC) is an efficient entropy coding tool adopted in Compared to the register set scheme used in [4], the long delay
H.264/AVC main and high profiles. While providing a better caused by context group switching is shortened.
compression rate compared to the baseline entropy coding
method, it also dramatically increases the computational The rest of the paper is organized as follows. Section II
complexity. The maximum bit rate of H.264/AVC Level 5.1 is provides an overview of CABAC decoding. Section III shows
240Mbits/s. As a result, the CABAC decoder needs to process the proposed decoding architecture and describes the
approximately 300 million bins per second. Even if a DSP optimization strategies in detail. Finally, experimental results
processor could work at 3GHz, it would be difficult to and conclusion are given in Section IV and Section V
accomplish the decoding of each bin in 10 cycles to meet the respectively.
requirements, since the decoding involves lots of calculations
and memory accesses. Therefore, in order to meet the II. OVERVIEW OF CABAC DECODING
requirements of Level 5.1 in H.264/AVC standard, achieving
Fig. 1 shows the flow chart of CABAC decoding. At the
CABAC decoding with VLSI implementation is inevitable.
beginning of a new slice, the context model and the arithmetic
Moreover, due to the syntax-element-level dependency,
decoding engine are both initialized. After that, binarization
implementation of CABAC decoding is more challenging than
process is repeated to decode all the syntax elements in this
that of encoding. While the state-of-the-art CABAC encoders
slice until the syntax element end_of_slice_flag equals 1. The
can achieve throughput rates of more than 2bins/cycle [1, 2],
inputs of binarization process are a series of bins decoded by
the typical CABAC decoder [7] can only process 0.86 bin in
arithmetic engine.
one cycle.
There are three types of arithmetic decoding modes:
In recent years, some VLSI architectures have been
bypass, terminate and decision. In this section we mainly focus
proposed for CABAC decoding. Chen, et al. [3] first proposed
on the decision mode because the others are much simpler. To
an architecture with optimization on FSM. Yu, et al. [4] used
decode a decision bin, ctxIdx should be calculated first to
forty-four registers to store the context tables of one context
acquire pStateIdx and valMPS from the context model. Then,
group in context memory, so memory access in arithmetic
symbol type decision is carried out to determine whether this
decoding was eliminated. However, when switching context
bin is Most Probable Symbol (MPS) or Least Probable Symbol
groups, there would be a long delay before the context tables of
(LPS), so the bin value can be calculated. After that, the new
the new group were loaded into the registers. They also
range and offset are renormalized, and context model is
optimized the arithmetic decoding engine by concatenating two
updated for the decoding of the next bin.
single-bin decoding units together. The side effect was that the

978-1-4244-5035-0/09/$26.00 ©2009 IEEE -71- ISOCC 2009


Neighbour MBs Storage Memory Decode
Decode a slice (DRAM) I/F Controller

decode_type

se_value
Init context tables

Init decoding engine Binarization


Context ctxg_idx Contoller
Pre-fetch
Start binarization Bus Context

arith_mode
Register

ctxidx_1/2

bin_vals
I/F Memory
Set
(CPRS)
binidx = 0
lookup_vals
mb_type & Arithmetic
update_vals
I_PCM Decoder
Calculate ctxIdx
DB Engine

Other BP Engine
Decode a bin binidx++ Bit-stream bs_data
SEs Bit-stream
Storage DMA TB Engine
Manager bs_shiftlen
(DRAM)
Binarization finished? No Figure 2. The proposed CABAC decoding architecture.

end_of_slice_flag==1
A. Optimizing the Arithmetic Decoding Engines
Done In this work, the decoding engines for decision bin (DB)
and bypass bin (BP) are both optimized. Through optimizations
Figure 1. Flow chart of CABAC decoding. on the decoding flow, the proposed DB engine has achieved a
better throughput rate, and has solved the critical path problem
in [4] without performance decrease as in [5]. In addition, the
The bottlenecks of VLSI implementations for CABAC four-stage BP engine is employed to accelerate the decoding of
decoding exist in both the arithmetic decoding engines and the bypass bins.
context model maintaining. The arithmetic decoding engines in The conventional decoding flow first calculates range_mps
conventional implementations can decode only one bin at a to determine the symbol type of one bin, and then calculates
time, so the throughput rate is limited. As for a multi-bin the corresponding range and offset, which will be used later to
decoding engine, the accumulated critical path length may perform renormalization. As shown in Fig. 3(a), symbol type
constrain the operating frequency and hence the performance decision and renormalization are processed sequentially in
of the whole decoder. Therefore, reducing the critical path conventional implementations. This flow results in long logic
length is essential to achieve such a multi-bin decoding delay, which may be acceptable for a single-bin engine, but is
architecture. Besides, the context model is frequently accessed critical when two identical single-bin engines are concatenated
during CABAC decoding process. In VLSI implementation, together for better performance. In order to solve the long
on-chip SRAM is usually used to store context tables. In order critical path problem, this paper proposes a symbol-prediction-
to decode one bin, the arithmetic engine has to wait two cycles based scheme. This scheme includes two renormalization
for the loading and updating of the context tables. As a result, a modules for the two single-bin decoding units for MPS and
lot of time is consumed on memory access. These bottlenecks LPS respectively. Therefore, both of the renormalization
are solved in this paper with several proposed techniques. processes can be performed in parallel with symbol type
Details are discussed in Section III. decision. As shown in Fig. 3(b), the critical path of a basic
single-bin decoding unit is shortened. With this scheme, three
basic decoding units can predict all the six cases for the
decoding of one or two DBs. The results of the correct case are
III. PROPOSED CABAC DECODING ARCHITECTURE
selected later according to the symbol type decision results.
The proposed CABAC decoding architecture is shown in
Fig. 2. Context tables in Context Memory are initialized from The DB engine is also optimized to support extra-bypass
external (e.g. CPU) before decoding each slice. Then, the decoding. Thus, it can decode one additional BP when parsing
Decode Controller instructs the whole decoder to process a DBs. With this optimization, the frequently repeated
whole slice. In this architecture, engines in arithmetic decoder coeff_sign_flag in CABAC decoding can be decoded as extra
are optimized to support the decoding of multiple bins per BP if level_minus1 contains no suffix part. As a result, up to
cycle according to the dedicated decoding modes. Meanwhile, 384 cycles can be saved for each macroblock. The process of
context model access time is significantly reduced by using BP decoding is similar to the one of renormalization. Therefore,
Context Pre-fetch Register Set (CPRS). renormalization module is modified to support the decoding of
extra BP, as shown in Fig. 4.

-72- ISOCC 2009


TABLE I. USING THE PROPOSED MODES ON COMBINATIONS OF SES
Corresponding Cycles saved for
Syntax elements
modes each MB
prev_intraNxN_pred_mode_flag
SoO + D2D up to 24
& rem_intraNxN_pred_mode
significant_coeff_flag &
SoZ up to 384
last_significant_coeff_flag
SoZBP (prefix)
level_minus1 & coeff_sign_flag up to 384
BP_FLP1 (suffix)

TABLE II. COMPARISON IN CONTEXT GROUP SWITCHING LATENCY

Average switching latency per MB


Sequence Proportion
 type Register set without reduced
Proposed CPRS
Figure 3. Critical path comparison between (a) the conventional pre-fetch
implementation and (b) the proposed symbol-prediction-based scheme. I only 61.01 17.49 71.33%

IPPP 42.35 13.72 68%

IBBBP 34.35 11.4 67%

As for the BP decoding modes, when parsing the prefix part


of UEGk, Stop-on-Zero (BP_SoZ) mode is used first to
determine the number of heading ones, and then Fix-Length
(BP_FL) mode is applied to decode the remaining BPs. In
CABAC, the suffix part of level_minus1 is always followed by
coeff_sign_flag. Thus, these two syntax elements can be
combined and decoded together by using Fix-Length-Plus1
(BP_FLP1) mode in this work. Similarly, as shown in Table I,
other syntax elements can also be combined in order to
accelerate the decoding process.

C. Reducing the Context Model Access Time


Figure 4. Renormalization logics with extra-bypass support. To decode one decision bin, context tables should be loaded
from and then written back to the Context Memory. Moreover,
A four-stage bypass engine, which concatenates four basic as the proposed DB engines can decode two decision bins at a
bypass decoding units together, is also employed in this work. time, more memory accesses are required. In this paper,
The decoding of frequently occurred mvd and level_minus1 Context Pre-fetch Register Set (CPRS) is proposed to reduce
can be significantly accelerated with this engine. the context model access time. Context tables in Context
Memory are classified into several groups. Before decoding the
B. Applying the Dedicated Arithmetic Decoding Modes syntax elements of one group, the context tables are loaded into
To make full use of the optimized engines, we have CPRS. Therefore, the arithmetic decoding engine no longer
designed five dedicated decoding modes for DB decoding, and needs to access the Context Memory, and the memory access
three for BP decoding. The DB decoding modes are: Decode- latency is eliminated. CPRS also solves the long delay problem
one-DB (D1D), Decode-two-DBs (D2D), Stop-on-Zero (SoZ), caused by context group switching in [4]. It contains two sets
Stop-on-One (SoO), Decode-one-DB-with-extra-BP (D1DBP) of registers. While one register set is used by arithmetic engine,
and Stop-on-Zero-with-extra-BP (SoZBP). When decoding the other one can pre-fetch the next-coming context group with
syntax elements of FL binarization type, using D2D mode can the largest probability. When group switching is required, if the
double the decoding speed compared to using D1D only. The new group is already pre-fetched, the arithmetic engine can
SoZ mode is designed to accelerate the decoding of U, TU and start working immediately. Even if pre-fetch misses, the
the prefix part of UEGk. The decoding of mb_type and switching latency can still be reduced since the writing back of
sub_mb_type can also be accelerated by using combinations of the previous group can be concealed by arithmetic decoding.
D1D, D2D, SoZ and SoO. The SoZBP mode, which decodes Fig. 5 illustrates how CPRS can accelerate the context group
one more BP after being stopped by the zero DB, is used to switching. As shown in Table II, compared to the register set
eliminate the extra cycle for decoding coeff_sign_flag after the without pre-fetch scheme, CPRS can reduce more than 60% of
prefix part of level_minus1. And when it comes to the last bin context group switching latency, and can reduce 10~20% of
in prefix, D1DBP is used instead. total decoding time.

-73- ISOCC 2009


TABLE III. COMPARISON OF THE PROPOSED DESIGN AND PREVIOUS WORKS

Yu [4] Kim [5] Yi [6] Yang [7] Proposed


Technology 0.18m 0.18m 0.18m 0.18m 0.13m
2
Logic Gate Count/Area 0.3mm NA 81162 76333 47081

Context Memory 3360 bits (single-port) NA 5296 bits 3528 bits 3472 bits (single-port)
a a b
Cycles/MB 500 NA NA 177 , 396 172a, 326b
a
0.71bins/cycle , 0.73bins/cyclea,
Throughput Rate NA 0.41bins/cycle 0.254bins/cycle
0.86bins/cycleb 1.08bins/cycleb
Max. Frequency 150 MHz NA 225 MHz 140MHz 333MHz

Max. Throughput NA NA 57Mbins/s 120Mbins/s 360Mbins/s


a. When decoding 720x480 streams with a bit rate of 4Mbits/s.
b. When decoding 1920x1088 streams with a bit rate of 60Mbits/s.
decoder reduces 42% and 38% of total hardware cost, while
improving 4.25 and 1.26 times of throughput rate in
comparison with the existed works [6, 7] respectively. Working
at 333MHz, the proposed decoder is fast enough for the real-
time decoding of H.264/AVC Level 5.1 streams.

DDR PHY

DDR DRAM Controller


Figure 5. (a) Context group switching latency in different cases: (a) without
register set, (b) when pre-fetch hits, (c) when pre-fetch misses.
Video Decoder
IV. EXPERIMENTAL RESULTS Core
HOST IF

The proposed CABAC decoder is implemented in Verilog


CABAC
HDL and simulated with NC-Verilog. It has been fabricated as Decoder
a part of a multi-standard video decoder chip [8] using SMIC Display
0.13m 5-metal-layer technology. The chip layout is shown in
Fig. 6. The standalone CABAC decoder can work at a
maximum frequency of 333MHz. It uses a single-port Figure 6. Chip layout.
124x28bit SRAM as Context Memory, and the total logic gate
count is 47081. Table III summarizes the features of the
proposed and previous works. According to the verification REFERENCES
results, when processing video sequences with the bit rate of [1] R. R. Osorio and J. D. Bruguera, “High-throughput architecture for
60Mbits/s (1920x1088@30fps) and 240Mbits/s H.264/AVC CABAC compression system”, IEEE Trans. CSVT, vol. 16,
(4000x2000@30fps), the average throughput rate of the no. 11, pp. 1376-1384, Nov. 2006.
proposed decoder is 1.08bins/cycle. Working at 333MHz, it [2] G. Pastuszak, “A high-performance architecture of the double-mode
binary coder for H.264.AVC", IEEE Trans. CSVT, vo. 18, no. 7, pp.
can achieve a high throughput of 360Mbins/s. Thus, it is 949-960, July 2008.
suitable for the H.264/AVC Level 5.1 applications which
[3] J.-W. Chen, C.-R. Chang and Y.-L. Lin, “A hardware accelerator for
require a throughput of 300Mbins/s approximately. context-based adaptive binary arithmetic decoding in H.264/AVC”,
IEEE Symp. Circuits and Systems, vo. 5, pp. 4525-4528, May 2005.
V. CONCLUSION [4] Wei Yu and Yun He, “A high performance CABAC decoding
architecture”, IEEE Trans. Consumer Electronics, vol. 51, no. 4, pp.
This paper presents a CABAC decoder for H.264/AVC 1352-1359, Nov. 2005.
Level 5.1 applications. First of all, in order to optimize the [5] C.-H. Kim and I.-C. Park, “High speed decoding of context-based
critical path for multi-bin decoding, a symbol-prediction-based adaptive binary arithmetic codes using most probable symbol
scheme is applied. Additionally, the extra-bypass scheme for prediction”, IEEE Symp. Circuits and Systems, pp.1707-1710, May
decision engine and the four-stage bypass engine are both 2006.
employed to increase the throughput rate. Moreover, the [6] Y. Yi and I.-C. Park, “High-speed H.264/AVC CABAC decoding”,
IEEE Trans. CSVT, vol. 17, no. 4, pp. 490-494, April 2007.
context model access time is significantly reduced with the use
[7] Y.-C. Yang and J.-I. Guo, "A high throughput H.264/AVC high profile
of CPRS. Compared to the register set scheme without pre- CABAC decoder for HDTV applications", IEEE Trans. CSVT, in press.
fetch, it can reduce 50%~70% of the delay caused by context
[8] D. Zhou, et al., "A 1080p@60fps multi-standard video decoder chip
group switching, and hence increase the performance by designed for power and cost efficiency in a system perspective", IEEE
7%~20%. Considered as a whole, the proposed CABAC Symp. VLSI Circuits, Kyoto, Japan, Jun. 16-18, 2009.

-74- ISOCC 2009

You might also like