Yuhong 2009
Yuhong 2009
264/AVC Level
5.1 Applications
Yu Hong, Peilin Liu, Hang Zhang, Zongyuan You Dajiang Zhou, Satoshi Goto
Department of Electronic Engineering Graduate School of Information, Production and Systems
Shanghai Jiao Tong University Waseda University
Shanghai 200240, China 2-7 Hibikino, 808-0135, Japan
Abstract—This paper presents a VLSI architecture of CABAC critical path length was doubled, thus the operating frequency
decoder for H.264/AVC Level 5.1 applications. It adopts a and the performance of the whole decoder was constrained.
symbol-prediction-based decision engine with extra-bypass Kim et al. [5] solved the critical path problem by adopting a
decoding support, a four-stage bypass engine, along with most-probable-symbol prediction based scheme, but their
dedicated arithmetic decoding modes to increase the throughput decoder could process only one bin if the first bin was not
rate. It also reduces the context model access time significantly by MPS, so the performance decreased.
applying Context Pre-fetch Register Set. The proposed design
can decode an average of 1.08 bins per cycle, and can be operated In order to meet the requirements of H.264/AVC Level 5.1,
at a maximum frequency of 333MHz using SMIC 0.13m the throughput rate should be increased, while the critical path
technology. Therefore, it is able to provide a throughput of length should be reduced to achieve higher operating
360Mbins/s, and hence can meet the requirements of Level 5.1 in frequency. In this paper, a symbol-prediction-based scheme is
H.264/AVC standard. proposed to solve the critical path problem in multi-bin
decoding. Besides, the extra-bypass scheme for decision engine
Keywords-H.264; CABAC; Level 5.1 and the four-stage bypass engine are also employed. Following
the dedicated arithmetic decoding modes, the optimized
I. INTRODUCTION engines can achieve a higher throughput rate compared to
previous works. Moreover, Context Pre-fetch Register Set
Context-based Adaptive Binary Arithmetic Coding (CPRS) is proposed to reduce the context model access time.
(CABAC) is an efficient entropy coding tool adopted in Compared to the register set scheme used in [4], the long delay
H.264/AVC main and high profiles. While providing a better caused by context group switching is shortened.
compression rate compared to the baseline entropy coding
method, it also dramatically increases the computational The rest of the paper is organized as follows. Section II
complexity. The maximum bit rate of H.264/AVC Level 5.1 is provides an overview of CABAC decoding. Section III shows
240Mbits/s. As a result, the CABAC decoder needs to process the proposed decoding architecture and describes the
approximately 300 million bins per second. Even if a DSP optimization strategies in detail. Finally, experimental results
processor could work at 3GHz, it would be difficult to and conclusion are given in Section IV and Section V
accomplish the decoding of each bin in 10 cycles to meet the respectively.
requirements, since the decoding involves lots of calculations
and memory accesses. Therefore, in order to meet the II. OVERVIEW OF CABAC DECODING
requirements of Level 5.1 in H.264/AVC standard, achieving
Fig. 1 shows the flow chart of CABAC decoding. At the
CABAC decoding with VLSI implementation is inevitable.
beginning of a new slice, the context model and the arithmetic
Moreover, due to the syntax-element-level dependency,
decoding engine are both initialized. After that, binarization
implementation of CABAC decoding is more challenging than
process is repeated to decode all the syntax elements in this
that of encoding. While the state-of-the-art CABAC encoders
slice until the syntax element end_of_slice_flag equals 1. The
can achieve throughput rates of more than 2bins/cycle [1, 2],
inputs of binarization process are a series of bins decoded by
the typical CABAC decoder [7] can only process 0.86 bin in
arithmetic engine.
one cycle.
There are three types of arithmetic decoding modes:
In recent years, some VLSI architectures have been
bypass, terminate and decision. In this section we mainly focus
proposed for CABAC decoding. Chen, et al. [3] first proposed
on the decision mode because the others are much simpler. To
an architecture with optimization on FSM. Yu, et al. [4] used
decode a decision bin, ctxIdx should be calculated first to
forty-four registers to store the context tables of one context
acquire pStateIdx and valMPS from the context model. Then,
group in context memory, so memory access in arithmetic
symbol type decision is carried out to determine whether this
decoding was eliminated. However, when switching context
bin is Most Probable Symbol (MPS) or Least Probable Symbol
groups, there would be a long delay before the context tables of
(LPS), so the bin value can be calculated. After that, the new
the new group were loaded into the registers. They also
range and offset are renormalized, and context model is
optimized the arithmetic decoding engine by concatenating two
updated for the decoding of the next bin.
single-bin decoding units together. The side effect was that the
decode_type
se_value
Init context tables
arith_mode
Register
ctxidx_1/2
bin_vals
I/F Memory
Set
(CPRS)
binidx = 0
lookup_vals
mb_type & Arithmetic
update_vals
I_PCM Decoder
Calculate ctxIdx
DB Engine
Other BP Engine
Decode a bin binidx++ Bit-stream bs_data
SEs Bit-stream
Storage DMA TB Engine
Manager bs_shiftlen
(DRAM)
Binarization finished? No Figure 2. The proposed CABAC decoding architecture.
end_of_slice_flag==1
A. Optimizing the Arithmetic Decoding Engines
Done In this work, the decoding engines for decision bin (DB)
and bypass bin (BP) are both optimized. Through optimizations
Figure 1. Flow chart of CABAC decoding. on the decoding flow, the proposed DB engine has achieved a
better throughput rate, and has solved the critical path problem
in [4] without performance decrease as in [5]. In addition, the
The bottlenecks of VLSI implementations for CABAC four-stage BP engine is employed to accelerate the decoding of
decoding exist in both the arithmetic decoding engines and the bypass bins.
context model maintaining. The arithmetic decoding engines in The conventional decoding flow first calculates range_mps
conventional implementations can decode only one bin at a to determine the symbol type of one bin, and then calculates
time, so the throughput rate is limited. As for a multi-bin the corresponding range and offset, which will be used later to
decoding engine, the accumulated critical path length may perform renormalization. As shown in Fig. 3(a), symbol type
constrain the operating frequency and hence the performance decision and renormalization are processed sequentially in
of the whole decoder. Therefore, reducing the critical path conventional implementations. This flow results in long logic
length is essential to achieve such a multi-bin decoding delay, which may be acceptable for a single-bin engine, but is
architecture. Besides, the context model is frequently accessed critical when two identical single-bin engines are concatenated
during CABAC decoding process. In VLSI implementation, together for better performance. In order to solve the long
on-chip SRAM is usually used to store context tables. In order critical path problem, this paper proposes a symbol-prediction-
to decode one bin, the arithmetic engine has to wait two cycles based scheme. This scheme includes two renormalization
for the loading and updating of the context tables. As a result, a modules for the two single-bin decoding units for MPS and
lot of time is consumed on memory access. These bottlenecks LPS respectively. Therefore, both of the renormalization
are solved in this paper with several proposed techniques. processes can be performed in parallel with symbol type
Details are discussed in Section III. decision. As shown in Fig. 3(b), the critical path of a basic
single-bin decoding unit is shortened. With this scheme, three
basic decoding units can predict all the six cases for the
decoding of one or two DBs. The results of the correct case are
III. PROPOSED CABAC DECODING ARCHITECTURE
selected later according to the symbol type decision results.
The proposed CABAC decoding architecture is shown in
Fig. 2. Context tables in Context Memory are initialized from The DB engine is also optimized to support extra-bypass
external (e.g. CPU) before decoding each slice. Then, the decoding. Thus, it can decode one additional BP when parsing
Decode Controller instructs the whole decoder to process a DBs. With this optimization, the frequently repeated
whole slice. In this architecture, engines in arithmetic decoder coeff_sign_flag in CABAC decoding can be decoded as extra
are optimized to support the decoding of multiple bins per BP if level_minus1 contains no suffix part. As a result, up to
cycle according to the dedicated decoding modes. Meanwhile, 384 cycles can be saved for each macroblock. The process of
context model access time is significantly reduced by using BP decoding is similar to the one of renormalization. Therefore,
Context Pre-fetch Register Set (CPRS). renormalization module is modified to support the decoding of
extra BP, as shown in Fig. 4.
Context Memory 3360 bits (single-port) NA 5296 bits 3528 bits 3472 bits (single-port)
a a b
Cycles/MB 500 NA NA 177 , 396 172a, 326b
a
0.71bins/cycle , 0.73bins/cyclea,
Throughput Rate NA 0.41bins/cycle 0.254bins/cycle
0.86bins/cycleb 1.08bins/cycleb
Max. Frequency 150 MHz NA 225 MHz 140MHz 333MHz
DDR PHY