Decode to Encode(1)
Decode to Encode(1)
California, USA
This work is subject to copyright. No parts of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording,
or otherwise, without the prior written permission of the copyright owner.
This book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold,
hired out, or otherwise circulated without the publisher’s prior consent in any form of binding or cover
other than that in which it is published and without a similar condition including this condition being
imposed on the subsequent purchaser. Under no circumstances may any part of this book be
photocopied for resale.
Avinash Ramachandran
[email protected]
www.decodetoencode.com
2 VIDEO COMPRESSION
2.1 LANDSCAPE
2.2 WHY IS VIDEO COMPRESSION NEEDED?
2.3 HOW IS VIDEO COMPRESSED?
2.3.1 Spatial Pixel Redundancy
2.3.2 Exploiting Temporal Pixel Correlation
2.3.3 Entropy Coding
2.3.4 Exploiting Psycho-Visual Redundancies
2.3.5 8-bit versus 10-bit encoding
2.4 SUMMARY
2.5 NOTES
3 EVOLUTION OF CODECS
5 INTRA PREDICTION
6 INTER PREDICTION
7 RESIDUAL CODING
8 ENTROPY CODING
9 FILTERING
10.1 CONSTRAINTS
10.2 DISTORTION MEASURES
10.2.1 Sum of Absolute Differences
10.2.2 SATD (Sum of Absolute Transform Differences)
10.3 FORMULATION OF THE ENCODING PROBLEM
10.4 RATE DISTORTION OPTIMIZATION
10.5 RATE CONTROL CONCEPTS
10.5.1 Bit Allocation
10.5.2 RDO in Rate Control
10.5.3 Summary of Rate Control Mechanism
10.6 ADAPTIVE QUANTIZATION (AQ)
10.7 SUMMARY
10.8 NOTES
11 ENCODING MODES
12 PERFORMANCE
13 ADVANCES IN VIDEO
Avinash Ramachandran
November 2018
Acknowledgements
I would like to acknowledge everyone whose help I have benefitted from in
many ways in bringing shape to this book. The examples presented in the
book use the raw video akiyo (CIF), in_to_tree (courtesy SVT Sweden),
Stockholm and DOTA2 sequences hosted on the Xiph.org Video Test Media
[derf's collection] website. The work and efforts of experts in standards
organizations, companies and associations like JCT-VC, IEEE, Fraunhofer
HHI, Google and universities driving state-of-the-art research in Video
Compression technologies have contributed significantly to this book. I
would like to thank authors and contributors of several excellent books on the
subject, online technical blog articles and research papers whose material I
consulted while writing this book. An extensive list of these sources is also
included in the resources section of the book. Special thanks for review,
discussions, and support are also due to Rakesh Patel, Oliver Gunasekara,
Yueshi Shen, Tarek Amara, Ismaeil Ismaeil, Akrum Elkhazin, Harry
Ramachandran and Edward Hong. Thanks to Susan Duncan for all the
editorial assistance. This book has been possible only because of the
unstinting support of my wonderful family: my mother Vasanthi, my wife
Veena and our children Agastya and Gayatri.
Organization of the Book
This book is organized into three parts. Part I introduces the reader to digital
video in general and lays the groundwork for a foray into video compression.
It provides details of basic concepts relevant to digital video as well as
insights into how video is compressed and the characteristics of video that we
take advantage of in order to compress it. It also covers how exactly these
characteristics are exploited progressively to achieve the significant level of
compression that we have today. Part I concludes by providing a brief history
of the evolution of video codecs and summarizes the important video
compression standards and their constituent coding tools.
Building on this foundation, Part II focuses on all the key compression
technologies. It starts by covering in detail the block-based architecture of a
video encoder and decoder that is employed in all video coding standards,
including H.264, H.265, VP9, and AV1. Each chapter in Part II explains one
core technique, or block, in the video encoding and decoding pipeline. These
include Intra prediction, inter prediction, motion compensation, transform
and quantization, loop filtering and rate control. I have generously illustrated
these techniques with numerical and visual examples to help provide an
intuitive understanding. Well-known, industry-recognized clips, including
in_to_tree, stockholm, akiyo, and DOTA2 have been used throughout the
book for these illustrations. I have also attempted to provide explanations of
not just the overall signal flow but also why things are done the way they are.
Equipped with all the essential technical nuts and bolts, you will then be
ready to explore, in Part III, how all these nitty-gritties together make up an
encoder, how to configure one and how to use it in different application
scenarios. Part III presents different application scenarios and shows how
encoders are tuned to achieve compression using the tools that were detailed
in Part II. Specifically, the section explains in detail the various bit rate
modes, quality metrics and availability, and performance testing of different
codecs. Part III concludes with a chapter on upcoming developments in the
video technology space, including content-specific, per-title optimized
encoding, the application of machine learning tools in video compression,
video coding tools in the next generation AV1 coding standard, and also
compression for new experiential video platforms like 360 Video and VR.
I hope that the book is able to illustratively convey the entire video
compression landscape and to inspire the reader toward further exploration,
collaborations, and pioneering research in the exciting and rapidly-advancing
field of video coding.
Part I
1 Introduction to Digital Video
In this chapter, we explore how visual scenes are represented digitally. I will
explain various specialized terms used in digital video. This is useful before
we explore the realm of digital video compression. If you have a working
knowledge of uncompressed digital video, you may briefly skim through this
section or skip it entirely and proceed to the next chapter. Once you complete
this chapter, you will better understand how terms like sampling, color spaces
and bit depths apply to digital video.
Digital video is the digital representation of a continuous visual scene. In its
simplest sense, a visual scene in motion can be represented as a series of still
pictures. When these still pictures are consecutively displayed in rapid
progression, the human eye interprets the pictures as a moving scene rather
than perceiving the individual images. This is why, during the early days of
filmmaking, it was called moving pictures. This, over time, became
condensed to movies.
MOVIES = Moving + Pictures
To capture a visual scene and represent it digitally, the cameras therefore
temporally sample the scene; that is, they derive still images from the scene at
intervals over time. The method of capturing and displaying a complete
picture at regular intervals of time results in what is referred in the industry as
progressive video. Also, every temporal image is spatially sampled to get the
individual digital pixels.
1.2 Sampling
So, what is sampling? Sampling is the conversion of a continuous signal to a
discrete signal. This can be done in space and time. In case of video, the
visual scene sampled at a point in time produces a frame or picture of the
digital video. Normally, scenes are sampled at 30 or 25 frames in one second,
however, a 24 frames-per-second sampling rate is used for movie production.
When the frames are rapidly played back at the rate at which they were
sampled (frames per second or fps), they produce the motion picture effect.
Each frame, in turn, is composed of three components, one of which is
usually needed to represent monochrome pictures and the remaining two are
included only for color images. These components are obtained by sampling
the image spatially and together are called pixels or pels. Thus, every pixel in
a video has one component (for monochrome) or three components (for
color). The number of spatial pixels that make a video frame determines how
accurately the source has been captured and represented. As shown in Figure
2, this is a 2-dimensional array of horizontal and vertical sample points in the
visual scene and the total number is the parameter called video resolution.
Mathematically, this can be expressed as:
Resolution = H (in pixels) x V (in pixels)
Thus, if a video has a resolution of 1920x1080, this means that it has 1920
horizontal pixel samples and 1080 vertical pixel rows. It should be noted that
the resolution of the video usually refers to the first component of the video,
namely, luminance, while the two color, or chrominance, components may be
sampled at the same or lower sampling ratios. The resolution, combined with
the frame capture rate expressed in frames per second, determines the
captured digital image's degree of fidelity to the original visual scene. In turn,
this also determines how much processing and bandwidth is needed to
efficiently encode the video for transmission and storage.
Figure 3, below, shows how both the brightness component (called
luminance) and color components (chrominance) are sampled in a typical
picture. In this example, the chrominance is subsampled by a factor of 2, both
horizontally and vertically, compared to the luminance. Figure 4 illustrates
how the temporal sampling into pictures and the spatial sampling into pixels
comprise the digital representation of an entire video sequence.
Figure 6, below, shows a 16x16 block in the frame that, when zoomed in,
clearly shows the varying color shades in the 16 x 16 square matrix of pixels.
Every small square block in this zoomed image corresponds to a pixel that is
composed of three components with unique values.
Figure 6: A 16x16 block of pixels in a frame that illustrates the 256 square pixels [1].
1.5 HDR
Video technology continues to evolve from high definition (HD) resolutions
to ultra HD (UHD) and beyond. The new technologies offer four or more
times the resolution of HD. Given this evolution, it becomes important to
better represent this high density of pixels and the associated colors in order
to achieve the enhanced viewing experience that they make possible. Various
methods have been explored to improve the representation of a digital video
scene, including these two:
1. Improved spatial and temporal digital sampling. This includes
techniques to provide faster frame rates and higher resolutions, as
mentioned above.
2. Better representation of the values of the pixels. Several
techniques are incorporated in HDR video to improve how various
colors and shades are represented in the pixels.
While the former mostly deals with different ways of pixel sampling, the
latter focuses on every individual pixel itself. This is an enhancement over
the traditional standard dynamic range (SDR) video. High dynamic range
video provides a very significant improvement in the video viewing
experience by incorporating improvements in all aspects of pixel
representations. In this section, we shall explore how this is done.
HDR, as the name indicates, provides improved dynamic range, meaning that
it extends the complete range of luminance values, thereby providing richer
detail in terms of the tone of dark and light shades. It doesn't stop there but
also provides improvements in the representation of colors as well. The
overall result is a far more natural rendering of a video scene.
It’s important to note that HDR technology for video is quite different from
the technology used in digital photography that also uses HDR terminology.
In photography, different exposure values are used, and the captures are
blended to expand the dynamic range of pixels by creating several local
contrasts. However, every capture still uses the same 8-bit depth and 256
levels of brightness. HDR in video extends beyond just the dynamic range
expansion to encompass the following [2]:
- high dynamic range with higher peak brightness and lower black levels,
offering richer contrast;
- improved color space or wide color gamut (WCG), specifically, a new
color standard called Rec. 2020 that replaces the earlier Rec. 709 used in
SDR;
- improved bit depth, either 10-bit (distribution) or 12-bit (production)
used instead of traditional 8-bits in SDR;
- improved transfer function, for instance, PQ, HLG etc. used instead of
the earlier gamma function;
- improved metadata, in that HDR includes the addition of static (for the
entire video sequence) and dynamic (scene or picture-specific) metadata
which aid in enhanced rendering.
Table 2: Summary of enhancements in HDR video over earlier SDR.
1.5.4 Metadata
Video signals are compressed before transmission using an encoder at the
source. The compressed signals are then decoded at the receiving end. When
the decoder receives the encoded stream and produces raw YUV pixels, these
pixels will need to be displayed on a computer screen or display. This means
the YUV values will need to be converted to the correct color space. This
includes a color model and the associated transfer function. These define how
the pixel values get converted into light photons on the display panels.
Modern codecs have provisions for signaling the color space in the bitstream
header using supplemental enhancement information (SEI) messages.
However, what happens when the display device doesn't support the color
space with which the source is produced? In this case, it’s important to use
the source color space characteristics and convert color spaces at the decoder
end to the display’s supported format. This is crucial to ensure that colors
aren't displayed incorrectly when displays are incompatible with the source
color space.
The HDR enhancements support a mechanism to interpret the characteristics
of encoded pictures and use this information as part of the decoding process,
incorporating a greater variety of transfer functions. As explained earlier, if
content produced using a specific transfer function at the source goes through
various transformations in the video processing and transmission workflow
and then gets mapped using another transfer function by the display device,
the content ends up perceptibly degraded. HDR standards provide enhanced
metadata mechanisms to convey the transfer function details from the source
to the decoding and display devices.
1.6 Summary
Digital video is the digital representation of a continuous visual
scene that is obtained by sampling in time to produce frames, which
in turn is spatially sampled to obtain pixels.
Colors in the real world are converted to pixel values using color
spaces.
The number of bits used to represent a pixel determines how
accurately the visual information is captured from the source and is
called bit depth which is often 8-bit or 10-bit for video.
HDR technology improves the visual experience by enhancing pixel
representation. It incorporates advanced dynamic range, higher bit
depth, advanced color space, and transfer functions.
1.7 Notes
1. in_to_tree. xiph.org. Xiph.org Video Test Media [derf's collection].
https://media.xiph.org/video/derf/. Accessed September 21, 2018.
2. High Dynamic Range Video: Implementing Support for HDR Using
Software-Based Video Solutions. AWS Elemental Technologies.
https://goo.gl/7SMNu3. Published 2017. Accessed September 21,
2018.
3. I, Sakamura. File:CIExy1931.svg, CIE 1931 color space.
Wikimedia Commons.
https://commons.wikimedia.org/wiki/File:CIExy1931.svg.
Published July 13, 2007. Updated June 9, 2011. Accessed
September 20, 2018.
2 Video Compression
As explained earlier, digital video is the representation of a visual scene using
a series of still pictures. In the previous chapter, we saw how digital video is
represented and explained concepts like sampling and color spaces. In this
chapter, we explore why video needs to be compressed and characteristics in
the video signal that are exploited to achieve this. In a nutshell, video
compression primarily focuses on how to take the contiguous still pictures,
identify and remove redundancies in them and minimize the information
needed to represent the video sequence.
2.1 Landscape
Video compression is essential to store and transmit video signals. Typically,
video originates from a variety of sources like live sports and news events,
movies, live video conferencing and calling, video games and emerging
applications like augmented reality/virtual reality (AR/VR). While some
applications like live events and video calling demand real-time video
compression and transmission, others, like movie libraries, storage, and on-
demand streaming are non-real time applications. Each of these applications
imposes different constraints on the encoding and decoding process, resulting
in differences in compression parameters. A live sports event broadcast
requires high quality, real-time encoding with very low encoding and
transmission latency, whereas encoding for a video-on-demand service like
Netflix or Hulu is non-real time and focuses on highest quality and visual
experience. To this effect, every video compression standard provides a
variety of toolsets that can be enabled, disabled and tuned to suit specific
requirements. All modern video compression standards, including MPEG2,
H.264, H.265, and VP9, define only the toolsets and the decoding process.
This is done to ensure interoperability across a variety of devices. Every
decoder implementation must decode a compliant bitstream to provide an
output identical to the outputs of other implementations operating on the
same input. Encoder implementations, on the other hand, are free to choose
coding tools available in the standard and tune them as part of their design, as
long as they produce an output video that is standard-compliant. Encoder
implementations can design and incorporate different pre-processing
techniques, coding tool selection, and tuning algorithms as part of their
design. This may result in dramatic differences in video quality from one
encoder to another.
Figure 10: Illustration of spatial correlation in pixels in one frame of the akiyo sequence [2].
We can exploit this strong correlation and represent the video with only a
differential from a base value, as this requires far fewer bits to encode the
pixels. This is the core concept of how differential coding is applied in video
compression. All modern video coding standards remove these spatial
redundancies to transmit only the minimal bits needed to make up the
residual video.
For example, in the group of 4x4 pixels in Figure 11, 8 bits are needed to
represent values from 0 through 255. Every pixel will need 8 bits for its
representation.
Thus, the 4x4 block can be completely represented using 128 bits. We,
however, notice that the pixel values in this block vary only by very small
amounts. If we were to represent the same signal as a differential from a base
value (say 240), then every pixel in the block can be represented using 8 bits
for the base value and 2 bits (values from 0 through 3) for each of the
remaining 15 values, resulting in a total of only 40 bits required for
representation. The base value chosen is called a predictor and the effective
differential values from this predictor are called residuals. This mechanism of
using differential coding to remove spatial pixel redundancies within the
same picture is called intra picture coding.
Figure 12: Illustration of temporal correlation in successive frames of the akiyo sequence [2].
0 1 0 2
1 1 1 0 3
2 1 1 1 3
3 0 1
2.3.4 Exploiting Psycho-Visual Redundancies
Studies have shown that the human visual system (HVS) is much more
sensitive to luminance information (Y) than chrominance information (U and
V). This means that a reduction in the number of bits allocated for the chroma
component will have a significantly lower impact on the visual experience
than a corresponding reduction in luma bits. By exploiting this visual
perceptual reality, all modern video coding standards use a lower resolution
of chroma by subsampling the chroma components, while maintaining full
resolution of luminance components. The following are the most commonly
used formats:
1. 4:2:0: Chroma subsampled by ½ across H and V directions
2. 4:2:2: Both U & V subsampled by ½ across H direction only
3. 4:4:4: Full resolution for U and V without any subsampling
The 4:2:2 and 4:2:0 subsampling mechanism is illustrated in Figure 13 for a
sample 8x8 block of pixels, where full luma (Y) resolution is used but the
chroma components (Cb and Cr) are both sampled as indicated by the shaded
pixels. As 4:2:2 subsamples chroma along horizontal resolution only, every
other pixel location along the horizontal rows is used. For 4:2:0, in addition
to the above horizontal subsampling, vertical subsampling is also used.
Hence, a pixel location that is between two consecutive rows of
corresponding luma pixel locations is used.
Every video distribution system, for instance, satellite uplinks, cable or
internet video, use 4:2:0 exclusively. In professional facilities like live event
capture and news production centers, however, 4:2:2 format is used to
capture, process, and encode video to preserve video fidelity.
Figure 13: 4:2:2 and 4:2:0 subsampling of pixels.
Figure 14: HVS sensitivity to the low-frequency sky, and high-frequency trees, areas.
2.4 Summary
Storage and transmission of raw uncompressed video consume
enormous bandwidth (e.g. 3840x2160p60 takes 7.46 Gbits for one
second). This is not practical, especially as video resolutions are
increasing and consumption is growing. Hence, video needs to be
compressed.
Video pixels are similar with significant redundancies, classifiable
into 2 types: statistical and psycho-visual. Compression techniques
work to identify and remove these redundancies.
Statistical redundancies are classified into spatial pixel redundancies
(across pixels within a frame), temporal pixel redundancies (across
frames) and coding redundancies.
The human eye is more sensitive to luma than to chroma. Therefore,
luma is prioritized over chroma in compression techniques.
The human eye is sensitive to small changes in brightness over a
large area but not very sensitive to rapid brightness variations.
Therefore, low-frequency components are prioritized over high-
frequency components during compression.
2.5 Notes
1. Wong J I. The internet has been quietly rewired, and video is the
reason why. Quartz Obsessions. https://qz.com/742474/how-
streaming-video-changed-the-shape-of-the-internet/. Published
October 31, 2016. Accessed September 21, 2018.
2. akiyo. xiph.org. Xiph.org Video Test Media [derf's collection].
https://media.xiph.org/video/derf/. Accessed September 21, 2018.
3. in_to_tree. xiph.org. Xiph.org Video Test Media [derf's collection].
https://media.xiph.org/video/derf/. Accessed September 21, 2018.
4. Why Does 10-bit Save Bandwidth (Even When Content is 8-bit)?
ATEME. www.ateme.com. http://x264.nl/x264/10bit_02-ateme-
why_does_10bit_save_bandwidth.pdf. Published 2010. Accessed
September 21, 2018.
3 Evolution of Codecs
3.1 Key Breakthroughs in Encoding
Research
Video encoding technologies have consistently progressed over several
decades since the 1980s with a suite of successful codecs built on the hybrid
block-based architecture. The core technologies that constitute this
architecture were developed over decades, starting from research as early as
the 1940s. These underpinned the evolution of the architecture into its present
form. Today’s codecs still build on many significant research developments,
especially from the 1970s and 1980s [1]. The focus of this section is to look at
what these core technologies are and how they influenced and contributed to
the evolution of the modern video coding standards. This provides us with
valuable insights on why coding technologies are the way they are today. It
also helps us understand, at a higher level, the fundamental framework of
video coding.
3.1.2 Prediction
Given the nature of video signals, efforts to represent video data using some
form of prediction, in order to minimize the redundancies and thereby reduce
the amount of transmitted data, began as early as the 1950s. In 1972, Manfred
Schroeder of Bell Labs obtained a patent, "Transform Coding of Image
Difference Signals,"[5] that explored several of the modern video codec
concepts, including inter-frame prediction, transforms and quantization to
image signals. Schroeder's work also specifically mentions the application of
Fourier, l-Hadamard, and other unitary matrix transforms that help to
disperse the difference data homogeneously in the domain of the transformed
variable.
Figure 17: Netravali and Stuller’s motion compensated prediction in the transform domain.
Source: https://patents.google.com/patent/US4245248A/
Over the course of this evolution, every generation has built on top of the
previous generation by introducing new toolsets that focus primarily on
reduction of bit rates, reduction of decoder complexity, support for increased
resolutions, newer technologies like multi-view coding, HDR, and
improvements in error resilience, among other enhancements. Table 5
consolidates the details of the evolution of video coding standards.
3.3 Summary
The core technologies that constitute modern compression
architecture were developed over decades of research starting in the
1940s.
The three key breakthroughs that propelled video compression
systems in their present form are a) information theory, b) prediction,
and c) transform.
Two major standards bodies, namely, ISO/IEC and ITU-T, have, over
the years, published the majority of video coding standards. The
Motion Pictures Experts Group (MPEG) has been the working group
for ISO/IEC standards and the Video Coding Experts Group (VCEG)
has been the working group under the ITU-T.
MPEG and VCEG have collaborated to jointly produce immensely
successful video standards, including MPEG2, H.264, and H.265.
Google published VP9 as an open video coding standard in 2013. Its
success has led to the formation of an alliance called AOM that is
working on a new standard called AV1.
The Joint Video Experts Group (JVET) is working on the successor
to H.265, called VVC (Versatile Video Codec), with a goal of
decreasing the bit rate by a further 50% over H.265.
3.4 Notes
1. Richardson I, Bhat A. Video coding history Part 1. Vcodex.
https://www.vcodex.com/video-coding-history-part-1/. Accessed
September 21, 2018.
2. Shannon CE. A mathematical theory of communication. Bell Syst
Tech J. 1948;27(Jul):379-423;(Oct):623-656. https://goo.gl/dZbahv.
Accessed September 21, 2018.
3. Huffman DA. A method for the construction of minimum-
redundancy codes. Proc IRE. 1952;40(9):1098-1101.
https://goo.gl/eMYVd5. Accessed September 21, 2018.
4. Witten IH, Neal RM, Cleary JG. Arithmetic coding for data
compression. Commun ACM. 1987;30(6):520-540.
https://goo.gl/gRrXaS. Accessed September 21, 2018.
5. Schroeder MR. Transform coding of image difference signals. Patent
US3679821A. 1972.
6. Netravali AN, Stuller JA. Motion estimation and encoding of video
signals in the transform domain. Patent US4245248A. 1981.
7. Ahmed N, Natarajan T, Rao KR. Discrete cosine transform. IEEE
Trans Comput. 1974;23(1):90-93. https://dl.acm.org/citation.cfm?
id=1309385. Accessed September 21, 2018.S982
Part II
4 Video Codec Architecture
Video compression (or video coding) is the process of converting digital
video into a format that takes up less capacity, thereby becoming efficient to
store and transmit. As we have seen in Chapter 2, raw digital video needs a
considerable number of bits and compression is essential for applications
such as internet video streaming, digital television, video storage on Blu-ray
and DVD disks, video chats, and conferencing applications like FaceTime
and Skype.
The word ‘codec’ is derived from the two words - encode and decode.
CODEC = Encode + Decode
Quite obviously, there are numerous ways in which video data can be
compressed and it therefore becomes important to standardize this process.
Standardization ensures that encoded video from different sources using
products of different manufacturers can be decoded uniformly across
products and platforms provided by other manufacturers. For example, video
encoded and transmitted using iPhone needs to be viewable on an iPhone and
on a Samsung tablet. Streamed video from Netflix or YouTube needs to be
viewable on a host of end devices. It needs no further emphasis that this
interoperability is critical to mass adoption of the compression technology.
All modern video coding standards, including H.264, H.265 and VP9, define
a bitstream syntax for the compressed video along with a process to decode
this syntax to get a displayable video. This is referred as the normative
section of the video standard. The video standard encompasses all the coding
tools, and restrictions on their use, that can be used in the standard to encode
the video. The standard, however, does not specify a process to encode the
video. While this provides immense opportunity for individuals, universities
and companies for research to come up with the best possible encoding
schemes, it also ensures that every single encoded bitstream adhering to the
standard can be completely decoded and produce identical output from a
compliant decoder. Figure 18, below, shows the encoding and decoding
processes and the shaded portion in the decoder section highlights the
normative part that is covered by video coding standards.
To understand why this is needed and also the benefits that this type of
prediction offers, let us examine the sequence of frames shown in Figure 19.
If the third frame in the sequence were to be encoded as a P frame that uses
only forward prediction, it would lead to poorer prediction for the ball region,
as the additional ball is not present in the preceding frames. However,
compression efficiency can be improved for such regions by using backward
prediction. This is because the ball region can be predicted from the future
frames that have the fourth moving ball in them. Thus, as shown in the figure,
the third frame could benefit from B frame compression by having the
flexibility to choose either forward or backward prediction for the region
containing the top three balls and any of the future frames for the last ball.
To be able to predict from future frames, the encoder has to, in an out-of-
order fashion, encode the future frame before it can use it as a reference to
encode the B frame. This requires an additional buffer in memory to
temporarily store the B frame, pick and encode the future frame first and then
come back to encode the stored B frame. Because the frames are sent in the
order in which they are encoded, in the encoded bitstream, they are available
out-of-order relative to the source stream. Correspondingly, the decoder
would decode the future frame first, store it in memory and use it to decode
the B frame. However, the decoder will sequentially display the B frame first,
followed by the I frame. Consequently, as a B frame is based on a frame that
will be displayed in the future, there will be a difference between decode
order and display order of frames in the sequence when the decoder
encounters a B frame.
Figure 20: Sequence of frames in display order as they appear in the input.
Figure 21: Sequence of frames in encode/decode order as they appear in the bitstream.
We can see this clearly in Figure 20, above. It shows the display order of
frames in the video sequence. This is also the same sequence of frames as
they appear in the input. This display number, or frame number, is also
indicated in the frames in the figure. In this example, I, P, and B frame
encoding is used and there are five B frames for every P frame. Frame
number 1 is encoded as an I frame, followed by a future frame (frame number
7) as a P frame, so that the five B frames can have future frames from which
they can predict. After frame 7 is encoded, frames 2 to 6 are also encoded.
This pattern is repeated periodically, with frame 13 encoded earlier as a P
frame, and so on.
The encoding (and corresponding decoding) order of the frames, as they
appear in the bitstream for this specific example, is shown in Figure 21.
During the decoding process the P frames (frames 7, 13, 19) have to be
decoded before the B frames (frames 2-6, 8-12, 14-17) that precede them can
be decoded. However, the P frames will be held in a buffer and displayed
only after the B frame is displayed.
Typically, B frames use far fewer coding bits than P frames and there are
many B frames encoded for every P frame in the sequence. A B frame is thus
the term used to define both forward prediction and backward prediction and
consists of motion vectors and residual data that effectively describe the
prediction. As with P frames, the reliance on the previous and future frames
helps in further reducing the number of bits used to encode the B frames.
However, it increases the sensitivity to transmission losses. Furthermore, the
cumulative prediction and quantization (a term that will be explained in
chapter 7) processes across successive frames increase the error between the
original picture and the reconstructed picture. This is because quantization is
a lossy process and prediction from a lossy version of the picture results in
increased error between the original and reconstructed pictures. For this
reason, earlier standards did not use B frames as reference frames. However,
the enhancements in filtering and prediction tools have improved the
prediction efficiency, thereby enabling the use of B frames as reference
frames in newer standards like H.264, H.265, and VP9.
The relative frame sizes by each of these frame types are illustrated in Figure
22. The figure is a graph plotting the file sizes of a sample file encoded with
I, P, and B frame types using an I frame period of 50 frames. In this graph,
the peaks are the frame sizes of I frames that occur at every interval of 50
frames. These are represented in red color. The next largest frame sizes are P
frames represented in blue color followed by B frames which are shown in
green color.
The first step in every block-based encoder is to split every frame into block-
shaped regions. These regions are known by different names in different
standards. H.264 standard uses 16x16 blocks called macroblocks, VP9 uses
64x64 blocks called superblocks and H.265 uses a variety of square block
sizes called coded tree units (CTUs) that can range from 64x64 to 16x16
pixels. With the increased need for higher resolutions over the years, the
standards have evolved to support larger block sizes for better compression
efficiency. The next generation codec, AV1, also supports block sizes of
128x128 pixels. Every block in turn is usually processed in raster order
within the frame in the encoding pipeline.
These blocks are further broken down in a recursive fashion and the resulting
sub blocks are then processed independently for the prediction. Figure 27
shows how the recursive partitioning is implemented in VP9. Each 64x64
superblock is broken down in either of the 4 modes, namely, 64x32
horizontal split, 32x64 vertical split, 32x32 horizontal and vertical split mode,
or, no split mode.
Recursive splitting is permitted in the 32x32 horizontal and vertical split
mode. In this mode, each of the 32x32 blocks can be again broken down into
any of the 4 modes and this continues until the smallest partition size is 4x4.
This type of splitting is also called quadtree split. Figure 27 also illustrates
how a 64x64 superblock in VP9 is broken down in a recursive manner to
different partition sizes.
At the highest level, the superblock is broken down using split mode to four
32x32 blocks. The first 32x32 block is in none mode. This is not broken
down again. The second 32x32 block has horizontal mode. This is split into
two partitions of 32x16 pixels, each (indicated as 2 and 3). The third 32x32
block is in split mode; hence, again, recursively broken in to four 16x16
blocks. These are further broken down all the way to 8x8 and further down to
4x4 blocks.
This mechanism of recursive partition split is also illustrated in Figure 28.
The partition scheme in HEVC is very similar with a few minor variations.
Now that we understand the recursive block sub partitioning schema, let us
explore why this kind of partition is needed and how it helps. Coding
standards like HEVC and VP9 address a variety of video resolutions from
mobile (e.g., 320x240 pixels) to UHD (3840x2160 pixels) and beyond. Video
scenes are complex and different areas in the same picture can be similar to
other neighboring areas. If the scene has a lot of detail with different objects
or texture, it’s likely that smaller blocks of pixels are similar to other, smaller,
neighboring blocks or also to other corresponding blocks in neighboring
pictures.
Figure 28: Recursive partitioning from 64x64 blocks down to 4x4 blocks.
Thus, we see in Figure 29, that the area with fine details of the leaves has
smaller partitions. The interpixel dependencies in the smaller partition areas
can be exploited better using prediction at the sub-block level to get better
compression efficiency. This benefit, however, comes with increased cost in
signaling the partition modes in the bitstream. Flat areas, on the other hand,
like the darker backgrounds with few details, will have good prediction even
with larger partitions.
Figure 29: Partition of a picture in to blocks and sub partition of blocks.
Analyzer source: https://www.twoorioles.com/vp9-analyzer/
The next question that comes to mind is, how are these partitions determined?
The challenge to any encoder is to use the partition that best enables encoding
of the pixels using the fewest bits and yields the best visual quality. This is
usually an algorithm that is unique to every encoder. The encoder evaluates
various partition combinations against set criteria and picks one that it
expects will require the fewest encoding bits. The information on what
partitioning is done for the superblock or CTU block is also signaled in the
bitstream.
Once the partitions are determined, the following steps are done in every
encoder in sequence. At first, every block undergoes a prediction process to
remove correlation among pixels in the block. The prediction can be either
within the same picture or across several pictures. This involves finding the
best matching prediction block whose pixel values are subtracted from the
current block pixels to derive the residuals. If the best prediction block is
from the same picture as the current block, then it is classified as using intra
prediction.
Otherwise, the block is classified as an inter prediction block. Inter prediction
uses motion information, which is a combination of motion vector (MV) and
its corresponding reference picture. This motion information and the selected
prediction mode data are transmitted in the bitstream. As explained earlier,
blocks are partitioned in a recursive manner for the best prediction
candidates. Prediction parameters like prediction mode used, reference
frames chosen, and motion vectors can be specified, usually for each 8x8
block within the superblock.
The difference between the original block and the resulting prediction block,
called the residual block, then undergoes transform processing using a spatial
transform. Transform is a process that takes in the block of residual values
and produces a more efficient representation. The transformed pixels are said
to be in transform domain and the coefficients are concentrated around the
top left corner of the block with reducing values as we traverse the block
horizontally rightward and vertically downward. Until this point, the entire
process is lossless. This means, given the transformed block, the original set
of pixels can be generated by using an inverse transform and reverse
prediction.
The transform block then undergoes a process called quantization that
involves dividing the block values by a fixed number to reduce the number of
residual coefficients. These can then be efficiently arranged using a scanning
process and then encoded to produce a binary bitstream using an entropy
coding scheme.
When the decoder receives this video bitstream, it carries out the
complementary processes of entropy decoding, de-scanning, de-quantization,
inverse transform and inverse prediction to produce the decoded raw video
sequence. When B frames are present in the stream, the decoding order (that
is, bitstream order) of pictures is different from the output order (that is,
display order). When this happens, the decoder has to buffer the pictures in
its internal memory until they can be displayed.
It should be noted that the encoder does not employ the input source material
for its prediction process. Instead, it has a built-in decoder processing loop.
This is needed so that the encoder can produce a prediction result that is
identical to that of the decoder, given that the decoder has access only to the
reconstructed pixels derived from the encoded material. To this end, the
encoder performs inverse scaling and inverse transform to duplicate the
residual signal that the decoder would arrive at. The residual is then added to
the prediction and loop-filtered to arrive at the final reconstructed picture.
This is stored in the decoded picture buffer for subsequent prediction. This
exactly matches the process and output of the decoder and prevents any pixel
value drift between the encoder and the decoder.
VP9 supports tiles and, when implemented, the picture is broken along
superblock boundaries. Each tile contains multiple superblocks that are all
processed in raster order and flexible ordering is not permitted. However, the
tiles themselves can be in any order. This means that the ordering of the
superblocks in the picture is dependent on the tile layout. This is illustrated in
Figure 31 below.
It should be noted that tiles and slices are parallelism features intended to
speed up processing. They are not quality improvement functions. This
means that, in order to achieve parallel operations, some operations like
predictions, context sharing, and so on would not be permitted across slices
or tiles. This is to facilitate independent processing. Such limitations may
also lead to some reduction in compression efficiency. As an example, VP9
imposes the restriction that there can be no coding dependencies across
column tiles. This means that two column tiles can be independently coded
and hence decoded. For example, in a frame split into 4 vertical tiles, as
shown in Figure 31, above, there will be no coding dependencies like motion
vector prediction across the vertical tiles. Software decoders can therefore use
four independent threads, one each to decode one full column tile. The tile
size information is transmitted in the picture header for every tile except the
last one. Decoder implementations can use this information to skip ahead and
start decoding the next tile in a separate thread. Encoder implementations can
use the same approach and process superblocks in parallel.
4.5 Summary
● Video Compression involves an encoder component that converts the
input uncompressed video to a compressed stream. This stream, after
transmission or storage, is received by a complementary decoder
component that converts it back into an uncompressed format.
● Video compression technologies employ frameworks to analyze and
compress every picture individually by categorizing them either as intra
or inter frame. These identify and eliminate the spatial and temporal
redundancies, respectively.
● Inter frames provide significant compression by using unidirectional
prediction frames (P frames) and bi-directional prediction frames (B-
frames).
● An encoder uses a sequence of periodic and structured organization of
I, P and B frames in the encoded stream. This is called a Group of
Pictures or GOP.
● I frames are used in the stream to allow fast seek and random access
through the sequence.
● Video coding standards use a block-based prediction model in which
every frame in turn is split into block-shaped regions of different sizes
(e.g. 16x16 macroblocks in H.264, up to 64x64 CTUs in H.265 or
64x64 superblocks in VP9). These blocks are further broken down and
partitioned for prediction in a recursive fashion.
● Encoders use slices and tiles to split the frame into multiple processing
units so that the encode/decode of these independent units can happen
in parallel. These speed up the computations in parallel architectures
and multi-threaded environments.
4.6 Notes
1. Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High
Efficiency Video Coding (HEVC) standard. IEEE Trans Circuits Syst
Video Technol. 2012;22(12):1649-1668.
https://ieeexplore.ieee.org/document/6316136/?part=1. Accessed
September 21, 2018.
5 Intra Prediction
It should be noted that the prediction block is derived using blocks that are
previously encoded and reconstructed (before filtering). As blocks in a frame
are usually processed in left to right and top to bottom raster fashion, the top
and left blocks of the current block are already encoded and hence these
available pixels can be leveraged to predict the current block. The left and top
pixel sets are double the current block's height and width, respectively, and
different codecs limit the usage of pixels from this set for prediction. For
example, H.265 uses all the left and bottom left pixels whereas VP9 allows
the use of only the left set of pixels.
The concept of using neighboring block pixels for intra prediction is
illustrated in Figure 32. The shaded blocks in the figure are all causal. This
means that they have already been processed and coded (in scan order) and
can be used for prediction. For current block C to be coded, the immediate
neighbor blocks are left (L), top (T), top left (TL) and top right (TR). The
bottom half of the figure also shows the pixel values for a sample 4x4 luma
block and the corresponding luma 4x4 neighboring blocks. The prediction
involves the samples from these neighboring blocks that are closest to the
current block. The highlighting emphasizes that these samples are very
similar and have almost the same values as the luma samples from the current
4x4 block. Intra prediction takes advantage of exactly this pixel redundancy
by finding the best of these neighboring pixels that can be used for optimal
prediction. This enables use of the fewest bits. Now that we know what intra
prediction does, let us explore through an example how it is accomplished.
Example: Let us use the circumstances illustrated in Figure 32 as an example
to understand, from an encoder’s perspective, which of these pixels best help
to predict the current pixels in the given 4x4 block. Let’s try the left set of
pixels meaning the right-most column of the left neighbor L. In this case,
every pixel from this column is duplicated horizontally, as shown in Figure
33(a). Another option is to use the top set of pixels. In this case, this would be
the bottom-most row of the top neighbor block T. In this scenario, every pixel
from this row is duplicated vertically as shown in Figure 33(b). Projections of
pixels along other angles are also possible and an example is shown in Figure
33(c) where the pixels are projected from the bottom-most row of the top (T)
and top right neighbor (TR) block, along a 45-degree diagonal.
We have seen in the above example a few possible prediction candidates.
Every encoding standard defines its own list of permissible prediction
candidates or prediction modes. The challenge for the encoder now is to
choose one for every block. The process of intra prediction in the encoder
therefore involves iterating through all the possible neighbor blocks and
prediction modes allowed by the standard to identify the best prediction mode
and pixels for minimizing the number of resulting residual bits that will be
encoded.
Figure 33: Intra prediction from neighbor blocks by using different directional modes: (a)
horizontal, (b) vertical, (c) diagonal.
To do this, the encoder could, for each of the prediction modes, use a
distortion criterion like minimal sum of absolute differences (min-SAD). This
involves a simple computation and indicates the energy contained in the
residuals. By computing the SAD for all the modes, the encoder can pick the
prediction block that has the least SAD. The SAD, while providing a
distortion metric, does not quantify anything about the resulting residual bits
if the mode were chosen. To overcome this limitation, modern encoders
compute a bit estimate and use it to derive a cost function. This is a
combination of the distortion and bits estimate. The best prediction mode is
the one that has the most minimal cost function.
In the example, as illustrated in Figure 33, the residual blocks contain much
smaller numbers. These are cheaper to represent than the original pixels. The
SADs of the residual blocks from the three directional modes are 3, 1 and 7,
respectively, and the encoder in this case could pick the vertical prediction
mode as the best mode if it were using a min-SAD criterion.
The number of such allowable prediction directional modes and block sizes
for intra prediction are different across codecs. More modes amount to an
increase in complexity in the encoder and better compression efficiency. For
example, H.264 allows every 4x4 or 8x8 blocks within a 16x16 macroblock
to select a mode from the defined nine intra modes and it also offers one
mode for a 16x16 luma block. In VP9, ten prediction modes are defined.
These are calculated for the 4x4 block.
H.265, on the other hand, extends it further to use 35 modes for prediction of
4x4, 8x8, 16x16 or 32x32 sub blocks within the CTU. The prediction
direction of the 35 modes is shown in Figure 34. The increased angles in
H.265 are designed such that they provide more options for near-horizontal
and near-vertical angles and fewer options for near-diagonal angles, in accord
with statistical observations. The upcoming AV1 codec has up to 65 different
modes. It’s imperative, therefore, that H.265 and AV1 have better prediction
and thereby video quality improvement. However, they also have significant
computational complexity increase relative to other standards like H.264.
The most common prediction modes available across all codecs are
directional. This includes horizontal, vertical and other angular directions, a
DC predictor and any other standard-specific specialized modes (e.g., VP9
has a specialized mode called True Motion, or TM mode). Typically, the
number of directional modes varies widely across standards, as we have
discussed above.
While the exact number of pixels and the number of prediction modes differ
slightly across codecs, conceptually they are all the same, as explained above.
When the intra prediction modes are established, the following are then sent
as part of the encoded bit stream for every intra predicted block:
a) The prediction mode
b) Residual values (differentials of current and predicted pixels)
However, if the transform block size is 32x32 and the partition size is 16x16
(this is permitted in H.265), there will be only one intra prediction mode
shared by the entire 32x32 block. It should be noted that, as transform sizes
are square, intra prediction operations are always square.
Some standards, like VP9, allow separate intra prediction for luma and
chroma in which case the process can be repeated for chroma. H.264 and
H.265, however, use the same luma intra prediction mode for chroma. The
figures 35-37, illustrate the process and efficiency of spatial intra prediction
for a complete intra frame.
Figure 35 shows the raw source frame. Figure 36 shows the image formed by
the predicted pixels. Notice how close the predicted pixels are to the original
source in Figure 35. Finally, Figure 37 shows the residual values and by
looking at this image, we can infer the accuracy of the prediction.
It can be observed that the prediction is quite accurate even at a block level in
flat areas like the sky, whereas in places of detail like the building windows
and the trees, the blocks used in prediction are unable to completely capture
these details. This results in prediction errors that will eventually be encoded
in the bitstream as residual values.
Figure 37: Residual image formed by subtracting original and predicted pixel values.
5.4 Summary
● The term, intra frame coding, implies that the compression operations
like prediction and transforms are all done using data only within the
current frame and no other frames in the video stream.
● Intra prediction exploits the correlation among neighboring pixels by
using the reconstructed pixels within the frame to derive predicted
values through extrapolation from already coded pixels.
● Reconstructed pixels from the last columns and rows of the left and top
neighboring blocks, respectively, are typically used for prediction of the
pixels of the current block.
● Encoders choose the pixels to predict from by selecting from several
directional modes and block sizes. More modes result in an increase in
complexity in the encoder but provide better compression efficiency.
● The prediction mode and residual values are signaled in the encoded bit
stream for every intra predicted block.
● One intra prediction mode is selected for every partition but the
prediction process is done for every transform block. For example, if
transform size is 8x8 and the partition is 16x16, there will be one
prediction mode for the 16x16 partition but intra prediction will be
performed for every 8x8 block.
6 Inter Prediction
Inter frame coding implies that the compression operations like prediction
and transforms are done using data from the current frame and its
neighboring frames in the video stream. Every frame or block that is inter
coded is dependent on temporally neighboring frames (called reference
frames). It can be fully decoded only after the frames on which it depends are
decoded. Unlike in intra prediction, where pre-defined modes in the standard
define from which directions blocks may be used for prediction, inter
prediction has no defined modes. Thus, the encoder has the flexibility to
search a wide area in one or several reference frames to derive a best
prediction match. The partition sizes and shapes within a superblock are also
flexible such that every sub-partition could have its best matching prediction
block spanning different reference frames. In intra prediction, a row or
column of pixels from the neighboring blocks are duplicated to form the
prediction block. In inter prediction, there is no extrapolation mechanism.
Instead, the entire block of pixels from the reference frame that corresponds
to the best match forms the prediction block.
6.1 Motion-based Prediction
As different objects in the scene can move at different speeds, independently
of the frame rate, their actual motion displacements are not necessarily in
integer pel units. This means that limiting the search for a block match in the
reference frames using pixel granularity can lead to imperfect prediction
results. Searching using sub-pixel granularity could give better matching
prediction blocks, thereby improving compression efficiency. The natural
question, then, is how to derive these sub-pixels, given that they don't
actually exist in the reference frames? The encoder will have to use a smart
interpolation algorithm to derive these sub-pel values from the neighboring
full-pel integer samples. The details of this interpolation process are
presented in later sections of this chapter. For now, we assume that the
encoder uses such an algorithm and searches for matching prediction blocks
with sub-pixel accuracy by interpolating pixel values between the
corresponding integer-pixels in the reference frame. If it still turns out that
good temporal matching blocks are unavailable or intra prediction yields
better results, the block is coded as intra.
The process of searching the reference frames to come up with the best
matching prediction block is called motion estimation (ME). The spatial
displacement of the current block from its prediction block is called motion
vector (MV) and is expressed in (X, Y) pixel coordinates. Figure 38
illustrates this concept. The motion search for the sample block that contains
a ball is shown. In the reference frame, among the other blocks in the defined
search area, the highlighted block containing the ball is chosen and the
relative distance between the current and the chosen block is tracked as the
motion vector.
The mechanism to use motion vectors to form the prediction block is called
motion compensation (MC). It is one of the most computationally intensive
blocks in an encoder.
The prediction block derived using motion search is not always identical to
the current block. The encoder, therefore, calculates the residual difference
by subtracting the current block pixel values from those of the prediction
block. This, along with the motion vector information, is then encoded in the
bitstream. The idea is that a better search algorithm results in a residual block
with minimal bits, resulting in better compression efficiency. If the ME
algorithm can't find a good match, the residual error will be significant. In
this case, other possible options evaluated by the encoder could include intra
prediction of the block or even encoding the raw pixels of the block. Also, as
inter prediction relies on reconstructed pixels from reference frames,
encoding consecutive frames using reference frames that have previously
been encoded using inter prediction often results in residual error propagation
and gradual reduction in quality. An intra frame is inserted at intervals in the
bitstream. These reset the inter prediction process to gracefully maintain
video quality.
Once the motion vector is derived, it needs to be signaled in the bitstream and
this process of encoding a motion vector for each partition block can take a
significant number of bits. As motion vectors for neighboring blocks are
often correlated, the motion vector for the current block can be predicted
from the MVs of nearby, previously coded blocks. Thus, using another
algorithm, which is often a part of the standard specification, the differential
MVs are computed and signaled in the bitstream along with every block.
When the decoder receives the bitstream, it can then use the differential
motion vector and the neighboring MV predictors to calculate the absolute
value of the MV for the block. Using this MV, it can build the prediction
block and then add to it the residuals from the bitstream to recreate the pixels
of the block.
The challenge here is to derive the correct values of W and O, however, when
found, the SAD can be correspondingly compensated for and the ME process
can accurately get the correct matching prediction block. It should be noted
that different situations would warrant slightly different approaches. Scenes
with fade-in and fade-out have global brightness variations across the whole
pictures. This can be compensated for correspondingly with frame level WP
parameters. However, scenes with camera flash will have local illumination
variations within the picture which will obviously require more localized WP
parameters for efficient compensation. This is also true in scenes with fade-in
from white where the brightness variation of blocks with lighter pixels is
smaller than that of darker pixels. Localized WP parameters, however, will
introduce excessive overhead bits in the encoded bitstream and are not
available in H.264 and H.265. To combat this, various approaches have been
developed that effectively use the available structures of multiple reference
frames with different WP parameters to compensate for non-uniform
brightness variations.
6.2 Motion Estimation Algorithms
In differential coding, the prediction error and the number of bits used for
encoding depend on how effectively the current frame can be predicted from
the previous frame. As we discussed in the earlier sections, if there’s motion
involved, the prediction for the moving parts in the sequence involves motion
estimation (ME); that is, finding out the position in the reference frame from
where it has been moved.
Figure 46: Search area with range +/- 128 pixels around a 64x64 block.
It is expected that the best motion search match is to be found in one of the
points within this search range. However, it should be noted that this is not
guaranteed. Sometimes the motion can be quite considerable and outside this
range, as well; for example, a fast-moving sports scene. In such cases, the
best match could be the motion point in the search area that is closest to the
actual motion vector. Intra prediction can also be used if there’s no suitable
motion vector match. Depending on how and what motion points are
searched in the motion estimation process, the search algorithms can be
classified as follows: Full search or exhaustive search algorithms employ a
brute force approach in which all the points in the entire search range are
visited. The SAD or similar metrics at all these points are calculated and the
one with the minimum SAD is adjudged the predictor. This is the best of all
the search techniques, as every point in the search range is evaluated
meticulously. The downside of exhaustive search is its excessive
computational complexity.
Figure 47: Three step search for motion estimation.
Luma and chroma pixels are not sampled at sub-pixel positions. Thus, pixels
at these precisions don't exist in the reference picture. Block matching
algorithms therefore have to create them using interpolation from the nearest
integer pixels and the accuracy of interpolation depends on the number of
integer pixels and the filter weights that are used in the interpolation process.
Sub-pixel motion estimation and compensation is found to provide
significantly better compression performance than integer-pixel
compensation and ¼-pixel is better than ½-pixel accuracy. While sub-pixel
MVs require more bits to encode compared to integer-pixel MVs, this cost is
usually offset by more accurate prediction and, hence, fewer residual bits.
Figure 48 illustrates how a 4x4 block could be predicted from the reference
frame in two scenarios: integer-pixel accurate and fractional-pixel accurate
MVs. In Figure 48a, the grey dots represent the current 4x4 block. When the
MV is integral (1,1) as shown in Figure 48b, it points to the pixels
corresponding to the black dots that are readily available in the reference
frame. Hence, no interpolation computations are needed in this case. When
the MV is fractional (0.75, 0.5), as shown in Figure 48c, it has to point to
pixel locations as represented by the smaller gray dots. Unfortunately, these
values are not part of the reference frame and have to be computed using
interpolation from the neighboring pixels.
H.265 uses the same MVs for luma and chroma and uses ¼ accurate MVs for
luma that are computed using six-tap interpolation filters. For YUV 4:2:0,
these MVs are scaled accordingly for chroma as ⅛ pixel accurate values. VP9
uses a similar interpolation but uses a longer eight-tap filter and also a more
accurate ⅛-pixel interpolation mode. In VP9, the luma half-pixel samples are
generated first and are interpolated from neighboring integer-pixel samples
using an eight-tap weighting filter. This means that each half-pixel sample is
a weighted sum of the 8 neighboring integer pixels used by the filter. Once
half-pixel interpolation is complete, quarter-pixel interpolation is performed
using both half and full-pixel samples.
The filter coefficient values are given in Table 8, below. An example for
sample b0,j in half sample position and a0,j in quarter sample position is given
below.
Table 8: Interpolation filter coefficients used in HEVC.
Index -3 -2 -1 0 1 2 3 4
HF -1 4 -11 40 40 -11 4 1
QF -1 4 -10 58 17 -5 1
The samples labeled e0,0 to o0,0 can then be derived by applying the same
filters to the above computed samples as follows:
Figure 50: Motion vectors of neighboring blocks are highly correlated [2].
Encoding absolute values of motion vectors for each partition can consume
significant bits. The smaller the partitions chosen, the greater is this overhead.
The overhead can also be significant in low bit rate scenarios. Fortunately, as
highlighted in Figure 50, the motion vectors of neighboring blocks are
usually similar. This correlation can be leveraged to reduce bits by signaling
only the differential motion vectors obtained by subtracting the motion vector
of the block from the best neighbor motion vector. A predicted vector, PMV,
is first formed from the neighboring motion vectors. DMV, the difference
between the current MV and the predicted MV, is then encoded in the
bitstream.
The question now is, which neighboring MV is most suitable for prediction
of any block? Different standards allow different mechanisms to derive the
PMV. It usually depends on the block partition size and on the availability of
neighboring MVs. Both HEVC and VP9 have an enhanced motion vector
prediction approach. That is, MVs of several spatially and temporally
neighboring blocks that have been coded earlier are candidates that are
evaluated for selection as the best PMV candidate. In VP9, up to 8 motion
vectors from both spatial and temporal neighbors are searched to arrive at 2
candidates. The first candidate uses spatial neighbors, whereas the second
candidate list consists of temporal neighbors. VP9 specifically prefers to use
candidates using the same reference picture and searches this picture first.
However, candidates from different references are also evaluated if the earlier
search fails to yield enough candidates. If there still aren't enough predictor
MVs, then 0,0 vectors are inferred and used.
Once these motion vector predictors (PMVs) are obtained, they are used to
signal the DMV in the bitstream using either of the four modes available in
VP9. Three out of the four modes correspond to direct or merge modes. In
these modes, no motion vector differential need be sent in the bitstream.
Based on the signaled inter mode, the decoder just infers the predictor MV
and uses it as the block’s MV. These modes are as follows.
● Nearest MV uses the first predictor intact, with no delta.
● Near MV uses the second predictor with no delta.
● Zero MV uses 0,0 as the MV.
In the fourth mode, called the New MV mode, the DMVs are explicitly sent
in the bitstream. The decoder reads this motion vector difference and adds it
to the nearest motion vector to compute the actual motion vector.
● New MV uses the first predictor of the prediction list and adds a delta
MV to it to derive the final MV. The delta MV is encoded in the
bitstream.
H.265 also uses similar mechanisms as above with slightly different
terminologies and candidate selection process. In H.265, the following modes
are allowed for any CTU.
Merge Mode. This is similar to the first three modes of PMV in VP9 where
no DMV is sent in the bitstream and the decoder infers the motion
information for the block using the set of PMV candidates. The algorithm on
how to arrive at the specific PMV for every block is specified in the standard.
Advanced Motion Vector Prediction. Unlike the merge mode, in this mode
the DMV is also explicitly signaled in the bitstream. This is then added to the
PMV (derived using a process similar to the above for merge mode) to derive
the MV for the block.
Skip Mode. This is a unique mode that is used when there is motion of
objects without any significant change in illumination. While earlier
standards defined skip mode to be used in a perfectly static scenario with zero
motion, newer codecs like H.265 defined it to include motion. In H.265, a
skip mode syntax flag is signaled in the bitstream and, if enabled, the decoder
uses the corresponding PMV candidate as the motion vector and the
corresponding pixels in the reference frame as is, without adding any
residuals.
6.5 Summary
● In inter frame coding, operations like prediction and transforms are
done using data from the current and neighboring frames in the video
stream.
● The process of searching the reference frames to come up with the best
matching prediction block is called motion estimation (ME) and the
mechanism to use the motion vectors (MVs) to form the prediction
block is called motion compensation (MC).
● Sub-pixel precision in motion vectors is needed as different objects in
the video scene can have motion at speeds that are independent of the
frame rate and hence cannot be represented using full pixel MVs.
● To overcome motion vector signaling overhead, MVs of neighboring
blocks, which are quite similar, are used to predict block MV and only
resulting residual MV is signaled in the bitstream.
6.6 Notes
1. Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High
Efficiency Video Coding (HEVC) standard. IEEE Trans Circuits Syst
Video Technol. 2012;22(12):1649-1668.
https://ieeexplore.ieee.org/document/6316136/?part=1. Accessed
September 21, 2018.
2. VP9 Analyzer. Two Orioles. https://www.twoorioles.com/vp9-
analyzer/. Accessed September 22, 2018.
7 Residual Coding
Figure 51: Illustration of high frequency and low frequency areas in an image.
7.2 How Can an Image be Broken Down
into its Frequencies?
Within every block of pixels in the image, we can approximate the individual
column and row of pixels as the sum of a series of frequencies, starting with
the lowest frequency and adding more frequencies. The block of pixels is
thus a juxtaposition of a series of frequencies. The lowest frequency, in effect
the DC or average value of pixels in the block, doesn't add any fine details at
all. With every frequency added, one after another, more and more details are
built up in the picture.
The basic idea here is that a complex signal
like an image pixel block can be broken
down into a linear weighted sum of its
constituent frequency components. Higher
frequency components represent more
details.
7.2.1 Why move to the frequency domain?
The pixel blocks in the image have components of varying frequencies and
the transform process serves to represent it as a linear weighted sum of its
constituent frequency components. The sections of high detail like edges
correspond to high frequency components. The flat areas correspond to low
frequency components, as described in the previous section. Splitting the
image in this way affords us certain advantages, as follows.
The DCT of a set of 4x4 samples X is given by the expression below, where
Y is the transformed block, A is the DCT basis set matrix, and ATr is its
matrix transpose.
Y = A X ATr
Figure 54 shows a set of 8x8 residual values that is taken from the top left,
8x8 block of a sample 16x16 block of residual values.
Figure 54: Residual samples: top-left 8x8 block.
Performing the 8x8 DCT operation yields the 8x8 matrix shown in Figure 55.
It should be noted from Figure 55 that the larger coefficients are compactly
located around the top left corner, in other words, around the low frequency
DC component. This is the desired energy compaction function of the
transform operation.
H.265 and VP9 define prediction modes in accordance with the transform
sizes and also use a combination of a few transforms to suit different
prediction modes. VP9 supports transform sizes up to the prediction block
size, or, 32x32, whichever is less. Figure 56 shows a screenshot from the
stockholm clip with the grids showing the transform sizes. Larger transform
sizes, up to 32x32, are used in the smooth areas like the sky and water, while
smaller transform sizes are better able to capture the fine details like
buildings and so on. Encoders, in addition to deciding the best prediction
modes for every block, also have to decide the optimal transform size for
every block.
Figure 56: Flexible use of different transform sizes.
7.3 Quantization
Let’s stop here for a moment to understand and record our observations from
this process.
1. The quantized numbers are smaller compared to the original numbers.
2. Several numbers have become zeros.
3. The quantizer value 4 controls the nature of the resulting number set.
The higher this value, the more pronounced will be the above effects
from observations 1 & 2.
4. Information was lost during this process and the original numbers
were non-retrievable.
These observations are important as they form the principles by which
significant compression is achieved in every encoder. In the following section
we shall explore in detail how this is done.
Dividing the original number set that is replicated in Figure 59a by this
quantizer set and discarding the remainders assuming this is an integer set,
we get the following quantized 4x4 matrix in 59b.
When we do the reverse operation of multiplication at the receiving end, we
obtain the following reconstructed matrix of 58c. Clearly, the reconstructed
values at the decoder end for the numbers at the top left of the 4x4 matrix,
which correspond to the low frequency components, have improved
compared to the results provided by a fixed quant value division. While these
numbers have become bigger with a wider range, this can be compensated for
by more aggressive quantization values for the numbers down the series.
Figure 63 shows the 16x16 block after the process of inverse quantization.
This is followed by Figure 64 that shows the final reconstructed residual
values at the decoder after inverse transform. While these values clearly show
patterns and are a fair approximation of the original 16x16 residual block in
Figure 60, it’s nowhere near an identical representation of the input source.
The above operations were performed at higher QP, upwards of 40. This
introduces significant quantization of the transformed values, as we observe
in Figure 62. Let us now illustrate how this QP value affects the results above
by doing the same operations at around QP 30 and also at around QP 20. This
is shown in Figures 65 and 66, below.
As we see in Figure 66, with lower QP values of around 20, the reconstructed
residual values are almost identical to the input source residual values in
Figure 60, except for some rounding differences.
Figure 65: The reconstructed 16x16 block after inverse 16x16 transform in QP 30 case.
Figure 66: The reconstructed 16x16 block after inverse 16x16 transform in QP 20 case.
7.4 Reordering
As we have seen, the quantized transform coefficients have several zero
coefficients and the non-zero coefficients are concentrated around the top left
corner. Instead of transmitting all the values, which invariably include
redundant zeros, it becomes beneficial to transmit the very few coefficient
values and signal the remaining values as zeros. The decoder, when it
receives the bitstream, would be able to derive the non-zero coefficients and
then use the zero signaling to add zeros to the remaining values.
4 2 2 1 -1 1 0 0
2 2 1 -1 0 0 0 0
-2 1 1 1 0 0 0 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
We notice that the number of trailing zeros is significantly higher than the
raster scan pattern. The result is more efficient encoding. However, we also
notice that there are still zeros between the coefficients that could potentially
be further optimized.
Scan tables like the classic zig-zag scan shown in Figure 69, thus, provide a
much more efficient parsing of coefficients. All non-zero coefficients are
grouped together first, followed by zero coefficients. Other scan tables are
also possible and in fact VP9 provides a few different pattern options that
organize coefficients more or less by their distance from the top left corner.
The default VP9 scan tables, showing the order of the first few coefficients, is
shown in Figure 70, below.
As we see below, the earlier 8x8 block of quantized coefficients can be also
efficiently represented when the VP9 scanning pattern shown in Figure 70 is
used for reordering.
[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (and forty-six 0s)]
It should be noted that the scan pattern is very much tied to the underlying
transform. It removes redundancies among residual samples and provides
energy compaction. As VP9 provides flexible combinations of DCT and
ADST for horizontal and vertical transforms, it also provides flexible
scanning options which can be used in conjunction with the transform
combinations.
The first number in every (run, level) pair, namely, run, indicates the number
of immediately preceding zeros and the second number, namely, level,
indicates the non-zero coefficient value. The end of block is a special signal
that can be communicated in the bitstream to indicate no more bits and all
else are zeros. Another way of organizing the same information could be
assigning a value to every symbol that indicates the end of block. The
previous sequence of numbers would then look as follows:
[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)
(0,1,0)(0,1,0)(0,-1,0)(0,-1,0)(0,1,0)(1,1,1)]
Notice that the last value is set to 1, indicating the end of block. When the
decoder receives this stream, it looks for the last 1-bit and sets the remaining
values of the block to zeros.
In the next chapter we shall see how the compactly represented, quantized
signal gets encoded as bits and bytes in the final encoded bitstream.
7.6 Summary
● Transforms take in a block of residual pixel values (after prediction)
and convert them to the frequency domain. This amounts to the same
values being represented differently.
● Pixel values vary in intensity and the time it takes to change from one
intensity to another and back again is represented by frequency. The
faster the change of intensity from, say, light to dark and back, the
higher the frequency needed to represent that part of the picture.
● An image pixel block can be broken down into a linear weighted sum of
its constituent frequency components with higher frequency
components representing more details.
● Transforms provide energy compaction. This is the fundamental
criterion in their selection.
● The DCT is widely used in video coding standards as it provides a high
degree of energy compaction.
● Quantization is the process of reducing the range of the set of
transformed residual values. It can be achieved by division at the
encoding side and a corresponding multiplication at the decoding side.
● Quantization is an irreversible process.
● Higher quantization results in loss of signal fidelity. Higher
compression and the control of quantization values is key to striking a
balance between preserving signal fidelity and achieving a high
compression ratio.
8 Entropy Coding
In the previous chapters we explored how inter and intra pixel redundancies
are removed to minimize the information that needs to be encoded. We also
saw mechanisms to efficiently represent the resulting residuals using
transforms, quantization, scanning and run-level coding. The following
pieces of information have to be sent as part of the bitstream at a block level:
motion vectors, residuals, prediction modes, and filter settings information.
In this chapter, we will study in detail how the run-level values are encoded
using the fewest bits by minimizing statistical coding redundancies. Recall
that, in the previous section, we had the following run level pairs for the
example encoding block:
[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)
(0,1,0)(0,1,0)(0,-1,0)(0,-1,0)(0,1,0)(1,1,1)]
The easiest way to encode this to a binary bitstream would be to understand
the range of all possible values of such symbols, then determine the number
of bits needed to encode them with a fixed number of bits per symbol.
Assuming there are 100 such symbols each for last = 0 and last = 1, then we
can create a unique value using 8 bits for every symbol. To encode the above
set we would then need 17 x 8 = 136 bits or 17 bytes.
This implicitly assumes that all the symbols have the same likelihood of
occurrence, in which case assigning the same number of bits to each symbol
makes sense. However, in reality, data including video and image content
rarely have symbols that are equally likely. Instead, they tend to have some
symbols that occur more frequently than others.
Figure 71 illustrates the stages of this process. At the first step, if the input
symbol is not binary-valued, it is mapped to a corresponding binary value in
a process called binarization. The individual bits in the resulting binary value
are called bins. Thus, instead of encoding the symbols themselves, we will
focus on encoding their mapped binary equivalents. In designing the binary
value mapping for every symbol, care is taken to ensure no binary value
pattern occurs within another binary value so as to ensure that every binary
value received by the decoder can be uniquely decoded and mapped to an
encoded symbol. The next step is to select a suitable model based on the past
symbol distribution through a process called context modeling. The last step
is the adaptive arithmetic encoding stage that adapts itself using the
probability estimates that are provided by the earlier stages. As the
probability distribution of the input symbol is highly correlated to the
probability distribution of the bins of its binary equivalent, the probability
estimates of the neighboring symbols can be used to estimate fairly
accurately the probabilities of the bins that will be encoded. After every bin is
encoded, the probability estimates are immediately updated and will be used
for encoding subsequent bins.
Probability distribution of Symbol ≈ Probability distribution of bins
8.2.1 Binarization
Most of the encoded symbols, for example, residuals, prediction modes, and
motion vectors, are non-binary valued. Binarization is the process of
converting them to binary values before arithmetic coding. It is thus a pre-
processing stage. It is carried out so that subsequently a simpler and uniform
binary arithmetic coding scheme can be used, as opposed to an m-symbol
arithmetic coding that is usually computationally more complex. It should be
noted, however, that this binary code is further encoded by the arithmetic
coder prior to transmission. The result of the binarization process is a
binarized symbol string that consists of several bits. The subsequent stages of
context modeling, arithmetic encoding and context updates are repeated for
each bit of the binarized symbol string. The bits in the binarized string are
also called bins. In H.265, the binarization schemes can be different for
different symbols and can be of varying complexities. In this book, we shall
illustrate a few binarization techniques that have been employed in H.265.
These include Fixed Length binarization technique and a concatenated
binarization technique that combines Truncated Unary and Exp-Golomb
binarization. The same concepts can be extended to other schemes.
x BFL (x)
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
x BTU (x)
0 0
1 1 0
2 1 1 0
3 1 1 1 0
4 1 1 1 1 0
5 1 1 1 1 1 0
6 1 1 1 1 1 1 0
7 1 1 1 1 1 1 1 0
8 1 1 1 1 1 1 1 1
In general, k-th order EGk code has the same prefix across different k values
but varies in suffix by a factor of k. Every k-th order EGk code scheme starts
with k suffix bits and progresses from there. Examples for EGk codes for k=0
and k=1 are given in Table 12. EG0 code schemes start with 0 bits for suffix
for their first value x=0 and then add 1 bit for suffix for x=1,2 and so on. In
contrast, EG1 schemes start with 1-bit suffix code for x=0,1 and so on.
Now that we know how TU and EGk codes work, let’s conclude this section
with an example of a UEGk binarization code scheme that involves a simple
concatenation of TU and EGk codes. As illustrated in Table 13 below, the
scheme uses a TU prefix with a truncation cut-off value S = 14 and EGk
suffix of order k=0.
Different schemes of this type are deployed in H.265 for different symbols
like MVDs and transform coefficients. These schemes vary primarily in the
cut-off points for the TU scheme and also the order k of the EGk suffix.
These values are chosen after a careful consideration of the typical
magnitudes of these symbols and their probability distributions.
Table 13: Binary codes for UEG0 binarization.
UEGk (x)
x BTU (x) (PREFIX) EGk
(SUFFIX)
0 0
1 1 0
2 1 1 0
3 1 1 1 0
4 1 1 1 1 0
: : : : : :
13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
17 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1
20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0
21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
VP9 video standard employs a very similar design framework with slight
differences in implementation choices and terminology. Every non-binary
symbol is binarized to construct a binary tree and each internal node of the
binary tree corresponds to a bin. The tree is traversed, and the binary
arithmetic coder is run at each node to encode a particular symbol. Every
node (bin) has an associated probability with 8-bit precision and this set of
probabilities of the nodes is the maintained context for the symbol.
Now that we’ve understood how the non-binary symbols are binarized to
produce a binary bin-stream, let us delve into the details of the next step:
context modeling.
This means that no longer is the symbol bin stream directly encoded but,
rather, its mapped fractional representation is instead encoded. It is this
mapping process that facilitates encoding with better compression efficiency,
such that the final bitstream has a fractional number of bits compared to the
input bin stream. This process relies critically on the input probability context
model and we will show how this is so shortly. The stages involved in the
binary arithmetic coding process are illustrated in Figure 73.
Let us assume we have the following stream of 7 bits to be coded
[0 1 0 0 0 0 1] and P0 = 0.7 and P1 = 0.3
To start the process, the available interval range is assumed to be [0, 1] and
the goal is to identify a final interval specific to the sequence of symbols and
to pick any number from that interval. To do this, the binary symbols are
taken one by one and assigned sub-intervals based on their probabilities, as
illustrated in Figure 74 for our example bin stream.
Figure 74: Illustration of coding a sample sequence using arithmetic coding.
The first bit is ‘0’. It has P0 = 0.7 and is assigned the interval [0, 0.7]. This
interval is chosen as the initial interval to encode the next bit. In this case, the
next bit is ‘1’ (with P1 = 0.3). Based on this the interval [0,0.7] is further
broken down into [0.49, 0.7]. This then becomes the initial interval for the
next bit and so on.
The process can thus be summarized in the following 3 steps:
1. Initialize the interval to [0,1].
2. Calculate the sub-interval based on the incoming bit and its
probability value.
3. Use the subinterval as the initial interval for the next bit and repeat
step 2.
In our example, the final fractional interval is [0.5252947, 0.540421]. If we
pick a number in this interval, say, 0.54, its binary equivalent is 0.10001. This
can be then sent in the bitstream in the form of 5 bits: [10001]. It’s clear from
this example how a series of 7 input binary symbols can be compactly
represented using just 5 bits, thereby achieving a fractional number of bits per
symbol—0.714 bits per symbol in this case.
Now we'll explore how context probabilities affect this coding scheme using
the same sequence of input bits but using different probabilities, say, P0 = 0.4
and P1 = 0.6. Figure 75 shows how this sequence will be coded. In this
scenario, the final fractional interval is found to be [0.1624576, 0.166144]. If
we pick a number in this interval, say, 0.165, its binary equivalent is
0.0010101. This can be effectively represented in the bitstream using a
minimum of 7 bits. This is more than the 5 bits needed with probabilities P0 =
0.7 and P1 = 0.3. This clearly demonstrates the critical importance of accurate
probability context models to provide coding gains using an arithmetic
coding scheme.
Figure 75: Coding the sample sequence using different context probabilities.
Having explored how the arithmetic encoding works, let us now explore how
its reverse operation, namely, arithmetic decoding works. This is illustrated in
Figure 77 and is the reverse of the earlier binary encoding steps. The interval
range is initialized to [0,1] with known probabilities P0 = 0.7 and P1 = 0.3.
These values are usually implicitly and dynamically computed by the
decoder. When the decoder receives the binary sequence [10001], it interprets
it as the fraction 0.10001 corresponding to decimal 0.53. The interval [0,1]
can be successively sub-sectioned based on the context probabilities as
follows. Since 0.53125 lies between 0 and 0.7 the first symbol is determined
as 0. Then the interval is set to [0,0.7] and the sub-intervals are [0,0.49] and
[0.49,0.7] based on the P0 value. Since 0.53125 now lies between 0.49 and
0.7 that corresponds to P1 = 0.3 probability interval, the symbol is determined
to be 1. This step is repeated in succession to arrive at the sequence of
symbol bits, namely, [0 1 0 0 0 0 1] as shown in Figure 77.
The arithmetic coding process described in the previous section involves
multiplication. This is avoided in CABAC implementations by approximating
the interval values using lookup tables that are specified in the standard. This
approach can potentially impact compression efficiency but is needed to keep
the implementation simple. Also, during the internal encoding/decoding
process, when the interval range drops below thresholds that are specified in
the standard, a reset process is initiated. In this process, the bits from
previous intervals are written to the bitstream and the process continues
further.
8.3 Summary
Information theory helps us understand how best to send content with
minimum bits for symbols having unequal likelihoods. Entropy
coding is based on information theory.
The term entropy is widely used in information theory and can be
intuitively thought of as a measure of randomness associated with a
specific content. In other words, it is simply the average amount of
information from the content and measures the average number of
‘bits’ needed to express the information.
Context adaptive entropy coding methods keep a running symbol
count during encoding and use it to update probabilities (contexts)
either at every block level or at the end of coding the frame. In doing
so, the entropy coding context probability 'adapts' or changes
dynamically as the content is encoded.
CABAC is a context adaptive entropy coding method which includes
the following functions in sequence: binarization, context modeling
and arithmetic coding.
Arithmetic coding involves a transformative operation where the
symbols are transformed to coded number intervals and successive
symbols are coded by recursive interval sub-division.
8.4 Notes
1. Marpe D, Schwarz H, Wiegand T. Context-based adaptive binary
arithmetic coding in the H.264/AVC video compression. Oral
presentation by C-H Huang at: IEEE CSVT meeting; July 2003.
https://slideplayer.com/slide/5674258/ Accessed September 22, 2018.
9 Filtering
While each of the 4x4 blocks that are separately transformed and quantized
appear to be a reasonable approximation of the original input, the local
averaging has resulted in a stark discontinuity along the 4x4 vertical edge that
was not present in the source. This is quite common in block processing.
Further processing is needed to mitigate the impact of this artificially created
edge. The deblocking process works to identify such edges, analyze their
intensities and then applies filtering across such identified block boundaries
to smooth off these discontinuities. The block boundaries thus filtered usually
include the edges between transform blocks and also the edges between
blocks of different modes.
In HEVC, the in-loop filtering is pipelined internally to two stages with a first
stage deblocking filter followed immediately by a sample adaptive offset
(SAO) filter. The deblocking filter, similarly to what occurs in H.264, is
applied first. It operates on block boundaries to reduce the artifacts resulting
from block-based transform and coding. Subsequently, the picture goes
through the SAO filter. This filter does a sample-based smoothing by
encompassing pixels that are not on block boundaries, in addition to those
that are. These filters operate independently and can be activated and
configured separately. In VP9, an enhanced in-loop filter with higher filter
taps is used and SAO filter is not part of the standard. In the following
sections, we shall explore both these filters in detail.
Figure 78: Order of processing of deblocking for 4x4 blocks of a superblock in VP9.
Figure 80: akiyo clip encoded at 100 kbps with deblocking disabled.
source: https://media.xiph.org/video/derf/
Figure 80 illustrates the output with the filter disabled. While the deblocking
filter does not necessarily soften all the blocking artifacts, it has worked to
significantly improve the visual experience. This can be seen especially
around the eyes and chin areas of the face and jaggies along the folds in the
clothes. It should be noted that at low bit rates care should be taken to
balance the filtering process and the preservation of real edges in the content.
9.3 SAO
The sample adaptive offset (SAO) filter is the second stage filter. It is used
in-loop exclusively in H.265 after the deblocking filter is applied. This is
illustrated in Figure 81, below. While the deblocking filter primarily operates
on transform block edges to fix blocking artifacts, the SAO filter is used to
remove ringing artifacts and to reduce the mean distortion between
reconstructed and original pictures [1]. Thus, the two filters work in a
complementary fashion to provide good cumulative visual benefit.
In signal processing, low-pass filtering operations that involve truncating in
the frequency domain cause ringing artifacts in the time domain. In modern
codecs, the loss of high-frequency components resulting from finite block-
based transforms results in ringing. As high frequency corresponds to sharp
transitions, the ringing is particularly observable around edges. Another
source of ringing artifact is the use of interpolation filters with a large number
of taps that are used in H.265 and VP9.
Figure 81: Video decoding pipeline with in-loop deblocking and SAO filters.
The SAO filter provides the fix by modifying the reconstructed pixels. It first
divides the region into multiple categories. For each category, an offset is
computed. The filter, then, conditionally adds the offset to each pixel within
every category. It should be noted that the filter may use different offsets for
every sample in a region. It depends on the sample classification. Also, the
filter parameters can vary across regions. By doing this, it reduces the mean
sample distortion of the identified region. The SAO filter offset values can
also be generated by using any other criterion other than minimization of the
regional mean sample distortion. Two SAO modes are specified in H.265,
namely, edge offset mode and band offset mode. While the edge offset mode
depends on directional information based on the current pixels and the
neighboring pixels, the band offset mode operates without any dependency
on the neighboring samples. Let us explore the concepts behind each of these
two approaches.
In H.265, the edge offset mode uses 4 directional modes (using neighboring
pixels) as shown in Figure 82 (Fu, et al., 2012) [1]. One of these directions is
chosen for every CTU by the encoder using rate distortion optimization.
Figure 83: Pixel categorization to identify local valley, peak, concave or convex corners [1].
For every mode, the pixels in the CTU are analyzed to see if they belong to
one of the four categories, namely, 1) local valley, 2) concave corner, 3) local
peak or 4) convex corner. This is shown in Figure 83. As categories (1) and
(2) have the current pixels in a local minimum compared to their neighboring
pixels, positive offsets for these categories would smooth out the local
minima. A negative offset would in turn work to sharpen the minima. The
effects, on the other hand, would be reversed for categories (3) and (4) where
negative offsets result in smoothing and positive offsets result in sharpening.
In Figure 84 [1], the horizontal axis denotes the sample position and the
vertical axis denotes the sample value. The dotted curve is the original
samples. The solid curve is the reconstructed samples. As we see in this
example, for these four bands, the reconstructed samples are shifted slightly
to the left of the original samples. This results in negative errors that can be
corrected by signaling positive BOs for these four bands.
Bitrate is the amount of data in bits that is used to encode one second of
video. It is usually expressed in megabits per second (Mbps) or kilobits per
second (kbps). Bitrate is a critical parameter of the video and affects the file
size and overall quality of the video. In general, the higher the bitrate, the
more bits there are available to encode the video. This means better video
quality but usually comes at the expense of bigger file size. When the
application provides a target bitrate to the encoder, the job of the encoder is
to allocate the available bits intelligently across the video sequence, keeping
track of the video complexity and delivering the best possible picture quality.
In the first section of this chapter, we will deal with the process of mode
decision. The latter half of the chapter will cover topics in rate control. We
will also see in this chapter how these processes are intertwined with one
another.
10.1 Constraints
Given specific settings, including bitrate, latency and so on, the fundamental
challenge for any encoder is how to optimize the output-encoded picture
quality such that it can either:
a) maximize the output video quality for the given bit rate constraints,
or
b) minimize the bit rate for a given output video quality.
While doing the above, the encoder must also ensure that it operates within
its several constraints. Some of these are outlined below:
Bitrate. The encoder has to ensure it produces an average bitrate per this
setting. Additional constraints may also be imposed, such that it may also be
required to operate within a set of maximum and minimum bitrate limits.
This is especially the case in constant bitrate mode where, usually, the
channel capacity is fixed.
Latency. Latency is defined as the total time consumed between the picture
being input to the encoder and being output from the decoder and available
for display. This interval depends on factors like the number of encoding
pipeline stages, the number of buffers at various stages in the encoder
pipeline and how the encoder processes various picture types, such as B-
pictures, with its internal buffering mechanisms. The interval also includes
the corresponding operations from the decoder. The term, latency, usually
refers to the combined latency of both the encoder and the decoder.
Buffer Space. When the decoder receives the encoded bitstream, it stores it
in a buffer. There, the decoder smooths out the variations in the bitrate so as
to provide decoded output at a constant time interval. Conversely, this buffer
also defines the flexibility that the encoder has, in terms of the variability of
its bitrate, both instantaneously and at any defined interval of time. The
buffer fullness at any time is thus a difference between the bits encoded and a
constant rate of removal from the buffer that corresponds to the target bitrate.
The lower boundary of the buffer is zero and the upper boundary is the buffer
capacity. H.264 and H.265 define a hypothetical reference decoder (HRD)
buffer model. This model is used to simulate the fullness of the decoder
buffer. This aids the rate control in producing a compliant bitstream.
The encoder thus has to tightly regulate the number of bits sent in any period
of time. This is to ensure that the decoder buffers are never full or empty.
This is especially true for hardware decoders that often have limited memory
buffers. When the decoder buffer is full, no further bits can be
accommodated, and incoming bits may be dropped. On the other hand, if the
decoder has consumed all the bits and the buffer becomes empty, it may not
have anything to display except the last decoded picture. This may manifest
as undesirable pauses in the output video.
Encoding Speed. Typically, encoding applications get classified as either
real-time or non real-time encoding and most encoders are designed for one
or the other. Examples of real-time encoding applications include live event
broadcasting. Here, the camera feed reaches the studios where it’s processed,
encoded and streamed over satellite, cable or the internet in real time. In real-
time encoding, if the output frame rate is 60fps, then the encoder has to
ensure it can produce an encoded output of 60 frames in every second of its
operation. Non-real time encoding, or, offline encoding, has the luxury of
time to perform additional processing in an effort to improve the encoding
quality. A typical example is video on demand streaming where the encoding
is done on all the content offline and stored in servers and the requested video
is fetched, streamed and played back upon demand.
Operating within these constraints, the encoder has to take decisions at all
stages, including selecting picture types, selection of partition types for the
coding blocks, selection of prediction modes, motion vectors and the
corresponding reference pictures they point to, filtering modes, transform
sizes and modes, quantization parameters, and so on. By taking these
decisions at various stages, the encoder strives to optimize the bit spend both
within and across various pictures to provide the best output picture quality.
This video quality is measured objectively by comparing the reconstructed
video (encoded and decoded) to the original input video using a mathematical
formula called distortion measure that is usually computed pixel-by-pixel and
averaged for the frame. As the distortion measure is an indication of how
much different the reconstructed frame or block is from the original, the
higher this number the worse the quality associated with the selected block
and vice versa.
SATD = ∑n ∑m | tn,m |
where C is the matrix corresponding to the current picture block samples and
R is the matrix representation of the reconstructed block samples. H is the
Hadamard transform matrix and T is thus the result of Hadamard transform
of the residual samples whose sum of absolute values results in the SATD
metric.
J (P, λ) = ∑n Jn (Pn, λ)
10.5.2.1 Determination of λ
The rate-distortion model, as we explained earlier, does just this by using a
mathematical model to derive the λ value from the target bitrate for a picture
or BU. It should be noted that the initial values used by the model are not
fixed. Different sequences may have quite different modeling values and
these values will also get updated as the encoding progresses. Thus, the
model adapts dynamically. After encoding one BU or picture, the actual
encoded bitrate is used to update its internal parameters and correspondingly
derive future updated λ values.
At the start of the encoding process, the demanded bitrate input is fed to the
virtual buffer model (if present). This then provides information on buffer
fullness. During the process of encoding, buffer fullness is always kept track
of by monitoring the total bits that have been encoded thus far and the rate of
removal of the bits from the buffer. The buffer fullness information, along
with the target bitrate, are fed as input to the bit allocation units. These are
then used to compute the GOP level, picture level and basic unit-level target
bits.
The BU-level target bits, along with a surrogate for spatial picture complexity
information like MAD (usually stored from previous pictures), are then used
as the inputs for the rate and distortion calculations in the R-D model. The R-
D model also takes as an input an initial QP value. This will have been
computed based on the target bitrate and updated according to the running
bitrate. The output of the R-D model is a corresponding target QP value to
encode the BU. This target QP value then passes through a QP limiter block
that analyzes the QP values over previous QP values. This ensures that a
smooth transition is provided and any dramatic changes in the QP values are
smoothed out. The output of the QP limiter block is the final target QP for the
BU.
When the BU is quantized and encoded, the following parameters are fed
back into the model for future computations.
a) Total bits encoded for the BU
b) Residual bits encoded for the BU
c) Actual residual values
The total bits parameter is used to update the buffer fullness in the virtual
buffer. The residual bits parameter is updated to provide accurate rate
information to the RD model. The actual prediction residuals are fed back
into the complexity estimator (MAD).
This framework allocates QPs and corresponding bits to different picture
types flexibly using the target picture allocation mechanism at the picture
level.
Figure 88: Heat map showing quant offset variation using Adaptive Quantization [2].
10.7 Summary
● Bitrate is a critical parameter that affects the file size and overall video
quality. At higher bitrates, more bits are available to encode the video.
This results in better video quality but comes at the expense of bigger
file size.
● Typical constraints that the encoder has to operate with are latency,
bitrate, buffer space, and encoding speed.
● Given a specific setting that includes bitrate, latency, and so on, the
fundamental challenge for any encoder is how to optimize the bitrate
and output encoded picture quality. The encoder has to either maximize
the output video quality for the given bitrate or minimize the bitrate for
a set video quality.
● Video quality is quantified mathematically using a distortion measure.
This is usually computed at every pixel and averaged for the frame.
This provides a good measure of similarity between the blocks that are
compared.
● Encoding is an optimization problem where the distortion between the
input video and its output, reconstructed video is minimized, subject to
a set of constraints including bitrates and coding delay.
● Rate control algorithms maintain a bit budget and allocate bits to every
picture and every block within each picture by analyzing the video
complexity and keeping track of previously allocated bits and the target
bitrate.
● Rate control algorithms contain two important functions: 1)
determining and allocating target bits and 2) achieving target bit rate.
10.8 Notes
1. Rate control and H.264. PixelTools Experts in MPEG.
http://www.pixeltools.com/rate_control_paper.html. Published 2017.
Accessed September 22, 2018.
2. DOTA2. xiph.org. Xiph.org Video Test Media [derf's collection].
https://media.xiph.org/video/derf/. Accessed Oct 30, 2018.
Part III
11 Encoding Modes
In previous chapters we have explored at length how the encoder internally
operates to allocate the target bits across the video sequence. Let us now
review three important, application-level encoding modes and the
mechanisms for bitrate allocation in each of these modes. These encoding
modes are agnostic with respect to encoding standards. This means that every
encoder can be integrated with one or more of the rate control mechanisms.
All these modes are accessible in publicly available x264, x265, and libvpx
versions of H.264, H.265 and VP9, respectively.
CBR Configuration:
./vpxenc test_1920x1080_25.yuv -o test_1920x1080_25_vp9_cbr.webm --
codec=vp9 --i420 -w 1920 -h 1080 -p 1 -t 4 --cpu-used=4 --end-usage=cbr --
target-bitrate=3000 --fps=25000/1001 --undershoot-pct=95 --buf-sz=18000 --
buf-initial-sz=12000 --buf-optimal-sz=15000 -v --kf-max-dist=999999 --min-
q=4 --max-q=56
VBR 2-Pass Configuration:
./vpxenc test_1920x1080_25.yuv -o test_1920x1080_25_vp9_vbr.webm --
codec=vp9 --i420 -w 1920 -h 1080 -p 2 -t 4 --best --target-bitrate=3000 --
end-usage=vbr --auto-alt-ref=1 --fps=25000/1001 -v --minsection-pct=5 --
maxsection-pct=800 --lag-in-frames=16 --kf-min-dist=0 --kf-max-dist=360 --
static-thresh=0 --drop-frame=0 --min-q=0 --max-q=60
11.4.4 Storage
This is used by enterprise and private users to store encoded video in personal
drives or cloud storage for archival purposes. The goal is to achieve the best
possible quality without too much concern about the file size. Real-time or
non-real time CRF encoding would be a good encoding mode to use for this
class of applications. If, however, devices like DVDs or Blu Ray Disks are
used for storage, there are fixed size restrictions that also have to be
considered. The encoding in these cases is done with capped VBR mode,
either in real time or non-real time, using some form of multi-pass encoding.
11.5 Summary
● Different application scenarios define how encoders allocate bits to
frames. Three important bitrate modes are: 1) CBR, 2) VBR, and 3)
CRF.
● In CBR encoding, the encoder imposes more rigorous constraints on the
bitrate around periodic intervals and encodes at a more or less
consistent rate by disallowing dramatic bitrate swings.
● In VBR mode, the encoder allows more bits as necessary to the more
complex segments of the video and uses fewer bits to encode simpler
and more static segments.
● The CRF mode is a new mode. It is a constant quality encoding mode
that prioritizes the quality metric and ensures a fixed quality across all
sections of the video.
● CBR is a more predictable mode that is compatible across wider variety
of systems compared to VBR mode.
12 Performance
Video encoding is often an irreversible, lossy process wherein the encoded
video is a good approximation of the source and the quality of this
approximation depends on various encoding parameters like the quantization
parameter (QP) that we discussed earlier in this book. Presumably, the
encoded video is degraded relative to the source. The quality of encoding is
gauged using a measure of this perceived video degradation compared to the
source. The distortion or artifacts produced by the encoding process
negatively impact the user experience and this is of paramount importance for
content providers and service providers who deploy these systems.
The most important characteristic of any video encoder is quality. Any video
encoder goes through evaluations to assess how it performs by using input
video sequences that represent a broad variety of content and analyzing the
encoded outputs. These clips are typically encoded using standard settings at
various target bitrates. There are two broad ways to evaluate the output video
quality:
1. Objective Analysis. This uses mathematical models that
approximate a subjective quality assessment. Assessments are
automatically calculated using a computer program. The advantage
of using this method is that it is easily quantified and always
provides a uniform and consistent result for a given set of outputs
and inputs. However, its limitations are usually in terms of how
accurately the model can approximate human perception. While
there are several metrics for objective analysis, three tools that are
increasingly used in the industry, namely, PSNR, SSIM and VMAF
are discussed in this chapter.
2. Subjective Analysis. Here, the set of test video clips is shown to a
group of viewers and their feedback, which is usually in some form
of a scoring system, is averaged into a mean opinion score. While
this method is not easily quantifiable, it’s the most frequently used
method. This is because it’s simpler than objective analysis and it
connects directly to the real-world experiences of users, who are the
ultimate judge of perceived quality. However, the testing procedure
may vary depending on what testing setup is available, what
encoders are used for the testing, and so on. Subjective analysis is
also prone to user bias and opinions.
SSIM, on the other hand, does away with error-based computations. Instead,
it leverages the characteristic of the HVS to focus on structural information.
SSIM defines a model to measure image quality degradation based on
changes in its structural information. The idea of structural information is that
the strong spatial correlations among pixels in an image or video picture
carry important information about the structure of the objects in the picture.
This is ignored by error-based metrics like PSNR that treat every pixel
independently. If x = {xi | i = 1, 2,…,N} is the original signal and y = {yi | i =
1,2,…,N} is the reconstructed signal, then the SSIM index is calculated using
the following formula[1]:
In this equation, and are the mean of x and y, respectively, and σx, σy, σxy are
the variance of x, the variance of y and the covariance of x and y. A and B
are constants that are defined based on bit depths. The value of SSIM ranges
from 0 to 1 with 1 being the best value. A 1 means that the reconstructed
image is identical to the original image. In general, SSIM scores of 0.95 and
above are found to have imperceivable visual quality impact (similar to
PSNR greater than 45dB). As with PSNR, the SSIM index is computed
frame-by-frame on all three components of the video separately and the
overall SSIM index for the video (for every component) is computed as the
average of all the frame values.
x264 Configuration
./x264 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.264 --input-res
1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune
ssim --keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000
x265 Configuration
./x265 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.265 --input-res
1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune
ssim --keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000
Table 15 consolidates the SSIMs and bitrates for each of the test runs. This is
graphically plotted in Figure 91. The figure provides a visual comparison of
SSIM vs. bitrates for both x264 and x265 encoders.
Table 15: SSIM for x264 and x265 encoding.
Figure 91: Comparison SSIM vs bit rates for x264 and x265 encoding.
We see that the x265 curve is to the left of the x264 curve. This means that
x265 provides a higher value of SSIM, hence higher perceived visual quality
at the same bitrate as x264. Also, for any given SSIM value the leftmost
curve x265 has a lower corresponding bitrate than the x264 curve. It can be
verified by drawing a horizontal line similar to the line across the 10 dB
SSIM point in the chart. This is a measure of relative compression efficiency
across different encoders.
12.4 Summary
● There are two broad ways of evaluating output video quality: 1)
objective analysis and 2) subjective analysis.
● A majority of the objective VQ metrics assume that the undistorted
source is available for analysis and such metrics are called full
reference (FR) metrics. These include PSNR, SSIM and VMAF.
● PSNR is the most widely used objective metric and is a simple function
of the mean squared error (MSE) value between the source and the
encoded video.
● The HVS is highly specialized in extracting structural information.
While traditional methods like PSNR rely on extracting errors, SSIM
focuses on extracting structural information.
● VMAF is based on not relying on a single method of distortion analysis
but instead using the major existing metrics in combination and
incorporating machine learning tools.
● x264 and x265 are widely-used, downloadable software H.264 and
H.265 encoders, respectively. vpxenc from the WebM project is the
free-to-use VP9 encoder.
● Several quality comparisons exist that compare x264, x265 and libvpx.
Both x265 and libvpx use newer tools and have performed well and
significantly better than earlier-generation H.264 encoding.
12.5 Notes
1. Wang Z, Lu L, Bovik AC. Video quality assessment based on
structural distortion measurement. Signal Process Image Commun.
2004;19(1):1-9.
https://live.ece.utexas.edu/publications/2004/zwang_vssim_spim_2004.pdf
Accessed September 22, 2018.
2. Liu T, Lin W, Kuo C. Image quality assessment using multi-method
fusion. IEEE Trans Image Process. 2013;22(5):1793-1807.
https://www.researchgate.net/publication/234047751_Image_Quality_Assessmen
Method_Fusion. Accessed September 22, 2018.
3. Li Z, Aaron A, Katsavounidis I, et al. Toward a practical perceptual
video quality metric. The Netflix Tech Blog.
https://medium.com/netflix-techblog/toward-a-practical-perceptual-
video-quality-metric-653f208b9652. Published June 6, 2016.
Accessed September 22, 2018.
4. Rassool R. VMAF Reproducibility: Validating a Perceptual
Practical Video Quality Metric. Real Networks.
https://www.realnetworks.com/sites/default/files/vmaf_reproducibility_ieee.pdf
Accessed October 22, 2018.
5. stockholm. xiph.org. Xiph.org Video Test Media [derf's collection].
https://media.xiph.org/video/derf/. Accessed September 22, 2018.
13 Advances in Video
According to reports available on the internet, [1] video data is poised to
consume up to 82% of Internet traffic by 2021 with live and VOD video,
surveillance video, and VR video content driving much of the video traffic
over the web. As more users increase their online video consumption and
with the advent of infrastructural changes like 5G, high-quality video
experiences with UHD and higher resolutions, frame rates with ultralow
latencies will soon be everyday realities. This will be possible with a
combination of advances in a few key areas that, cumulatively, are well
suited to drive significant growth. In this chapter we will focus on the
following three broad areas:
● Advances in machine learning and optimization tools that are being
integrated into existing video encoding frameworks to achieve
compression gains using proven and deployed codecs.
● Newer compression codecs with enhanced tools to address upcoming
video requirements like increased resolutions. We will highlight coding
tools in an upcoming next-generation coding standard called AV1.
● Newer experiential platforms like VR and 360 Video and their inherent
video requirements that are important topics of upcoming research.
13.1 Per-title Encoder Optimization
Per-title encoding has been around conceptually and in experimental stages
for several years. It was deployed at scale by Netflix in December 2015 as
outlined in a Netflix tech blog article [2] that has also inspired this section.
Internet streaming video services traditionally use a set of bit rate-resolution
pairs, a.k.a a bit rate ladder. This is a table that specifies, for a given codec,
what bit rates are sufficient to use for any fixed resolution. The ladder also,
therefore, defines at what bitrate transitions from one resolution to the other
occur. For example, if the bitrate ladder defines 1280x720p at 1 Mbps and
720x480p at 500 kbps, then as long as the bitrate remains around 1 Mbps and
above, the streaming would use 720p encoded stream. When the network
conditions drop the available bitrate to below 1 Mbps, the streaming would
be using the 480p version. This implementation is called a fixed ladder, as the
resolution used for every bitrate is always fixed. While this is easy to
implement, a fixed ladder may not always be the optimal approach. For
example, if the content or scene is simple with less texture or motion, it will
still be encoded at a fixed bitrate that may be higher than what it really needs.
Conversely, high complex content or scenes may need more bits than what’s
allocated using even the highest bitrate in the fixed ladder. Also, for a given
bit rate a better resolution could be chosen based on the complexity of content
instead of a fixed resolution ladder. For example, complex scenes can be
better encoded using 1280x720p at 2 Mbps while easier content can be
encoded using 1920x1080p at the same bitrate. Thus, the fixed approach,
while providing good quality, cannot guarantee the best quality for any
specific content at the requested bitrate. It is obvious from these examples
that the key to successful encoding that’s missing in traditional fixed bitrate-
resolution ladders is content complexity.
Figure 92: PSNR-bitrate optimal curve for encoding at three resolutions and various bitrates.
The concept can be explained using the following example, where a single
source is encoded at three different resolutions starting with lower and
moving to higher resolutions across various bitrates. From Figure 92, adapted
from the Netflix article, [2] we see that at each resolution, the quality gains
start to diminish beyond a certain bitrate threshold. This means that beyond
this point, the perceptual quality gains are negligible. This level, as seen in
Figure 92, is different for different resolutions. It is clear from this chart that
different resolutions are optimal at different bitrate ranges. This optimal
bitrate is shown in the dotted curve that is the point of ideal operation.
Selection of bitrate-resolution pairs close to this curve yields the best
compression efficiency.
Also, the charts are content specific and the optimal bitrates from this chart
will not necessarily be optimal for other content. Per-title encoding
overcomes this problem posed by fixed ladders by choosing content-specific
bitrate-resolution pairs close to this curve for every title. To do this,
experimental results from several data sets can be used to classify source
material based on different complexity types and different bitrate ladders can
be chosen per-title based on their content classification.
Figure 94: Current 360° video delivery workflow with 2D video encoding.
360° Image source: https://pixabay.com/en/winter-panorama-mountains-snow-2383930/
13.5 Summary
● Per-title encoding overcomes the limitations of a fixed resolution
encoding ladder by incorporating content complexity as a measure to
determine at what bitrate and resolution a specific content will be
encoded.
● At the core of ML prediction algorithms are mathematical optimization
techniques that minimize an internal cost function to get the best
possible predicted output among all others. This can be explored to
optimize video encoding.
● AV1 is the new open, royalty-free standard developed by the Alliance
for Open Media (AOM) and uses the block-based hybrid model by
building on VP9 with several enhancements. These, in aggregate,
account for its increased compression efficiency.
● Emerging technologies like 360 Video and Virtual Reality have more
complex visual scenarios that require several magnitudes of higher
video throughputs with low latency. These together put forth significant
compression requirements and thus opportunities exist to develop
compression systems suited for these applications.
13.6 Notes
1. Cisco Visual Networking Index: Forecast and Methodology, 2016–
2021. Cisco. June 6, 2017.
https://www.cisco.com/c/en/us/solutions/collateral/service-
provider/visual-networking-index-vni/complete-white-paper-c11-
481360.html. Updated September 15, 2017. Accessed September 22,
2018.
2. Aaron A, Li Z, Manohara M, et al. Per-title encode optimization. The
Netflix Tech Blog. https://medium.com/netflix-techblog/per-title-
encode-optimization-7e99442b62a2. Published December 14, 2015.
Accessed September 22, 2018.
3. Jones A. Netflix introduces dynamic optimizer. Bizety.
https://www.bizety.com/2018/03/15/netflix-introduces-dynamic-
optimizer/. Published March 15, 2018. Accessed September 22,
2018.
4. Escribano GF, Jillani RM, Holder C, Cuenca P. Video encoding and
transcoding using machine learning. MDM'08 Proceedings of the 9th
International Workshop Multimedia Data Mining: Held in
Conjunction with the ACM SIGKDD 2008. New York, NY:
Conference KDD'08 ACM; 2008:53-62.
https://www.researchgate.net/publication/234810713/download.
Accessed September 22, 2018.
5. Massimino P. AOM - AV1 How does it work? AOM-AV1 Video
Tech Meetup. https://parisvideotech.com/wp-
content/uploads/2017/07/AOM-AV1-Video-Tech-meet-up.pdf.
Published July, 2017. Accessed September 22, 2018.
6. Begole, B. Why the internet pipes will burst when virtual reality
takes off. Forbes Valley Voices. Forbes Media LLC.
https://www.forbes.com/sites/valleyvoices/2016/02/09/why-the-
internet-pipes-will-burst-if-virtual-reality-takes-off/#34c7f6e43858.
Published February 9, 2016. Accessed September 22, 2018.
Resources
Bankoski J, Wilkins P, Xu X. Technical overview of VP8, an open source
video codec for the web. Google, Inc.
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37073.
Accessed September 23, 2018.
Bultje RS. Overview of the VP9 video codec. Random Thoughts Blog,
General [Internet]. https://blogs.gnome.org/rbultje/2016/12/13/overview-of-
the-vp9-video-codec/. Published December 13, 2016. Accessed September
23, 2018.
Exponential-Golomb Coding. Wikipedia.
https://wikivisually.com/wiki/Exponential-Golomb_coding. Updated July 9,
2018. Accessed September 23, 2018.
Ghanbari M. History of video coding. Chapter 1 in Standard Codecs: Image
Compression to Advanced Video Coding. London, England: Institution of
Electrical Engineers; 2003.
https://flylib.com/books/en/2.537.1/history_of_video_coding.html. Accessed
September 23, 2018.
Grange A, Alvestrand HT. A VP9 bitstream overview. Network Working
Group Internet-Draft. ITEF.org. https://tools.ietf.org/id/draft-grange-vp9-
bitstream-00.html#rfc.section.2.6. Published February 18, 2013. Accessed
September 23, 2018.
Grange A, de Rivaz P, Hunt J. VP9 bitstream and decoding process
specification - v0.6. Google, Inc. webmproject.org.
https://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-
bitstream-specification-v0.6-20160331-draft.pdf. Published March 31, 2016.
Accessed September 23, 2108.
Karwowski D, Grajek T, Klimaszewski K, et al. 20 years of progress in video
compression – from MPEG-1 to MPEG-H HEVC. General view on the path
of video coding development. In Choras RS, ed. Image Processing and
Communications Challenges 8: 8th International Conference, IP&C 2016
Bydgoszcz, Poland, September 2016 Proceedings. Cham, Switzerland:
Springer International Publishing; 2017:3-15.
https://www.researchgate.net/publication/310494503_20_Years_of_Progress_in_Video_C
_from_MPEG-1_to_MPEG-
H_HEVC_General_View_on_the_Path_of_Video_Coding_Development.
Accessed September 23, 2018.
Marpe D, Schwarz H, Wiegand T. Context-Based Adaptive Binary
Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE
Circuits Syst Video Tech.
http://iphome.hhi.de/wiegand/assets/pdfs/csvt_cabac_0305.pdf. Accessed
September 23, 2018.
Melanson M. Video coding concepts: Quantization. Breaking Eggs and
Making Omelettes: Topics on Multimedia Technology and Reverse
Engineering; Multimedia Mike [Internet]. https://multimedia.cx/eggs/video-
coding-concepts-quantization/. Published April 5, 2005. Accessed September
23, 2018.
Mukherjee D, Bankoski J, Bultje RS, et al. A technical overview of VP9—
the latest open-source video codec. SMPTE Motion Imaging J.
2015;124(1):44-54.
https://www.researchgate.net/publication/272399193/download. Accessed
September 22, 2018.
Mukherjee D, Bankoski J, Bultje RS, et al. A technical overview of VP9: The
latest royalty-free video codec from Google. Google, Inc.
http://files.meetup.com/9842252/Overview-VP9.pdf. Accessed September
23, 2018.
Ozer J. Finding the just noticeable difference with Netflix VMAF. Streaming
Learning Center. streaminglearningcenter.com.
https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-
subjective-ratings.html. Published September 4, 2017. Accessed September
23, 2018.
Ozer J. Mapping SSIM and VMAF scores to subjective ratings. Streaming
Learning Center. streaminglearningcenter.com.
https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-
subjective-ratings.html. Published July 5, 2018. Accessed September 23,
2018.
Ozer J. Video Encoding By The Numbers: Eliminate the Guesswork from
Your Streaming Video. Galax, VA: Doceo Publishing, Inc.; 2017.
pieter3d. How VP9 works, technical details & diagrams. Doom9’s Forum
[Internet]. https://forum.doom9.org/showthread.php?t=168947. Published
October 8, 2013. Accessed September 23, 2018.
Rate distortion optimization for encoder control. Fraunhofer Institute for
Telecommunications, Heinrich Hertz Institute, HHI.
https://www.hhi.fraunhofer.de/en/departments/vca/research-groups/image-
video-coding/research-topics/rate-distortion-optimization-rdo-for-encoder-
control.html. Accessed September 23, 2018.
Riabtsev S. Video Compression. www.ramugedia.com.
http://www.ramugedia.com/video-compression. Accessed September 23,
2018.
Richardson I. A short history of video coding. Invited talk at United States
Patent and Trade Office, PETTP 2014 USPTO Tech Week, December 1-5,
2014. SlideShare Technology. slideshare.net.
https://www.slideshare.net/vcodex/a-short-history-of-video-coding. Accessed
September 23, 2018.
Richardson I, Bhat A. Historical timeline of video coding standards and
formats. Vcodex. https://goo.gl/bqyyXd. Accessed September 23, 2018.
Richardson IE. H.264 and MPEG-4 Video Compression: Video Coding for
Next-generation Multimedia. West Sussex, UK: John Wiley & Sons, Ltd.;
2003.
Robitza W. Understanding rate control modes (x264, x265, vpx).
SLHCK.info. https://slhck.info/video/2017/03/01/rate-control.html. Published
March 1, 2017. Updated August, 2018. Accessed September 23, 2018.
Sonnati F. Artificial Intelligence in video encoding optimization. Video
Encoding & Streaming Technologies, Fabio Sonnati on Video Delivery and
Encoding Blog [Internet].
https://sonnati.wordpress.com/2017/10/09/artificial-intelligence-in-video-
encoding-optimization/. Published October 9, 2017. Accessed September 23,
2018.
Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High Efficiency
Video Coding (HEVC) Standard. IEEE Trans Circuits Syst Video Technol.
2012;22(12):1649-1668. https://ieeexplore.ieee.org/document/6316136/?
part=1. Accessed September 21, 2018.
Urban J. Understanding video compression artifacts. biamp.com. Component.
Biamp's Blog [Internet]. http://blog.biamp.com/understanding-video-
compression-artifacts/. Published February 16, 2017. Accessed September
23, 2018.
Vinayagam M. Next generation broadcasting technology - video codec.
SlideShare Technology. slideshare.net.
https://www.slideshare.net/VinayagamMariappan1/video-codecs-62801463.
Published June 7, 2016. Accessed September 23, 2018.
WebM Blog. webmproject.org. http://blog.webmproject.org. Accessed
September 23, 2018.
Wiegand T, Sullivan GJ, Bjøntegaard G, Luthra A. Overview of the
H.264/AVC Video Coding Standard. IEEE Trans Circuits Syst Video
Technol. 2003;13(7):560-576.
http://ip.hhi.de/imagecom_G1/assets/pdfs/csvt_overview_0305.pdf. Accessed
September 21, 2018.
Wien M. High Efficiency Video Coding: Coding Tools and Specification.
Berlin and Heidelberg, Germany: Springer-Verlag; 2015.
xiph.org. [YUV video sources]. Xiph.org Video Test Media [derf's
collection]. https://media.xiph.org/video/derf/. Accessed September 23, 2018.
Ye Y. Recent trends and challenges in 360-degree video compression.
Keynote presentation at IEEE International Conference on Multimedia and
Expo (ICME), 9th Hot 3D Workshop. InterDigital Inc. SlideShare
Technology. slideshare.net. https://www.slideshare.net/YanYe5/recent-
trends-and-challenges-in-360degree-video-compression. Published August 1,
2018. Accessed September 23, 2018.
Zhang H, Au OC, Shi Y, et al. Improved sample adaptive offset for HEVC.
Proceedings of the 2013 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference. IEEE.
http://www.apsipa.org/proceedings_2013/papers/142_PID2936291.pdf.
Published 2013. Accessed September 23, 2018.
INDEX
⅛
⅛ pixel precision
2
2-D logarithmic search
3
backward adaptation
Band Offset, Band Offset
Bidirectional Prediction
binarization, binarization, binarization, binarization, binarization,
binarization, binarization, binarization, binarization, binarization, binarization
Bit Allocation, Bit Allocation, Bit Allocation, Bit Allocation
buffer capacity, buffer capacity
buffer fullness, buffer fullness, buffer fullness
C
capped VBR
CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR,
CBR, CBR, CBR, CBR, CBR, CBR
Codecs, Codecs, Codecs
color model, color model
color space, color space, color space, color space, color space, color space,
color space, color space
Compression efficiency, Compression efficiency
context adaptive variable length coding
context modeling, context modeling, context modeling, context modeling,
context modeling
CRF, CRF, CRF, CRF, CRF, CRF, CRF
D
DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT,
DCT, DCT, DCT, DCT, DCT, DCT
deblocking, deblocking, deblocking, deblocking, deblocking, deblocking
Deblocking
deblocking, deblocking, deblocking
Deblocking
deblocking
Deblocking
deblocking, deblocking, deblocking, deblocking, deblocking, deblocking,
deblocking, deblocking
Deblocking
deblocking, deblocking, deblocking, deblocking
decoders, decoders, decoders, decoders, decoders, decoders
diamond search
differential coding, differential coding, differential coding, differential coding
discrete cosine transform, discrete cosine transform
discrete sine transform
distortion measure, distortion measure, distortion measure, distortion
measure, distortion measure
Dynamic range
E
Edge Offset, Edge Offset
Energy Compaction
entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy,
entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy
Entropy coding, Entropy coding, Entropy coding
equirectangular projection
Error resilience
exhaustive search, exhaustive search, exhaustive search
Exp-Golomb, Exp-Golomb
F
Lagrange multiplier
Lagrangian, Lagrangian, Lagrangian
Latency
List L0
luminance, luminance, luminance, luminance, luminance, luminance
M
Machine learning
mean average difference
mean opinion score, mean opinion score
mean squared error, mean squared error
Merge Mode
metadata, metadata, metadata, metadata, metadata, metadata
ML Tools
mode decision, mode decision, mode decision, mode decision
motion compensation, motion compensation, motion compensation, motion
compensation, motion compensation, motion compensation, motion
compensation
motion vector, motion vector, motion vector, motion vector, motion vector,
motion vector, motion vector, motion vector, motion vector, motion vector,
motion vector, motion vector, motion vector, motion vector, motion vector,
motion vector, motion vector, motion vector
Motion Vectors Prediction
N
Near MV
Nearest MV
New MV, New MV
O
Objective Analysis
over the top
P
pel, pel, pel
Per-title encoding, Per-title encoding, Per-title encoding
Progressive
PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR,
PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR,
PSNR, PSNR
Q
rate control, rate control, rate control, rate control, rate control, rate control,
rate control, rate control, rate control, rate control, rate control, rate control,
rate control, rate control, rate control, rate control, rate control, rate control,
rate control
rate-distortion optimization, rate-distortion optimization
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed, reconstructed, reconstructed, reconstructed, reconstructed,
reconstructed
reference frames, reference frames, reference frames, reference frames,
reference frames, reference frames, reference frames, reference frames,
reference frames, reference frames, reference frames, reference frames,
reference frames, reference frames, reference frames, reference frames,
reference frames, reference frames, reference frames
residual, residual, residual, residual, residual, residual, residual, residual,
residual, residual, residual, residual, residual, residual, residual, residual,
residual, residual, residual, residual, residual, residual, residual, residual,
residual, residual, residual, residual, residual, residual, residual, residual,
residual, residual, residual, residual, residual, residual, residual, residual,
residual, residual
Residual values
RGB, RGB, RGB
ringing, ringing, ringing
Run Level
S