Deep Affine Motion Compensation Network For Inter Prediction in VVC
Deep Affine Motion Compensation Network For Inter Prediction in VVC
Abstract— In video coding, it is a challenge to deal with scenes Group (VCEG) and ISO/IEC Moving Picture Experts
with complex motions, such as rotation and zooming. Although Group (MPEG) have developed a series of video compression
affine motion compensation (AMC) is employed in Versatile Video coding standards [1], [2]. Different from image coding, video
Coding (VVC), it is still difficult to handle non-translational
motions due to the adopted hand-craft block-based motion coding generally focuses on removing temporal redundancy by
compensation. In this paper, we propose a deep affine motion inter prediction with motion compensation to effectively boost
compensation network (DAMC-Net) for inter prediction in video coding performance. In the process of motion compensation,
coding to effectively improve the prediction accuracy. To the best pixels of each block are first predicted with the highest similar
of our knowledge, our work is the first attempt to deal with the block in reference frames, and then the residual between
deformable motion compensation based on CNN in VVC. Specif-
ically, a deformable motion-compensated prediction (DMCP) predicted pixels and real pixels is encoded into bitstream.
module is proposed to compensate the current encoding block Therefore, how to improve the prediction accuracy of motion
through a learnable way to estimate accurate motion fields. compensation is highly critical for boosting compression
Meanwhile, the spatial neighboring information and the temporal efficiency.
reference block as well as the initial motion field are fully In the latest Versatile Video Coding (VVC), the exist-
exploited. By effectively fusing the multi-channel feature maps
from DMCP, an attention-based fusion and reconstruction (AFR) ing translational motion compensation (TMC) and advanced
module is designed to reconstruct the output block. The proposed affine motion compensation (AMC) [3] are jointly exploited
DAMC-Net is integrated into VVC and the experimental results for eliminating temporal redundancy. TMC predicts pixels
demonstrate that the proposed method considerably enhances the on the assumption that movement between video frames is
coding performance. translational. Therefore, non-translational motions in natural
Index Terms— Video coding, VVC, affine motion compensa- videos will result in a large residual in TMC. To address
tion, deep neural network, deformable motion compensation. this issue, AMC is integrated into VVC to improve the
ability to deal with complex motions. Although AMC has
I. I NTRODUCTION significantly improved coding performance, there still exist
several limitations. First, the subblock-wised motion field is
W ITH the prevalence
ultra-high-definition
for high efficiency video
of high-definition (HD) and
(UHD) videos, the demand
compression techniques has
derived from fixed points by hand-craft algorithms, thus result-
ing in blocking artifacts between sub-blocks, and inaccurate
increased dramatically. The ITU-T Video Coding Experts prediction in some high-order motions, such as bilinear and
perspective motions. Second, existing AMC algorithms pay
Manuscript received April 30, 2021; revised July 27, 2021; accepted more attention to the correlation in temporal domain, while
August 10, 2021. Date of publication August 24, 2021; date of cur- the spatial neighboring information in the current frame is not
rent version June 6, 2022. The work of Jianjun Lei and Bo Peng was
supported in part by the National Key R&D Program of China under effectively utilized.
Grant 2018YFE0203900, in part by the National Natural Science Foundation In the past years, deep learning-based methods have achie-
of China under Grant 61931014 and Grant 61722112, and in part by the ved promising results in several image and video processing
Natural Science Foundation of Tianjin under Grant 18JCJQJC45800. The
work of Qingming Huang was supported in part by the National Natural tasks, such as classification, super-resolution, and atten-
Science Foundation of China under Grant 61620106009 and Grant U1636214. tion prediction [4]–[8]. Inspired by the success of deep
This article was recommended by Associate Editor G. Correa. (Corresponding learning, recent researches have been devoted to devel-
author: Jianjun Lei.)
Dengchao Jin, Jianjun Lei, and Bo Peng are with the School of Electrical oping learning-based tools for traditional video coding
and Information Engineering, Tianjin University, Tianjin 300072, China schemes [9]–[25] and learning-based end-to-end compression
(e-mail: [email protected]; [email protected]; [email protected]). schemes [26]–[31]. Specifically, several studies [9]–[12] have
Wanqing Li is with the Advanced Multimedia Research Laboratory, Univer-
sity of Wollongong, Wollongong, NSW 2522, Australia (e-mail: wanqing@ attempted to substitute or enhance TMC with convolutional
uow.edu.au). neural networks (CNNs) to improve the coding performance.
Nam Ling is with the Department of Computer Science and Engineering, However, there is no report yet on CNN-based AMC to
Santa Clara University, Santa Clara, CA 95053 USA (e-mail: [email protected]).
Qingming Huang is with the School of Computer and Control Engineering, effectively deal with complex motions.
University of Chinese Academy of Sciences, Beijing 100190, China (e-mail: This paper proposes a deep affine motion compensation
[email protected]). network (DAMC-Net) to boost the performance of AMC.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2021.3107135. The main idea of the proposed method is to compensate the
Digital Object Identifier 10.1109/TCSVT.2021.3107135 current encoding block by estimating accurate motion fields
1051-8215 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
3924 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 6, JUNE 2022
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DAMC-Net FOR INTER PREDICTION IN VVC 3925
generate an additional reference frame from coarse to fine. field between frames is estimated to compensate the current
Liu et al. [18] proposed a multi-scale quality attentive fac- block for alleviating temporal misalignment.
torized kernel convolutional neural network (MQ-FKCNN) to Fig. 2 shows the overall architecture of the proposed
synthesize an additional reference frame. Choi and Bajic [19] DAMC-Net. As shown in the figure, the multi-domain infor-
utilized both decoded frames and temporal indices to generate mation is fully leveraged in the proposed DAMC-Net. In order
a reference frame for video coding scheme. They also pro- to improve the prediction accuracy of AMC, the spatial
posed an affine transformation-based scheme [20], in which neighboring pixels of the current block as well as its prediction
the spatially-varying filters and affine parameters are computed of AMC are combined as the first input (IC ) to explore spatial
to generate the warped samples for synthesizing the reference correlations. Besides, to obtain as accurate as possible source
frame. pixels in the temporal reference frame of the current block,
However, to the best of our knowledge, no work has the most similar block together with neighboring pixels in
been reported on learning-based AMC. Moreover, the existing the reference frame is constructed based on CPMVs and used
learning-based tools for TMC pay little attention to estimating as the second input (I R ). More importantly, since the initial
the complex motion field for compensating the current block, motion field (I M F ) constructed by CPMVs contains motion
while estimating the accurate motion field is essential for information, I M F is utilized as the third input. Taking the
motion-compensated prediction. IC , I M F , and I R as the inputs, the proposed DAMC-Net is
optimized with respect to jointly utilizing spatial neighboring
III. T HE P ROPOSED M ETHOD and temporal correlative information to improve the prediction
In this section, the proposed method is presented in detail. accuracy.
First, the architecture of the proposed method is systematically In the DMCP module, features FC , FM F , and FR are
introduced. Second, the deformable motion-compensated pre- first extracted from IC , I M F , and I R respectively. Taking
diction module which plays an important role in the proposed these features as inputs, the motion estimation unit (MEU)
network is illustrated. Third, the attention-based fusion and is designed to estimate motion fields. Based on estimated
reconstruction module is illustrated. Finally, the details of motion fields, deformable convolution is used to compensate
integrating DAMC-net into VVC are described. FC and FR . Features of compensated output FTCar and FTRar
are concatenated with FC as well as FR to construct the
A. Architecture of DAMC-Net aggregated feature FAgg . Finally, the output block, O D AMC ,
Traditional AMC compensates the current block merely is reconstructed from FAgg by an AFR module.
by a parameterized affine model, which has limited capa-
bility to model complex motions. To solve this problem,
a learning-based model, DAMC-Net, is designed to com- B. Deformable Motion-Compensated Prediction (DMCP)
pensate the current block by explicitly estimating pixel-wise Due to the limitation of deriving the subblock-wised motion
motion fields rather than implicitly deriving the subblock-wise field, the prediction of AMC suffers from misalignment at
motion field. Specifically, a spatial pixel-wise motion field pixel-level. In order to improve the granularity of AMC,
is estimated to refine the prediction block for alleviating the the DMCP module is designed to estimate pixel-wise motion
spatial blocking artifacts, and a temporal pixel-wise motion fields for compensating the current block.
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
3926 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 6, JUNE 2022
In DMCP, to extract deep features with abundant infor- an up-sampling layer are employed to increase the receptive
mation, features FC , FM F , and FR are extracted from IC , field and preserve the low frequency information, with two
I M F , and I R by multi-scale convolution unit [25] respectively. Res + CBAM units between them. The skip connection [36]
Then, the MEU is designed to estimate accurate motion fields. is designed to accelerate training process. As shown in Fig. 2,
As shown in Fig. 2, FC , FM F , and FR are first concatenated, the overall framework of the AFR module stacks five Res +
then separate convolution operations are followed to generate CBAM units. To optimize the proposed network, L2 loss is
offsets for each texture branch in MEU. It should be noticed utilized as the loss function:
that not only the texture information IC and I R are exploited in
MEU, but also the initial motion field I M F are jointly utilized L = (OGT − O D AMC ) 22 (3)
to estimate accurate motion fields. Compared to a network where OGT is the corresponding block in the raw videos and
which learns motion fields from scratch, the DMCP-Net esti- O D AMC is the output of AFR.
mates accurate motion fields with coarse input, which helps to
reduce the network training difficulty and ensure the quality
D. Integration of DAMC-Net in VVC
of learned motion fields.
Let δC denote the motion field from FC to FTCar , which 1) The Scope of DAMC-Net in VVC: There are two affine
computes the affine motion between IC and O D AMC . δ R modes in VVC, namely, affine inter-mode and affine merge-
denotes the motion field from FR to FTRar , which computes mode. Since these two modes are both based on AMC,
the affine motion between I R and O D AMC . Since the motion the proposed DAMC-Net is applied to these two modes.
between IC and O D AMC is smaller than that between I R and In addition, since the flexible quadtree nested multi-type tree
O D AMC , the MEU actually estimates a fine motion field from structure is explored in VVC to split coding tree unit (CTU)
IC to O D AMC . Calculation of motion fields δC and δ R for the into CUs, there exist CUs with square or rectangular shape
two texture branches can be expressed as: in VVC. Overall, there are 12 sizes of CUs with affine inter-
mode, and 19 sizes of CUs with affine merge-mode. In order
δC = Fθ1 (FC , FM F , FR ) to ensure the performance of DAMC-Net in each size of CU,
(1)
δ R = Fθ2 (FC , FM F , FR ) a series of models corresponding to different sizes of CUs are
exploited in this paper. Therefore, 12 models are trained for the
where θ 1 and θ 2 are parameters learned by the network. F (·) affine inter-mode and 19 models for the affine merge-mode.
represents the operation of the motion estimation unit. The following section illustrates the performance of two affine
Similar to the function of the affine motion compensation modes integrated with DAMC-Net.
module in VVC, features FC and FR of the two texture 2) The Strategy of DAMC-Net in VVC: The DAMC-Net is
branches are deformed separately to generate compensated defined as a new DAMC mode and embedded into the process
features. Inspired by [34], motion compensation is operated of CU optimal mode decision. The DAMC mode is utilized as
by deformable convolution, which adaptively deforms the an optional mode for the CUs with affine inter-mode and affine
kernel sampling under the control of a motion field. Therefore, merge-mode and determined whether to be selected in the
compensated features FTCar and FTRar for the two texture encoder based on rate-distortion optimization (RDO). As for
branches can be computed as follows. the DAMC mode in the affine inter-mode, DAMC-Net is first
FTRar = DConv (FR , δ R ) fed with inputs after AMC, and outputs the compensated block
(2) O D AMC . Then, the RDO process determines whether to select
FTCar = DConv (FC , δC )
the DAMC mode, and a designed flag recording the decision
where DConv (·) denotes the deformable convolution [34]. result of the RDO is signalled to decoder. As for the DAMC
Since DMCP compensates the feature maps rather than the mode in the affine merge-mode, considering the encoding
pixels of the target image, it effectively exploits non-local complexity, it is determined whether to be selected by the RDO
context. after the best affine merge candidate is searched. Meanwhile,
only the CUs with the affine merge-mode need to encode and
C. Attention-Based Fusion and Reconstruction (AFR) decode the flag. Since the DAMC-Net is nested in the process
Taking the outputs of the DMCP module as input, the AFR of CU optimal mode decision rather than post-processing after
module fuses the multi-channel information and reconstructs encoding frames, the selection rate of CUs with affine modes
the final prediction signal. In order to obtain feature repre- increases significantly.
sentation with abundant information and improve the quality
of the final prediction signal, the AFR module obtains the IV. E XPERIMENTS
aggregated feature FAgg by fusing the non-deformed fea- A. Experimental Settings
tures FC and FR with the compensated features FTCar and 1) Training Data Preparation: To evaluate the proposed
FTRar . Due to multiple sources of information are included DAMC-Net, a training dataset is first collected with 106 videos
in FAgg , an attention mechanism is employed to emphasize at different resolutions from [37], [38] and 8 videos at the
useful features and suppress the others. Considering that the resolution of 3840 × 2160 from [39], that is 114 raw video
residual block can extract deep features effectively, the Res + sequences with rich and complex motions in total. Considering
CBAM structure [35], being composed of CBAM and residual the complexity of VVC, the 4K videos are down-sampled
block, is adopted. Furthermore, a down-sampling layer and to 1280 × 720. Then, VTM-6.2 is used to compress the
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DAMC-Net FOR INTER PREDICTION IN VVC 3927
TABLE I
BD-R ATE R ESULTS OF THE P ROPOSED DAMC-N ET FOR A FFINE I NTER -M ODE
Fig. 3. Visual comparison between VTM-6.2 and the proposed Inter & Merge DAMC-Net. Top: The 3-rd frame of BQsquare under QP 27. Bottom: The
30-th frame of PartyScene under QP 32. (a) BQsquare original image, (b) Inter & Merge AMC (5216 bits, 35.24 dB), (c) Proposed (4440 bits, 35.42 dB),
(d) PartyScene original image, (e) Inter & Merge AMC (22408 bits, 29.82 dB), (f) Proposed (21984 bits, 29.91 dB).
video sequences configured with Low Delay P (LDP) under sampled at the regular interval of 3 to generate the training
four quantization parameters (QPs) {22, 27, 32, 37}. Due to samples. In the process of compression, IC , I M F , and I R
the high similarity between adjacent frames, video frames are of CUs with affine modes in the selected frames, and the
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
3928 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 6, JUNE 2022
TABLE II
BD-R ATE R ESULTS OF THE P ROPOSED DAMC-N ET FOR B OTH A FFINE I NTER -M ODE AND A FFINE M ERGE -M ODE
corresponding ground-truth in raw video frames are utilized affine merge-mode are trained for each QP. In the training
to construct training samples. Consequently, 76 sub-datasets phase, the base model with the affine inter-mode, QP of 22,
in total are obtained, which correspond to 4 QPs and 19 CU and CU size of 16 × 16, is firstly trained. Specifically,
sizes (8 × 8, 8 × 16, 8 × 32, 8 × 64, 16 × 8, 16 × 16, the network is optimized using Adam [44] with a batch size
16 × 32, 16 × 64, 32 × 8, 32 × 16, 32 × 32, 32 × 64, of 128, and the learning rate is initially set to 0.0001 for
64 × 8, 64 × 16, 64 × 32, 64 × 64, 64 × 128, 128 × 64, the first 2,000,000 steps and decayed to 0.00001 for the last
128 × 128). 1,000,000 steps. Then, other models are refined based on the
2) Encoding Configurations: DAMC-Net is integrated into base model with the learning rate of 0.00005 for 100,000 steps.
VVC reference software VTM (version 6.2). Experiments
are performed under the JVET common test conditions B. Comparison Results and Analyses
(CTC) [40]. Since single reference frame is utilized in the To validate the effectiveness of the proposed DAMC-Net,
proposed DAMC-Net, LDP configuration and Classes B∼E performance of the scheme of DAMC-Net for affine
are tested. In the experiments, the testing QPs are set as inter-mode (Inter DAMC-Net) is first compared with the
{22, 27, 32, 37}, and the widely employed BD-rate [41], [42] scheme of VTM-6.2 with affine inter-mode (Inter AMC).
is used as the objective metric to evaluate the coding Meanwhile, the VTM-6.2 without affine inter-mode is set as
performance. A CPU + GPU cluster is used as the test the baseline to compute BD-rate. The coding performance on
environment, where VVC coding is tested in CPU and the Y, Cb, and Cr components of different methods are reported
DAMC-Net is running in GPU. The CPU is Inter(R) Core(TM) in Table I. As shown in the table, on the Y component,
i9-9900K CPU @ 3.60GHz, and the GPU is NVIDIA GeForce the “Inter DAMC-Net” achieves 4.11%, 1.59%, 3.43%, and
GTX 1080Ti. 3.02% BD-rate reduction on average for Class B, C, D, and E,
3) Training Strategy: The proposed DAMC-Net is imple- respectively. Particularly, the “Inter DAMC-Net” achieves up
mented on TensorFlow [43] and trained on a Nvidia GeForce to 6.13% BD-rate reduction on BQsquare, while “Inter AMC”
GTX 1080Ti GPU. To satisfy all the sizes of CUs with affine only obtains 3.43% BD-rate reduction. Besides, to further
mode, 12 models for affine inter-mode and 19 models for validate the advantage of the DAMC-Net, experiments about
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DAMC-Net FOR INTER PREDICTION IN VVC 3929
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
3930 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 6, JUNE 2022
TABLE V
BD-R ATE R ESULTS OF THE DAMC-N ET W ITHOUT DMCP M ODULE
Fig. 4. Mode selection results for the 21-st frame of Cactus under QP 22.
(a) Inter AMC. (b) Proposed.
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DAMC-Net FOR INTER PREDICTION IN VVC 3931
TABLE VIII
BD-R ATE R ESULTS OF THE P ROPOSED M ETHOD FOR VTM-12.1
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
3932 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 6, JUNE 2022
4.32% to 4.95% when compared with the results on VTM-6.2. [16] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao,
Therefore, the proposed method is proved effective for the new “Enhanced motion-compensated video coding with deep virtual refer-
ence frame generation,” IEEE Trans. Image Process., vol. 28, no. 10,
reference software. pp. 4832–4844, Oct. 2019.
[17] S. Xia, W. Yang, Y. Hu, and J. Liu, “Deep inter prediction via pixel-wise
motion oriented reference generation,” in Proc. IEEE Int. Conf. Image
V. C ONCLUSION Process. (ICIP), Sep. 2019, pp. 1710–1714.
[18] J. Liu, S. Xia, and W. Yang, “Deep reference generation with multi-
In this paper, a DAMC-Net for inter prediction is pro- domain hierarchical constraints for inter prediction,” IEEE Trans. Mul-
posed to effectively boost the performance of affine motion timedia, vol. 22, no. 10, pp. 2497–2510, Oct. 2020.
[19] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,”
compensation in VVC. In order to compensate the current IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1843–1855,
encoding block, a DMCP module is designed to estimate Jul. 2020.
accurate motion fields from spatial neighboring informa- [20] H. Choi and I. V. Bajic, “Affine transformation-based deep frame
prediction,” IEEE Trans. Image Process., vol. 30, pp. 3321–3334, 2021.
tion, temporal reference block, and initial motion field. [21] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional neural
Then, an attention-based fusion and reconstruction module network-based block up-sampling for HEVC,” IEEE Trans. Circuits Syst.
is designed to fuse multi-channel features from DMCP Video Technol., vol. 29, no. 12, pp. 3701–3715, Dec. 2019.
[22] Y. Li et al., “Convolutional neural network-based block up-sampling for
and reconstruct the final prediction signal. The proposed intra frame coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 28,
DAMC-Net is integrated into VVC as an optional mode for no. 9, pp. 2316–2330, Sep. 2018.
CU with affine mode. Experimental results demonstrate that [23] J. Deng, L. Wang, S. Pu, and C. Zhuo, “Spatio-temporal deformable
the proposed DAMC-Net can considerably enhance coding convolution for compressed video quality enhancement,” in Proc. AAAI,
Apr. 2020, pp. 10696–10703.
performance. [24] Z. Pan, X. Yi, Y. Zhang, B. Jeon, and S. Kwong, “Efficient in-loop
filtering based on enhanced deep convolutional neural networks for
HEVC,” IEEE Trans. Image Process., vol. 29, pp. 5352–5366, 2020.
R EFERENCES [25] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang, “MFQE
2.0: A new approach for multi-frame quality enhancement on com-
[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the pressed video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3,
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits pp. 949–963, Mar. 2021.
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. [26] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wang, “Image
[2] S. Li, C. Zhu, and M.-T. Sun, “Hole filling with multiple reference and video compression with neural networks: A review,” IEEE Trans.
views in DIBR view synthesis,” IEEE Trans. Multimedia, vol. 20, no. 8, Circuits Syst. Video Technol., vol. 30, no. 6, pp. 1683–1698, Jun. 2020.
pp. 1948–1959, Aug. 2018. [27] J. Lin, D. Liu, H. Li, and F. Wu, “M-LVC: Multiple frames prediction
[3] L. Li et al., “An efficient four-parameter affine motion model for video for learned video compression,” in Proc. IEEE/CVF Conf. Comput. Vis.
coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 8, Pattern Recognit. (CVPR), Jun. 2020, pp. 3543–3551.
pp. 1934–1948, Aug. 2018. [28] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and
[4] J. Xie, N. He, L. Fang, and P. Ghamisi, “Multiscale densely-connected G. Toderici, “Scale-space flow for end-to-end optimized video compres-
fusion networks for hyperspectral images classification,” IEEE Trans. sion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Circuits Syst. Video Technol., vol. 31, no. 1, pp. 246–259, Jan. 2021. Jun. 2020, pp. 8500–8509.
[5] J. Lei et al., “Deep stereoscopic image super-resolution via interaction [29] R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for
module,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 8, video compression with hierarchical quality and recurrent enhancement,”
pp. 3051–3061, Aug. 2021. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[6] L. Wang et al., “Learning parallax attention for stereo image super- Jun. 2020, pp. 6627–6636.
resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [30] Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,”
(CVPR), Jun. 2019, pp. 12242–12251. IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 2, pp. 566–576,
[7] Y. Fang, C. Zhang, H. Huang, and J. Lei, “Visual attention prediction for Feb. 2020.
stereoscopic video by multi-module fully convolutional network,” IEEE [31] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-
Trans. Image Process., vol. 28, no. 11, pp. 5253–5265, Nov. 2019. to-end deep video compression framework,” in Proc. IEEE/CVF Conf.
[8] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang, “MEMC-Net: Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10998–11007.
Motion estimation and motion compensation driven neural network for [32] K. Zhang, Y.-W. Chen, L. Zhang, W.-J. Chien, and M. Karczewicz,
video interpolation and enhancement,” IEEE Trans. Pattern Anal. Mach. “An improved framework of affine motion compensation in video
Intell., vol. 43, no. 3, pp. 933–948, Mar. 2021. coding,” IEEE Trans. Image Process., vol. 28, no. 3, pp. 1456–1469,
[9] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-based Mar. 2019.
motion compensation refinement for video coding,” in Proc. IEEE Int. [33] H. Huang, J. W. Woods, Y. Zhao, and H. Bai, “Control-point represen-
Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4. tation and differential coding affine-motion compensation,” IEEE Trans.
[10] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “Enhanced Circuits Syst. Video Technol., vol. 23, no. 10, pp. 1651–1660, Oct. 2013.
bi-prediction with convolutional neural network for high-efficiency video [34] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 11, Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 764–773.
pp. 3291–3301, Nov. 2019. [35] S. Woo, J. Park, J. Lee, and I. Kweon, “CBAM: Convolutional block
[11] J. Mao and L. Yu, “Convolutional neural network based bi-prediction attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Jul. 2018,
utilizing spatial and temporal information in video coding,” IEEE Trans. pp. 3–19.
Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1856–1870, Jul. 2020. [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[12] Y. Wang, X. Fan, C. Jia, D. Zhao, and W. Gao, “Neural network based image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
inter prediction for HEVC,” in Proc. IEEE Int. Conf. Multimedia Expo (CVPR), Jun. 2016, pp. 770–778.
(ICME), Jul. 2018, pp. 1–4. [37] Xiph.org. (2017). Xiph.org Video Test Media. [Online]. Available:
[13] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional neural https://media.xiph.org/video/derf
network-based fractional-pixel motion compensation,” IEEE Trans. Cir- [38] VQEG. (2017). VQEG Video Datasets Organizations. [Online].
cuits Syst. Video Technol., vol. 29, no. 3, pp. 840–853, Mar. 2019. Available: https://www.its.bldrdoc.gov/vqeg/video-datasetsand-
[14] J. Lin, D. Liu, H. Li, and F. Wu, “Generative adversarial network-based organizations.aspx/
frame extrapolation for video coding,” in Proc. IEEE Vis. Commun. [39] L. Song, X. Tang, W. Zhang, X. Yang, and P. Xia, “The SJTU 4K video
Image Process. (VCIP), Dec. 2018, pp. 1–4. sequence dataset,” in Proc. 5th Int. Workshop Qual. Multimedia Exper.
[15] S. Huo, D. Liu, B. Li, S. Ma, F. Wu, and W. Gao, “Deep network- (QoMEX), Jul. 2013, pp. 34–35.
based frame extrapolation with reference frame alignment,” IEEE Trans. [40] K. Suehring and X. Li, JVET Common Test Conditions and Software
Circuits Syst. Video Technol., vol. 31, no. 3, pp. 1178–1192, Mar. 2021. Reference Configurations, document JVET-G1010, Aug. 2017.
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DAMC-Net FOR INTER PREDICTION IN VVC 3933
[41] J. Lei, J. Duan, W. Feng, N. Ling, and C. Hou, “Fast mode decision Nam Ling (Life Fellow, IEEE) received the B.Eng.
based on grayscale similarity and inter-view correlation for depth map degree from the National University of Singapore,
coding in 3D-HEVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, Singapore, in 1981, and the M.S. and Ph.D. degrees
no. 3, pp. 706–718, Mar. 2018. from the University of Louisiana at Lafayette,
[42] G. Bjontegaard, Calculation of Average PSNR Differences Between RD Lafayette, LA, USA, in 1985 and 1989, respectively.
Curves, document VCEG-M33, Apr. 2001. From 2002 to 2010, he was an Associate Dean with
[43] M. Abadi et al., “TensorFlow: Large-scale machine learning on het- the School of Engineering, Santa Clara University,
erogeneous distributed systems,” 2016, arXiv:1603.04467. [Online]. Santa Clara, CA, USA. He was Sanfilippo Family
Available: http://arxiv.org/abs/1603.04467 Chair Professor, and currently Wilmot J. Nicholson
[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic Family Chair Professor and the Chair with the
optimization,” 2014, arXiv:1412.6980. [Online]. Available: http:// Department of Computer Science and Engineering,
arxiv.org/abs/1412.6980 Santa Clara University. He is/was also a Consulting Professor with the
National University of Singapore; a Guest Professor with Tianjin University,
Tianjin, China, and Shanghai Jiao Tong University, Shanghai, China; Cuiying
Chair Professor with Lanzhou University, Lanzhou, China; a Chair Professor
and Minjiang Scholar with Fuzhou University, Fuzhou, China; a Distin-
Dengchao Jin received the B.S. degree in mecha- guished Professor with Xi’an University of Posts and Telecommunications,
tronic engineering from Northwestern Polytechnical Xi’an, China; a Guest Professor with Zhongyuan University of Technology,
University, Xi’an, China, in 2019. He is currently Zhengzhou, China; and an Outstanding Overseas Scholar with Shanghai
pursuing the Ph.D. degree with the School of Electri- University of Electric Power, Shanghai. He has authored or coauthored over
cal and Information Engineering, Tianjin University, 220 publications and seven adopted standard contributions. He has been
Tianjin, China. His research interests include video granted nearly 20 U.S. patents so far. He is an IEEE Fellow due to his
coding and deep learning. contributions to video coding algorithms and architectures. He is also an
IET Fellow. He was named as an IEEE Distinguished Lecturer twice and
also an APSIPA Distinguished Lecturer. He was a recipient of the IEEE
ICCE Best Paper Award (First Place) and the Umedia Best/Excellent Paper
Award three times, six awards from Santa Clara University, four at the Uni-
versity level (Outstanding Achievement, Recent Achievement in Scholarship,
President’s Recognition, and Sustained Excellence in Scholarship), and two at
the School/College level (Researcher of the Year and Teaching Excellence).
He was a Keynote Speaker of IEEE APCCAS, VCVP (twice), JCPC, IEEE
Jianjun Lei (Senior Member, IEEE) received the ICAST, IEEE ICIEA, IET FC Umedia, IEEE Umedia, IEEE ICCIT, and Work-
Ph.D. degree in signal and information processing shop at XUPT (twice); and a Distinguished Speaker of IEEE ICIEA. He served
from Beijing University of Posts and Telecommuni- as the General Chair/Co-Chair for IEEE Hot Chips, VCVP (twice), IEEE
cations, Beijing, China, in 2007. He was a Visiting ICME, Umedia (seven times), and IEEE SiPS. He was an Honorary Co-Chair
Researcher with the Department of Electrical Engi- of IEEE Umedia 2017. He served as the Technical Program Co-Chair for
neering, University of Washington, Seattle, WA, IEEE ISCAS, APSIPA ASC, IEEE APCCAS, IEEE SiPS (twice), DCV, and
USA, from August 2012 to August 2013. He is cur- IEEE VCIP. He was the Technical Committee Chair of IEEE CASCOM TC
rently a Professor with Tianjin University, Tianjin, and IEEE TCMM. He served as a Guest Editor or an Associate Editor for the
China. His research interests include 3-D video IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS ,
processing, virtual reality, and artificial intelligence. the IEEE J OURNAL OF S ELECTED T OPICS IN S IGNAL P ROCESSING, IEEE
A CCESS , JSPS (Springer), and MSSP (Springer).
Authorized licensed use limited to: Universidade Federal dos Pampas (UNIPAMPA). Downloaded on November 08,2022 at 17:27:04 UTC from IEEE Xplore. Restrictions apply.