Bidirectional recurrent deformable alignment network for video super-resolution
Bidirectional recurrent deformable alignment network for video super-resolution
super-resolution
Wenli Shui, Hongbin Cai*, Guanghui Lu
University of Electronic Science and Technology of China, Chengdu, Sichuan, China
*
[email protected]
ABSTRACT
The challenge in video super-resolution(VSR) is maximizing temporal correlation in video sequences and achieving
precise inter-frame alignment. We proposed a bidirectional recurrent network integrating multi-scale deformable
alignment with temporal attention fusion. We first designed a multi-scale cascading deformable alignment module,
which captures spatiotemporal information at different scales through gradual refinement. It employs deformable
convolutions for adaptive feature alignment to address complex motion and deformation issues. To effectively fuse the
temporal information, we have added a temporal attention fusion module. This module dynamically weights the
information of each frame using an attention mechanism. We carried out a comprehensive series of experiments on
multiple standard VSR datasets. The outcomes indicate that our method achieved significant performance.
Keywords: Video Super-Resolution, Bidirectional Recurrent Network, Deformable Convolution, Multi-Scale Analysis,
Temporal Attention
1. INTRODUCTION
Single image super-resolution relies on exploring the internal self-similarity. Videos, being multi-frame, requires us to
explore the spatiotemporal correlations for reconstruction. As deep learning continues to advance, its application in VSR
has become a mainstream. BasicVSR1 categorized VSR models into distinct classes: propagation, alignment, aggregation,
and upsampling. The propagation focuses on the use of temporal information, divided into sliding windows and recurrent
networks. Studies have shown that bidirectional propagation outperforms, compensating for limitations in forward-only
feature refinement. There are two main alignment methods: flow-based (e.g., VESPCN 2, SOF-SRB3, OFR-BRN4) and
flow-free (e.g., DUF5, MOFN6, RGAN7). Flow-based alignment wraps supporting frames on image wise but heavily
relies on motion estimation accuracy, where inaccuracies can cause artifacts. Flow-free alignment implicitly implies
feature-wise motion compensation. Regarding the fusion stage, VESPCN2 explored Early fusion, Slow fusion and 3D
convolution. FRVSR8 and RBPN9 employed recurrent neural networks for feature fusion. SOF-SRB3 utilized sub-pixel
convolution to complete VSR reconstruction.
In this work, we designed a VSR network integrating deformable convolution, attention mechanisms and multi-scale
alignment to restore high-quality video frames. Our key contributions are: 1) We introduced a bidirectional recurrent
VSR network with multi-scale cascading deformable alignment and temporal attention fusion. 2) We used deformable
convolution for multi-scale alignment adapting to complex motion. 3) When fusing features, the attention mechanism
weights info in the front and back frames.
2. RELATED WORK
2.1 Video super-resolution
Prominent VSR algorithms, like FRVSR8 and VESPCN2, utilize optical flow for video frame alignment. RBPN9
suggested a recurrent bidirectional encoder-decoder method that extracts missing information through multi-projections.
BasicVSR++10 utilized flow-guided deformable to align inter-frame features. Liu11 et al. employed a dynamic local
filtering network for implicit motion compensation. 3D convolution can analyze spatial and temporal data dimensions,
benefiting sequential tasks. Zhou12 et al. have combined 3D convolution with deformable convolution to integrate multi-
dimensional spatiotemporal information. Liang13 et al. combined recurrent network with transformer to achieve their
balance. Lee14 et al. explored to utilize the Refvideo for VSR.
International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024),
edited by Ram Bilas Pachori, Lei Chen, Proc. of SPIE Vol. 13180, 131802N
© 2024 SPIE · 0277-786X · doi: 10.1117/12.3034126
3. METHOD
3.1 Overview
For a series of low-resolution(LR) video frames { , , … , }, the method aims to restore the corresponding high-
resolution(HR) video frame sequence { , , … , } . The architecture of our network, depicted in Figure 1, is
engineered as a bidirectional recurrent structure that accepts three sequential LR frames , , as inputs. ,
are fed into the forward propagation branch (FPB) while , are processed by the backward propagation branch
(BPB). In FPB and BPB, a multi-scale cascading deformable alignment module (MCDM) is implemented for feature
alignment, which facilitates the alignment of the reference frame with supporting frames at multiple scales of features.
To blend and restore the outputs obtained from the FPB and BPB, we proposed a temporal attention fusion module
(TAFM). In the final step, is generated by combining the bicubic upsampled result of with the features produced by
TAFM.
3.2 Forward/Backward propagation branch
FPB and BPB share a similar architecture, so we delve into the detailed configuration of both modules by using FPB as
an example. In FPB, we employ a feature extraction module to derive features , from , respectively. The
module consists of a convolutional layer and three residual blocks, with ReLU serving as the activation function. Next,
MCDM takes , as inputs to predict the , which is forward feature of . Moreover, use the hidden state
information from the previous frame ℎ to further refine . Subsequently, is passed on to two branches to obtain
the forward hidden state ℎ and the forward output . ℎ is then applied to the FPB for the next time step. is
applied to TAFM to synthesize the high-resolution preference frame .
( )= ⋅ ( + + ) (2)
We denote as the weight at the k-th position and as the learned offset for the k-th position respectively. Since +
+ might be a fraction, bilinear interpolation is applied.
In alignment, it is necessary to handle similarities in large motions while maintaining detail accuracy. To this end, we
used MCDM with three cascading deformable convolutions on multi-scale feature maps for coarse-to-fine motion
compensation. Figure 2 (a) shows the structure of MCDM. The reference frame and its adjacent frame are passed
through convolution to output their respective feature maps as the first layer's feature map , . Subsequently, we use
two strided convolutions to downsample the features, yielding three tiers of feature maps, with resolutions decreasing in
×2.
The process begins in the high-level/low-resolution stage, advancing to the low-level/high-resolution stage, with
deformable convolutions at each level. The resulting feature maps are then upsampled and propagated to the next lower
level:
= ( , ), ( )↑ (3)
Where (·) denotes several convolution layers. denotes the deformable convolution described in Eq2. (·)↑
denotes bilinear interpolation for ×2 upsampling.
3.4 Temporal attention fusion module
For fusion stage, to better utilize the feature maps of the front and back frames, we introduced the temporal attention. As
shown in the Figure 2 (b), first we calculate the attention maps between the reference frame's feature map and those of
the front and back frames separately. Taking forward feature map as an example, the weight calculation is as follows:
= ( ) (4)
4. EXPERIMENTS
4.1 Implementation details
Datasets: We utilized the Vimeo-90K dataset for training. Recognized as a staple in video processing tasks, the Vimeo-
90K contains a rich variety of scenes and motions. We evaluated our method and other advanced methods on two test
datasets: Vid4 and Vimeo-90K-T. The Vid4 comprises four categories of video sequences: city, walk, calendar and
foliage. Vimeo-90K-T is a subset drawn from the Vimeo-90K for testing purposes. To diversify our training data, we
employed rotation and flipping to augment the dataset.
Training details: During the training, we set the initial learning rate to 4x10-4, which subsequently decreased by a factor
of 0.1 every 25 epochs until epoch 75. The batch size is set to 64. We utilized the Adam optimizer with β1=0.9 and
β2=0.999, a weight decay of 5×10−4. A downsampling factor of s=4 is used in all experiments. We chose the Charbonnier
penalty loss function, which is defined as ℒ = − + ,ε is set to 1×10−3.
5. CONCLUSION
In this study, we have introduced an innovative approach to video super-resolution. The multi-scale cascading
deformable alignment module is adept at effectively extracting and aligning features at different scales, capturing local
details and global structural information in videos with greater precision, and possesses the ability to adaptively handle
inter-frame motion changes, enhancing the adaptability of the algorithm to dynamic scenes. The temporal attention
fusion module can automatically learn and emphasize key parts within neighboring frames, optimizing the capture of
temporal correlations. By integrating these two modules into a bidirectional recurrent network architecture, we achieved
REFERENCES
[1] Chan K C K, Wang X T, Yu K, Dong C, Loy C C. BasicVSR: the search for essential components in video
super-resolution and beyond. In:2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). Nashville, TN, USA, IEEE, 2021, 4945–4954 DOI: 10.1109/cvpr46437.2021.00491
[2] Caballero J, Ledig C, Aitken A, et al. Real-Time Video Super-Resolution with Spatio-Temporal Networks and
Motion Compensation[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2017
[3] Fatima N. Accelerating the Super-Resolution Multi-Scale Deep Convolutional Neural Network in the treatment
of Degenerative Spondylolisthesis[C]. Global Spine Conference. 2021.
[4] Y. Zhang, H. Wang, H. Zhu and Z. Chen, "Optical Flow Reusing for High-Efficiency Space-Time Video Super
Resolution," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 5, pp. 2116-2128,
May 2023, doi: 10.1109/TCSVT.2022.3222875.
[5] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim. Deep video super-resolution network using dynamic upsampling filters
without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3224–3232, 2018
[6] Y. Chen, Y. Wang and Y. Liu, "MOFN: Multi-Offset-Flow-Based Network for Video Restoration and
Enhancement," 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Taipei
City, Taiwan, 2022, pp. 1-6, doi: 10.1109/ICMEW56448.2022.9859519.
[7] Zhu, Y.; Li, G. A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution. Sensors
2023, 23, 8574. https://doi.org/10.3390/s23208574
[8] Mehdi S M Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In
CVPR, 2018
[9] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video
super-resolution. In CVPR, 2019
[10] K. C. K. Chan, S. Zhou, X. Xu and C. C. Loy, "BasicVSR++: Improving Video Super-Resolution with
Enhanced Propagation and Alignment," 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5962-5971, doi: 10.1109/CVPR52688.2022.00588.
[11] Liu X, Kong L, Zhou Y, et al. End-To-End Trainable Video Super-Resolution Based on a New Mechanism for
Implicit Motion Estimation and Compensation: IEEE, 10.1109/WACV45572.2020.9093552[P]. 2020.
[12] J. Zhou, R. Lan, X. Wang, C. Pang and X. Luo, "3D Deformable Kernels for Video super-resolution," 2022 9th
International Conference on Digital Home (ICDH), Guangzhou, China, 2022, pp. 291-298, doi:
10.1109/ICDH57206.2022.00052.
[13] Liang, J., Fan, Y., Xiang, X., Ranjan, R., Ilg, E., Green, S., Cao, J., Zhang, K., Timofte, R., & Gool, L.V.
(2022). Recurrent Video Restoration Transformer with Guided Deformable Attention. ArXiv, abs/2206.02146.
[14] Lee, J., Lee, M., Cho, S., & Lee, S. (2022). Reference-based Video Super-Resolution Using Multi-Camera
Video Triplets. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17803-
17812.
[15] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-
oriented flow. arXiv preprint arXiv:1711.09078, 2017.