0% found this document useful (0 votes)
29 views

Bidirectional recurrent deformable alignment network for video super-resolution

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Bidirectional recurrent deformable alignment network for video super-resolution

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Bidirectional recurrent deformable alignment network for video

super-resolution
Wenli Shui, Hongbin Cai*, Guanghui Lu
University of Electronic Science and Technology of China, Chengdu, Sichuan, China
*
[email protected]

ABSTRACT

The challenge in video super-resolution(VSR) is maximizing temporal correlation in video sequences and achieving
precise inter-frame alignment. We proposed a bidirectional recurrent network integrating multi-scale deformable
alignment with temporal attention fusion. We first designed a multi-scale cascading deformable alignment module,
which captures spatiotemporal information at different scales through gradual refinement. It employs deformable
convolutions for adaptive feature alignment to address complex motion and deformation issues. To effectively fuse the
temporal information, we have added a temporal attention fusion module. This module dynamically weights the
information of each frame using an attention mechanism. We carried out a comprehensive series of experiments on
multiple standard VSR datasets. The outcomes indicate that our method achieved significant performance.
Keywords: Video Super-Resolution, Bidirectional Recurrent Network, Deformable Convolution, Multi-Scale Analysis,
Temporal Attention

1. INTRODUCTION
Single image super-resolution relies on exploring the internal self-similarity. Videos, being multi-frame, requires us to
explore the spatiotemporal correlations for reconstruction. As deep learning continues to advance, its application in VSR
has become a mainstream. BasicVSR1 categorized VSR models into distinct classes: propagation, alignment, aggregation,
and upsampling. The propagation focuses on the use of temporal information, divided into sliding windows and recurrent
networks. Studies have shown that bidirectional propagation outperforms, compensating for limitations in forward-only
feature refinement. There are two main alignment methods: flow-based (e.g., VESPCN 2, SOF-SRB3, OFR-BRN4) and
flow-free (e.g., DUF5, MOFN6, RGAN7). Flow-based alignment wraps supporting frames on image wise but heavily
relies on motion estimation accuracy, where inaccuracies can cause artifacts. Flow-free alignment implicitly implies
feature-wise motion compensation. Regarding the fusion stage, VESPCN2 explored Early fusion, Slow fusion and 3D
convolution. FRVSR8 and RBPN9 employed recurrent neural networks for feature fusion. SOF-SRB3 utilized sub-pixel
convolution to complete VSR reconstruction.
In this work, we designed a VSR network integrating deformable convolution, attention mechanisms and multi-scale
alignment to restore high-quality video frames. Our key contributions are: 1) We introduced a bidirectional recurrent
VSR network with multi-scale cascading deformable alignment and temporal attention fusion. 2) We used deformable
convolution for multi-scale alignment adapting to complex motion. 3) When fusing features, the attention mechanism
weights info in the front and back frames.

2. RELATED WORK
2.1 Video super-resolution
Prominent VSR algorithms, like FRVSR8 and VESPCN2, utilize optical flow for video frame alignment. RBPN9
suggested a recurrent bidirectional encoder-decoder method that extracts missing information through multi-projections.
BasicVSR++10 utilized flow-guided deformable to align inter-frame features. Liu11 et al. employed a dynamic local
filtering network for implicit motion compensation. 3D convolution can analyze spatial and temporal data dimensions,
benefiting sequential tasks. Zhou12 et al. have combined 3D convolution with deformable convolution to integrate multi-
dimensional spatiotemporal information. Liang13 et al. combined recurrent network with transformer to achieve their
balance. Lee14 et al. explored to utilize the Refvideo for VSR.

International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024),
edited by Ram Bilas Pachori, Lei Chen, Proc. of SPIE Vol. 13180, 131802N
© 2024 SPIE · 0277-786X · doi: 10.1117/12.3034126

Proc. of SPIE Vol. 13180 131802N-1


Complex motion challenges trajectory capture, affecting fusion and VSR performance, causing artifacts or blurring. In
the fusion module, different frames' feature information varies. Incomplete alignment can blur details, so effectively
using this info is essential.
2.2 Deformable convolution
Deformable convolution improves spatial sampling with adaptively learned offsets. Integrated into the alignment module,
it adjusts kernel offsets based on feature variations for implicit frame alignment. Excessive deformation might include
irrelevant content; adding more deformable layers and a weight mechanism can refine deformation learning and assign
weights to each feature point.
2.3 Attention mechanism
The attention mechanism focuses the network on relevant information, boosting efficiency and accuracy. It's often paired
with gating (like sigmoid) to refine feature maps. Spatial attention mechanisms distribute diverse weights across feature
map positions. Channel attention mechanisms assign different weight to each channel. Temporal attention mechanisms
calculate the similarity between different frames, assigning more weight to frames with higher similarity.

Figure 1. The overall process of our method.

3. METHOD
3.1 Overview
For a series of low-resolution(LR) video frames { , , … , }, the method aims to restore the corresponding high-
resolution(HR) video frame sequence { , , … , } . The architecture of our network, depicted in Figure 1, is
engineered as a bidirectional recurrent structure that accepts three sequential LR frames , , as inputs. ,
are fed into the forward propagation branch (FPB) while , are processed by the backward propagation branch
(BPB). In FPB and BPB, a multi-scale cascading deformable alignment module (MCDM) is implemented for feature
alignment, which facilitates the alignment of the reference frame with supporting frames at multiple scales of features.
To blend and restore the outputs obtained from the FPB and BPB, we proposed a temporal attention fusion module
(TAFM). In the final step, is generated by combining the bicubic upsampled result of with the features produced by
TAFM.
3.2 Forward/Backward propagation branch
FPB and BPB share a similar architecture, so we delve into the detailed configuration of both modules by using FPB as
an example. In FPB, we employ a feature extraction module to derive features , from , respectively. The
module consists of a convolutional layer and three residual blocks, with ReLU serving as the activation function. Next,
MCDM takes , as inputs to predict the , which is forward feature of . Moreover, use the hidden state
information from the previous frame ℎ to further refine . Subsequently, is passed on to two branches to obtain
the forward hidden state ℎ and the forward output . ℎ is then applied to the FPB for the next time step. is
applied to TAFM to synthesize the high-resolution preference frame .

Proc. of SPIE Vol. 13180 131802N-2


3.3 Multi-scale cascading deformable alignment module
MCDM utilizes deformable convolution to align the reference frame's features with its adjacent frames. First, the offset
used by deformable convolution of the t-th frame is learned from the adjacent frame:
= ([ , ]), ∈ −1,1 (1)
f is a function composed of a succession of convolutional layers. [.,.] denotes the concatenation operation. denotes
the front or back frame.

(a) MCDM (b) TAFM


Figure 2. Structure of multi-scale cascading deformable alignment module(MCDM) and temporal attention fusion module
(TAFM).
For instance, we use a 3x3 deformable convolution kernel, so deformable convolution has K=9 sampling positions, ∈
{(− 1, − 1), (− 1, 0),··, (1,1)}. For each position , the aligned feature map is determined by:

( )= ⋅ ( + + ) (2)

We denote as the weight at the k-th position and as the learned offset for the k-th position respectively. Since +
+ might be a fraction, bilinear interpolation is applied.
In alignment, it is necessary to handle similarities in large motions while maintaining detail accuracy. To this end, we
used MCDM with three cascading deformable convolutions on multi-scale feature maps for coarse-to-fine motion
compensation. Figure 2 (a) shows the structure of MCDM. The reference frame and its adjacent frame are passed
through convolution to output their respective feature maps as the first layer's feature map , . Subsequently, we use
two strided convolutions to downsample the features, yielding three tiers of feature maps, with resolutions decreasing in
×2.
The process begins in the high-level/low-resolution stage, advancing to the low-level/high-resolution stage, with
deformable convolutions at each level. The resulting feature maps are then upsampled and propagated to the next lower
level:
= ( , ), ( )↑ (3)
Where (·) denotes several convolution layers. denotes the deformable convolution described in Eq2. (·)↑
denotes bilinear interpolation for ×2 upsampling.
3.4 Temporal attention fusion module
For fusion stage, to better utilize the feature maps of the front and back frames, we introduced the temporal attention. As
shown in the Figure 2 (b), first we calculate the attention maps between the reference frame's feature map and those of
the front and back frames separately. Taking forward feature map as an example, the weight calculation is as follows:

= ( ) (4)

Proc. of SPIE Vol. 13180 131802N-3


(·) denotes the convolution of forward output features. (·) denotes the convolution of preference frame feature.
Then, multiply the temporal attention maps with the aligned feature on a pixel-by-pixel basis:
= ⨀ (5)
⊙ denotes element-wise multiplication.
The temporal attention-weighted features are input into the fusion module, which uses 3D fusion2. Afterwards, residual
blocks are utilized to further refine and enhance the fused features. A sub-pixel network is used to generate the final HR
features.

4. EXPERIMENTS
4.1 Implementation details
Datasets: We utilized the Vimeo-90K dataset for training. Recognized as a staple in video processing tasks, the Vimeo-
90K contains a rich variety of scenes and motions. We evaluated our method and other advanced methods on two test
datasets: Vid4 and Vimeo-90K-T. The Vid4 comprises four categories of video sequences: city, walk, calendar and
foliage. Vimeo-90K-T is a subset drawn from the Vimeo-90K for testing purposes. To diversify our training data, we
employed rotation and flipping to augment the dataset.
Training details: During the training, we set the initial learning rate to 4x10-4, which subsequently decreased by a factor
of 0.1 every 25 epochs until epoch 75. The batch size is set to 64. We utilized the Adam optimizer with β1=0.9 and
β2=0.999, a weight decay of 5×10−4. A downsampling factor of s=4 is used in all experiments. We chose the Charbonnier
penalty loss function, which is defined as ℒ = − + ,ε is set to 1×10−3.

4.2 Comparison with state-of-the-arts


We compared our approach against various advanced VSR methods on the datasets Vimeo-90K-T and Vid4, including
TOFlow15, RCAN11, FRVSR8, RBPN9 and DUF5. Some of the experimental data were directly cited from the original
papers, while other data were obtained by executing publicly available code or replicating the experiments based on the
relevant paper descriptions. We gauged our results using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity
(SSIM) as quantitative and qualitative assessment metrics. Table 1 summarizes the quantitative results on Vid4, showing
that our method achieves superior performance, surpassing the others by at least 0.11dB. Table 2 displays the
comparative results for Vimeo-90K. Our approach enhances the performance over DUF by nearly 0.54dB in the RGB
channels. Compared to RBPN, our method enhances by 0.12dB in the RGB channel and achieves an improvement of
0.15dB in the Y channel. Our method outperforms these advanced technologies in both PSNR and SSIM, underscoring
the efficacy of our approach.
We also conducted a qualitative evaluation of these models. Figure 3 shows the visualization results for the calendar and
city scenes in Vid4. Figure 4 shows the visualization results for Vimeo-90k-T. This demonstrates that our model has
superior performance in detail and edge restoration.
Table 1. Quantitative results of PSNR(dB)/SSIM on the Vid4. ‘–’ denotes results unavailable.
Method Calendar (Y) City (Y) Foliage (Y) Walk(Y) Average (Y) Average (RGB)
Bicubic 20.38/0.5718 25.18/0.6025 23.44/0.5634 26.14/0.7934 23.79/0.6328 22.84/0.6034
TOFLOW 22.39/0.7278 26.79/0.7436 25.30/0.7108 29.02/0.8795 25.88/0.7654 24.39/0.7438
RCAN 22.35/0.7253 26.13/0.6860 24.71/0.6657 28.45/0.8409 25.28/0.7395 24.02/0.7295
FRVSR - - - - 26.69/0.8220 -
RBPN 24.02/0.8088 27.73/0.8025 26.21/0.7577 30.68/0.9111 27.16/0.8200 25.68/0.7996
DUF 24.05/0.8112 28.29/0.8310 26.41/0.7706 30.61/0.912 27.34/0.8312 25.80/0.8137
OURS 24.17/0.8130 28.37/0.8317 26.55/0.7699 30.75/0.9128 27.45/0.8318 26.48/0.8104
Table 2. Quantitative results of PSNR(dB)/SSIM on the Vimeo90K-T.
Bicubic TOFLOW RCAN FRVSR RBPN DUF OURS
Y channel 31.29/0.8688 34.63/0.9212 33.60/0.9103 36.48/0.9402 37.23/0.9454 36.86/0.9443 37.38/0.9457
RGB Channels 29.77/0.8491 32.79/0.9041 35.36/0.9251 34.57/0.9274 35.37/0.9341 34.95/0.9311 35.49/0.9347

Proc. of SPIE Vol. 13180 131802N-4


Figure 3. Qualitative comparison on the Vid4 dataset for 4× setting.

Figure 4. Qualitative comparison on the Vimeo dataset for 4× setting.


4.3 Ablation study
To validate the effectiveness of our approach, we performed an ablation study on the multi-scale cascading deformable
alignment module(MCDM) and temporal attention fusion module(TAFM)
MCDM: We devised three models for alignment. Table 3 shows the comparative analysis. The baseline (Model 1) uses a
solitary deformable convolution. Model 2 employs three deformable convolutions, resulting in an enhanced performance
of 0.14 dB. Model 3 is our proposed MCDM, which shows an improvement of nearly 0.4 dB over Model 2,
demonstrating the efficacy of MCDM in aligning features.
TAFM: we removed the temporal attention fusion module from our method. The experimental comparison results with
the complete model are shown in the Table 4. The complete model improves performance compared to the model
without TAFM. Furthermore, we tested the complete model on video sequences of 5 frames, which is slightly inferior to
the results of 7 frames. These findings indicate that TAFM is beneficial for the reference frame's utilization of
information from neighboring frames.
Table 3. Ablation on: The efficacy of the MDCM.
Model Model 1 Model 2 Mode l3
Alignment 1Dconv 3Dconv MCDM
PSNR 29.73 29.87 30.28
Table 4. Ablation on: The efficacy of the TASM and the impact of varying input frames within the TAFM.
Model Model1 Model2 Model3
Frame 5 7 7
TAFM √ × √
Vid4 26.32/0.8088 26.44/0.8101 26.48/0.8104
Vimeo-90K-T 35.29/0.9328 35.36/0.9337 35.42/0.9342

5. CONCLUSION
In this study, we have introduced an innovative approach to video super-resolution. The multi-scale cascading
deformable alignment module is adept at effectively extracting and aligning features at different scales, capturing local
details and global structural information in videos with greater precision, and possesses the ability to adaptively handle
inter-frame motion changes, enhancing the adaptability of the algorithm to dynamic scenes. The temporal attention
fusion module can automatically learn and emphasize key parts within neighboring frames, optimizing the capture of
temporal correlations. By integrating these two modules into a bidirectional recurrent network architecture, we achieved

Proc. of SPIE Vol. 13180 131802N-5


end-to-end training and optimization. Comprehensive experimental evaluations have corroborated the efficacy and
superiority of our proposed approach.

REFERENCES

[1] Chan K C K, Wang X T, Yu K, Dong C, Loy C C. BasicVSR: the search for essential components in video
super-resolution and beyond. In:2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). Nashville, TN, USA, IEEE, 2021, 4945–4954 DOI: 10.1109/cvpr46437.2021.00491
[2] Caballero J, Ledig C, Aitken A, et al. Real-Time Video Super-Resolution with Spatio-Temporal Networks and
Motion Compensation[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2017
[3] Fatima N. Accelerating the Super-Resolution Multi-Scale Deep Convolutional Neural Network in the treatment
of Degenerative Spondylolisthesis[C]. Global Spine Conference. 2021.
[4] Y. Zhang, H. Wang, H. Zhu and Z. Chen, "Optical Flow Reusing for High-Efficiency Space-Time Video Super
Resolution," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 5, pp. 2116-2128,
May 2023, doi: 10.1109/TCSVT.2022.3222875.
[5] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim. Deep video super-resolution network using dynamic upsampling filters
without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3224–3232, 2018
[6] Y. Chen, Y. Wang and Y. Liu, "MOFN: Multi-Offset-Flow-Based Network for Video Restoration and
Enhancement," 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Taipei
City, Taiwan, 2022, pp. 1-6, doi: 10.1109/ICMEW56448.2022.9859519.
[7] Zhu, Y.; Li, G. A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution. Sensors
2023, 23, 8574. https://doi.org/10.3390/s23208574
[8] Mehdi S M Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In
CVPR, 2018
[9] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video
super-resolution. In CVPR, 2019
[10] K. C. K. Chan, S. Zhou, X. Xu and C. C. Loy, "BasicVSR++: Improving Video Super-Resolution with
Enhanced Propagation and Alignment," 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5962-5971, doi: 10.1109/CVPR52688.2022.00588.
[11] Liu X, Kong L, Zhou Y, et al. End-To-End Trainable Video Super-Resolution Based on a New Mechanism for
Implicit Motion Estimation and Compensation: IEEE, 10.1109/WACV45572.2020.9093552[P]. 2020.
[12] J. Zhou, R. Lan, X. Wang, C. Pang and X. Luo, "3D Deformable Kernels for Video super-resolution," 2022 9th
International Conference on Digital Home (ICDH), Guangzhou, China, 2022, pp. 291-298, doi:
10.1109/ICDH57206.2022.00052.
[13] Liang, J., Fan, Y., Xiang, X., Ranjan, R., Ilg, E., Green, S., Cao, J., Zhang, K., Timofte, R., & Gool, L.V.
(2022). Recurrent Video Restoration Transformer with Guided Deformable Attention. ArXiv, abs/2206.02146.
[14] Lee, J., Lee, M., Cho, S., & Lee, S. (2022). Reference-based Video Super-Resolution Using Multi-Camera
Video Triplets. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17803-
17812.
[15] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-
oriented flow. arXiv preprint arXiv:1711.09078, 2017.

Proc. of SPIE Vol. 13180 131802N-6

You might also like