0% found this document useful (0 votes)
13 views

BasicVSR++

Uploaded by

matin fazel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

BasicVSR++

Uploaded by

matin fazel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BasicVSR++: Improving Video Super-Resolution

with Enhanced Propagation and Alignment

Kelvin C.K. Chan Shangchen Zhou Xiangyu Xu Chen Change Loy*


S-Lab, Nanyang Technological University
{chan0899, s200094, xiangyu.xu, ccloy}@ntu.edu.sg
arXiv:2104.13371v1 [cs.CV] 27 Apr 2021

BasicVSR++
BasicVSR BasicVSR++ BasicVSR BasicVSR++ (ours)
EDVR
(CVPRW19)
IconVSR
(CVPR21)
Bas ic VSR
(CVPR21) RBPN
(CVPR19)
spatial residual RSDN DUF
(ECCV20)
warping offsets (CVPR18)
PFNL
RLSP (ICCV19)
(ICCVW19)
#Params
Optical flow
DCN FRVSR
Features (CVPR18)
5M 10M 15M 20M

(a) Propagation (b) Alignment (c) Performance Gain

Figure 1: Improvements over BasicVSR [2]. (a) Second-order grid propagation in BasicVSR++ allows a more effective propagation of
features. (b) Flow-guided deformable alignment in BasicVSR++ provides a means for more robust feature alignment across misaligned
frames. (c) BasicVSR++ outperforms existing state of the arts while maintaining efficiency.

Abstract 1. Introduction
A recurrent structure is a popular framework choice Video super-resolution (VSR) is challenging in that one
for the task of video super-resolution. The state-of-the- needs to gather complementary information across mis-
art method BasicVSR adopts bidirectional propagation with aligned video frames for restoration. One prevalent ap-
feature alignment to effectively exploit information from the proach is the sliding-window framework [9, 32, 35, 38],
entire input video. In this study, we redesign BasicVSR where each frame in the video is restored using the frames
by proposing second-order grid propagation and flow- within a short temporal window. In contrast to the sliding-
guided deformable alignment. We show that by empower- window framework, a recurrent framework attempts to ex-
ing the recurrent framework with the enhanced propagation ploit the long-term dependencies by propagating the latent
and alignment, one can exploit spatiotemporal information features. In general, these methods [8, 10, 11, 12, 14, 27] al-
across misaligned video frames more effectively. The new low a more compact model compared to those in the sliding-
components lead to an improved performance under a sim- window framework. Nevertheless, the problems of trans-
ilar computational constraint. In particular, our model Ba- mitting long-term information and aligning features across
sicVSR++ surpasses BasicVSR by 0.82 dB in PSNR with frames in a recurrent model remain formidable.
similar number of parameters. In addition to video super-
A recent work by Chan et al. [2] studies the problems
resolution, BasicVSR++ generalizes well to other video
carefully. It summarizes the common VSR pipelines into
restoration tasks such as compressed video enhancement.
four components, namely Propagation, Alignment, Aggre-
In NTIRE 2021, BasicVSR++ obtains three champions and
gation, and Upsampling, and proposes BasicVSR. In Ba-
one runner-up in the Video Super-Resolution and Com-
sicVSR, bidirectional propagation is adopted to exploit in-
pressed Video Enhancement Challenges. Codes and models
formation from the entire input video for reconstruction.
will be released to MMEditing1 .
For alignment, optical flow is adopted for feature warp-
ing. BasicVSR serves as a succinct yet strong backbone
∗ Corresponding author where components can be easily added for performance
1 https://github.com/open-mmlab/mmediting gain. However, its rudimentary designs in propagation and

1
alignment limit the efficacy of information aggregation. As rent detail structural block and a hidden state adaptation
a result, the network often struggles to restore fine details, module to enhance the robustness to appearance change and
especially when dealing with occluded and complex re- error accumulation. Chan et al. [2] propose BasicVSR. The
gions. The shortcomings call for refined designs in prop- work demonstrates the importance of bidirectional propa-
agation and alignment. gation over unidirectional propagation to better exploit fea-
In this work, we redesign BasicVSR by devising second- tures temporally. In addition, the study also shows the ad-
order grid propagation and flow-guided deformable align- vantage of feature alignment in aligning highly relevant but
ment that allow information to be propagated and aggre- misaligned features. We refer readers to [2] for the detailed
gated more effectively: comparisons of these components against the more conven-
1) The proposed second-order grid propagation, as shown tional ways of performing propagation and alignment. In
in Fig. 1(a), addresses two limitations in BasicVSR: i) we our experiments, we focus on comparing with BasicVSR
allow more aggressive bidirectional propagation arranged since it is the state-of-the-art method for VSR.
in a grid-like manner, and ii) we relax the assumption of Grid Connections. Grid-like designs are seen in various
first-order Markov property in BasicVSR, and incorporate a vision tasks such as object detection [5, 30, 34], semantic
second-order connection [28] into the network so that infor- segmentation [7, 30, 34, 43], and frame interpolation [25].
mation can be aggregated from different spatiotemporal lo- In general, these designs decompose a given image/feature
cations. Both modifications ameliorate information flow in into multiple resolutions, and grids are adopted across res-
the network and improve robustness of the network against olutions to capture both fine and coarse information. Un-
occluded and fine regions. like aforementioned methods, BasicVSR++ does not adopt
2) BasicVSR shows advantages of using optical flow for a multi-scale design. Instead, the grid structure is designed
temporal alignment. However, optical flow is not robust for propagation across time in a bidirectional fashion. We
to occlusion. Inaccurate flow estimation could jeopardize link different frames with a grid connection to repeatedly
the restoration performance. Deformable alignment [32, refine the features, improving expressiveness.
33, 35] has demonstrated its superiority in VSR, but it is Higher-Order Propagation. Higher-order propagation has
difficult to train in practice [3]. To take advantage of de- been studied to improve gradient flow [16, 20, 28]. These
formable alignment while overcoming the training insta- methods demonstrate improvements in different tasks in-
bility, we propose flow-guided deformable alignment, as cluding classification [16] and language modeling [28].
shown in Fig. 1(b). In the proposed module, instead of However, these methods do not consider temporal align-
learning the DCN offsets directly [6, 42], we reduce the bur- ment, which is shown critical in the task of VSR [2].
den of offset learning by using optical flow field as base off- To allow temporal alignment in second-order propagation,
sets refined by flow field residue. The latter can be learned we incorporate alignment into our propagation scheme by
more stably than the original DCN offsets. extending our flow-guided deformable alignment to the
The two aforementioned components are novel and more second-order settings.
discussion can be found in the related work section. Bene- Deformable Alignment. Several works [32, 33, 35, 37]
fit from the more effective designs, BasicVSR++ can adopt employ deformable alignment. TDAN [32] performs align-
a more lightweight backbone than its counterparts. Conse- ment at the feature level using deformable convolution.
quently, BasicVSR++ surpasses existing state of the arts, EDVR [35] further proposes a Pyramid Cascading De-
including BasicVSR and IconVSR (the more elaborated formable (PCD) alignment with a multi-scale design. Re-
BasicVSR variant), by a large margin while maintaining cently, Chan et al. [3] analyze deformable alignment and
efficiency (Fig. 1(c)). In particular, when compared to show that the performance gain over flow-based alignment
its precedent BasicVSR, a gain of 0.82 dB in PSNR on comes from the offset diversity. Motivated by [3], we adopt
REDS4 [35] is obtained with similar numbers of param- deformable alignment but with a reformulation to overcome
eters. In addition, BasicVSR++ obtains three champi- the training instability [3]. Our flow-guided deformable
ons and one runner-up in the NTIRE 2021 Video Super- alignment is different from offset-fidelity loss [3]. The lat-
Resolution [29] and Compressed Video Enhancement [39] ter uses optical flow as a loss function during training. In
Challenges. contrast, we directly incorporate optical flow into our mod-
ule as base offsets, allowing a more explicit guidance, both
2. Related Work during training and inference.
Recurrent Networks. The recurrent framework is a pop- 3. Methodology
ular structure adopted in various video processing tasks
such as super-resolution [8, 10, 11, 12, 14, 27], deblur- BasicVSR++ consists of two effective modifications
ring [24, 41], and frame interpolation [36]. For instance, for improving propagation and alignment. As shown in
RSDN [12] adopts unidirectional propagation with a recur- Fig. 2, given an input video, residual blocks are first ap-
xi
<latexit sha1_base64="mTT3im4iVdIisQa8mLsyYVd/ulE=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LC2CF0uigh4LXjxWsB/QhrLZTtqlm03Y3Ygl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkSleSwfzCRBP6JDyUPOqLFS+6mf8fPLab9cdWvuHGSVeDmpQo5Gv/zVG8QsjVAaJqjWXc9NjJ9RZTgTOC31Uo0JZWM6xK6lkkao/Wx+7pScWmVAwljZkobM1d8TGY20nkSB7YyoGellbyb+53VTE974GZdJalCyxaIwFcTEZPY7GXCFzIiJJZQpbm8lbEQVZcYmVLIheMsvr5LWRc1za979VbVeyeMowglU4Aw8uIY63EEDmsBgDM/wCm9O4rw4787HorXg5DPH8AfO5w/7Ko84</latexit>
3 xi
<latexit sha1_base64="RqXt+LLPBj9PRmxN5c6DObPGAlU=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LC2CF0tSBD0WvHisYD+gDWWznbRLN5uwuxFL6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmt3O/84hK81g+mGmCfkRHkoecUWOlztMg45f12aBcdWvuAmSdeDmpQo7moPzVH8YsjVAaJqjWPc9NjJ9RZTgTOCv1U40JZRM6wp6lkkao/Wxx7oycW2VIwljZkoYs1N8TGY20nkaB7YyoGetVby7+5/VSE974GZdJalCy5aIwFcTEZP47GXKFzIipJZQpbm8lbEwVZcYmVLIheKsvr5N2vea5Ne/+qtqo5HEU4QwqcAEeXEMD7qAJLWAwgWd4hTcncV6cd+dj2Vpw8plT+APn8wf5pY83</latexit>
2 xi
<latexit sha1_base64="8ZgKKml4S7NkJ5L1k2CZDgexDro=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LC2CF0siBT0WvHisYD+gDWWz3bRLN5uwOxFL6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RQ2Nre2d4q7pb39g8Oj8vFJ28SpZrzFYhnrbkANl0LxFgqUvJtoTqNA8k4wuZ37nUeujYjVA04T7kd0pEQoGEUrdZ4Gmbj0ZoNy1a25C5B14uWkCjmag/JXfxizNOIKmaTG9Dw3QT+jGgWTfFbqp4YnlE3oiPcsVTTixs8W587IuVWGJIy1LYVkof6eyGhkzDQKbGdEcWxWvbn4n9dLMbzxM6GSFLliy0VhKgnGZP47GQrNGcqpJZRpYW8lbEw1ZWgTKtkQvNWX10n7qua5Ne++Xm1U8jiKcAYVuAAPrqEBd9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QP4II82</latexit>
1 <latexit sha1_base64="54yGSiK/xIYFJ2p37yRd1ZH/AGE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LC2Cp5KIUI8FLx4r2g9oQ9lsJ+3SzSbsbsQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJzdzvPKLSPJYPZpqgH9GR5CFn1Fjp/mnAB+WqW3MXIOvEy0kVcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TcKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8NrPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbQva55b8+6uqo1KHkcRzqACF+BBHRpwC01oAYMRPMMrvDnCeXHenY9la8HJZ07hD5zPH1bwjbg=</latexit>
xi

fij 1

fij
<latexit sha1_base64="Fuf2p1Br1GNsVk9s5hviwCx8zKQ=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LC2CF0sigh4LXjxWsB+QxrLZbtq1m92wuxFKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjjpapIrRNJJeqF2JNORO0bZjhtJcoiuOQ0244uZn53SeqNJPi3kwTGsR4JFjECDZW8qOH7PHcywcZywfVuttw50CrxCtIHQq0BtWv/lCSNKbCEI619j03MUGGlWGE07zSTzVNMJngEfUtFTimOsjmJ+fo1CpDFEllSxg0V39PZDjWehqHtjPGZqyXvZn4n+enJroOMiaS1FBBFouilCMj0ex/NGSKEsOnlmCimL0VkTFWmBibUsWG4C2/vEo6Fw3PbXh3l/VmrYijDCdQgzPw4AqacAstaAMBCc/wCm+OcV6cd+dj0Vpyiplj+APn8wcj/pEM</latexit>

2
Flow-Guided
<latexit sha1_base64="PhyMOosCSgzABmmnrKKaU5q/v9g=">AAAB8nicbVBNS8NAEJ34WetX1aOXpUXwYkmKoMeCF48V7AeksWy2m3btZjfsboQS8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dtbWNza3tks75d29/YPDytFxR8tUEdomkkvVC7GmnAnaNsxw2ksUxXHIaTec3Mz87hNVmklxb6YJDWI8EixiBBsr+dFD9pgPMnbRyAeVmlt350CrxCtIDQq0BpWv/lCSNKbCEI619j03MUGGlWGE07zcTzVNMJngEfUtFTimOsjmJ+fozCpDFEllSxg0V39PZDjWehqHtjPGZqyXvZn4n+enJroOMiaS1FBBFouilCMj0ex/NGSKEsOnlmCimL0VkTFWmBibUtmG4C2/vEo6jbrn1r27y1qzWsRRglOowjl4cAVNuIUWtIGAhGd4hTfHOC/Ou/OxaF1zipkT+APn8wcni5EN</latexit>

<latexit sha1_base64="nu90nFi8V6ibIGHIAugHUrT4b7o=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LC2Cp5KIoMeCF48V7Qe0oWy2k3TpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAqujet+O6WNza3tnfJuZW//4PCoenzS0UmmGLZZIhLVC6hGwSW2DTcCe6lCGgcCu8Hkdu53n1BpnshHM03Rj2kkecgZNVZ6iIZ8WK27DXcBsk68gtShQGtY/RqMEpbFKA0TVOu+56bGz6kynAmcVQaZxpSyCY2wb6mkMWo/X5w6I+dWGZEwUbakIQv190ROY62ncWA7Y2rGetWbi/95/cyEN37OZZoZlGy5KMwEMQmZ/01GXCEzYmoJZYrbWwkbU0WZselUbAje6svrpHPZ8NyGd39Vb9aKOMpwBjW4AA+uoQl30II2MIjgGV7hzRHOi/PufCxbS04xcwp/4Hz+AD0Kjac=</latexit>
gi Deformable
Alignment
fij 1
fˆij
<latexit sha1_base64="JJHM/fmlMoxSmcfkWUZivgkAAYo=">AAAB8HicbVBNSwMxEJ31s9avqkcvoUXwYtmIoMeCF48V7Ie0a8mm2TY2yS5JVihLf4UXD4p49ed489+YtnvQ1gcDj/dmmJkXJoIb6/vf3srq2vrGZmGruL2zu7dfOjhsmjjVlDVoLGLdDolhgivWsNwK1k40IzIUrBWOrqd+64lpw2N1Z8cJCyQZKB5xSqyT7qOHx17Gz/CkV6r4VX8GtExwTiqQo94rfXX7MU0lU5YKYkwH+4kNMqItp4JNit3UsITQERmwjqOKSGaCbHbwBJ04pY+iWLtSFs3U3xMZkcaMZeg6JbFDs+hNxf+8TmqjqyDjKkktU3S+KEoFsjGafo/6XDNqxdgRQjV3tyI6JJpQ6zIquhDw4svLpHlexX4V315UauU8jgIcQxlOAcMl1OAG6tAAChKe4RXePO29eO/ex7x1xctnjuAPvM8fW2aQAA==</latexit>

<latexit sha1_base64="UTLMbkardjZz4SR4t16SPk4ukzg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LC2CF0sigh4LXjxWsB+QxrLZbtq1m92wuxFKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjjpapIrRNJJeqF2JNORO0bZjhtJcoiuOQ0244uZn53SeqNJPi3kwTGsR4JFjECDZW8qNBxvKH7PHcywfVuttw50CrxCtIHQq0BtWv/lCSNKbCEI619j03MUGGlWGE07zSTzVNMJngEfUtFTimOsjmJ+fo1CpDFEllSxg0V39PZDjWehqHtjPGZqyXvZn4n+enJroOMiaS1FBBFouilCMj0ex/NGSKEsOnlmCimL0VkTFWmBibUsWG4C2/vEo6Fw3PbXh3l/VmrYijDCdQgzPw4AqacAstaAMBCc/wCm+OcV6cd+dj0Vpyiplj+APn8wcmBpEM</latexit>
fij 1 <latexit sha1_base64="EWBmEfdUSjpq3vxbfxzChQ0QnJY=">AAAB8nicbVBNS8NAEJ34WetX1aOXpUXwVBIR9Fjw4rGC/YA0ls12067dbMLuRCihP8OLB0W8+mu8+W/ctjlo64OBx3szzMwLUykMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3Q2q4FIq3UKDk3VRzGoeSd8LxzczvPHFtRKLucZLyIKZDJSLBKFrJ740o5tG0Lx4e+5WaW3fnIKvEK0gNCjT7la/eIGFZzBUySY3xPTfFIKcaBZN8Wu5lhqeUjemQ+5YqGnMT5POTp+TMKgMSJdqWQjJXf0/kNDZmEoe2M6Y4MsveTPzP8zOMroNcqDRDrthiUZRJggmZ/U8GQnOGcmIJZVrYWwkbUU0Z2pTKNgRv+eVV0r6oe27du7usNapFHCU4hSqcgwdX0IBbaEILGCTwDK/w5qDz4rw7H4vWNaeYOYE/cD5/AIukkU8=</latexit>

<latexit sha1_base64="GxrgUZMJwL7NX9/9XGvrMWWoCUw=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LC2Cp5KIoMeCF48VTFtpY9lsN+3a3U3Y3Qgl5Fd48aCIV3+ON/+N2zYHbX0w8Hhvhpl5YcKZNq777ZTW1jc2t8rblZ3dvf2D6uFRW8epItQnMY9VN8Saciapb5jhtJsoikXIaSecXM/8zhNVmsXyzkwTGgg8kixiBBsr3UeDjOUP2WM+qNbdhjsHWiVeQepQoDWofvWHMUkFlYZwrHXPcxMTZFgZRjjNK/1U0wSTCR7RnqUSC6qDbH5wjk6tMkRRrGxJg+bq74kMC62nIrSdApuxXvZm4n9eLzXRVZAxmaSGSrJYFKUcmRjNvkdDpigxfGoJJorZWxEZY4WJsRlVbAje8surpH3e8NyGd3tRb9aKOMpwAjU4Aw8uoQk30AIfCAh4hld4c5Tz4rw7H4vWklPMHMMfOJ8/RjiQmg==</latexit>
fij
fij


<latexit sha1_base64="W3nA+5zpDHgtJ4x9TfX5AVE8k3Y=">AAAB7HicbVBNS8NAEJ34WetX1aOXpUXwVBIR9Fjw4rGCaQttLJvtpF272YTdjVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXpoJr47rfztr6xubWdmmnvLu3f3BYOTpu6SRTDH2WiER1QqpRcIm+4UZgJ1VI41BgOxzfzPz2EyrNE3lvJikGMR1KHnFGjZX8qM8fHvuVmlt35yCrxCtIDQo0+5Wv3iBhWYzSMEG17npuaoKcKsOZwGm5l2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmeg6yLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n7INwVt+eZW0LuqeW/fuLmuNahFHCU6hCufgwRU04Baa4AMDDs/wCm+OdF6cd+dj0brmFDMn8AfO5w+3P46C</latexit>

fij+1
fij
<latexit sha1_base64="kMQJrez4q00768lapNe/aOnvZkI=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LC2CIJREBD0WvHisYD8gjWWz3bRrN7thdyOUkJ/hxYMiXv013vw3btsctPXBwOO9GWbmhQln2rjut1NaW9/Y3CpvV3Z29/YPqodHHS1TRWibSC5VL8SaciZo2zDDaS9RFMchp91wcjPzu09UaSbFvZkmNIjxSLCIEWys5EeDjOUP2eO5lw+qdbfhzoFWiVeQOhRoDapf/aEkaUyFIRxr7XtuYoIMK8MIp3mln2qaYDLBI+pbKnBMdZDNT87RqVWGKJLKljBorv6eyHCs9TQObWeMzVgvezPxP89PTXQdZEwkqaGCLBZFKUdGotn/aMgUJYZPLcFEMXsrImOsMDE2pYoNwVt+eZV0Lhqe2/DuLuvNWhFHGU6gBmfgwRU04RZa0AYCEp7hFd4c47w4787HorXkFDPH8AfO5w8i+pEK</latexit>

<latexit sha1_base64="W3nA+5zpDHgtJ4x9TfX5AVE8k3Y=">AAAB7HicbVBNS8NAEJ34WetX1aOXpUXwVBIR9Fjw4rGCaQttLJvtpF272YTdjVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXpoJr47rfztr6xubWdmmnvLu3f3BYOTpu6SRTDH2WiER1QqpRcIm+4UZgJ1VI41BgOxzfzPz2EyrNE3lvJikGMR1KHnFGjZX8qM8fHvuVmlt35yCrxCtIDQo0+5Wv3iBhWYzSMEG17npuaoKcKsOZwGm5l2lMKRvTIXYtlTRGHeTzY6fkzCoDEiXKljRkrv6eyGms9SQObWdMzUgvezPxP6+bmeg6yLlMM4OSLRZFmSAmIbPPyYArZEZMLKFMcXsrYSOqKDM2n7INwVt+eZW0LuqeW/fuLmuNahFHCU6hCufgwRU04Baa4AMDDs/wCm+OdF6cd+dj0brmFDMn8AfO5w+3P46C</latexit>

Residual Blocks
Pixel-Shuffle
"
<latexit sha1_base64="cvjpFk5pfdpVYDnpezH4JYajms4=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RTKIkS5idzCZD5rHMzCphyVd48aCIVz/Hm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOn7CBvKmaQtyyyn94mmWEScdqLxde53Hqk2TMk7O0loKPBQspgRbJ300EsTrLV6qvSrNb/uz4CWSVCQGhRo9qtfvYEiqaDSEo6N6QZ+YsMMa8sIp9NKLzU0wWSMh7TrqMSCmjCbHTxFJ04ZoFhpV9Kimfp7IsPCmImIXKfAdmQWvVz8z+umNr4MMyaT1FJJ5ovilCOrUP49GjBNieUTRzDRzN2KyAhrTKzLKA8hWHx5mbTP6oFfD27Pa42rIo4yHMExnEIAF9CAG2hCCwgIeIZXePO09+K9ex/z1pJXzBzCH3ifP6XukEo=</latexit>

<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
"
<latexit sha1_base64="cvjpFk5pfdpVYDnpezH4JYajms4=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RTKIkS5idzCZD5rHMzCphyVd48aCIVz/Hm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOn7CBvKmaQtyyyn94mmWEScdqLxde53Hqk2TMk7O0loKPBQspgRbJ300EsTrLV6qvSrNb/uz4CWSVCQGhRo9qtfvYEiqaDSEo6N6QZ+YsMMa8sIp9NKLzU0wWSMh7TrqMSCmjCbHTxFJ04ZoFhpV9Kimfp7IsPCmImIXKfAdmQWvVz8z+umNr4MMyaT1FJJ5ovilCOrUP49GjBNieUTRzDRzN2KyAhrTKzLKA8hWHx5mbTP6oFfD27Pa42rIo4yHMExnEIAF9CAG2hCCwgIeIZXePO09+K9ex/z1pJXzBzCH3ifP6XukEo=</latexit>

<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
"
<latexit sha1_base64="cvjpFk5pfdpVYDnpezH4JYajms4=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RTKIkS5idzCZD5rHMzCphyVd48aCIVz/Hm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOn7CBvKmaQtyyyn94mmWEScdqLxde53Hqk2TMk7O0loKPBQspgRbJ300EsTrLV6qvSrNb/uz4CWSVCQGhRo9qtfvYEiqaDSEo6N6QZ+YsMMa8sIp9NKLzU0wWSMh7TrqMSCmjCbHTxFJ04ZoFhpV9Kimfp7IsPCmImIXKfAdmQWvVz8z+umNr4MMyaT1FJJ5ovilCOrUP49GjBNieUTRzDRzN2KyAhrTKzLKA8hWHx5mbTP6oFfD27Pa42rIo4yHMExnEIAF9CAG2hCCwgIeIZXePO09+K9ex/z1pJXzBzCH3ifP6XukEo=</latexit>

<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
"
<latexit sha1_base64="cvjpFk5pfdpVYDnpezH4JYajms4=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RTKIkS5idzCZD5rHMzCphyVd48aCIVz/Hm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOn7CBvKmaQtyyyn94mmWEScdqLxde53Hqk2TMk7O0loKPBQspgRbJ300EsTrLV6qvSrNb/uz4CWSVCQGhRo9qtfvYEiqaDSEo6N6QZ+YsMMa8sIp9NKLzU0wWSMh7TrqMSCmjCbHTxFJ04ZoFhpV9Kimfp7IsPCmImIXKfAdmQWvVz8z+umNr4MMyaT1FJJ5ovilCOrUP49GjBNieUTRzDRzN2KyAhrTKzLKA8hWHx5mbTP6oFfD27Pa42rIo4yHMExnEIAF9CAG2hCCwgIeIZXePO09+K9ex/z1pJXzBzCH3ifP6XukEo=</latexit>

<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Grid Propagation
Second-Order Propagation
"
<latexit sha1_base64="cvjpFk5pfdpVYDnpezH4JYajms4=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RTKIkS5idzCZD5rHMzCphyVd48aCIVz/Hm3/jbLIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w/aRqWa0BZRXOn7CBvKmaQtyyyn94mmWEScdqLxde53Hqk2TMk7O0loKPBQspgRbJ300EsTrLV6qvSrNb/uz4CWSVCQGhRo9qtfvYEiqaDSEo6N6QZ+YsMMa8sIp9NKLzU0wWSMh7TrqMSCmjCbHTxFJ04ZoFhpV9Kimfp7IsPCmImIXKfAdmQWvVz8z+umNr4MMyaT1FJJ5ovilCOrUP49GjBNieUTRzDRzN2KyAhrTKzLKA8hWHx5mbTP6oFfD27Pa42rIo4yHMExnEIAF9CAG2hCCwgIeIZXePO09+K9ex/z1pJXzBzCH3ifP6XukEo=</latexit>
Bilinear Upsampling
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Elementwise Addition
C Channel-wise Concatenation

Figure 2: An Overview of BasicVSR++. BasicVSR++ consists of two modifications to improve propagation and alignment. For propaga-
tion, we introduce second-order propagation (blue solid lines) to refine features bidirectionally. In addition, second-order connection (red
dotted lines) is adopted to improve the robustness of propagation. Within each propagation branch, flow-guided deformable alignment is
proposed to increase the offset diversity while overcoming the offset overflow problem.

plied to extract features from each frame. The features are To further enhance the robustness of propagation, we re-
then propagated under our second-order grid propagation lax the assumption of first-order Markov property in Ba-
scheme, where alignment is performed by our flow-guided sicVSR and adopt a second-order connection, realizing a
deformable alignment. After propagation, the aggregated second-order Markov chain. With this relaxation, informa-
features are used to generate the output image through con- tion can be aggregated from different spatiotemporal loca-
volution and pixel-shuffling. tions, improving robustness and effectiveness in occluded
and fine regions.
3.1. Second-Order Grid Propagation
Integrating the above two components, we devise our
Most existing methods adopt unidirectional propaga- second-order grid propagation as follows. Let xi be the in-
tion [12, 14, 27]. Several works [2, 10, 11] adopt bidi- put image, gi be the feature extracted from xi by multiple
rectional propagation for exploiting the information avail- residual blocks, and fij be the feature computed at the i-
able in the video sequence. In particular, IconVSR [2] con- th timestep in the j-th propagation branch. In this section,
sists of a coupled propagation scheme with sequentially- we describe the procedure for forward propagation, and the
connected branches to facilitate information exchange. procedure for backward propagation is defined similarly.
Motivated by the effectiveness of the bidirectional prop- To compute the feature fij , we first align fi−1
j j
and fi−2
agation, we devise a grid propagation scheme to enable re- (following the second-order Markov chain) using our pro-
peated refinement through propagation. More specifically, posed flow-guided deformable alignment, which will be
the intermediate features are propagated backward and for- discussed in the next section:
ward in time in an alternating manner. Through propaga-
tion, the information from different frames can be “revis-
 
fˆij = A gi , fi−1
j j
, fi−2 , si→i−1 , si→i−2 , (1)
ited” and adopted for feature refinement. Compared to ex-
isting works that propagate features only once, grid prop-
agation repeatedly extracts information from the entire se- where si→i−1 , si→i−2 denote the optical flows from i-th
quence, improving feature expressiveness. frame to the (i−1)-th and (i−2)-th frames, respectively,
fi 1 si!i 1 gi frame, we first warp fi−1 with si→i−1 :
warping
<latexit sha1_base64="nu90nFi8V6ibIGHIAugHUrT4b7o=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LC2Cp5KIoMeCF48V7Qe0oWy2k3TpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAqujet+O6WNza3tnfJuZW//4PCoenzS0UmmGLZZIhLVC6hGwSW2DTcCe6lCGgcCu8Hkdu53n1BpnshHM03Rj2kkecgZNVZ6iIZ8WK27DXcBsk68gtShQGtY/RqMEpbFKA0TVOu+56bGz6kynAmcVQaZxpSyCY2wb6mkMWo/X5w6I+dWGZEwUbakIQv190ROY62ncWA7Y2rGetWbi/95/cyEN37OZZoZlGy5KMwEMQmZ/01GXCEzYmoJZYrbWwkbU0WZselUbAje6svrpHPZ8NyGd39Vb9aKOMpwBjW4AA+uoQl30II2MIjgGV7hzRHOi/PufCxbS04xcwp/4Hz+AD0Kjac=</latexit>

<latexit sha1_base64="0UDxlsZGvYv5vbHWTTmF5zkzsVg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LC2CF0sigh4LXjxWsB/QhrLZTtqlm03Y3Qgl5Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777ZQ2Nre2d8q7lb39g8Oj6vFJR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTu7nffUKleSwfzSxBP6JjyUPOqLFSNxxm/NLLh9W623AXIOvEK0gdCrSG1a/BKGZphNIwQbXue25i/Iwqw5nAvDJINSaUTekY+5ZKGqH2s8W5OTm3yoiEsbIlDVmovycyGmk9iwLbGVEz0aveXPzP66cmvPUzLpPUoGTLRWEqiInJ/Hcy4gqZETNLKFPc3krYhCrKjE2oYkPwVl9eJ52rhuc2vIfrerNWxFGGM6jBBXhwA024hxa0gcEUnuEV3pzEeXHenY9la8kpZk7hD5zPH9xsjyQ=</latexit>

<latexit sha1_base64="ADHMVwR/ItooMHdI5kXYgI1Jrx4=">AAAB/XicbVDLSgNBEOyNrxhf6+PmZUgQvBh2RdBjwIvHCOYBybLMTibJkNmZZWZWiUvwV7x4UMSr/+HNv3GS7EETCxqKqm66u6KEM20879sprKyurW8UN0tb2zu7e+7+QVPLVBHaIJJL1Y6wppwJ2jDMcNpOFMVxxGkrGl1P/dY9VZpJcWfGCQ1iPBCszwg2VgrdIx1mrKvYYGiwUvIBsTN/EroVr+rNgJaJn5MK5KiH7le3J0kaU2EIx1p3fC8xQYaVYYTTSambappgMsID2rFU4JjqIJtdP0EnVumhvlS2hEEz9fdEhmOtx3FkO2NshnrRm4r/eZ3U9K+CjIkkNVSQ+aJ+ypGRaBoF6jFFieFjSzBRzN6KyBArTIwNrGRD8BdfXibN86rvVf3bi0qtnMdRhGMowyn4cAk1uIE6NIDAIzzDK7w5T86L8+58zFsLTj5zCH/gfP4Aaa2VEg==</latexit>

previous optical LR
feature flow feature
f¯i−1 = W(fi−1 , si→i−1 ), (3)
where W denotes the spatial warping operation. The pre-
warped aligned features are then used to compute the DCN offsets
f¯i C
<latexit sha1_base64="DEfcxByx/STKO8B0obxT3mNnnhg=">AAAB8HicbVDLSgNBEOyNrxhfUY9ehgTBU9gVQY8BLx4jmIckIcxOZpMhM7PLTK8QlnyFFw+KePVzvPk3TpI9aGJBQ1HVTXdXmEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5p2Tg1jDdZLGPTCanlUmjeRIGSdxLDqQolb4eT27nffuLGilg/4DThfUVHWkSCUXTSYy+kJotmAzEoV/2avwBZJ0FOqpCjMSh/9YYxSxXXyCS1thv4CfYzalAwyWelXmp5QtmEjnjXUU0Vt/1scfCMnDtlSKLYuNJIFurviYwqa6cqdJ2K4tiuenPxP6+bYnTTz4ROUuSaLRdFqSQYk/n3ZCgMZyinjlBmhLuVsDE1lKHLqORCCFZfXiety1rg14L7q2q9ksdRhDOowAUEcA11uIMGNIGBgmd4hTfPeC/eu/exbC14+cwp/IH3+QP+TJBr</latexit>

feature oi→i−1 and modulation masks mi→i−1 . Instead of directly


computing the DCN offsets, we compute the residue to the
Conv 𝐶 ! Conv 𝐶 "
optical flow:
oi→i−1 = si→i−1 + C o c(gi , f¯i−1 ) ,

residual
offsets (4)
mi→i−1 = σ C m c(gi , f¯i−1 ) .
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>



DCN DCN Here C {o,m} denotes a stack of convolutions, and σ denotes


DCN block offsets masks the sigmoid function. A DCN is then applied to the un-
fˆi <latexit sha1_base64="WQlshAYPbMCBkrxqsYpzpF1ZEzc=">AAAB/XicbVDLSgNBEOyNrxhf6+PmZUgQvBh2RdBjwIvHCOYBybLMTibJkNmZZWZWiUvwV7x4UMSr/+HNv3GS7EETCxqKqm66u6KEM20879sprKyurW8UN0tb2zu7e+7+QVPLVBHaIJJL1Y6wppwJ2jDMcNpOFMVxxGkrGl1P/dY9VZpJcWfGCQ1iPBCszwg2VgrdIxlmrKvYYGiwUvIBsTN/EroVr+rNgJaJn5MK5KiH7le3J0kaU2EIx1p3fC8xQYaVYYTTSambappgMsID2rFU4JjqIJtdP0EnVumhvlS2hEEz9fdEhmOtx3FkO2NshnrRm4r/eZ3U9K+CjIkkNVSQ+aJ+ypGRaBoF6jFFieFjSzBRzN6KyBArTIwNrGRD8BdfXibN86rvVf3bi0qtnMdRhGMowyn4cAk1uIE6NIDAIzzDK7w5T86L8+58zFsLTj5zCH/gfP4AY1GVDg==</latexit>
oi!i 1 <latexit sha1_base64="s1LgKoDj2pdiYRUos/jo99y3jZI=">AAAB/XicbVDLSgNBEOyNrxhf6+PmZUgQvBh2RdBjwIvHCOYBybLMTibJkJnZZWZWiUvwV7x4UMSr/+HNv3GS7EETCxqKqm66u6KEM20879sprKyurW8UN0tb2zu7e+7+QVPHqSK0QWIeq3aENeVM0oZhhtN2oigWEaetaHQ99Vv3VGkWyzszTmgg8ECyPiPYWCl0j0SYsa5ig6HBSsUPiJ35k9CteFVvBrRM/JxUIEc9dL+6vZikgkpDONa643uJCTKsDCOcTkrdVNMEkxEe0I6lEguqg2x2/QSdWKWH+rGyJQ2aqb8nMiy0HovIdgpshnrRm4r/eZ3U9K+CjMkkNVSS+aJ+ypGJ0TQK1GOKEsPHlmCimL0VkSFWmBgbWMmG4C++vEya51Xfq/q3F5VaOY+jCMdQhlPw4RJqcAN1aACBR3iGV3hznpwX5935mLcWnHzmEP7A+fwBYCOVDA==</latexit>
mi!i 1 warped feature fi−1 :
<latexit sha1_base64="9HPUILtyD4OFEg10zmkYKn9qtvQ=">AAAB8HicbVDLSgNBEOyNrxhfUY9ehgTBU9gVQY8BLx4jmIckIcxOZpMhM7PLTK8QlnyFFw+KePVzvPk3TpI9aGJBQ1HVTXdXmEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5p2Tg1jDdZLGPTCanlUmjeRIGSdxLDqQolb4eT27nffuLGilg/4DThfUVHWkSCUXTSY29MMYtmAzEoV/2avwBZJ0FOqpCjMSh/9YYxSxXXyCS1thv4CfYzalAwyWelXmp5QtmEjnjXUU0Vt/1scfCMnDtlSKLYuNJIFurviYwqa6cqdJ2K4tiuenPxP6+bYnTTz4ROUuSaLRdFqSQYk/n3ZCgMZyinjlBmhLuVsDE1lKHLqORCCFZfXiety1rg14L7q2q9ksdRhDOowAUEcA11uIMGNIGBgmd4hTfPeC/eu/exbC14+cwp/IH3+QMKr5Bz</latexit>

fˆi = D (fi−1 ; oi→i−1 , mi→i−1 ) , (5)


Figure 3: Flow-guided deformable alignment. Optical flow is
used to pre-align the features. The aligned features are then con- where D denotes a deformable convolution.
catenated to produce to DCN offsets (residue to optical flow). A The above formulation is designed only for aligning one
DCN is then applied to the unwarped features. Only first-order single feature, and hence is not directly applicable to our
connections are drawn, the second-order connections are omitted second-order propagation. The most intuitive way to adapt
for simplicity. to the second-order settings is to apply the above procedure
j j
to the two features, fi−1 and fi−2 , independently. How-
and A represents flow-guided deformable alignment2 . The ever, this requires doubled computations, resulting in re-
features are then concatenated and passed into a stack of duced efficiency. Furthermore, separate alignment poten-
residual blocks: tially ignores the complementary information from the fea-
   tures. Therefore, we allow alignment of two features simul-
fij = fˆij + R c fij−1 , fˆij , (2) taneously. More specifically, we concatenate the warped
features and flows to compute the offsets oi−p (p=1, 2):
where fi0 = gi , R denotes the residual blocks, and c denotes
oi→i−p = si→i−p + C o c(gi , f¯i−1 , f¯i−2 ) ,

concatenation along channel dimension.
(6)
mi→i−p = σ C m c(gi , f¯i−1 , f¯i−2 ) .

3.2. Flow-Guided Deformable Alignment
Deformable alignment [33, 35] has demonstrated sig- A DCN is then applied to the unwarped features:
nificant improvements over flow-based alignment [9, 38] oi = c(oi→i−1 , oi→i−2 ),
thanks to the offset diversity [3] intrinsically introduced
mi = c(mi→i−1 , mi→i−2 ), (7)
in deformable convolution (DCN) [6, 42]. However, de-
formable alignment module can be difficult to train [3]. The fˆi = D (c(fi−1 , fi−2 ); oi , mi ) .
training instability often results in offset overflow, deterio-
More details of the second-order flow-guided deformable
rating the final performance.
alignment are provided in the supplementary material.
To take advantage of the offset diversity while overcom-
Discussion. Unlike existing methods [32, 33, 35, 37]
ing the instability, we propose to employ optical flow to
that directly compute the DCN offsets, our proposed flow-
guide deformable alignment, motivated by the strong rela-
guided deformable alignment adopts optical flow as guid-
tion between deformable alignment and flow-based align-
ance. The benefits are two-fold. First, since CNNs are
ment [3]. The graphical illustration is shown in Fig. 3. In
known to have local receptive fields, the learning of off-
the rest of this section, we detail the alignment procedure
sets can be assisted by pre-aligning the features using op-
for forward propagation. The procedure for backward prop-
tical flow. Second, by learning only the residue, the net-
agation is defined similarly. The superscript j is omitted for
work needs to learn only small deviations from the optical
notational simplicity.
flow, reducing the burden in typical deformable alignment
At the i-th timestep, given the feature gi computed from
modules. In addition, instead of directly concatenating the
the i-th LR image, the feature fi−1 computed for the pre-
warped feature, the modulation masks in DCN act as at-
vious timestep, and the optical flow si→i−1 to the previous
tention maps to weigh the contributions of different pixels,
2s providing additional flexibility.
0→−1 =s0→−2 =s1→−1 =f−1 =f−2 =0.
Table 1: Quantitative comparison (PSNR/SSIM). All results are calculated on Y-channel except REDS4 [23] (RGB-channel). Red and
blue colors indicate the best and the second-best performance, respectively. The runtime is computed on an LR size of 180×320. A 4×
upsampling is performed following previous studies. Blanked entries correspond to results not reported in previous works.

BI degradation BD degradation
Params (M) Runtime (ms) REDS4 [23] Vimeo-90K-T [38] Vid4 [21] UDM10 [40] Vimeo-90K-T [38] Vid4 [21]
Bicubic - - 26.14/0.7292 31.32/0.8684 23.78/0.6347 28.47/0.8253 31.30/0.8687 21.80/0.5246
VESPCN [1] - - - - 25.35/0.7557 - - -
SPMC [31] - - - - 25.88/0.7752 - - -
TOFlow [38] - - 27.98/0.7990 33.08/0.9054 25.89/0.7651 36.26/0.9438 34.62/0.9212 -
FRVSR [27] 5.1 137 - - - 37.09/0.9522 35.64/0.9319 26.69/0.8103
DUF [15] 5.8 974 28.63/0.8251 - - 38.48/0.9605 36.87/0.9447 27.38/0.8329
RBPN [9] 12.2 1507 30.09/0.8590 37.07/0.9435 27.12/0.8180 38.66/0.9596 37.20/0.9458 -
EDVR-M [35] 3.3 118 30.53/0.8699 37.09/0.9446 27.10/0.8186 39.40/0.9663 37.33/0.9484 27.45/0.8406
EDVR [35] 20.6 378 31.09/0.8800 37.61/0.9489 27.35/0.8264 39.89/0.9686 37.81/0.9523 27.85/0.8503
PFNL [40] 3.0 295 29.63/0.8502 36.14/0.9363 26.73/0.8029 38.74/0.9627 - 27.16/0.8355
MuCAN [19] - - 30.88/0.8750 37.32/0.9465 - - - -
TGA [13] 5.8 - - - - - 37.59/0.9516 27.63/0.8423
RLSP [8] 4.2 49 - - - 38.48/0.9606 36.49/0.9403 27.48/0.8388
RSDN [12] 6.2 94 - - - 39.35/0.9653 37.23/0.9471 27.92/0.8505
RRN [14] 3.4 45 - - - 38.96/0.9644 - 27.69/0.8488
BasicVSR [2] 6.3 63 31.42/0.8909 37.18/0.9450 27.24/0.8251 39.96/0.9694 37.53/0.9498 27.96/0.8553
IconVSR [2] 8.7 70 31.67/0.8948 37.47/0.9476 27.39/0.8279 40.03/0.9694 37.84/0.9524 28.04/0.8570
BasicVSR++ 7.3 77 32.39/0.9069 37.79/0.9500 27.79/0.8400 40.72/0.9722 38.21/0.9550 29.04/0.8753

Table 2: Performance of a lighter BasicVSR++. Our lighter sidered inclusively in our method. The number of residual
model, BasicVSR++ (S), has a similar complexity to BasicVSR blocks for each branch is set to 7. The number of feature
and IconVSR, but still shows considerable improvements. The channels is 64. Detailed experimental settings and model
PSNR and runtime are computed on REDS4.
architectures are provided in the supplementary material.
BasicVSR [2] IconVSR [2] BasicVSR++ (S)
4.1. Comparisons with State-of-the-Art Methods
Params (M) 6.3 8.7 6.4
Runtime (ms) 63 70 69 We conduct comprehensive experiments by comparing
PSNR (dB) 31.42 31.67 32.24 with 16 models, as listed in Table 1. The quantitative results
are summarized in Table 1 and the speed and performance
4. Experiments comparison is provided in Fig. 1(c). Note that the parame-
ters reported above are inclusive of that in the optical flow
Two widely-used datasets are adopted for training: network (if any). So the comparison is fair.
REDS [23] and Vimeo-90K [38]. For REDS, following Ba- As shown in Table 1, BasicVSR++ achieves state-of-the-
sicVSR [2], we use REDS43 as our test set and REDSval44 art performance on all datasets for both degradations. In
as our validation set. The remaining clips are used for particular, BasicVSR++ outperforms EDVR [35], a large-
training. We use Vid4 [21], UDM10 [40], and Vimeo- capacity sliding-window method, by up to 1.3 dB in PSNR,
90K-T [38] as test sets along with Vimeo-90K. All models while having 65% fewer parameters. When compared to
are tested with 4× downsampling using two degradations – the previous state of the art, IconVSR [2], BasicVSR++
Bicubic (BI) and Blur Downsampling (BD). possesses fewer parameters but has improvements of up to
We adopt Adam optimizer [17] and Cosine Annealing 1 dB. As shown in Table 2, even if we train a lighter version
scheme [22]. The initial learning rate of the main network of BasicVSR++ (denoted as BasicVSR++ (S)) with com-
and the flow network are set to 1×10−4 and 2.5×10−5 , re- parable network parameters and runtime to BasicVSR and
spectively. The total number of iterations is 600K, and the IconVSR, our model still shows an improvement of 0.82 dB
weights of the flow network are fixed during the first 5,000 over BasicVSR and 0.57 dB over IconVSR. Such gains are
iterations. The batch size is 8 and the patch size of input LR considered significant in VSR.
frames is 64×64. We use Charbonnier loss [4] since it bet- Some qualitative comparisons are shown in Fig. 11 to
ter handles outliers and improves the performance over the Fig. 14. BasicVSR++ successfully restores the fine details.
conventional `2 -loss [18]. We use pre-trained SPyNet [26] In particular, BasicVSR++ is the only method that restores
as our flow network. Its parameters and runtime are con- the wheel’s spokes in Fig. 11, the stairs in Fig. 13, and the
3 Clips 000, 011, 015, 020 of REDS training set. building structure in Fig. 14. More examples are provided
4 Clips 000, 001, 006, 017 of REDS validation set. in the supplementary material.
Bicubic RBPN EDVR-M EDVR
24.87 dB 29.82 dB 28.32 dB 28.64 dB

Frame 018, Clip 000 BasicVSR IconVSR BasicVSR++ (ours) GT


29.03 dB 29.21 dB 29.82 dB PSNR
Figure 4: Challenging scenario on REDS4 [35]. Only BasicVSR++ is able to recover the patterns of the wheel’s spokes.

Bicubic RBPN EDVR-M EDVR


23.79 dB 28.65 dB 28.06 dB 29.64 dB

Sequence 0216, Clip 024 BasicVSR IconVSR BasicVSR++ (ours) GT


28.25 dB 28.79 dB 30.80 dB PSNR
Figure 5: Challenging scenario on Vimeo-90K-T [38]. Only BasicVSR++ is able to reconstruct the stairs.

Bicubic RBPN EDVR-M EDVR


22.99 dB 25.09 dB 25.13 dB 25.37 dB

Frame 017, Clip City BasicVSR IconVSR BasicVSR++ (ours) GT


25.16 dB 25.32 dB 25.52 dB PSNR
Figure 6: Challenging scenario on Vid4 [21]. Only BasicVSR++ is able to recover the correct structure of the building.

5. Ablation Studies tended to higher orders and more propagation iterations.


However, while the performance gain is considerable when
To understand the contributions of the proposed compo- increasing from first-order to second-order (i.e. (B)→(C)),
nents, we start with a baseline and gradually insert the com- and from one to two iterations (i.e. (C)→BasicVSR++), we
ponents. From Table 3, it is apparent that each component observe in our preliminary experiments that further increas-
brings considerable improvement, ranging from 0.14 dB to ing the orders and number of iterations does not lead to a
0.46 dB in PSNR. significant improvement (0.05 dB in PSNR). Therefore, we
In theory, our proposed propagation schemes can be ex-
LR GT LR GT

LR w/o 2nd order w/ 2nd order LR w/o grid w/ grid


(a) Second-Order Propagation (b) Grid Propagation
Figure 7: Analysis of second-order grid propagation. By propagating the features more effectively, our second-order grid propagation
leads to more details, improving the output quality.

(a) Optical flow (b) DCN offsets #1 (c) DCN offsets #2 (d) DCN offsets #3

(e) Reference image (f) Neighboring image (g) Aligned by optical flow (h) Aligned by flow-guided
./.deformable alignment
Figure 8: Analysis of flow-guided deformable alignment. (a-d) The DCN offsets are highly similar to optical flow, but still with
noticeable differences. (e-f) The reference and neighboring images. (g) The feature aligned by optical flow experiences blurry edges. (h)
The feature aligned by our proposed module is sharper and preserves more details, as indicated by the red arrows.

Table 3: Ablation studies of the components. Each component be transmitted via a robust and effective propagation. This
brings significant improvements in PSNR, verifying their effec- complementary information essentially assists the restora-
tiveness. tion of the fine details. As shown in the examples, the net-
(A) (B) (C) BasicVSR++ work successfully restores the details with our components,
Flow-Guided Deform. Align. 3 3 3 whereas the counterparts without our components produce
Second-Order Propagation 3 3 blurry outputs.
Grid Propagation 3
Flow-Guided Deformable Alignment. In Fig. 8(a-d), we
PSNR (dB) 31.48 31.94 32.08 32.39
compare the offsets with the optical flow computed by the
flow estimation module in BasicVSR++. By learning only
keep both the orders and iterations to two. the residue to optical flow, the network produces offsets that
Second-Order Grid Propagation. We further provide are highly similar to the optical flow, but with observable
some qualitative comparisons to understand the contribu- differences. When compared to the baseline which aggre-
tions of the proposed propagation scheme. As shown in the gates information from only one spatial location indicated
two examples of Fig. 7, the contribution of both the second- by the motion (optical flow), our proposed module allows
order propagation and grid propagation is more noticeable retrieving information from multiple locations around, pro-
in regions that contain fine details and complex textures. viding additional flexibility.
In those regions, there is limited information from the cur- This flexibility leads to features with better quality, as
rent frame that can be employed for reconstruction. To im- shown in Fig. 8(g-h). When the warping is performed by
prove the output quality of those regions, effective informa- using optical flow, the aligned features contain blurry edges,
tion aggregation from other video frames is necessary. With owing to the interpolation operation in spatial warping. In
our second-order propagation scheme, the information can contrast, by gathering more information from the neighbors,
Table 4: Comparison of alignment modules. Using optical flow
to guide deformable alignment successfully stabilizes training.
BasicVSR++ directly incorporates optical flow into the network,
outperforming the offset-fidelity loss [3].

w/o Flow Offset-Fidelity Loss [3] Ours


PSNR (dB) 27.44 30.22 32.39

t EDVR BasicVSR BasicVSR++ GT

y Compressed BasicVSR++ GT

Figure 10: Results on compressed video enhancement. The out-


puts clearly possesses fewer artifacts, and the details are shown
more clearly.
Figure 9: Comparison of temporal profile. We select a column
(orange dotted lines) and observe the changes across time. The the temporal profile from EDVR contains significant noise,
profile from EDVR possesses noise, indicating flickering artifacts. indicating flickering artifacts in the output video. In con-
The profile from BasicVSR still contains discontinuity. By bet- trast, for recurrent networks, without explicit modeling of
ter aggregating the long-term information, the profile from Ba- temporal consistency, the profiles from BasicVSR and Ba-
sicVSR++ demonstrates a smoother transition. sicVSR++ demonstrate better consistencies. However, the
profile from BasicVSR still contains discontinuity. Ben-
the feature aligned by our proposed module is sharper and efit from our enhanced propagation and alignment, Ba-
preserves more details. sicVSR++ is able to aggregate richer information from the
To demonstrate the superiority of our designs, we com- video frames, showing smoother temporal transition. The
pare our alignment module with two variants: (1) No opti- video results are given in the supplementary material.
cal flow is used. (2) Optical flow is used as in the offset-
fidelity loss [3], i.e. the flow is merely used as supervision 6. NTIRE 2021 Challenge Results
in the loss function (rather than serving as base offsets as
in our method). As shown in Table 4, without using opti- In NTIRE 2021, BasicVSR++ wins the video super-
cal flow as guidance, the instability causes training to col- resolution track [29] with a compact and efficient structure.
lapse, leading to a very poor PSNR value. When using the In addition to VSR, BasicVSR++ generalizes well to other
offset-fidelity loss, the training is stabilized. However, a restoration tasks. BasicVSR++ obtains two champions and
drop of 2.17 dB from our full model is observed. Our flow- one runner-up in the compressed video enhancement chal-
guided deformable alignment directly incorporates optical lenge [39]. Fig. 10 shows the restoration results of three dif-
flow into the network to provide more explicit guidance, ferent patches of compressed videos. BasicVSR++ success-
leading to better results. fully reduces the artifacts and produces outputs with much
better qualities. The promising performance in the com-
Temporal Consistency. Here, we examine the tempo-
petitions demonstrate the generalizability and versatility of
ral consistency, which is another important direction in
BasicVSR++.
VSR. The recurrent framework intrinsically maintains a
better temporal consistency in comparison to the sliding-
window framework. In the sliding-window framework
7. Conclusion
(e.g., EDVR [35]), each frame is reconstructed indepen- In this work, we redesign BasicVSR with two novel
dently. In such a design, the consistency between the out- components to enhance its propagation and alignment per-
puts cannot be guaranteed. In contrast, in the recurrent formance for the task of video super-resolution. Our model
framework (e.g., BasicVSR [2]), the outputs are related BasicVSR++ outperforms existing state of the arts by a
through the propagation of the intermediate features. The large margin while maintaining efficiency. These designs
temporal propagation essentially helps maintaining better generalizes well to other video restoration tasks including
temporal consistency. compressed video enhancement. These components are
In Fig. 9 we show a comparison of the temporal pro- generic and we speculate that they will be useful for other
files between BasicVSR++ and two state-of-the-art methods video-based enhancement or restoration tasks such as de-
– EDVR and BasicVSR. For the sliding-window method, blurring and denoising.
References [17] Diederik Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In ICLR, 2015. 5, 10
[1] Jose Caballero, Christian Ledig, Aitken Andrew, Acosta Ale- [18] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-
jandro, Johannes Totz, Zehan Wang, and Wenzhe Shi. Real- Hsuan Yang. Deep laplacian pyramid networks for fast and
time video super-resolution with spatio-temporal networks accurate super-resolution. In CVPR, 2017. 5, 10
and motion compensation. In CVPR, 2017. 5 [19] Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya
[2] Kelvin C.K. Chan, Xintao Wang, Ke Yu, Chao Dong, and Jia. MuCAN: Multi-correspondence aggregation network for
Chen Change Loy. BasicVSR: The search for essential com- video super-resolution. In ECCV, 2020. 5
ponents in video super-resolution and beyond. In CVPR, [20] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles.
2021. 1, 2, 3, 5, 8, 10 Learning long-term dependencies in NARX recurrent neural
[3] Kelvin C.K. Chan, Xintao Wang, Ke Yu, Chao Dong, and networks. IEEE Transactions on Neural Networks, 1996. 2
Chen Change Loy. Understanding deformable alignment in [21] Ce Liu and Deqing Sun. On bayesian adaptive video super
video super-resolution. In AAAI, 2021. 2, 4, 8 resolution. TPAMI, 2014. 5, 6, 10, 12
[4] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and [22] Ilya Loshchilov and Frank Hutter. SGDR: Stochas-
Michel Barlaud. Two deterministic half-quadratic regular- tic gradient descent with warm restarts. arXiv preprint
ization algorithms for computed imaging. In ICIP, 1994. 5, arXiv:1608.03983, 2016. 5, 10
10 [23] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik
[5] Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuan- Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee.
jun Xiong, Chen Change Loy, and Dahua Lin. Optimizing NTIRE 2019 challenge on video deblurring and super-
video object detection via a scale-time lattice. In CVPR, resolution: Dataset and study. In CVPRW, 2019. 5, 10
2018. 2 [24] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Re-
[6] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong current neural networks with intra-frame iterations for video
Zhang, Han Hu, and Yichen Wei. Deformable convolutional deblurring. In CVPR, 2019. 2
networks. In ICCV, 2017. 2, 4 [25] Simon Niklaus and Feng Liu. Softmax splatting for video
[7] Damien Fourure, Rémi Emonet, Élisa Fromont, Damien frame interpolation. In CVPR, 2020. 2
Muselet, Alain Trémeau, and Christian Wolf. Residual conv- [26] Anurag Ranjan and Michael J Black. Optical flow estimation
deconv grid network for semantic segmentation. In BMVC, using a spatial pyramid network. In CVPR, 2017. 5, 10
2017. 2 [27] Mehdi S M Sajjadi, Raviteja Vemulapalli, and Matthew
[8] Dario Fuoli, Shuhang Gu, and Radu Timofte. Efficient video Brown. Frame-recurrent video super-resolution. In CVPR,
super-resolution through recurrent latent space propagation. 2018. 1, 2, 3, 5
In ICCVW, 2019. 1, 2, 5 [28] Rohollah Soltani and Hui Jiang. Higher order recurrent neu-
[9] Muhammad Haris, Greg Shakhnarovich, and Norimichi ral networks. arXiv preprint arXiv:1605.00064, 2016. 2
Ukita. Recurrent back-projection network for video super- [29] Sanghyun Son, Suyoung Lee, Seungjun Nah, Radu Timo-
resolution. In CVPR, 2019. 1, 4, 5 fte, Kyoung Mu Lee, Kelvin C.K. Chan, et al. NTIRE 2021
[10] Yan Huang, Wei Wang, and Liang Wang. Bidirectional challenge on video super-resolution. In CVPRW, 2021. 2, 8
recurrent convolutional networks for multi-frame super- [30] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao,
resolution. In NIPS, 2015. 1, 2, 3 Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and
[11] Yan Huang, Wei Wang, and Liang Wang. Video super- Jingdong Wang. High-resolution representations for labeling
resolution via bidirectional recurrent convolutional net- pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
works. TPAMI, 2018. 1, 2, 3 2
[12] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin [31] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya
Wang, and Qi Tian. Video super-resolution with recurrent Jia. Detail-revealing deep video super-resolution. In CVPR,
structure-detail network. In ECCV, 2020. 1, 2, 3, 5 2017. 5
[13] Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory [32] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu.
Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, and Qi TDAN: Temporally deformable alignment network for video
Tian. Video super-resolution with temporal group attention. super-resolution. In CVPR, 2018. 1, 2, 4
In CVPR, 2020. 5 [33] Hua Wang, Dewei Su, Longcun Jin, and Chuangchuang Liu.
[14] Takashi Isobe, Fang Zhu, and Shengjin Wang. Revisiting Deformable non-local network for video super-resolution.
temporal modeling for video super-resolution. In BMVC, IEEE Access, 2019. 2, 4
2020. 1, 2, 3, 5 [34] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
[15] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Joo Kim. Deep video super-resolution network using dy- Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
namic upsampling filters without explicit motion compensa- high-resolution representation learning for visual recogni-
tion. In CVPR, 2018. 5 tion. TPAMI, 2020. 2
[16] Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan [35] Xintao Wang, Kelvin C.K. Chan, Ke Yu, Chao Dong, and
Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. Chen Change Loy. EDVR: Video restoration with enhanced
Sparse attentive backtracking: Temporal creditassignment deformable convolutional networks. In CVPRW, 2019. 1, 2,
through reminding. In NIPS, 2018. 2 4, 5, 6, 8, 11
[36] Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Table 5: Architectures of C o and C m . The two modules share
Allebach, and Chenliang Xu. Zooming Slow-Mo: Fast and the first six layers. They can be implemented as a stack of con-
accurate one-stage space-time video super-resolution. In volutions followed by a channel-splitting. The arguments in the
CVPR, 2020. 2 convolution layer are input channels, output channels, and kernel
[37] Xiangyu Xu, Muchen Li, Wenxiu Sun, and Ming-Hsuan size, respectively.
Yang. Learning spatial and spatio-temporal pixel aggrega- Layer Co Cm
tions for image and video denoising. TIP, 2020. 2, 4 1. conv(196, 64, 3)
[38] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and 2. LeakyReLU(0.1)
William T Freeman. Video enhancement with task-oriented 3. conv(64, 64, 3)
flow. IJCV, 2019. 1, 4, 5, 6, 10, 12 4. LeakyReLU(0.1)
[39] Ren Yang, Radu Timofte, Jing Liu, Yi Xu, Xinjian Zhang, 5. conv(64, 64, 3)
Minyi Zhao, Shuigeng Zhou, Kelvin CK Chan, Shangchen 6. LeakyReLU(0.1)
Zhou, Xiangyu Xu, et al. NTIRE 2021 challenge on quality 7. conv(64, 288, 3) conv(64, 144, 3)
enhancement of compressed video: Methods and results. In
CVPRW, 2021. 2, 8
[40] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Degradations. All models are tested with 4× down-
Jiayi Ma. Progressive fusion video super-resolution net- sampling using two degradations – Bicubic (BI) and
work via exploiting non-local spatio-temporal correlations. Blur Downsampling (BD). For BI, the MATLAB function
In ICCV, 2019. 5, 10, 11 imresize is used for downsampling. For BD, we blur the
[41] Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, ground-truth by a Gaussian filter with σ=1.6, followed by
Wangmeng Zuo, and Jimmy Ren. Spatio-temporal filter a subsampling every four pixels.
adaptive network for video deblurring. In ICCV, 2019. 2 Training Settings. We adopt Adam optimizer [17] and Co-
[42] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- sine Annealing scheme [22]. When trained on REDS, the
formable ConvNets v2: More deformable, better results. In initial learning rate of the main network and the flow net-
CVPR, 2019. 2, 4 work are set to 1×10−4 and 2.5×10−5 , respectively. The
[43] Juntang Zhuang, Junlin Yang, Lin Gu, and Nicha Dvornek. total number of iterations is 600K, and the weights of the
ShelfNet for fast semantic segmentation. In ICCVW, 2019.
flow network are fixed during the first 5,000 iterations. The
2
batch size is 8 and the patch size of input LR frames is
64×64. We use Charbonnier loss [4] since it better han-
A. Network Architecture dles outliers and improves the performance over the conven-
We use pretrained SPyNet [26] as our flow network. The tional `2 -loss [18]. During training, 30 LR frames are used
number of residual blocks for the initial feature extraction as inputs. Since Vimeo-90K contains only seven frames
is set to 5, and the number of residual blocks for each prop- per sequence, networks trained solely on Vimeo-90K may
agation branch is set to 7. The feature channel is set to 64. not be able to capture long-term dependencies. Therefore,
The architecture of our second-order deformable align- we initialize the model using the weights trained on REDS
ment is highly similar to the first-order counterpart (Fig. when trained on Vimeo-90K. The number of finetune itera-
3 in the main paper). The only difference is that the pre- tions is 300K.
aligned features and optical flows from different timesteps Test Settings. We take the full video sequence as inputs to
are concatenated, and passed to the offset estimation mod- explore information from all video frames for restoration.
ule C o and mask estimation module C m . Their architectures
are detailed in Table 5. We set the DCN kernel size to 3 C. Qualitative Comparisons
and the number of deformable groups to 16. Codes will be In this section, we provide additional qualitative compar-
released. isons on REDS4 [23], UDM10 [40], Vimeo-90K [38], and
Vid4 [21]. From the examples, we see that BasicVSR++ is
B. Experimental Settings able to restore the fine details, leading to plausible results.
A video demo is also provided in the submitted zip file.
Datasets. Two widely-used datasets are adopted for train-
ing: REDS [23] and Vimeo-90K [38]. For REDS, fol-
lowing BasicVSR [2], we use REDS45 as our test set and
REDSval46 as our validation set. The remaining clips are
used for training. We use Vid4 [21], UDM10 [40], and
Vimeo-90K-T [38] as test sets along with Vimeo-90K.
5 Clips 000, 011, 015, 020 of REDS training set.
6 Clips 000, 001, 006, 017 of REDS validation set.
Bicubic RBPN EDVR-M EDVR
25.93 dB 30.88 dB 31.28 dB 32.11 dB

Frame 002, Clip 011 BasicVSR IconVSR BasicVSR++ (ours) GT


32.25 dB 32.62 dB 33.61 dB PSNR

Bicubic RBPN EDVR-M EDVR


26.00 dB 29.31 dB 30.26 dB 30.82 dB

Frame 061, Clip 020 BasicVSR IconVSR BasicVSR++ (ours) GT


31.43 dB 31.65 dB 32.25 dB PSNR
Figure 11: Qualitative comparison on REDS4 [35].

Bicubic EDVR BasicVSR


24.03 dB 30.91 dB 29.42 dB

Frame 031, Clip auditorium IconVSR BasicVSR++ (ours) GT


31.06 dB 31.93 dB PSNR
Figure 12: Qualitative comparison on UDM10 [40].
Bicubic RBPN EDVR-M EDVR
20.32 dB 24.19 dB 23.09 dB 23.89 dB

Sequence 0864, Clip 015 BasicVSR IconVSR BasicVSR++ (ours) GT


22.16 dB 23.78 dB 25.34 dB PSNR

Bicubic RBPN EDVR-M EDVR


30.08 dB 33.34 dB 31.93 dB 30.98 dB

Sequence 0723, Clip 085 BasicVSR IconVSR BasicVSR++ (ours) GT


32.84 dB 31.55 dB 35.51 dB PSNR
Figure 13: Qualitative comparison on Vimeo-90K-T [38].

Bicubic RBPN EDVR-M EDVR


19.10 dB 21.94 dB 22.01 dB 22.12 dB

Frame 040, Clip calendar BasicVSR IconVSR BasicVSR++ (ours) GT


21.78 dB 22.09 dB 22.50 dB PSNR
Figure 14: Qualitative comparison on Vid4 [21].

You might also like