没有合适的资源?快使用搜索试试~ 我知道了~
XVFI: eXtreme Video Frame Interpolation提供源码
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 91 浏览量
2025-07-09
10:51:44
上传
评论
收藏 47.18MB PDF 举报
温馨提示
内容概要:本文介绍了针对视频帧插值(VFI)任务提出的高分辨率数据集X4K1000FPS以及一种新的极端VFI网络(XVFI-Net)。X4K1000FPS数据集由4K分辨率、1000fps的视频组成,包含丰富的运动、遮挡和纹理变化,旨在解决现有低分辨率数据集对高分辨率视频处理效果不佳的问题。XVFI-Net采用递归多尺度共享结构,分为双向光流学习模块(BiOF-I和BiOF-T),能够有效捕捉大运动并稳定估计光流。实验结果显示,XVFI-Net在X4K1000FPS和其他基准数据集上均表现出色,尤其在极端运动场景下性能显著优于现有方法。 适合人群:计算机视觉领域的研究人员和工程师,特别是关注视频处理、光流估计和深度学习模型设计的专业人士。 使用场景及目标:①研究和开发针对高分辨率视频的帧插值技术;②提升视频处理系统的性能,特别是在处理快速运动场景时;③评估不同VFI模型在极端条件下的表现,为实际应用提供参考。 阅读建议:本文详细描述了XVFI-Net的设计思路和技术细节,建议读者重点关注模型架构、训练方法和实验结果部分。此外,理解X4K1000FPS数据集的特点及其构建过程对于从事相关研究具有重要意义。所有源代码和数据集已公开,便于读者复现实验并进行进一步探索。
资源推荐
资源详情
资源评论
























XVFI: eXtreme Video Frame Interpolation
Hyeonjun Sim
*
Jihyong Oh
*
Munchurl Kim
†
Korea Advanced Institute of Science and Technology
{flhy5836, jhoh94, mkimee}@kaist.ac.kr
(a) 96.6 (b) 71.0 (c) 34.9 (d) 40.6 (e) 196.5 (f) 152.2
Figure 1. Some examples of our X4K1000FPS dataset, which contain diverse motions in 4K-resolution of 1000-fps. The numbers below
the examples are the magnitude means of optical flows between two input frames in 30 fps. This is a video figure that can be best viewed
with motion using Adobe™ Reader. It should be noted that they are rendered in down-scales at 15 fps for visualization convenience.
Abstract
In this paper, we firstly present a dataset (X4K1000FPS)
of 4K videos of 1000 fps with the extreme motion to the
research community for video frame interpolation (VFI),
and propose an extreme VFI network, called XVFI-Net,
that first handles the VFI for 4K videos with large motion.
The XVFI-Net is based on a recursive multi-scale shared
structure that consists of two cascaded modules for bidi-
rectional optical flow learning between two input frames
(BiOF-I) and for bidirectional optical flow learning from
target to input frames (BiOF-T). The optical flows are sta-
bly approximated by a complementary flow reversal (CFR)
proposed in BiOF-T module. During inference, the BiOF-
I module can start at any scale of input while the BiOF-
T module only operates at the original input scale so that
the inference can be accelerated while maintaining highly
accurate VFI performance. Extensive experimental results
show that our XVFI-Net can successfully capture the essen-
tial information of objects with extremely large motions and
complex textures while the state-of-the-art methods exhibit
poor performance. Furthermore, our XVFI-Net framework
also performs comparably on the previous lower resolution
benchmark dataset, which shows a robustness of our algo-
rithm as well. All source codes, pre-trained models, and
proposed X4K1000FPS datasets are publicly available at
https://github.com/JihyongOh/XVFI.
*
Both authors contributed equally to this work.
†
Corresponding author.
1. Introduction
Video frame interpolation (VFI) converts low frame rate
(LFR) contents to high frame rate (HFR) videos by syn-
thesizing one or more intermediate frames between given
two consecutive frames, and then the videos of fast motion
can be smoothly rendered in an increased frame rate, thus
yielding reduced motion judder [28, 24, 23, 10]. Therefore,
it is widely used for various practical applications, such as
adaptive streaming [46], novel view interpolation synthe-
sis [11], frame rate up conversion [29, 5, 50], slow mo-
tion generation [18, 4, 30, 32, 27, 34] and video restora-
tion [21, 43, 14, 42]. However, VFI is significantly chal-
lenging, which is attributed to diverse factors such as oc-
clusions, large motions and change of light. Recent deep-
learning-based VFI has been actively studied, showing re-
markable performances [48, 4, 7, 37, 25, 13, 31, 51, 6, 33].
However, they are often optimized for existing LFR bench-
mark datasets of low resolution (LR), which may lead to
poor VFI performance, especially for videos of 4K resolu-
tion (4096×2160) or higher with very large motion [1, 21].
Such 4K videos often contain frames of fast motion with
extremely large pixel displacements for which conventional
convolutional neural networks (CNNs) do not effectively
work with receptive fields of limited sizes.
To solve the above issues for deep learning-based
VFI methods, we directly photographed 4K videos to
construct a high-quality HFR dataset of high resolution,
called X4K1000FPS. Fig. 1 shows some examples of our
X4K1000FPS dataset. As shown, our videos of 4K resolu-
tion have extremely large motions and occlusions.
arXiv:2103.16206v2 [cs.CV] 5 Aug 2021

Overlapped 4K inputs (crop)
FeFlow DAIN
XVFI-Net (Ours)
288pixels
179pixels
Figure 2. VFI results for extreme motions. Our XVFI-Net can gen-
erate a more stable intermediate frame with very large motions
than two recent SOTA methods, FeFlow [13] and DAIN [4], which
are newly trained on our dataset for fair comparisons.
We also first propose an extreme VFI model, called
XVFI-Net, that is effectively designed to handle such a
challenging dataset of 4K@1000fps. Instead of directly
capturing extreme motions through consecutive feature
spaces with deformable convolution as recent trends in
video restoration [13, 47, 43, 42, 20], or using very large-
sized pretrained networks with extra information such as
contexts, depths, flows and edges [4, 51, 30, 13], our XVFI-
Net is simple but effective, which is based on a recur-
sive multi-scale shared structure. The XVFI-Net has two
cascaded modules: one for the bidirectional optical flow
learning between two input frames (BiOF-I) and the other
for the bidirectional optical flow estimating from target to
the inputs (BiOF-T). The BiOF-I and BiOF-T modules are
trained in combination with multi-scale losses. However,
once trained, the BiOF-I module can start from any down-
scaled input upward while the BiOF-T module only oper-
ates at the original input scale at inference, which is com-
putationally efficient and helps to generate an intermediate
frame at any target time instance. Structurally, the XVFI-
Net is adjustable in terms of the number of scales for infer-
ence according to the input resolutions or the motion magni-
tudes, even if training is once over. We also propose a novel
optical flow estimation from time t to those of the inputs,
called a complementary flow reversal (CFR) that effectively
fills the holes by taking complementary flows. Extensive ex-
periments are conducted for fair comparison and our XVFI-
Net that has a relatively smaller complexity outperforms
previous VFI SOTA methods on our X4K1000FPS, espe-
cially for extreme motions as shown in Fig. 2. A further ex-
periment on the previous LR-LFR benchmark dataset also
demonstrates the robustness of our XVFI-Net. Our contri-
butions are summarized as:
• We first propose a high-quality of HFR video dataset
of 4K resolution, called X4K1000FPS (4K@1000fps)
which contains a wide variety of textures, extremely
large motions, zoomings and occlusions.
• We propose the CFR that can generate stable optical
flow estimation from time t to the input frames, boost-
ing both qualitative and quantitative performances.
• Our proposed XVFI-Net can start from any down-
scaled input upward, which is adjustable in terms of
the number of scales for inference according to the in-
put resolutions or the motion magnitudes.
• Our XVFI-Net achieves state-of-the-art performance
on the testset of X4K1000FPS with a significant mar-
gin compared to the previous VFI SOTA methods
while having computational efficiency with a small
number of filter parameters. All source codes and pro-
posed X4K1000FPS dataset are publicly available at
https://github.com/JihyongOh/XVFI.
2. Related Work
2.1. Video Frame Interpolation
Most VFI methods can be categorized into optical flow-
or kernel-based [27, 32, 18, 30, 34, 48, 4, 1, 2, 25, 33, 31]
and pixel hallucination-based [13, 47, 7, 37, 21] methods.
Flow-based VFI. Super-SloMo [18] first linearly combines
predicted optical flows between two input frames to approx-
imate flows from the target intermediate frame to the input
frames. Quadratic video frame interpolation [48] utilizes
four input frames to cope with nonlinear motion modeling
by quadratic approximation, which limits the VFI general-
ization when two input frames are given. It also proposes
flow reversal (projection) for more accurate image warp-
ing. On the other hand, DAIN [4] gives different weights of
overlapped flow vectors depending on the object depth of
the scene via a flow projection layer. However, DAIN em-
ploys and fine-tunes both PWC-Net [41] and MegaDepth
[26], which is computationally burdened for inferring inter-
mediate HR frames. AdaCoF proposes a generalized warp-
ing module to deal with complex motion [25]. However, it
is not adaptive to handle the frames of higher resolutions
due to a fixed dilation degree, after once trained.
Pixel Hallucination-based VFI. FeFlow [13] has benefited
from deformable convolution [9] to the center frame gener-
ator by replacing optical flows with offset vectors. Zooming
Slow-Mo [47] also interpolates middle frames with the help
of deformable convolution in the feature domain. However,
since these methods directly hallucinate pixels unlike the
flow-based VFI methods, the predicted frames tend to be
blurry when fast-moving objects are present.
Most importantly, the aforementioned VFI methods are
difficult to operate on the entire HR frames at once, due to
their heavy computational complexity. On the other hand,
our XVFI-Net is designed to efficiently operate on the entire

4K frame input at once with a smaller number of parameters
and is capable of effectively capturing large motions.
2.2. Networks for Large Pixel Displacements
PWC-Net [41] is a state-of-the-art optical flow estimator
that has been adopted in several VFI methods for pretrained
flow estimators [48, 4, 31]. Since PWC-Net has a 6-level
feature pyramid structure to have larger sizes of receptive
fields, it enables to effectively predict large motions. IM-
Net [34] also adopts a multi-scale structure to cover large
displacements of objects in adjacent frames while the cov-
erage is limited in the size of the adaptive filters. Despite
of the multi-scale pyramid structures, the above methods
lack adaptivity because the coarsest level of each network is
fixed after once trained, i.e. each scale level consists of its
own (not shared) parameters. The RRPN [51] shares weight
parameters across different scale levels in a flexible recur-
rent pyramid structure. However, it only infers the centered
frame, not at arbitrary time instances. So it can only synthe-
size recursively the intermediate frames of time at a power
of 2. As a result, the prediction errors are accumulated as in-
termediate frames are generated iteratively between the two
input frames. Therefore, RRPN has limited temporal flexi-
bility for VFI at arbitrary target time instance t.
Distinguished from the above methods, our proposed
XVFI-Net has a scalable structure with shared parameters
for various input resolutions. Different from RRPN [51], the
XVFI-Net is structurally divided into the BiOF-I and BiOF-
T modules, which allows predicting an intermediate frame
at arbitrary time t with the help of the complementary flow
reversal in an efficient way. That is, the BiOF-T module can
be skipped at the down-scaled levels in inference so that
our model can infer the intermediate frame of 4K at once,
without any patch-wise iteration unlike all other previous
methods, which can be applied in real-world applications.
3. Proposed X4K1000FPS Dataset
Although numerous methods for VFI have been both
trained and evaluated over the diverse benchmark datasets,
such as Adobe240fps [40], DAVIS [35], UCF101 [39], Mid-
dlebury [3] and Vimeo90K [49], none of the datasets con-
tains rich amounts of 4K videos with HFR. These limits the
study of elaborate VFI methods required for VFI applica-
tions for targeting very high resolution videos.
To tackle the challenging extreme VFI task, we provide a
rich set of 4K@1000fps video that we photographed using
a Phantom Flex4K™ camera with the 4K spatial resolution
of 4096×2160 at 1,000 fps, producing 175 video scenes,
each with 5,000 frames by shooting for 5 seconds.
In order to select valuable data samples for VFI, we es-
timated bidirectional occlusion maps and optical flows of
every 32 frames of the scenes using IRR-PWC [16]. The
occlusion map indicates part of the objects to be occluded
Dataset
Occlusion [16] Flow magnitude [16]
25
th
50
th
75
th
25
th
50
th
75
th
Vimeo90K [49] 6.8 11.9 18.1 3.1 4.9 7.1
Adobe240fps [40] 0.8 1.7 3.2 3.8 8.9 16.3
X-TEST (ours) 2.1 5.6 17.7 23.9 81.9 138.5
X-TRAIN (ours) 6.9 10.1 15.7 5.5 18.0 59.5
25
th
, 50
th
and 75
th
represent percentiles of each datset.
Table 1. The occlusion and optical flow magnitude statistics of VFI
datasets: 3,782 test triplets of Vimeo90K [49], randomly selected
200 clips of Adobe240fps [40], 15 clips of X-TEST and 4,408
clips of X-TRAIN.
in the next frames. The occlusion makes optical flow es-
timation and frame interpolation challenging [44, 4, 16].
Thus, we manually selected 15 scenes as our testset, called
X-TEST, by considering the degrees of occlusion, optical
flow magnitudes and scene diversity. Each scene for X-
TEST simply contains one test sample that consists of two
input frames in a temporal distance of 32 frames and ap-
proximately corresponds to 30 fps. The test evaluation is
set to interpolate 7 intermediate frames, which results in
the consecutive frames of 240 fps. For the training dataset,
called X-TRAIN, we cropped and selected 4,408 clips of
768×768-sized and the lengths of 65 consecutive frames
by considering the amounts of occlusion. More details are
described in Supplementary Material.
Table 1 compares the statistics of datasets: Vimeo90K
[49], Adobe240fps [40], our X-TEST and X-TRAIN. We
estimated the occlusion range in [0,255] and optical flow
magnitudes [16] between input pairs and calculated their
percentiles for each dataset. As shown in Table 1, our
datasets contain comparable occlusion but significantly
larger motion, compared to the previous VFI datasets.
4. Proposed Method : XVFI-Net Framework
4.1. Design Considerations
Our XVFI-Net aims at interpolating an intermediate
frame I
t
at an arbitrary time t between two consecutive in-
put frames, I
0
and I
1
, of HR with extreme motion.
Scale Adaptivity. An architecture with a fixed number of
scale levels like PWC-Net [41] is difficult to adapt to vari-
ous spatial resolutions of the input video, because the struc-
ture in each scale level is not shared across different scale
levels, so the new architecture with an increased scale depth
needs to be retrained. In order to have a scale adaptivity to
various spatial resolutions of input frames, our XVFI-Net
is designed to have optical flow estimation starting at any
desired coarse scale level, adapting to the degree of motion
magnitudes in the input frames. To do so, our XVFI-Net
shares its parameters across different scale levels.
Capturing Large Motion. In order to effectively capture a
large motion between two input frames, the Feature Extrac-

tion Block of XVFI-Net first reduces the spatial resolution
of two input frames by a module scale factor M via a strided
convolution, thus yielding the spatially reduced feature that
is then converted to two contextual feature maps C
0
0
and
C
0
1
. The Feature Extraction Block in Fig. 3 is composed of
the strided convolution and two residual blocks [15]. Then,
XVFI-Net at each scale level estimates optical flows from
target frame I
t
to two input frames in the reduced spatial
size by M. The predicted flows are upscaled (×M) to warp
the input frames at each scale level to time t.
4.2. XVFI-Net Architecture
BiOF-I module. Fig. 4 shows our XVFI-Net architecture in
scale s, where I
s
denotes bicubicly down-scaled by 1/2
s
.
First, contextual pyramid C = {C
s
} is recurrently ex-
tracted from C
0
0
and C
0
1
via a stride 2 convolution, and
then utilized as an input for XVFI-Net at each scale level
s (s = 0, 1, 2, ...), where s = 0 denotes the scale of the
original input frames. Let F
s
t
a
t
b
denotes optical flow from
time t
a
to t
b
at scale s. F
s
01
and F
s
10
are the bidirectional
flows between input frames at scale s. F
s
t0
and F
s
t1
are the
bidirectional flows from I
s
t
to I
s
0
and I
s
1
, respectively.
The estimated flows F
s+1
01
, F
s+1
10
from the previous scale
(s + 1) are ×2 bilinearly up-scaled to be set as the ini-
tial flows for the current scale s, i.e.,
˜
F
s
01
= F
s+1
01
↑
2
,
˜
F
s
10
= F
s+1
10
↑
2
. To update the initial flows in the cur-
rent scale, C
s
0
and C
s
1
are first warped by the initial flows,
that is,
˜
C
s
01
= W (
˜
F
s
01
, C
s
1
) and
˜
C
s
10
= W (
˜
F
s
10
, C
s
0
), re-
spectively, where W is a backward warping operation [17].
Then
˜
C
s
01
,
˜
C
s
10
, C
s
0
, C
s
1
together with
˜
F
s
01
,
˜
F
s
10
are fed to an
auto-encoder-based BiFlownet as in Fig. 4 to output resid-
ual flows over the initial flows and a trainable importance
mask z [31]. Then F
s
01
, F
s
10
are obtained. They are then fed
as input to the BiOF-T module and are also used as the ini-
tial flows to the next scale s − 1.
BiOF-T module. Hereafter, we omit superscript s for
the notion of feature tensors at each scale, unless men-
tioned. Although the linear approximation with optical
flows F
01
, F
10
[18] or the flow reversal of F
0t
, F
1t
[48] al-
lows to estimate the flows F
t0
, F
t1
at arbitrary time t, there
are few shortcomings. The linear approximation is inaccu-
rate to predict F
t0
and F
t1
for fast-moving objects because
the anchor points of F
01
and F
10
are severely misaligned.
On the other hand, the flow reversal can align the anchor
points but holes may appear in estimated F
t0
and F
t1
. To
stabilize the performance of the flow reversal, we take com-
plementary advantages of both the linear approximation and
flow reversal. So, a stable optical flow estimate from time t
to 0 or 1 can be computed by a normalized linear combina-
tion of a negative anchor flow and a complementary flow,
which we call a complementary flow reversal (CFR). The
resulting complementary reversed optical flow maps,
˜
F
t0
and
˜
F
t1
, from time t to 0 and 1 are given by,
0
ˆ
t
I
1
ˆ
t
I
BiOF-I BiOF-T
BiOF-I BiOF-T
BiOF-I
Up-scaled flow
: Training paths
: Possible test paths
Scale 0
Scale 1
Up-scaled flow
: Shared parameters (Fig. 4)
Feature
Extraction
Block
Recurrent
Conv.
(stride 2)
Recurrent
Conv.
(stride 2)
Scale S
0
1
0
0
,
C
C
1
1
1
0
,
C
C
S
S
C
C
1
0
,
)/1( M
0
1
0
0
,II
Figure 3. Adjustable and efficient scalability of our XVFI-Net
framework. Even if the lowest scale depth S
trn
during training
is set to 1 in this example, inference can start from any scale level.
˜
F
x
t0
=
(1 − t)
P
N
0
w
0
·(−F
y
0t
) + t
P
N
1
w
1
· F
y
1·(1-t)
(1 − t)
P
N
0
w
0
+ t
P
N
1
w
1
(1)
˜
F
x
t1
=
(1 − t)
P
N
0
w
0
· F
y
0·(1-t)
+ t
P
N
1
w
1
· (−F
y
1t
)
(1 − t)
P
N
0
w
0
+ t
P
N
1
w
1
(2)
where x denotes a pixel location at time t and y is at time
0 or 1. w
i
= z
y
i
· G(|x − (y + F
y
it
)|) is a Gaussian weight
depending on the distance between x at time t and y+F
y
it
at
time i (= 0 or 1) while also considering the learnable impor-
tance mask of each flow by z
y
i
[31]. Also, −F
y
0t
(or −F
y
1t
)
and F
y
1·(1-t)
(or F
y
0·(1-t)
) in Eq. 1 (or Eq. 2) are defined as
a negative anchor flow and a complementary flow, respec-
tively. Furthermore, the anchor flows are normalized flows
that can be calculated as F
0t
= tF
01
and F
1t
= (1 − t)F
10
to intermediate time t. It should be noted in Eq. 1 and
Eq. 2 that the complementary flows are also normalized as
F
1·(1-t)
= tF
10
and F
0·(1-t)
= (1 − t)F
01
which comple-
mentally fill the holes occurred in the reversed flows. By do-
ing so, we can fully exploit the temporal-densely captured
X4K1000FPS dataset to train our XVFI-Net for VFI at ar-
bitrary time t. The neighborhoods of x are defined as,
N
0
= {y | round(y + F
y
0t
) = x} (3)
N
1
= {y | round(y + F
y
1t
) = x}. (4)
To refine the bidirectional flow approximates
˜
F
t0
,
˜
F
t1
,
we rewarp the feature maps (C
0
, C
1
) to
˜
C
t0
and
˜
C
t1
by
˜
F
t0
and
˜
F
t1
, respectively. We concatenate and feed
C
0
, C
1
,
˜
C
t0
,
˜
C
t1
, and
˜
F
t0
,
˜
F
t1
to the auto-encoder-based
TFlownet as in Fig. 4 (similarly to refine
˜
F
01
,
˜
F
10
). The out-
puts of TFlownet are used to compose refined flows F
t0
, F
t1
which are then bilinearly up-scaled (×M) back to the size
剩余17页未读,继续阅读
资源评论


码流怪侠

- 粉丝: 4w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 基于PLC的升降横移式立体车库设计.doc
- 互联网+助推智慧城市建设201509.ppt
- 热门计算机专业介绍.ppt
- 住宅小区楼宇自动化系统设计方案.doc
- 电子商务公司年终工作总结.pptx
- xx医疗美容医院网络营销方案.pptx
- 精华版最新国家开放大学电大《网络应用服务管理》机考2套真题题库及答案3.pdf
- 电子商务第七章客户关系管理(CRM).ppt
- 网络药理学---副本.pptx
- 整套智能家居系统解决方案样本.doc
- 高大上信息化教学设计说课模板.ppt
- 中国工业软件发展现状与趋势.doc
- 十大智能家居系统解决方案深度推荐.doc
- 人工智能技术介绍--人工智能AI发展分析.pptx
- MicroCommunity-Java资源
- 自行车里程表的设计单片机毕业设计.doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
