XVFI:eXtremeVideoFrameInterpolation提供源码资源-CSDN下载

版权申诉

深度学习

91 浏览量 2025-07-09 10:51:44 上传评论收藏 47.18MB PDF 举报

资源推荐

资源详情

资源评论

XVFI: eXtreme Video Frame Interpolation

Hyeonjun Sim

Jihyong Oh

Munchurl Kim

†

Korea Advanced Institute of Science and Technology

{flhy5836, jhoh94, mkimee}@kaist.ac.kr

(a) 96.6 (b) 71.0 (c) 34.9 (d) 40.6 (e) 196.5 (f) 152.2

Figure 1. Some examples of our X4K1000FPS dataset, which contain diverse motions in 4K-resolution of 1000-fps. The numbers below

the examples are the magnitude means of optical ﬂows between two input frames in 30 fps. This is a video ﬁgure that can be best viewed

with motion using Adobe™ Reader. It should be noted that they are rendered in down-scales at 15 fps for visualization convenience.

Abstract

In this paper, we ﬁrstly present a dataset (X4K1000FPS)

of 4K videos of 1000 fps with the extreme motion to the

research community for video frame interpolation (VFI),

and propose an extreme VFI network, called XVFI-Net,

that ﬁrst handles the VFI for 4K videos with large motion.

The XVFI-Net is based on a recursive multi-scale shared

structure that consists of two cascaded modules for bidi-

rectional optical ﬂow learning between two input frames

(BiOF-I) and for bidirectional optical ﬂow learning from

target to input frames (BiOF-T). The optical ﬂows are sta-

bly approximated by a complementary ﬂow reversal (CFR)

proposed in BiOF-T module. During inference, the BiOF-

I module can start at any scale of input while the BiOF-

T module only operates at the original input scale so that

the inference can be accelerated while maintaining highly

accurate VFI performance. Extensive experimental results

show that our XVFI-Net can successfully capture the essen-

tial information of objects with extremely large motions and

complex textures while the state-of-the-art methods exhibit

poor performance. Furthermore, our XVFI-Net framework

also performs comparably on the previous lower resolution

benchmark dataset, which shows a robustness of our algo-

rithm as well. All source codes, pre-trained models, and

proposed X4K1000FPS datasets are publicly available at

https://github.com/JihyongOh/XVFI.

Both authors contributed equally to this work.

†

Corresponding author.

1. Introduction

Video frame interpolation (VFI) converts low frame rate

(LFR) contents to high frame rate (HFR) videos by syn-

thesizing one or more intermediate frames between given

two consecutive frames, and then the videos of fast motion

can be smoothly rendered in an increased frame rate, thus

yielding reduced motion judder [28, 24, 23, 10]. Therefore,

it is widely used for various practical applications, such as

adaptive streaming [46], novel view interpolation synthe-

sis [11], frame rate up conversion [29, 5, 50], slow mo-

tion generation [18, 4, 30, 32, 27, 34] and video restora-

tion [21, 43, 14, 42]. However, VFI is signiﬁcantly chal-

lenging, which is attributed to diverse factors such as oc-

clusions, large motions and change of light. Recent deep-

learning-based VFI has been actively studied, showing re-

markable performances [48, 4, 7, 37, 25, 13, 31, 51, 6, 33].

However, they are often optimized for existing LFR bench-

mark datasets of low resolution (LR), which may lead to

poor VFI performance, especially for videos of 4K resolu-

tion (4096×2160) or higher with very large motion [1, 21].

Such 4K videos often contain frames of fast motion with

extremely large pixel displacements for which conventional

convolutional neural networks (CNNs) do not effectively

work with receptive ﬁelds of limited sizes.

To solve the above issues for deep learning-based

VFI methods, we directly photographed 4K videos to

construct a high-quality HFR dataset of high resolution,

called X4K1000FPS. Fig. 1 shows some examples of our

X4K1000FPS dataset. As shown, our videos of 4K resolu-

tion have extremely large motions and occlusions.

arXiv:2103.16206v2 [cs.CV] 5 Aug 2021

Overlapped 4K inputs (crop)

FeFlow DAIN

XVFI-Net (Ours)

288pixels

179pixels

Figure 2. VFI results for extreme motions. Our XVFI-Net can gen-

erate a more stable intermediate frame with very large motions

than two recent SOTA methods, FeFlow [13] and DAIN [4], which

are newly trained on our dataset for fair comparisons.

We also ﬁrst propose an extreme VFI model, called

XVFI-Net, that is effectively designed to handle such a

challenging dataset of 4K@1000fps. Instead of directly

capturing extreme motions through consecutive feature

spaces with deformable convolution as recent trends in

video restoration [13, 47, 43, 42, 20], or using very large-

sized pretrained networks with extra information such as

contexts, depths, ﬂows and edges [4, 51, 30, 13], our XVFI-

Net is simple but effective, which is based on a recur-

sive multi-scale shared structure. The XVFI-Net has two

cascaded modules: one for the bidirectional optical ﬂow

learning between two input frames (BiOF-I) and the other

for the bidirectional optical ﬂow estimating from target to

the inputs (BiOF-T). The BiOF-I and BiOF-T modules are

trained in combination with multi-scale losses. However,

once trained, the BiOF-I module can start from any down-

scaled input upward while the BiOF-T module only oper-

ates at the original input scale at inference, which is com-

putationally efﬁcient and helps to generate an intermediate

frame at any target time instance. Structurally, the XVFI-

Net is adjustable in terms of the number of scales for infer-

ence according to the input resolutions or the motion magni-

tudes, even if training is once over. We also propose a novel

optical ﬂow estimation from time t to those of the inputs,

called a complementary ﬂow reversal (CFR) that effectively

ﬁlls the holes by taking complementary ﬂows. Extensive ex-

periments are conducted for fair comparison and our XVFI-

Net that has a relatively smaller complexity outperforms

previous VFI SOTA methods on our X4K1000FPS, espe-

cially for extreme motions as shown in Fig. 2. A further ex-

periment on the previous LR-LFR benchmark dataset also

demonstrates the robustness of our XVFI-Net. Our contri-

butions are summarized as:

• We ﬁrst propose a high-quality of HFR video dataset

of 4K resolution, called X4K1000FPS (4K@1000fps)

which contains a wide variety of textures, extremely

large motions, zoomings and occlusions.

• We propose the CFR that can generate stable optical

ﬂow estimation from time t to the input frames, boost-

ing both qualitative and quantitative performances.

• Our proposed XVFI-Net can start from any down-

scaled input upward, which is adjustable in terms of

the number of scales for inference according to the in-

put resolutions or the motion magnitudes.

• Our XVFI-Net achieves state-of-the-art performance

on the testset of X4K1000FPS with a signiﬁcant mar-

gin compared to the previous VFI SOTA methods

while having computational efﬁciency with a small

number of ﬁlter parameters. All source codes and pro-

posed X4K1000FPS dataset are publicly available at

https://github.com/JihyongOh/XVFI.

2. Related Work

2.1. Video Frame Interpolation

Most VFI methods can be categorized into optical ﬂow-

or kernel-based [27, 32, 18, 30, 34, 48, 4, 1, 2, 25, 33, 31]

and pixel hallucination-based [13, 47, 7, 37, 21] methods.

Flow-based VFI. Super-SloMo [18] ﬁrst linearly combines

predicted optical ﬂows between two input frames to approx-

imate ﬂows from the target intermediate frame to the input

frames. Quadratic video frame interpolation [48] utilizes

four input frames to cope with nonlinear motion modeling

by quadratic approximation, which limits the VFI general-

ization when two input frames are given. It also proposes

ﬂow reversal (projection) for more accurate image warp-

ing. On the other hand, DAIN [4] gives different weights of

overlapped ﬂow vectors depending on the object depth of

the scene via a ﬂow projection layer. However, DAIN em-

ploys and ﬁne-tunes both PWC-Net [41] and MegaDepth

[26], which is computationally burdened for inferring inter-

mediate HR frames. AdaCoF proposes a generalized warp-

ing module to deal with complex motion [25]. However, it

is not adaptive to handle the frames of higher resolutions

due to a ﬁxed dilation degree, after once trained.

Pixel Hallucination-based VFI. FeFlow [13] has beneﬁted

from deformable convolution [9] to the center frame gener-

ator by replacing optical ﬂows with offset vectors. Zooming

Slow-Mo [47] also interpolates middle frames with the help

of deformable convolution in the feature domain. However,

since these methods directly hallucinate pixels unlike the

ﬂow-based VFI methods, the predicted frames tend to be

blurry when fast-moving objects are present.

Most importantly, the aforementioned VFI methods are

difﬁcult to operate on the entire HR frames at once, due to

their heavy computational complexity. On the other hand,

our XVFI-Net is designed to efﬁciently operate on the entire

4K frame input at once with a smaller number of parameters

and is capable of effectively capturing large motions.

2.2. Networks for Large Pixel Displacements

PWC-Net [41] is a state-of-the-art optical ﬂow estimator

that has been adopted in several VFI methods for pretrained

ﬂow estimators [48, 4, 31]. Since PWC-Net has a 6-level

feature pyramid structure to have larger sizes of receptive

ﬁelds, it enables to effectively predict large motions. IM-

Net [34] also adopts a multi-scale structure to cover large

displacements of objects in adjacent frames while the cov-

erage is limited in the size of the adaptive ﬁlters. Despite

of the multi-scale pyramid structures, the above methods

lack adaptivity because the coarsest level of each network is

ﬁxed after once trained, i.e. each scale level consists of its

own (not shared) parameters. The RRPN [51] shares weight

parameters across different scale levels in a ﬂexible recur-

rent pyramid structure. However, it only infers the centered

frame, not at arbitrary time instances. So it can only synthe-

size recursively the intermediate frames of time at a power

of 2. As a result, the prediction errors are accumulated as in-

termediate frames are generated iteratively between the two

input frames. Therefore, RRPN has limited temporal ﬂexi-

bility for VFI at arbitrary target time instance t.

Distinguished from the above methods, our proposed

XVFI-Net has a scalable structure with shared parameters

for various input resolutions. Different from RRPN [51], the

XVFI-Net is structurally divided into the BiOF-I and BiOF-

T modules, which allows predicting an intermediate frame

at arbitrary time t with the help of the complementary ﬂow

reversal in an efﬁcient way. That is, the BiOF-T module can

be skipped at the down-scaled levels in inference so that

our model can infer the intermediate frame of 4K at once,

without any patch-wise iteration unlike all other previous

methods, which can be applied in real-world applications.

3. Proposed X4K1000FPS Dataset

Although numerous methods for VFI have been both

trained and evaluated over the diverse benchmark datasets,

such as Adobe240fps [40], DAVIS [35], UCF101 [39], Mid-

dlebury [3] and Vimeo90K [49], none of the datasets con-

tains rich amounts of 4K videos with HFR. These limits the

study of elaborate VFI methods required for VFI applica-

tions for targeting very high resolution videos.

To tackle the challenging extreme VFI task, we provide a

rich set of 4K@1000fps video that we photographed using

a Phantom Flex4K™ camera with the 4K spatial resolution

of 4096×2160 at 1,000 fps, producing 175 video scenes,

each with 5,000 frames by shooting for 5 seconds.

In order to select valuable data samples for VFI, we es-

timated bidirectional occlusion maps and optical ﬂows of

every 32 frames of the scenes using IRR-PWC [16]. The

occlusion map indicates part of the objects to be occluded

Dataset

Occlusion [16] Flow magnitude [16]

Vimeo90K [49] 6.8 11.9 18.1 3.1 4.9 7.1

Adobe240fps [40] 0.8 1.7 3.2 3.8 8.9 16.3

X-TEST (ours) 2.1 5.6 17.7 23.9 81.9 138.5

X-TRAIN (ours) 6.9 10.1 15.7 5.5 18.0 59.5

, 50

and 75

represent percentiles of each datset.

Table 1. The occlusion and optical ﬂow magnitude statistics of VFI

datasets: 3,782 test triplets of Vimeo90K [49], randomly selected

200 clips of Adobe240fps [40], 15 clips of X-TEST and 4,408

clips of X-TRAIN.

in the next frames. The occlusion makes optical ﬂow es-

timation and frame interpolation challenging [44, 4, 16].

Thus, we manually selected 15 scenes as our testset, called

X-TEST, by considering the degrees of occlusion, optical

ﬂow magnitudes and scene diversity. Each scene for X-

TEST simply contains one test sample that consists of two

input frames in a temporal distance of 32 frames and ap-

proximately corresponds to 30 fps. The test evaluation is

set to interpolate 7 intermediate frames, which results in

the consecutive frames of 240 fps. For the training dataset,

called X-TRAIN, we cropped and selected 4,408 clips of

768×768-sized and the lengths of 65 consecutive frames

by considering the amounts of occlusion. More details are

described in Supplementary Material.

Table 1 compares the statistics of datasets: Vimeo90K

[49], Adobe240fps [40], our X-TEST and X-TRAIN. We

estimated the occlusion range in [0,255] and optical ﬂow

magnitudes [16] between input pairs and calculated their

percentiles for each dataset. As shown in Table 1, our

datasets contain comparable occlusion but signiﬁcantly

larger motion, compared to the previous VFI datasets.

4. Proposed Method : XVFI-Net Framework

4.1. Design Considerations

Our XVFI-Net aims at interpolating an intermediate

frame I

at an arbitrary time t between two consecutive in-

put frames, I

and I

, of HR with extreme motion.

Scale Adaptivity. An architecture with a ﬁxed number of

scale levels like PWC-Net [41] is difﬁcult to adapt to vari-

ous spatial resolutions of the input video, because the struc-

ture in each scale level is not shared across different scale

levels, so the new architecture with an increased scale depth

needs to be retrained. In order to have a scale adaptivity to

various spatial resolutions of input frames, our XVFI-Net

is designed to have optical ﬂow estimation starting at any

desired coarse scale level, adapting to the degree of motion

magnitudes in the input frames. To do so, our XVFI-Net

shares its parameters across different scale levels.

Capturing Large Motion. In order to effectively capture a

large motion between two input frames, the Feature Extrac-

tion Block of XVFI-Net ﬁrst reduces the spatial resolution

of two input frames by a module scale factor M via a strided

convolution, thus yielding the spatially reduced feature that

is then converted to two contextual feature maps C

and

. The Feature Extraction Block in Fig. 3 is composed of

the strided convolution and two residual blocks [15]. Then,

XVFI-Net at each scale level estimates optical ﬂows from

target frame I

to two input frames in the reduced spatial

size by M. The predicted ﬂows are upscaled (×M) to warp

the input frames at each scale level to time t.

4.2. XVFI-Net Architecture

BiOF-I module. Fig. 4 shows our XVFI-Net architecture in

scale s, where I

denotes bicubicly down-scaled by 1/2

First, contextual pyramid C = {C

} is recurrently ex-

tracted from C

and C

via a stride 2 convolution, and

then utilized as an input for XVFI-Net at each scale level

s (s = 0, 1, 2, ...), where s = 0 denotes the scale of the

original input frames. Let F

denotes optical ﬂow from

time t

to t

at scale s. F

and F

are the bidirectional

ﬂows between input frames at scale s. F

and F

are the

bidirectional ﬂows from I

to I

and I

, respectively.

The estimated ﬂows F

s+1

, F

s+1

from the previous scale

(s + 1) are ×2 bilinearly up-scaled to be set as the ini-

tial ﬂows for the current scale s, i.e.,

= F

s+1

↑

= F

s+1

↑

. To update the initial ﬂows in the cur-

rent scale, C

and C

are ﬁrst warped by the initial ﬂows,

that is,

= W (

, C

) and

= W (

, C

), re-

spectively, where W is a backward warping operation [17].

Then

, C

together with

are fed to an

auto-encoder-based BiFlownet as in Fig. 4 to output resid-

ual ﬂows over the initial ﬂows and a trainable importance

mask z [31]. Then F

, F

are obtained. They are then fed

as input to the BiOF-T module and are also used as the ini-

tial ﬂows to the next scale s − 1.

BiOF-T module. Hereafter, we omit superscript s for

the notion of feature tensors at each scale, unless men-

tioned. Although the linear approximation with optical

ﬂows F

, F

[18] or the ﬂow reversal of F

, F

[48] al-

lows to estimate the ﬂows F

, F

at arbitrary time t, there

are few shortcomings. The linear approximation is inaccu-

rate to predict F

and F

for fast-moving objects because

the anchor points of F

and F

are severely misaligned.

On the other hand, the ﬂow reversal can align the anchor

points but holes may appear in estimated F

and F

. To

stabilize the performance of the ﬂow reversal, we take com-

plementary advantages of both the linear approximation and

ﬂow reversal. So, a stable optical ﬂow estimate from time t

to 0 or 1 can be computed by a normalized linear combina-

tion of a negative anchor ﬂow and a complementary ﬂow,

which we call a complementary ﬂow reversal (CFR). The

resulting complementary reversed optical ﬂow maps,

and

, from time t to 0 and 1 are given by,

BiOF-I BiOF-T

BiOF-I

Up-scaled flow

: Training paths

: Possible test paths

Scale 0

Scale 1

Up-scaled flow

: Shared parameters (Fig. 4)

Feature

Extraction

Block

Recurrent

Conv.

(stride 2)

Recurrent

Conv.

(stride 2)

Scale S

)/1( M

,II

Figure 3. Adjustable and efﬁcient scalability of our XVFI-Net

framework. Even if the lowest scale depth S

trn

during training

is set to 1 in this example, inference can start from any scale level.

(1 − t)

·(−F

) + t

· F

1·(1-t)

(1 − t)

+ t

(1)

(1 − t)

· F

0·(1-t)

+ t

· (−F

)

(1 − t)

+ t

(2)

where x denotes a pixel location at time t and y is at time

0 or 1. w

= z

· G(|x − (y + F

)|) is a Gaussian weight

depending on the distance between x at time t and y+F

time i (= 0 or 1) while also considering the learnable impor-

tance mask of each ﬂow by z

[31]. Also, −F

(or −F

)

and F

1·(1-t)

(or F

0·(1-t)

) in Eq. 1 (or Eq. 2) are deﬁned as

a negative anchor ﬂow and a complementary ﬂow, respec-

tively. Furthermore, the anchor ﬂows are normalized ﬂows

that can be calculated as F

= tF

and F

= (1 − t)F

to intermediate time t. It should be noted in Eq. 1 and

Eq. 2 that the complementary ﬂows are also normalized as

1·(1-t)

= tF

and F

0·(1-t)

= (1 − t)F

which comple-

mentally ﬁll the holes occurred in the reversed ﬂows. By do-

ing so, we can fully exploit the temporal-densely captured

X4K1000FPS dataset to train our XVFI-Net for VFI at ar-

bitrary time t. The neighborhoods of x are deﬁned as,

= {y | round(y + F

) = x} (3)

= {y | round(y + F

) = x}. (4)

To reﬁne the bidirectional ﬂow approximates

we rewarp the feature maps (C

, C

) to

and

, respectively. We concatenate and feed

, C

, and

to the auto-encoder-based

TFlownet as in Fig. 4 (similarly to reﬁne

). The out-

puts of TFlownet are used to compose reﬁned ﬂows F

, F

which are then bilinearly up-scaled (×M) back to the size

剩余17页未读，继续阅读

评论收藏

内容反馈

版权申诉

码流怪侠

粉丝: 4w+

XVFI: eXtreme Video Frame Interpolation提供源码

最新资源

XVFI: eXtreme Video Frame Interpolation提供源码

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

DeepSeek从入门到精通-清华大学

CIFAR10数据集免费下载

清华deepseek入门到精通文档 夸克网盘资源下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

LabVIEW AI Vision(LabVIEW AI视觉工具包)

YOLOv5 人脸口罩图片数据集

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

免费Ollama 官方大模型服务器安装程序

Ollama windows安装包 0.5.7（截止2025-02-01）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

mamba、causal-conv1d安装.whl文件

YOLOv8目标追踪实战全套资源包 - 源码与数据集完整分享

皮肤病语义分割数据集+代码+unet模型 2000张标注好的数据+教学视频

【大作业-08】YOLOV5火灾检测数据集+代码+模型 2000张标注好的数据+教学视频

时间序列预测实战(十九)魔改Informer模型进行滚动长期预测（科研版本，结果可视化）

基于YOLOv5实现垃圾分类目标检测

cudnn-10.1-windows10-x64-v7.6.5.32.zip

download-NEU-DET

labelme v5.3.1 （2023年8月新版本，双击打开即用）

VaR风险价值-Python版本

无法打开 源 文件 "securec.h" (dependency of "lwip/inet.h")C/C++(1696)

最新资源

清华deepseek入门到精通文档夸克网盘资源下载

无法打开源文件 "securec.h" (dependency of "lwip/inet.h")C/C++(1696)