0% found this document useful (0 votes)
41 views

Drea Moving

Uploaded by

tai dam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Drea Moving

Uploaded by

tai dam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DreaMoving: A Human Video Generation Framework

based on Diffusion Models

Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo,
Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei,
Miaomiao Cui, Peiran Ren, Xuansong Xie
Alibaba Group
arXiv:2312.05107v2 [cs.CV] 11 Dec 2023

{mengyang.fmy, ljl191782, jinmao.yk, ryan.yy, huizheng.hz, guoxiefan.gxf,


xianhui.lxh, haolan.xhl, zhicheng.sc, lxw262398, liaojie.laj, kangxiaoyang.kxy, biwen.lbw
miaomiao.cmm, peiran.rpr, xingtong.xxs}@alibaba-inc.org

Abstract images. However, these techniques often fail to offer pre-


cise control over motion patterns or necessitate hyperpa-
In this paper, we present DreaMoving, a diffusion-based rameter fine-tuning specific to target identities, introducing
controllable video generation framework to produce high- an additional computational burden. Customized video gen-
quality customized human videos. Specifically, given tar- eration is still under investigation and represents a largely
get identity and posture sequences, DreaMoving can gen- uncharted territory. In this paper, we present a human
erate a video of the target identity moving or dancing any- dance video generation framework based on diffusion mod-
where driven by the posture sequences. To this end, we pro- els (DM), named DreaMoving.
pose a Video ControlNet for motion-controlling and a Con- The rest of the paper is organized as follows. Sec. 2
tent Guider for identity preserving. The proposed model presents a detailed description of how the DreaMoving is
is easy to use and can be adapted to most stylized diffu- built. Sec. 3 presents some results generated by our method.
sion models to generate diverse results. The project page
is available at https://dreamoving.github.io/ 2. Architecture
dreamoving.
DreaMoving is built upon Stable-Diffusion [9] models. As
illustrated in Fig. 1, it consists of three main networks, in-
cluding the Denoising U-Net, the Video ControlNet, and the
1. Introduction
Content Guider. Inspired by AnimateDiff [5], we insert mo-
Recent text-to-video (T2V) models like Stable-Video- tion blocks after each U-Net block in the Denoising U-Net
Diffusion1 and Gen22 make breakthrough progress in video and the ControlNet. The Video ControlNet and the Con-
generation. However, it is still a challenge for human- tent Guider work as two plug-ins of the Denoising U-Net
centered content generation, especially character dance. for controllable video generation. The former is responsi-
The problem involves the lack of open-source human dance ble for motion-controlling while the latter is in charge of the
video datasets and the difficulty of obtaining the corre- appearance representation.
sponding precise text description, making it a challenge to
train a T2V model to generate videos with intraframe con-
2.1. Data Collection and Preprocessing
sistency, longer length, and diversity. Besides, personal- To gain better performance in generating human videos, we
ization and controllability stand as paramount challenges in collected around 1,000 high-quality videos of human dance
the realm of human-centric content generation, attracting from the Internet. As the training of the temporal module
substantial scholarly attention. Representative research like needs continuous frames without any transitions and spe-
ControlNet [13] is proposed to control the structure in the cial effects, we further split the video into clips and finally
conditional image generation, while DreamBooth [10] and got around 6,000 short videos (8∼10 seconds). For text de-
LoRA [6] show the ability in appearance control through scription, we take Minigpt-v2 [3] as the video captioner.
1 https : / / stability . ai / news / stable - video - Specifically, using the “grounding” version, the instruction
diffusion-open-ai-video-model is [grounding] describe this frame in a detailed manner.
2 https://research.runwayml.com/gen2 The generated caption of the centered frame in keyframes
a girl in a light

Content Embeddings
Text
yellow dress,
Encoder
on the beach

Concat
Video ControlNet
Cloth
Image Image
Encoder
(Optional)
Pose/Depth Sequence
T
Iteratively Denoise

Denoising U-Net VAE


Decoder

Noise

a girl in a light Cloth


yellow dress, Image
on the beach (Optional) Content Guider U-Net Block Motion Block

Text Prompt Reference Images

Figure 1. The overview of DreaMoving. The Video ControlNet is the image ControlNet [13] injected with motion blocks after each U-Net
block. The Video ControlNet processes the control sequence (pose or depth) to additional temporal residuals. The Denoising U-Net is
a derived Stable-Diffusion [9] U-Net with motion blocks for video generation. The Content Guider transfers the input text prompts and
appearance expressions, such as the human face (the cloth is optional), to content embeddings for cross attention.

represents the whole video clip’s description, mainly de- beddings. The content embeddings are then sent to cross-
scribing the content of the subject and background faith- attention layers for human appearance and background rep-
fully. resentations as described in IP-Adapter [12]. Given the
query features Z, the text features ct , the face features cf ,
2.2. Motion Block ′
and the cloth features cc , the output of cross-attention Z
To improve the temporal consistency and motion fidelity, can be defined by the following equation:
we integrate motion blocks into both the Denosing U-Net
and ControlNet. The motion block is extended from the An- !
QKtT QKfT
 
imateDiff [5], and we enlarge the temporal sequence length ′
Z = sof t max √ Vt + αf sof t max √ Vf
to 64. We first initialize the weights of motion blocks from d d
the AnimateDiff (mm sd v15.ckpt) and fine-tune them on 
QKcT

the private human dance video data. + αc sof t max √ Vc ,
d
2.3. Content Guider (1)
The Content Guider is designed to control the content of where, Q = ZWq , Kt = ct Wkt , and Vt = ct Wvt are the
the generated video, including the appearance of human and query, key, and values matrices from the text features, Kf =
the background. One simple way is to describe the human cf Wkf , and Vf = cf Wvf are the key, and values matrices
appearance and background with a text prompt, such as ’a from the face features, and Kc = cc Wkc , and Vc = cc Wvc
girl in a white dress, on the beach’. However, it is hard are the key, and values matrices from the cloth features. αf
to describe a personalized human appearance for a normal and αc are the weights factor.
user. Even by complex prompt engineering, the model may
not give the desired output. 2.4. Model Training
Inspired by IP-Adapter [12], we propose to utilize the
2.4.1 Content Guider Training
image prompt for precise human appearance guidance and
the text prompt for background generation. Specifically, The Content Guider serves as an independent module for
a face image is used to encode the facial features through base diffusion models. Once trained, it can generalized
an image encoder, and a cloth/body image is optionally in- to other customized diffusion models. We trained Content
volved to encode the body features. The text and human ap- Guider based on SD v1.5 and used OpenCLIP ViT-H14 [7]
pearance features are concatenated as the final content em- as the image encoder as [12]. To better preserve the identity
of the reference face, we employ the Arcface [4] model to the form of multi-controlnet, and the depth and pose Video
extract the facial correlated features as a supplement to the ControlNets can be used simultaneously. The strength of
clip features. We collect the human data from LAION-2B, the face/body guidance is also controllable in the Content
then detect and filter images without faces. During training, Guider by adjusting the αf and αc in Eqn. 1. The content is
the data are randomly cropped and resized to 512 × 512. fully controlled by the text prompt if αf = αc = 0.
Content Guider is trained on a single machine with 8 V100
GPUs for 100k steps, batch size is set to 16 for each GPU, 3. Results
AdamW optimizer [8] is used with a fixed learning rate of
DreaMoving can generate high-quality and fidelity videos
1e − 4 and weight decay of 1e − 2.
given guidance sequence and simple content description
(text prompt only, image prompt only, or text-and-image
2.4.2 Long-Frame Pretraining prompts) as input. In Fig. 2, we show the result with text
We first conduct a warming-up training stage to extend the prompt only. To keep the face identity, the user can input
sequence length in the motion module from 16 to 64 on the the face image to the Content Guider to generate a video of
validation set (5k video clips) of WebVid-10M [1]. We only some specific person (demonstrated in Fig. 3). Moreover,
train the motion module of the Denoising U-Net and freeze the user can define both the face content and clothes con-
the weights of the rest of the network. No ControlNet and tent, as exhibited in Fig. 4. We further test the generaliza-
image guidance are involved in this stage. The learning rate tion of the proposed method on images of unseen domains.
is set to 1e − 4 and the resolution is 256 × 256 (resize & In Fig. 5, we run DreaMoving using unseen stylized im-
center crop). The training is stopped after 10k steps with a ages. Our method is able to generate videos in accordance
batch size of 1. with the style and content in the input image.

References
2.4.3 Video ControlNet Training
[1] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser-
After the long-frame pretraining, we train the Video Con- man. Frozen in time: A joint video and image encoder for
trolNet with the Denoising U-Net by unfreezing both the end-to-end retrieval. In IEEE International Conference on
motion blocks and U-Net blocks in the Video ControlNet Computer Vision, 2021. 3
and fixing the Denoising U-Net. The weights of motion [2] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka,
blocks in Video ControlNet are initialized from the long- and Matthias Müller. Zoedepth: Zero-shot transfer by com-
frame pretraining stage. In this stage, we train the network bining relative and metric depth, 2023. 3
on the collected 6k human dance video data. No image [3] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun
guidance is involved in this stage. The human pose or depth Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi,
is extracted as the input of the Video ControlNet using DW- Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny.
Minigpt-v2: large language model as a unified interface
Pose [11] or ZoeDepth [2], respectively. The learning rate
for vision-language multi-task learning. arXiv preprint
is set to 1e − 4 and the resolution is 352 × 352. The training
arXiv:2310.09478, 2023. 1
is stopped after 25k steps with a batch size of 1.
[4] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
Zafeiriou. Arcface: Additive angular margin loss for deep
2.4.4 Expression Fine-tuning face recognition. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pages
To gain better performance in human expression generation, 4690–4699, 2019. 3
we further fine-tune the motion blocks in Denoising U-Net [5] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu
by training with the Video ControlNet on the collected 6k Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your
human dance video data. In this stage, the whole Video personalized text-to-image diffusion models without specific
ControlNet and the U-Net blocks in the Denoising U-Net tuning. arXiv preprint arXiv:2307.04725, 2023. 1, 2
are locked, and only the weights of the motion blocks in [6] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
the Denoising U-Net are updated. The learning rate is set Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
to 5e − 5 and the resolution is 512 × 512. The training is LoRA: Low-rank adaptation of large language models. In In-
stopped after 20k steps with a batch size of 1. ternational Conference on Learning Representations, 2022.
1
2.5. Model Inference [7] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave,
At the inference stage, the inputs are composed of the text Vaishaal Shankar, Hongseok Namkoong, John Miller, Han-
prompt, the reference images, and the pose or depth se- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open-
quence. The control scale of the Video ControlNet is set clip, 2021. If you use this software, please cite it as below.
to 1.0 for pose or depth only. Our method also supports 2
a girl with short hair
wearing black clothes
in the room

A cheerleader wearing
red and golden uniform
on the football field

a woman with long


hair wearing white
suit and pants in the
street

Figure 2. The results of DreaMoving with text prompt as input.

a girl, smiling, dancing in a


French town, wearing long
light blue dress

a girl, smiling, in the park


with golden leaves in autumn
wearing coat with long sleeve

a girl, smiling, standing on


beach, wearing light yellow
dress with long sleeves

a man, dancing in front of


Pyramids of Egypt, wearing
a suit with a blue tie

Figure 3. The results of DreaMoving with text prompt and face image as inputs.
a girl, in the forest with great
sunshine, wearing a long pants.

a girl, in a fairytale town covered


in snow, wearing a long pants.

a girl, dancing in the center


park, wearing a long pants.

Input Video Frame

Figure 4. The results of DreaMoving with face and cloth images as inputs.

Pose Sequence:

Figure 5. The results of DreaMoving with stylized image as input.


[8] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017. 3
[9] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image syn-
thesis with latent diffusion models, 2021. 1, 2
[10] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
tuning text-to-image diffusion models for subject-driven
generation. 2022. 1
[11] Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec-
tive whole-body pose estimation with two-stages distillation.
In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 4210–4220, 2023. 3
[12] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. 2023. 2
[13] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models, 2023.
1, 2

You might also like