Motion mm2020

The document presents a new deep learning model called Dynamic Future Net (DFN) for generating diversified human motion. DFN explicitly models both short-term and long-term stochasticity in human motion dynamics. It learns latent representations of poses and dynamics and their conditional distributions. DFN can generate high-quality motions with different styles and arbitrary durations from limited training data.

Uploaded by

ryl0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Motion mm2020

Uploaded by

ryl0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Dynamic Future Net: Diversified Human Motion Generation

Wenheng Chen He Wang Yi Yuan∗

NetEase Fuxi AI Lab University of Leeds NetEase Fuxi AI Lab
[email protected] [email protected] [email protected]

Tianjia Shao Kun Zhou

State Key Lab of CAD&CG, Zhejiang State Key Lab of CAD&CG, Zhejiang
University University
[email protected] [email protected]
arXiv:2009.05109v1 [cs.CV] 25 Aug 2020

Figure 1: Given a 20-frame walking motion pre-fix (white), our model can generate diversified motion: walking (yellow),
walking-to-running (blue), walking-to-boxing (green), and walking-to-dancing (red), with arbitrary duration. The correspond-
ing animation can be found in teaser.mp4 in supplementary video.
ABSTRACT show the superiority of our method, for its robustness, versatility
Human motion modelling is crucial in many areas such as computer and high-quality.
graphics, vision and virtual reality. Acquiring high-quality skele-
tal motions is difficult due to the need for specialized equipment
and laborious manual post-posting, which necessitates maximiz- 1 INTRODUCTION
ing the use of existing data to synthesize new data. However, it Modeling natural human motions is a central topic in several fields
is a challenge due to the intrinsic motion stochasticity of human such as computer animation, bio-mechanics, virtual reality, etc,
motion dynamics, manifested in the short and long terms. In the where high-quality motion data is a necessity. Despite the improved
short term, there is strong randomness within a couple frames, e.g. accuracy and lowered costs of motion capture systems, it is still
one frame followed by multiple possible frames leading to different highly desirable to make full use of existing data to generate di-
motion styles; while in the long term, there are non-deterministic versified new data. One key challenge in motion generation is
action transitions. In this paper, we present Dynamic Future Net, the dynamics modelling, where it has been shown that a latent
a new deep learning model where we explicitly focuses on the space can be found due to the high coordination of body motions
aforementioned motion stochasticity by constructing a generative [21, 36, 41]. However, as much as the spatial aspect is studied, dy-
model with non-trivial modelling capacity in temporal stochas- namics modelling, especially with the aim of diversified motion
ticity. Given limited amounts of data, our model can generate a generation, still remains to be an open problem.
large number of high-quality motions with arbitrary duration, and Human motion dynamics manifest several levels of short-term
visually-convincing variations in both space and time. We evaluate and long-term stochasticity. Given a homogeneous discritization
our model on a wide range of motions and compare it with the of motions in time, the short-term stochasticity refers to the ran-
state-of-the-art methods. Both qualitative and quantitative results domness in next one or few frames (pose-transition); while the
long-term one refers to the random in the next or few actions
(action-transition). Tradition methods model them by Finite State
∗ Corresponding author Machines with carefully organized data [33], which have limited
model capacities for large amounts of data and require extensive pre- authors model a motion sequence as a trajectory in pose embedding
processing work. New deep learning methods either ignore them manifold, then use a bidirectional-RNN to encode the trajectory to
[21] or do not explicitly model them [41]. Very recently, dynamics model the dynamics of the trajectory, which can improve motion
modelling for diversified generation has just been investigated [42], classification results. Moreover, they designed a graph-like network
but only from the perspective of the overall dynamics, rather than to better represent the human body components.
the detailed short/long term stochasticity. The existing methods focus on the embedding of the poses and
In this paper, we propose a new deep learning model, Dynamic dynamics. However, they do not explicitly model the distributions
Future Net, or DFN, for automatic and diversified high-quality mo- of these latent variables which governs the stochasticity of the
tion generation based on limited amounts of data. Given a motion, dynamics. In this paper, we go a lever deeper and learn the latent
we assume that it can be discretized homogeneously in time and variable distributions for the embedded poses and dynamics.
represented by a series of posture and instantaneous velocities.
Following the observation that it is easier to learn the dynamics in 2.2 Deterministic human motion prediction
a natural motion latent space [21], we first embed features in the and synthesis
data space into a latent space. Next, DFN learns explicitly the his-
tory, current and future state given any time, where we also model In the effort of modelling motion dynamics, many methods employ
several conditional distributions for the influences of history and deterministic transitions [2, 4, 9, 13, 17, 19, 24, 30, 31, 38], especially
future state on the current state. The state-wise bidirectional mod- in human motion prediction or generation. They either focus on
elling (extending into both the past and future) separates DFN from short term dynamics modeling or spatial-temporal information of
existing methods and endows us with the ability of modelling the the overall dynamics. In [44], the authors propose a training tech-
short-term (next-frame) randomness and long-term (next-action) nique for RNN to generate very long human motions. Although this
randomness. Last, for inference purposes, we propose new loss technique solves the problem of the freezing phenomena of RNN,
functions based on distributional similarities as opposed to point- their model is deterministic, which makes the training difficult:
wise estimation [41, 44], which captures the dynamics accurately given a past state, if multiple possible future motions are present
in the data, the network will average them, which is a common
but also keep the randomness that is crucial for diversified motion
problem in many human motion prediction methods.
generation.
We show extensive experimental results to show DFN’s robust- One solution to this problem is to introduce control signals [14,
ness, versatility and high-quality. Unlike existing methods which 39]. They design several networks and make the character to follow
have to be trained on one type of motions a time [42, 44], DFN a given trajectory in real-time. In [35], the control signal becomes
can be trained using both single type of motions or mixed motions, the 3d human pose predicted by neural nets as a reference for an
which shows DFN’s ability to capture multi-modal dynamics and agent to imitate. In [1], the authors co-embed the language and
therefore its versatility in diversified motion generation. Visual corresponding motions to a share manifold, ignoring the fact that
evaluation shows that DFN can generate high-quality motions with language-to-motion is a one-to-many mapping. Even with a specific
different dynamics. control signal, like 2D human skeleton, one can still expect that
there are different motions or different pose corresponding to the
In summary, our formal contributions include:
same control signal [29], essentially indicating the multi-modality
(1) a new deep learning model, Dynamic Future Net, for auto- nature of human motion dynamics.
matic, diversified and high-quality human motion genera- Different from the existing methods, our paper focuses on the ex-
tion. plicit modelling of the multi-modality nature of motion transitions
(2) a new dynamic model that captures the transition stochas- in human motions. Further, we also aim to learn the stochastiity in
ticity of the past, current and future states in motions. those transitions.
(3) insights of the importance of both short-term and long-term
dynamics in human motion modelling.
2.3 Stochastic human motion synthesis
2 BACKGROUND AND RELATED WORKS In [42] the authors combine RNNs and Generative Adversarial Net-
works (GANs) to generate stochastic human motions. They use a
2.1 Human pose and motion embedding mixture density layer to model the stochastic property, and use an
Given a human motion sequence, it is useful to find the low dimen- adversarial discriminator to judge whether the generated motion
sion representation of the whole sequence. Holden et al. [22] for is natural or not. In MoGLow [18], the authors for the first time
the first time use a convolution neural network to project the entire use normalizing flow to directly model the next frame distribution.
sequence to a low dimensional embedding space. Using this more One advantage of this method is that it can capture complex dis-
abstract representation, one can blend motions or remove noises tributions without learning an apparent latent space. Given the
from corrupted motions. In [20, 21], the authors further make use of same initial poses and the same control signals or constraints, the
the power of the learned motion manifold and decoder to synthesize model still generate different motion sequences. Chen et al. [5]
motions with constraints. Another important application of motion combine Dynamic motion primitive and variational Bayesian filter
embedding is motion style generation [8], in which the embedding to model the human motion dynamics. They show that the latent
code can be tuned to matched the desired style. Although modeling representations are self-clustering after training. However, in the
a motion sequence with auto-encoders is straightforward, how it transitions, it needs the information of the whole sequence, which
can model the dynamics of human motion is not clear. In [28], the separates it from being a pure generative model.
Our method differs from existing approaches in its treatment in combine the past, current and future state to infer the current
the relations between the past, current and future states of human latent state distribution, and we combine the past and future to
motions. Unlike the aforementioned methods, we explicitly model infer the future latent state distribution. In the generation stage,
the current state based on both the past and the future. Also, we unlike existing methods [10] where the current state is generated
further model their randomness in the latent space that captures from the past state only, we first generate the future state and
the transition multi-modality. combine it with the past state to generate the current latent state
prior, from which we sample the current latent state then decode it
2.4 Stochastic RNN model to the pose-embedding and recover the current pose and velocity.
We regard this process as a self-driving motion generation process
Modelling the stochasticity in time-series data has been a long-
guided by the envisioned dynamic future. In this way, the model
standing problem, such as music, hand writing and human voice
can learn and generate rich and varied natural motions.
[11, 25, 40]. The VRNN [7] for the first time combines Variational
AutoEncoder (VAE) [27] and recurrent neural networks for this
purpose. Later in [23], the authors disentangle the latent variables
of the observation and the dynamics, with the observation latent
being used to recover the full observation information, and the
dynamic latent capturing the dynamics.
A key modelling choice in stochastic RNN models is the relations
between the past, current and future. In early work, the posterior
of current state is inferred from the past information, which makes
it lack the ability to foresee the future. In [3, 10, 37], the authors
show that the performance can be improved by incorporating the
future state with a backward RNN in the inference stage. In [12], the
authors design a model that can go beyond step-by-step modelling,
and predict multiple steps up to a given horizon. Similar effort is also
made in reinforcement learning, where the reward function takes
the discounted future reward into consideration [12, 15]. In [16], the
authors went further and designed a model that can predict multiple
future scenarios, then choose the one with highest predicted reward
from all the possibilities.
We observed that human motions follow a similar philosophy: Figure 2: Overview of the proposed Dynamic Future Net-
the current state is a result of the past motion but also a particular work. In the learning process, the network take human mo-
choice for a certain planned future. Our research is inspired by tion sequence as input and predict the long term distribu-
Stochastic RNN models but focuses on human motion transition tion and the short term(next frame) distribution.
stochasticity.

3 METHOD OVERVIEW 4 DATA PREPARATION

Our method takes a homogeneous series of human pose representa- We train our model on the CMU Human motion capture dataset.
tions as input. This representation contains the 3D joint coordinates As the skeletons in the origin dataset are different, we first retarget
relative to the root, and the root translation velocity over ground all motion to a chosen skeleton as in [21]. The skeleton contains
plane and the rotation velocity around the y-axis. We propose the 24 joints, we first extract the X and Z global coordinate of the root,
Dynamic Future Net to model the motion dynamics as a future- and rotate the human pose to the Y-axis direction as in [22], the
guided transition and generate random natural human motions global position and angle of human pose can be recovered from the
that transit between different actions. X-Z velocity and the rotation velocity around the Y axis. Finally the
As illustrated in Fig. 2, DFN is composed of three modules, a original human pose vector contains 76 degrees of freedom, 72 for
pose encoder, a pose trajectory encoder and a stochastic latent RNN. 3D joint positions, 4 for the global translation and rotation velocity.
The pose encoder (Section 5.1) maps the high dimensional human
pose to a latent space while the pose trajectory encoder (Section 5 METHODOLOGY
5.2) embeds the trajectory in the latent pose space into a code. Formally, we start by describing a motion as a homogeneous time
Such compact representations of pose sequences can facilitate the series: {X 0 , . . . , XT }, where X t is the motion frame at time t and
learning process [28]. As a key module, the stochastic latent RNN contains the joint positions and global velocities. Starting with a
(Section 5.3) deploys a stochastic latent state and a deterministic joint distribution P(X <t , X t , X t +1:t +H ), we model the influence of
latent state to learn two latent distributions for the pose-embedding the past frames X <t and future frames X t +1:t +H on the current
and the future trajectory embedding. Such explicit learning of two frame X t by transition probabilistic distributions P(X t |X t +1:t +H )
different latent distributions on the one hand forces the model to and P(X t |X <t ), where H is the duration of a short-horizon future.
learn strong temporal correlation and on the other hand generates The key reason of such a modelling choice is based two observa-
motions with varied and natural transition. During inference we tions: the current frame is a result of the past motion and therefore
quaternion decoder takes the pose latent feature as input and out-
puts joint angles (represented by quaternions), and the velocity
decoder takes the latent velocity feature as input and outputs the
velocity. The quaternion decoder essentially is a differential Inverse
Kinematics module. As stated in [34], using joint rotations instead
of joint positions maintains the bone lengths. After the reconstruc-
tion, we use a Forward Kinematics layer to compute the 3D joint
positions. To train the auto-encoder, we use a Mean Squared Error
loss function:
T
1Õ
Lsl = ||X t − X̂ t ||22 (2)
T t =0

where T is the number of frames in a motion.

Figure 3: The pose-velocity auto-encoder network. The in-
put of the encoder is the 76 dimensional pose-velocity vec- 5.2 Dynamics Embedding
tor. The encoder outputs two code, one for pose, the other
for velocity. The pose code and velocity code is fed into a
quaternion decoder and velocity decoder separately. The 3D
joint positions are recovered from quaternions by the For-
ward Kinematics (FK) layer.

conditioned on it, captured by P(X t |X <t ). Meanwhile, the current

frame is also a choice made for certain planned future, e.g. needing
to stop swing the legs aiming for a transition from walk to stand-
ing, captured by P(X t |X t +1:t +H ). In addition, since the past motion
will also limit the possibilities of the future motion, there is also
a impact of the past on the future, P(X t +1:t +H |X <t ). Overall, the
joint probability:
P(X <t , X t , X t +1:t +H ) ∝ P(X t |X t +1:t +H )P(X t +1:t +H , X t |X <t ) (1)
Not that the two probabilities on the right side play different roles.
P(X t +1:t +H , X t |X <t ) is the probability of unrolling from the past to Figure 4: A seq-to-seq network for trajectory embedding.
the future. Given a known past, this is a joint probability of both the
current and the future, containing all the possible transitions. On
top of it, P(X t |X t +1:t +H ) dictates that if the future is also known, After learning the pose latent space, we project the dynamics as
then the current can be inferred. This explicit modelling of the trajectories in this space using a Recurrent Neural Network shown
transition probabilistic distributions between the past, current and in Figure 4. We employ a sequence-to-sequence architecture as it
future helps capturing the transition stochasticity, which facilitates forces the model to learn long-term dynamics. The RNN consists
diversified motion generation as shown in the experiments. of Gated Recurrent Unit (GRU) [6], encodes a sequence of encoded
Learning the transitional probabilities in the data space, however, frames {zt , . . . , zt +H } into a latent representation mt , and then
is difficult due to the curse of dimensionality. We therefore project unroll to reconstruct the same sequence {zt′ +1 , . . . , zt′ +H } from
motions to a latent space, which involves embedding the frames mt given zt . To zt = PoseEnc(X t ), mt is a future summary over
as well as the dynamics. We then learn the transition distributions multiple frames. We use the following loss function:
in the latent space. During inference, we then recover motions
from sampled states in the latent space to the original data space. Ltl = Lr ec + Lsmooth (3)
DFN is naturally divided into three components: Spatial (frame) T
1Õ
embedding, dynamics embedding and dynamics modelling. where Lr ec = ||zt − zt′ ||22
T t =0
5.1 Spatial Embedding T
1Õ
Lsmooth = ||Vt − V̂t ||22
We use an auto-encoder for frame embedding, zt = PoseEnc(X t ) T t =0
and X̂ t = PoseDec(zt ), shown in Figure 3. PoseEnc is multi-layer
perceptron network to project the data into the latent space. Then where T is the frame number of a motion, Vt and V̂t are the original
we separate the latent feature into two components to represent and reconstructed joint velocities. To facilitate training, we use Eq.
the pose code and the global velocity code. PoseDec contains two 2 to pre-train the posture auto-encoder an fix its weights when
components, the quaternion decoder and the velocity decoder. The training the RNN module.
5.3 Dynamics modelling the current feature zt via zt = MLP(st , ht ), where the MLP has
Generative Model. After embedding the poses and dynamics into three hidden layer with 128 dimensions and LeakyReLU activation.
a latent space, we now explain the dynamics modelling, which is the Finally, given the current and future state, the past state is up-
key technical contribution of this paper. We propose a new dynam- dated as follows (Fig.5 c):
ics model that captures the joint distribution P(X <t , X t , X t +1:t +H ). ht +1 = GRU(ht , st , ft ) (8)
Rather than directly learning the distribution in the data space, we
aim to learn the latent joint distribution P(z <t , zt , zt +1:t +H ), where where the GRU has two stacked layer and hidden state of dimension
we abstract the past, current and future features separately. First, 128. Now the generation model in Fig.5 a is complete.
given the Markov property, we assume that all past information z <t Different from existing methods [7, 10] where the prior of current
is encoded into ht which is a deterministic (known) past state. Next, state is a function of past state ht only, and where the future state is
we assume that the future information zt +1:t +H can be summarized shared with the current state, we let the model learn two different
into a future state ft , and ft is drawn from a distribution over all distributions for current and future states. The prior of current state
possible future states conditioned on ht . Last, we also assume that is also a function of the future state, which will force the model to
there is a current state st which captures the current information make use of the future information.
zt and st is drawn from a distribution of all possible current states.
Then we can therefore assume: 5.4 Inference
In the generation model in Fig.5 a, the key variables to be inferred
P(z <t , zt , zt +1:t +H ) = P(zt +1:t +H , zt |z <t )P(z <t ) are st and ft , shown in Fig.5 b. The posterior of the future state ft
∝ P(zt |zt +1:t +H )P(zt |z <t )P(zt +1:t +H |z <t ) is dependent on past state ht and its unrolled future summary mt .
= P(st | ft )P(st |ht )P(ft |ht ) The posterior of the current state st is dependant on the feature zt ,
the past state ht and the future summary mt . We first factorize the
= P(st | ft , ht )P(ft |ht ) (4)
dynamics as follow:
where we directly use st , ht and ft to replace the corresponding T T
z variables by assuming mappings between them which will be
Ö Ö
q(s ≤T |z ≤T ) ≈ q(st |z ≤t −1 , zt , z ≤t +H ) = q(st |ht , zt , mt )
explained later. Different from existing methods [10, 37], our di- t =1 t =1
rect conditioning of the current state on the future and past state (9)
P(st | ft , ht ), and the future state on the past state P(ft |ht ) allows where T is the total length of motion. Here we approximate the
us great flexibility in modelling the stochasticity in transitions. The q(st |z ≤T ) with q(st |z ≤t +H ), as the correlations between st and the
generation model is shown in Fig.5 a. far future is likely to be small, so we only consider up to t +H . Then
Future Feature Prior. Given ht , we first predict the future state, for each time step, we use a MLP to parameterize the posterior:
via P(ft |ht ). Here, we assume a diagonal multivariate Gaussian q(st |ht , zt , mt ) = Gaussian(µ ts , σts ), [µ ts , σts ] = MLP(ht , zt , mt )
prior over ft [5, 14, 26] (10)
f
p(ft |ht ) = Gaussian(ft ; µ t , σt ),
f where the MLP has two hidden layers with 32 dimensions and
LeakyReLU activation. For the future state we approximate its
f f p
where [µ t , σt ] = дf (ht ) (5) posterior as follow:
f f f f
f f p q(ft |ht , mt ) = Gaussian(µ t , σt ), [µ t , σt ] = MLP(ht , mt ) (11)
where µ t and σt are the mean and covariance. дf is a three-layer
MLP with hidden dimension 256 and LeakyReLU activation. It where the MLP has two hidden layers with 512 dimensions and
contains all the possible future state given the past. It can represent a LeakyReLU activation.
goal or a driving signal for the generative process. Also it forces the
model to learn rich motion transitions and long term correlations, 5.5 Temporal difference loss
overcoming the freezing problem of traditional RNN [44]. Besides inferring st and ft , we also constrain the dynamics of s. We
Current Feature Prior. Next, we explain P(st | ft , ht ). Although assume a relation between two states st1 and st2 at t 1 and t 2 where
ht is a known (deterministic) past, ft is random. We therefore first t 1 < t 2 , similar to [12]:
sample a specific future state ft , then decode it to an unrolled future
p(st1 , st2 , t 1 , t 2 ) ∝ p(st2 |st1 , ht1 , ht2 ) (12)
summary mt , and finally condition the current state st on ht and
mt . We therefore have: where we parameterize the posterior:
P(st | ft , ht ) ∝ P(st |mt , ht )P(mt | ft , ht ) (6) q(st1 |st2 , ht1 , ht2 ) = Gaussian(st1 ; µ st1 , σst1 ) (13)
P(st |mt , ht ) = Gaussian(st ; µ ts , σts ) (7) where µ st1 , σst1 = MLP(st2 , ht1 , ht2 )
p p
[µ ts , σts ] = дs (ht , mt ) where µ st1 and σst1 are the mean and covariance. Here the MLP has
two hidden layers with 32 dimensions and LeakyReLU activation.
where P(mt | ft , ht ) is parameterized by a four layer MLP with hid- post er ior
This way we can sample st1 during inference. Meanwhile,
den dimension 128 with LeackyReLU activation. µ ts and σts are the
p we hope to reconstruct st2 with the time difference δt = |t 1 − t 2 |
mean and covariance. дs is a two layer MLP with hidden dimension
post er ior
128. After being able to sample the current state st , we can compute str2ec = MLP(st1 , δt) (14)
ft ht mt ht ht ht 1
ht : deterministic latent for history
state

mt zt Pose &
ft
zt ht zt : code for pose-v
Velociy
mt : code for future
sequence(length=8/16)

st mt st st ft
st : stochastic latent for current state
a) Generation b) Inference c) Transition(GRU) f t : stochastic latent for future state

Figure 5: The stochastic latent RNN. a) Generation Model. The current pose embedding feature zt depends on the current and
past latent state st and ht . st depends on the past state ht and the future summary mt which depends on the future latent state
ft and past state ht b) Inference on st and ft . c) Transition of ht

where the MLP for skip prediction has three hidden layers each DFN can make use of a small amount of data to generate high-
with 32 dimensions and LeakyReLU activation. quality and diversified data, which is crucial for data augmentation.
More example can be found in the supplementary video.
5.6 Learning
Finally, we compose all terms for the loss funtion: 6.1 Open-loop Motion Generation
We first show open-loop motion generation, where we do not mod-
max Φ Ld + LT (15) erate accumulative errors. We use a 8 to 20-frame motion prefix to
where Φ is the set of all learnable parameters in our networks and start motion generation to get 900 frames (dfn_run2box_2char and
dfn_boxing_3char in the video). The motion stability indicates that
T
Õ DFN does not suffer from the problem of cumulative error that is
Ld = [−KL(q(st |ht , zt , mt )||p(st |ht , mt )) common in time-series generation [41]. Given the same prefix, the
t =1
diversity is shown in their transitions between different postures
−KL(q(ft |ht , mt )||p(ft |ht )) (short-term) and different actions (long-term).
+loд(p(zt |st , ht )) + loд(p(mt | ft , ht ))]
LT =
Õ
[−KL(q(st1 |st2 , ht1 , ht2 )||p(st1 )) + p(st2 |st1 , t 1 , t 2 )] 6.2 Dynamics Multi-modality
t 1,t 2 We investigate how well DFN can capture different transition stochas-
ticity in different motions, using several types of motions with
KL is the KullbackâĂŞLeibler divergence.
different properties (shown in Tab.1). We first train DFN on them
After training the pose auto-encoder and the sequence auto-
separately then jointly. The results can be found in dfn_walk_top,
encoder (Section 5.1-5.2), we freeze their parameters and train the
dfn_walk_close, walking1-walking2, running1-running3, dancing1
dynamics model (Section 5.3).
and boxing1 in the video. We observe that DFN can learn the tran-
sition stochasticity well when trained on single type of motions.
6 EXPERIMENT AND RESULTS The diversity can be found in short-term and long-term transitions,
For all our experiments, we use CMU MoCap database1 . CMU which are two-levels of multi-modality captured well by DFN. In
dataset is a high-quality dataset acquired using optical motion cap- walking (dfn_walk_top and dfn_walk_close in video), the short-
ture systems, containing 2605 trials in 6 categories and 23 subcate- term stochasticity is shown in within-cycle motion randomness,
gories. Its high-quality serves our purposes well as it provides good which enriches the walking style. The long-term stochasticity is
data ‘seeds’ for motion generation. Also, the tremendous effort that shown when a turning is generated. The action-level transition has
went into capturing the data shows the need for tools such as DFN also been captured and generated. Similar observations are also
for data augmentation. To carefully evaluate DFN, we select differ- found in other motions.
ent motion classes with different features and dynamics, shown in When trained on mixed data (combining all motion in Tab.1),
Tab.1, to show that DFN can generate new high-quality motions DFN learns higher-level action transitions between different mo-
with arbitrary lengths using different motion prefixes. Next, we tion classes. We can see examples (action_transition in video and
evaluate DFN on data with single and mixed motion classes to see frame-level image in supplementary file) that transit from dancing
its ability to learn the different transition stochasticity on data with to running, from boxing to dancing or from slow walk to running,
a single type of dynamics and mixed types of dynamics. Last, we showing the modelling capacity at two levels. Within a single ac-
push the limit of DFN by reducing the training data, to show that tion, diversified styles are learned well. Between different actions,
transitions are learned well too. This demonstrates the benefits
1 http://mocap.cs.cmu.edu/ of modelling randomness explicitly between the past, current and
Motion Cyclic Main Body Part Rhythmic Dynamics
Walking Yes Lower No Low
Boxing No Upper No High
Dancing No Full Yes High
Running Yes Lower No High
Table 1: Motion types and their features.

Figure 7: Pose embedding trajectory of random generated se-

quences given same initialization with 20 frame. The circle
mark the first frame. We see that as time goes, these trajec-
tory depart from each other.

Pose embedding distribution Pose embedding latent distribution

Figure 6: Randomly selected training motions in the latent

space. Color indicate different motion class. Smooth trajec-
tories are universally obtained by embedding.

future state, which would otherwise make it hard to capture the

multi-modality and lead it to average over all types of dynamics, t=32 t=128
resulting in meaningless mean poses and motions.

6.3 Diversified Generation Figure 8: Four groups of motions generated from four dif-
Although visually it is clear that DFN can generate diversified ferent motion prefixes, each group with 4096 motions, and
motions, we also numerically show the diversity especially when their z (Left) and s (Right) at t = 32 and t = 128. We can see
the duration of generation becomes long. First, we randomly select that the earlier distributions are more concentrated and di-
training data of different classes, and show their latent feature verge fast as time passes.
trajectory in Fig. 6, with embedding dimension of 16. We then
use a PCA model to project the embedding features to 2D. The
trajectories are continues and smooth without extra constraints on For both z and s, the red (t = 32) are concentrated more, showing
the auto-encoder. It shows that the motion dynamics are captured that the difference between the 4096 generated motions are still
well in the latent space, which is critical in motion generation. Next, somewhat similar in the early stage. However, the yellow (t = 128)
we show a group of generated motions in Fig. 7. Even with the show that the generated motions start to diversify later. Not only
same motion prefix, motions start to diversify from the beginning, they shift out of the original red region, indicating that they are in
which is a distinct property lacking in deterministic generators in now in different pose regions, they also start to diverge more, shown
most action-prediction models such as [32], and our sequence is in by different modes in yellow areas, meaning they have diverged
3d which is more difficult than 2d [43]. into several different pose regions.
Distributional Shift in Time. The motion diversity increases Distribution Matching in Time. Another way to test the di-
in time. To show that there is a distributional shift of poses, we versity of generated motions is to see their statistical similarity to
randomly pick an initial sequence from training data, then randomly the training motions. Since the motion prefix is from one particular
generate 4096 sequences each with 128 frames. We visualize the motion, the more similar the generated motions are to the whole
current latent state s and latent feature z at t = 32 and t = 128 in training dataset, the more diverse they are, because the generated
Fig. 8. Note that the distributions of s and z capture the stochasticity motions have leave the original motion region where the motion
at two different levels, one at the stochastic state level and the other prefix is.
at the latent feature level. The red dots represent them at t = 32 We employ the mean-distance distribution as a measure, as in
and the yellow dots at t = 128. [43]. For each time step, we calculate the mean pose of all generated
6.5 Comparison
To our best knowledge, the only similar paper to ours is [42] which
also focuses on diversified motion generation. However, the biggest
difference is that DFN explicitly models the influence of the future
on the current. This enables DFN to explicitly model the transition
randomness at different stages and levels. This is the key reason why
DFN can be trained well on multiple types of motions, separately
and jointly, which has not been shown in [42]. However, a direct
numerical comparison is difficult due to the lack of widely accepted
metrics for diversified motion generation. In addition, the method
in [42] uses heavy post-processing while DFN does not.

7 CONCLUSION AND DISCUSSIONS

In this paper, we propose a new generative model, DFN, for diver-
sified human motion generation. DFN can generate motions with
arbitrary lengths. It successfully captures the transition stochastic-
ity in short and long term, and capable of learning the multi-modal
randomness in different motions. The training data needed is small.
Figure 9: Four groups of motions generated from four dif- We have conducted extensive evaluation to show DFN’s robustness,
ferent motion prefixes, each group with 4096 motions. The versatility and diversity in motion generation.
x axis represents the time dimension, and the y axis repre- There are two main limitations in our method. There is no control
sent the mean distance to average pose at each time step. The signal, and sometimes it can overly smooth high-frequency motions.
band represents the variations. We will address them in the future. Our explicit modelling of the
future makes it convenient to introduce desired future as control
signals;while replacing some of the Gaussian components with
motions, then calculate the Euclidean distances between the mean multi-modal priors might mitigate the over-smoothing issue.
pose and all other poses at that time step. We then plot the mean
distance and variance in Figure 9. The blue background indicate the ACKNOWLEDGMENTS
mean and variance of mean-distance distribution of the training We thank anonymous reviewers for their valuable comments. This
dataset. It shows that as time goes, the mean-distance distribution work is partially supported by the National Key Research & Devel-
of generated poses gradually matches that of the training data. This opment Program of China (No. 2016YFB1001403), NSF China (No.
further shows the generation diversity. 61772462, No. 61572429, No. U1736217), the 100 Talents Program
of Zhejiang University, Strategic Priorities Fund Research England,
6.4 Generation on Limited Training Data and EPSRC (Ref:EP/R031193/1).
DFN aims to solve the problem of data scarcity, so it should only
require as little data as possible for generation. We therefore push REFERENCES
DFN to its limit by reducing the training data, to see the minimal [1] Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural
amounts of data needed. To investigate each individual type of Language Grounded Pose Forecasting. In 2019 International Conference on 3D
Vision (3DV). IEEE, 719–728.
motions, we train DFN on walking, running, and boxing data sepa- [2] Federico Bartoli, Giuseppe Lisanti, Lamberto Ballan, and Alberto Del Bimbo. 2018.
rately. We start from full training data where the longest sequence Context-aware trajectory prediction. In 2018 24th International Conference on
lasts for around 10 minutes, and gradually reduce the duration by Pattern Recognition (ICPR). IEEE, 1941–1946.
[3] Justin Bayer and Christian Osendorfer. 2014. Learning Stochastic Recurrent
sampling until the quality of the generated motions start to dete- Networks. stat 1050 (2014), 27.
riorate. Although DFN responds to reduced training data slightly [4] Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017.
Deep representation learning for human motion prediction and classification.
differently on different motions, we finally able to reduce the train- In Proceedings of the IEEE conference on computer vision and pattern recognition.
ing data to a tiny amount, with the longest sequence being only 6158–6166.
15 seconds (12 second for walking, 15 second for boxing and 7 sec- [5] Nutan Chen, Maximilian Karl, and Patrick Van Der Smagt. 2016. Dynamic
movement primitives in latent space of time-dependent variational autoencoders.
ond for running). DFN can still generate stable motions even when In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).
trained on merely a 7-second long motion. (The result can be seen IEEE, 629–636.
in reduced_data in video) The impact of reducing the training data [6] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.
2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-
is mainly on the diversity of the motion. (However we can see in proaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and
supplementary video that the generated boxing motion still has a Structure in Statistical Translation. Association for Computational Linguistics,
Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012
certain of diversity). Less training data contains fewer transition [7] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville,
diversities (both short-term and long-term). The generated motions and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data.
therefore are less diverse. This is understandable as DFN cannot In Advances in neural information processing systems. 2980–2988.
[8] Han Du, Erik Herrmann, Janis Sprenger, Noshaba Cheema, Somayeh Hosseini,
deviate too much from the original data distribution to ensure the Klaus Fischer, and Philipp Slusallek. 2019. Stylistic Locomotion Modeling with
motion quality. Conditional Variational Autoencoder.. In Eurographics (Short Papers). 9–12.
[9] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. [27] Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes.
Recurrent network models for human dynamics. In Proceedings of the IEEE Inter- stat 1050 (2014), 1.
national Conference on Computer Vision. 4346–4354. [28] Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala, and Venkatesh Babu
[10] Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Radhakrishnan. 2019. Unsupervised feature learning of human actions as trajecto-
Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-forcing: Training stochastic ries in pose embedding manifold. In 2019 IEEE Winter Conference on Applications
recurrent networks. In Advances in neural information processing systems. 6713– of Computer Vision (WACV). IEEE, 1459–1467.
6723. [29] Chen Li and Gim Hee Lee. 2019. Generating multiple hypotheses for 3d hu-
[11] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. man pose estimation with mixture density network. In Proceedings of the IEEE
CoRR abs/1308.0850 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1308. Conference on Computer Vision and Pattern Recognition. 9887–9895.
html#Graves13 [30] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional
[12] Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theo- sequence to sequence model for human dynamics. In Proceedings of the IEEE
phane Weber. 2019. Temporal Difference Variational Auto-Encoder. In Interna- Conference on Computer Vision and Pattern Recognition. 5226–5234.
tional Conference on Learning Representations. https://openreview.net/forum?id= [31] Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning
S1x4ghC9tQ trajectory dependencies for human motion prediction. In Proceedings of the IEEE
[13] Xiao Guo and Jongmoo Choi. 2019. Human Motion Prediction via Learning International Conference on Computer Vision. 9489–9497.
Local Structure Representations and Temporal Dependencies. In Proceedings of [32] Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion
the AAAI Conference on Artificial Intelligence, Vol. 33. 2580–2587. prediction using recurrent neural networks. In Proceedings of the IEEE Conference
[14] Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, Taku Komura, on Computer Vision and Pattern Recognition. 2891–2900.
Jun Saito, Ikuo Kusajima, Xi Zhao, Myung-Geol Choi, Ruizhen Hu, et al. 2017. A [33] Jianyuan Min and Jinxiang Chai. 2012. Motion Graphs++: A Compact Generative
Recurrent Variational Autoencoder for Human Motion Synthesis.. In BMVC. Model for Semantic Motion Analysis and Synthesis. ACM Trans. Graph. 31, 6,
[15] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2020. Article 153 (Nov. 2012), 12 pages. https://doi.org/10.1145/2366145.2366172
Dream to Control: Learning Behaviors by Latent Imagination. In International [34] Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-
Conference on Learning Representations. https://openreview.net/forum?id= based Recurrent Model for Human Motion.
S1lOTC4tDS [35] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018.
[16] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Deepmimic: Example-guided deep reinforcement learning of physics-based char-
Lee, and James Davidson. 2019. Learning Latent Dynamics for Planning from acter skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.
Pixels. In International Conference on Machine Learning. 2555–2565. [36] A. Safonova, Jessica Hodgins, and Nancy Pollard. 2004. Synthesizing physically
[17] Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for realistic human motion in low dimensional. ACM Transactions on Graphics - TOG
character locomotion. In SIGGRAPH Asia 2018 Technical Briefs. 1–4. (01 2004).
[18] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2019. Moglow: [37] Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris
Probabilistic and controllable motion synthesis using normalising flows. arXiv Pal, and Yoshua Bengio. 2018. Twin Networks: Matching the Future for Sequence
preprint arXiv:1905.06598 (2019). Generation. In International Conference on Learning Representations. https:
[19] Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human //openreview.net/forum?id=BydLzGb0Z
motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE [38] Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2019. Spa-
International Conference on Computer Vision. 7134–7143. tiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Mo-
[20] Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural tion Prediction.
networks for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017), [39] Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state
1–13. machine for character-scene interactions. ACM Transactions on Graphics (TOG)
[21] Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework 38, 6 (2019), 1–14.
for character motion synthesis and editing. ACM Transactions on Graphics (TOG) [40] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
35, 4 (2016), 1–11. Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
[22] Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis
motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Workshop. 125–125.
Technical Briefs. 1–4. [41] He Wang, Edmond SL Ho, Hubert PH Shum, and Zhanxing Zhu. 2019. Spatio-
[23] Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised learning of temporal Manifold Learning for Human Motions via Long-horizon Modeling.
disentangled and interpretable representations from sequential data. In Advances IEEE transactions on visualization and computer graphics (2019).
in neural information processing systems. 1878–1889. [42] Zhiyong Wang, Jinxiang Chai, and Shihong Xia. 2019. Combining recurrent
[24] Junfeng Hu, Zhencheng Fan, Jun Liao, and Li Liu. 2019. Predicting Long-Term neural networks and adversarial training for human motion synthesis and control.
Skeletal Motions by a Spatio-Temporal Hierarchical Recurrent Network. arXiv IEEE transactions on visualization and computer graphics (2019).
preprint arXiv:1911.02404 (2019). [43] Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan,
[25] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis and Changyou Chen. 2019. Learning Diverse Stochastic Human-Action Gen-
Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Din- erators by Learning Smooth Latent Transitions. CoRR abs/1912.10150 (2019).
culescu, and Douglas Eck. 2019. Music Transformer. In International Conference arXiv:1912.10150 http://arxiv.org/abs/1912.10150
on Learning Representations. https://openreview.net/forum?id=rJe4ShAcF7 [44] Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2018.
[26] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion
2017. DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF Synthesis. In International Conference on Learning Representations. https://
STATE SPACE MODELS FROM RAW DATA. stat 1050 (2017), 3. openreview.net/forum?id=r11Q2SlRW