Motion mm2020
Motion mm2020
Figure 1: Given a 20-frame walking motion pre-fix (white), our model can generate diversified motion: walking (yellow),
walking-to-running (blue), walking-to-boxing (green), and walking-to-dancing (red), with arbitrary duration. The correspond-
ing animation can be found in teaser.mp4 in supplementary video.
ABSTRACT show the superiority of our method, for its robustness, versatility
Human motion modelling is crucial in many areas such as computer and high-quality.
graphics, vision and virtual reality. Acquiring high-quality skele-
tal motions is difficult due to the need for specialized equipment
and laborious manual post-posting, which necessitates maximiz- 1 INTRODUCTION
ing the use of existing data to synthesize new data. However, it Modeling natural human motions is a central topic in several fields
is a challenge due to the intrinsic motion stochasticity of human such as computer animation, bio-mechanics, virtual reality, etc,
motion dynamics, manifested in the short and long terms. In the where high-quality motion data is a necessity. Despite the improved
short term, there is strong randomness within a couple frames, e.g. accuracy and lowered costs of motion capture systems, it is still
one frame followed by multiple possible frames leading to different highly desirable to make full use of existing data to generate di-
motion styles; while in the long term, there are non-deterministic versified new data. One key challenge in motion generation is
action transitions. In this paper, we present Dynamic Future Net, the dynamics modelling, where it has been shown that a latent
a new deep learning model where we explicitly focuses on the space can be found due to the high coordination of body motions
aforementioned motion stochasticity by constructing a generative [21, 36, 41]. However, as much as the spatial aspect is studied, dy-
model with non-trivial modelling capacity in temporal stochas- namics modelling, especially with the aim of diversified motion
ticity. Given limited amounts of data, our model can generate a generation, still remains to be an open problem.
large number of high-quality motions with arbitrary duration, and Human motion dynamics manifest several levels of short-term
visually-convincing variations in both space and time. We evaluate and long-term stochasticity. Given a homogeneous discritization
our model on a wide range of motions and compare it with the of motions in time, the short-term stochasticity refers to the ran-
state-of-the-art methods. Both qualitative and quantitative results domness in next one or few frames (pose-transition); while the
long-term one refers to the random in the next or few actions
(action-transition). Tradition methods model them by Finite State
∗ Corresponding author Machines with carefully organized data [33], which have limited
model capacities for large amounts of data and require extensive pre- authors model a motion sequence as a trajectory in pose embedding
processing work. New deep learning methods either ignore them manifold, then use a bidirectional-RNN to encode the trajectory to
[21] or do not explicitly model them [41]. Very recently, dynamics model the dynamics of the trajectory, which can improve motion
modelling for diversified generation has just been investigated [42], classification results. Moreover, they designed a graph-like network
but only from the perspective of the overall dynamics, rather than to better represent the human body components.
the detailed short/long term stochasticity. The existing methods focus on the embedding of the poses and
In this paper, we propose a new deep learning model, Dynamic dynamics. However, they do not explicitly model the distributions
Future Net, or DFN, for automatic and diversified high-quality mo- of these latent variables which governs the stochasticity of the
tion generation based on limited amounts of data. Given a motion, dynamics. In this paper, we go a lever deeper and learn the latent
we assume that it can be discretized homogeneously in time and variable distributions for the embedded poses and dynamics.
represented by a series of posture and instantaneous velocities.
Following the observation that it is easier to learn the dynamics in 2.2 Deterministic human motion prediction
a natural motion latent space [21], we first embed features in the and synthesis
data space into a latent space. Next, DFN learns explicitly the his-
tory, current and future state given any time, where we also model In the effort of modelling motion dynamics, many methods employ
several conditional distributions for the influences of history and deterministic transitions [2, 4, 9, 13, 17, 19, 24, 30, 31, 38], especially
future state on the current state. The state-wise bidirectional mod- in human motion prediction or generation. They either focus on
elling (extending into both the past and future) separates DFN from short term dynamics modeling or spatial-temporal information of
existing methods and endows us with the ability of modelling the the overall dynamics. In [44], the authors propose a training tech-
short-term (next-frame) randomness and long-term (next-action) nique for RNN to generate very long human motions. Although this
randomness. Last, for inference purposes, we propose new loss technique solves the problem of the freezing phenomena of RNN,
functions based on distributional similarities as opposed to point- their model is deterministic, which makes the training difficult:
wise estimation [41, 44], which captures the dynamics accurately given a past state, if multiple possible future motions are present
in the data, the network will average them, which is a common
but also keep the randomness that is crucial for diversified motion
problem in many human motion prediction methods.
generation.
We show extensive experimental results to show DFN’s robust- One solution to this problem is to introduce control signals [14,
ness, versatility and high-quality. Unlike existing methods which 39]. They design several networks and make the character to follow
have to be trained on one type of motions a time [42, 44], DFN a given trajectory in real-time. In [35], the control signal becomes
can be trained using both single type of motions or mixed motions, the 3d human pose predicted by neural nets as a reference for an
which shows DFN’s ability to capture multi-modal dynamics and agent to imitate. In [1], the authors co-embed the language and
therefore its versatility in diversified motion generation. Visual corresponding motions to a share manifold, ignoring the fact that
evaluation shows that DFN can generate high-quality motions with language-to-motion is a one-to-many mapping. Even with a specific
different dynamics. control signal, like 2D human skeleton, one can still expect that
there are different motions or different pose corresponding to the
In summary, our formal contributions include:
same control signal [29], essentially indicating the multi-modality
(1) a new deep learning model, Dynamic Future Net, for auto- nature of human motion dynamics.
matic, diversified and high-quality human motion genera- Different from the existing methods, our paper focuses on the ex-
tion. plicit modelling of the multi-modality nature of motion transitions
(2) a new dynamic model that captures the transition stochas- in human motions. Further, we also aim to learn the stochastiity in
ticity of the past, current and future states in motions. those transitions.
(3) insights of the importance of both short-term and long-term
dynamics in human motion modelling.
2.3 Stochastic human motion synthesis
2 BACKGROUND AND RELATED WORKS In [42] the authors combine RNNs and Generative Adversarial Net-
works (GANs) to generate stochastic human motions. They use a
2.1 Human pose and motion embedding mixture density layer to model the stochastic property, and use an
Given a human motion sequence, it is useful to find the low dimen- adversarial discriminator to judge whether the generated motion
sion representation of the whole sequence. Holden et al. [22] for is natural or not. In MoGLow [18], the authors for the first time
the first time use a convolution neural network to project the entire use normalizing flow to directly model the next frame distribution.
sequence to a low dimensional embedding space. Using this more One advantage of this method is that it can capture complex dis-
abstract representation, one can blend motions or remove noises tributions without learning an apparent latent space. Given the
from corrupted motions. In [20, 21], the authors further make use of same initial poses and the same control signals or constraints, the
the power of the learned motion manifold and decoder to synthesize model still generate different motion sequences. Chen et al. [5]
motions with constraints. Another important application of motion combine Dynamic motion primitive and variational Bayesian filter
embedding is motion style generation [8], in which the embedding to model the human motion dynamics. They show that the latent
code can be tuned to matched the desired style. Although modeling representations are self-clustering after training. However, in the
a motion sequence with auto-encoders is straightforward, how it transitions, it needs the information of the whole sequence, which
can model the dynamics of human motion is not clear. In [28], the separates it from being a pure generative model.
Our method differs from existing approaches in its treatment in combine the past, current and future state to infer the current
the relations between the past, current and future states of human latent state distribution, and we combine the past and future to
motions. Unlike the aforementioned methods, we explicitly model infer the future latent state distribution. In the generation stage,
the current state based on both the past and the future. Also, we unlike existing methods [10] where the current state is generated
further model their randomness in the latent space that captures from the past state only, we first generate the future state and
the transition multi-modality. combine it with the past state to generate the current latent state
prior, from which we sample the current latent state then decode it
2.4 Stochastic RNN model to the pose-embedding and recover the current pose and velocity.
We regard this process as a self-driving motion generation process
Modelling the stochasticity in time-series data has been a long-
guided by the envisioned dynamic future. In this way, the model
standing problem, such as music, hand writing and human voice
can learn and generate rich and varied natural motions.
[11, 25, 40]. The VRNN [7] for the first time combines Variational
AutoEncoder (VAE) [27] and recurrent neural networks for this
purpose. Later in [23], the authors disentangle the latent variables
of the observation and the dynamics, with the observation latent
being used to recover the full observation information, and the
dynamic latent capturing the dynamics.
A key modelling choice in stochastic RNN models is the relations
between the past, current and future. In early work, the posterior
of current state is inferred from the past information, which makes
it lack the ability to foresee the future. In [3, 10, 37], the authors
show that the performance can be improved by incorporating the
future state with a backward RNN in the inference stage. In [12], the
authors design a model that can go beyond step-by-step modelling,
and predict multiple steps up to a given horizon. Similar effort is also
made in reinforcement learning, where the reward function takes
the discounted future reward into consideration [12, 15]. In [16], the
authors went further and designed a model that can predict multiple
future scenarios, then choose the one with highest predicted reward
from all the possibilities.
We observed that human motions follow a similar philosophy: Figure 2: Overview of the proposed Dynamic Future Net-
the current state is a result of the past motion but also a particular work. In the learning process, the network take human mo-
choice for a certain planned future. Our research is inspired by tion sequence as input and predict the long term distribu-
Stochastic RNN models but focuses on human motion transition tion and the short term(next frame) distribution.
stochasticity.
mt zt Pose &
ft
zt ht zt : code for pose-v
Velociy
mt : code for future
sequence(length=8/16)
st mt st st ft
st : stochastic latent for current state
a) Generation b) Inference c) Transition(GRU) f t : stochastic latent for future state
Figure 5: The stochastic latent RNN. a) Generation Model. The current pose embedding feature zt depends on the current and
past latent state st and ht . st depends on the past state ht and the future summary mt which depends on the future latent state
ft and past state ht b) Inference on st and ft . c) Transition of ht
where the MLP for skip prediction has three hidden layers each DFN can make use of a small amount of data to generate high-
with 32 dimensions and LeakyReLU activation. quality and diversified data, which is crucial for data augmentation.
More example can be found in the supplementary video.
5.6 Learning
Finally, we compose all terms for the loss funtion: 6.1 Open-loop Motion Generation
We first show open-loop motion generation, where we do not mod-
max Φ Ld + LT (15) erate accumulative errors. We use a 8 to 20-frame motion prefix to
where Φ is the set of all learnable parameters in our networks and start motion generation to get 900 frames (dfn_run2box_2char and
dfn_boxing_3char in the video). The motion stability indicates that
T
Õ DFN does not suffer from the problem of cumulative error that is
Ld = [−KL(q(st |ht , zt , mt )||p(st |ht , mt )) common in time-series generation [41]. Given the same prefix, the
t =1
diversity is shown in their transitions between different postures
−KL(q(ft |ht , mt )||p(ft |ht )) (short-term) and different actions (long-term).
+loд(p(zt |st , ht )) + loд(p(mt | ft , ht ))]
LT =
Õ
[−KL(q(st1 |st2 , ht1 , ht2 )||p(st1 )) + p(st2 |st1 , t 1 , t 2 )] 6.2 Dynamics Multi-modality
t 1,t 2 We investigate how well DFN can capture different transition stochas-
ticity in different motions, using several types of motions with
KL is the KullbackâĂŞLeibler divergence.
different properties (shown in Tab.1). We first train DFN on them
After training the pose auto-encoder and the sequence auto-
separately then jointly. The results can be found in dfn_walk_top,
encoder (Section 5.1-5.2), we freeze their parameters and train the
dfn_walk_close, walking1-walking2, running1-running3, dancing1
dynamics model (Section 5.3).
and boxing1 in the video. We observe that DFN can learn the tran-
sition stochasticity well when trained on single type of motions.
6 EXPERIMENT AND RESULTS The diversity can be found in short-term and long-term transitions,
For all our experiments, we use CMU MoCap database1 . CMU which are two-levels of multi-modality captured well by DFN. In
dataset is a high-quality dataset acquired using optical motion cap- walking (dfn_walk_top and dfn_walk_close in video), the short-
ture systems, containing 2605 trials in 6 categories and 23 subcate- term stochasticity is shown in within-cycle motion randomness,
gories. Its high-quality serves our purposes well as it provides good which enriches the walking style. The long-term stochasticity is
data ‘seeds’ for motion generation. Also, the tremendous effort that shown when a turning is generated. The action-level transition has
went into capturing the data shows the need for tools such as DFN also been captured and generated. Similar observations are also
for data augmentation. To carefully evaluate DFN, we select differ- found in other motions.
ent motion classes with different features and dynamics, shown in When trained on mixed data (combining all motion in Tab.1),
Tab.1, to show that DFN can generate new high-quality motions DFN learns higher-level action transitions between different mo-
with arbitrary lengths using different motion prefixes. Next, we tion classes. We can see examples (action_transition in video and
evaluate DFN on data with single and mixed motion classes to see frame-level image in supplementary file) that transit from dancing
its ability to learn the different transition stochasticity on data with to running, from boxing to dancing or from slow walk to running,
a single type of dynamics and mixed types of dynamics. Last, we showing the modelling capacity at two levels. Within a single ac-
push the limit of DFN by reducing the training data, to show that tion, diversified styles are learned well. Between different actions,
transitions are learned well too. This demonstrates the benefits
1 http://mocap.cs.cmu.edu/ of modelling randomness explicitly between the past, current and
Motion Cyclic Main Body Part Rhythmic Dynamics
Walking Yes Lower No Low
Boxing No Upper No High
Dancing No Full Yes High
Running Yes Lower No High
Table 1: Motion types and their features.
6.3 Diversified Generation Figure 8: Four groups of motions generated from four dif-
Although visually it is clear that DFN can generate diversified ferent motion prefixes, each group with 4096 motions, and
motions, we also numerically show the diversity especially when their z (Left) and s (Right) at t = 32 and t = 128. We can see
the duration of generation becomes long. First, we randomly select that the earlier distributions are more concentrated and di-
training data of different classes, and show their latent feature verge fast as time passes.
trajectory in Fig. 6, with embedding dimension of 16. We then
use a PCA model to project the embedding features to 2D. The
trajectories are continues and smooth without extra constraints on For both z and s, the red (t = 32) are concentrated more, showing
the auto-encoder. It shows that the motion dynamics are captured that the difference between the 4096 generated motions are still
well in the latent space, which is critical in motion generation. Next, somewhat similar in the early stage. However, the yellow (t = 128)
we show a group of generated motions in Fig. 7. Even with the show that the generated motions start to diversify later. Not only
same motion prefix, motions start to diversify from the beginning, they shift out of the original red region, indicating that they are in
which is a distinct property lacking in deterministic generators in now in different pose regions, they also start to diverge more, shown
most action-prediction models such as [32], and our sequence is in by different modes in yellow areas, meaning they have diverged
3d which is more difficult than 2d [43]. into several different pose regions.
Distributional Shift in Time. The motion diversity increases Distribution Matching in Time. Another way to test the di-
in time. To show that there is a distributional shift of poses, we versity of generated motions is to see their statistical similarity to
randomly pick an initial sequence from training data, then randomly the training motions. Since the motion prefix is from one particular
generate 4096 sequences each with 128 frames. We visualize the motion, the more similar the generated motions are to the whole
current latent state s and latent feature z at t = 32 and t = 128 in training dataset, the more diverse they are, because the generated
Fig. 8. Note that the distributions of s and z capture the stochasticity motions have leave the original motion region where the motion
at two different levels, one at the stochastic state level and the other prefix is.
at the latent feature level. The red dots represent them at t = 32 We employ the mean-distance distribution as a measure, as in
and the yellow dots at t = 128. [43]. For each time step, we calculate the mean pose of all generated
6.5 Comparison
To our best knowledge, the only similar paper to ours is [42] which
also focuses on diversified motion generation. However, the biggest
difference is that DFN explicitly models the influence of the future
on the current. This enables DFN to explicitly model the transition
randomness at different stages and levels. This is the key reason why
DFN can be trained well on multiple types of motions, separately
and jointly, which has not been shown in [42]. However, a direct
numerical comparison is difficult due to the lack of widely accepted
metrics for diversified motion generation. In addition, the method
in [42] uses heavy post-processing while DFN does not.