Multi-Task Deep Learning For Real-Time 3D Human Pose Estimation and Action Recognition
Multi-Task Deep Learning For Real-Time 3D Human Pose Estimation and Action Recognition
8, AUGUST 2021
Abstract—Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human
body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this
article, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying
human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and
still achieves state-of-the-art or comparable results at each task while running with a throughput of more than 100 frames per second.
The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing
in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way.
Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts,
which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and
NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly
available at https://github.com/dluvizon/deephar.
Index Terms—Human action recognition, human pose estimation, multitask deep learning, neural networks
Ç
1 INTRODUCTION
action recognition has been intensively studied in argmax function to recover the joint coordinates as a post
H UMAN
the last years, specially because it is a very challenging
problem, but also due to the several applications that can ben-
processing stage, which breaks the backpropagation chain
needed for end-to-end learning. We propose to solve this
efit from it. Similarly, human pose estimation has also rapidly problem by extending the differentiable soft-argmax [7], [8]
progressed with the advent of powerful methods based on for joint 2D and 3D pose estimation. This allows us to stack
convolutional neural networks (CNN) and deep learning. action recognition on top of pose estimation, resulting in a
Despite the fact that action recognition benefits from precise multi-task framework trainable from end-to-end.
body poses, the two problems are usually handled as distinct In comparison with our previous work [9], we propose a
tasks in the literature [1], or action recognition is used as a new network architecture carefully designed for pose and
prior for pose estimation [2], [3]. To the best of our knowledge, action prediction simultaneously at different feature map
there is no recent method in the literature that tackles both resolutions. Each prediction is supervised and re-injected
problems in a joint way to the benefit of action recognition. In into the network for further refinement. Differently from [9],
this paper, we propose a unique end-to-end trainable multi- where we first predict poses then actions, here poses and
task framework to handle human pose estimation and action actions are predicted in parallel and successively refined,
recognition jointly, as illustrated in Fig. 1. strengthening the multi-task aspect of our method. Another
One of the major advantages of deep learning methods is improvement is the proposed depth estimation approach for
their capability to perform end-to-end optimization. This is all 3D poses, which allows us to depart from learning the costly
the more true for multi-task problems, where related tasks can volumetric heat maps while improving the overall accuracy
benefit from one another, as suggested by Kokkinos [4]. Action of the method.
recognition and pose estimation are usually hard to be stitched The main contributions of our work are presented as fol-
together to perform a beneficial joint optimization, usually lows: First, we propose a new multi-task method for jointly
requiring 3D convolutions [5] or heatmaps transformations [6]. estimating 2D/3D human poses and recognizing associated
Detection based approaches require the non-differentiable actions. Our method is simultaneously trained from end-to-
end for both tasks with multimodal data, including still
images and video clips. Second, we propose a new regression
D. C. Luvizon is with the SAMSUNG Research Institute, Campinas, SP approach for 3D pose estimation from single frames, benefit-
13097-104, Brazil. E-mail: [email protected].
ing at the same time from images “in-the-wild” with 2D
D. Picard is with the LIGM, IMAGINE, Ecole des Ponts, Univ Gustave Eiffel,
CNRS, 77455 Marne-la-Vallee, France. E-mail: [email protected]. annotated poses and 3D data. This has been proven a very
H. Tabia is with the IBISC, Univ Evry, Universite Paris-Saclay, 91025 efficient way to learn good visual features, which is also very
Evry, France. E-mail: [email protected]. important for action recognition. Third, our action recogni-
Manuscript received 8 Feb. 2019; revised 12 Feb. 2020; accepted 16 Feb. 2020. tion approach is based only on RGB images, from which we
Date of publication 24 Feb. 2020; date of current version 1 July 2021. extract 3D poses and visual information. Despite that, our
(Corresponding author: Diogo C. Luvizon.)
Recommended for acceptance by J. Sivic. multi-task method achieves state-of-the-art on both 2D and
Digital Object Identifier no. 10.1109/TPAMI.2020.2976014 3D scenarios, even when compared with methods using
0162-8828 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2753
Fig. 2. Overview of the proposed multi-task network architecture. The entry-flow extracts feature maps from the input images, which are fed through a
sequence of CNNs composed of prediction blocks (PB), downscaling and upscaling units (DU and UU), and simple (skip) connections. Each PB out-
puts supervised pose and action predictions that are refined by further blocks and units. The information flow related to pose estimation and action
recognition are independently propagated from one prediction block to another, respectively depicted by blue and red arrows. See Fig. 3 and Fig. 4
for details about DU, UU, and PB.
2.2 Action Recognition have a low range of operation, and are not robust to occlusions,
2.2.1 2D Action Recognition frequently resulting in noisy skeletons. To cope with the noisy
In this section we revisited some methods that exploit pose skeletons, Spatio-Temporal LSTM networks [60] have been
information for action recognition. For example, classical meth- widely used to learn the reliability of skeleton sequences or as
ods for feature extraction have been used in [49], [50], where an attention mechanism [61], [62]. In addition to the skeleton
the key idea is to use body joint locations to select visual fea- data, multimodal approaches can also benefit from visual
tures in space and time. 3D convolutions have been stated as cues [63]. In that direction, pose-conditioned attention mecha-
the best option to handle the temporal dimension of images nisms have been proposed [64] to focus on image patches cen-
sequences [51], [52], [53], but they involve a high number of tered around the hands.
parameters and cannot efficiently benefit from the abundant Since our architecture predicts precise 3D poses from RGB
still images during training. Another option to integrate the frames, we do not have to cope with the noisy skeletons from
temporal aspect is by analysing motion from image sequen- Kinect. Moreover, we show in the experiments that, despite
ces [1], [54], but these methods require the difficult estimation being based on temporal convolution instead of the more
of optical flow. Unconstrained temporal and spatial analysis common LSTM, our system is able to reach state of the art
are also promising approaches to tackle action recognition, performance on 3D action recognition, indicating that action
since it is very likely that, in a sequence of frames, some very recognition does not necessarily require long term memory.
specific regions in a few frames are more relevant than the
remaining parts. Inspired on this observation, Baradel et al. 3 PROPOSED MULTI-TASK APPROACH
[55] proposed an attention model called Glimpse Clouds, The goal of the proposed method is to jointly handle human
which learns to focus on specific image patches in space and pose estimation and action recognition, prioritizing the use
time, aggregating the patterns and soft-assigning each feature of predicted poses on action recognition and benefiting from
to workers that contribute to the final action decision. The shared computations between the two tasks. For conve-
influence of occlusions could be alleviated by multi-view vid- nience, we define the input of our method as either a still
eos [56] and inaccurate pose sequences could be replaced by RGB image I 2 RHW 3 or a video clip (sequence of images)
heat maps for better accuracy [57]. However, this improvement V 2 RT HW 3 , where T is the number of frames in a video
is not observed when pose predictions are sufficiently precise. clip and H W is the frame size. This distinction is impor-
2D action recognition methods usually use the body joint tant because we handle pose estimation as a single frame
information only to extract localized visual features [1], [49], problem. The outputs of our method for each frame are: pre-
as an attention mechanism. Methods that directly explore the dicted human pose p ^ 2 RNj 3 and per body joint confidence
body joints usually do not generate it [50] or present lower score ^c 2 RNj 1 , where Nj is the number of body joints.
precision with estimated poses [51]. Our approach removes When taking a video clip as input, the method also outputs a
these limitations by performing pose estimation together vector of action probabilities ^ a 2 RNa 1 , where Na is the
with action recognition. As such, our model only needs the number of action classes. To simplify notation, in this section
input RGB frames while still performing discriminative visual we omit batch normalization layers and ReLU activations,
recognition guided by the estimated body joints. which are used in between convolutional layers as a common
practice in deep neural networks.
2.2.2 3D Action Recognition
Differently from video based action recognition, 3D action rec- 3.1 Network Architecture
ognition is mostly based on skeleton data as the primary infor- Differently from our previous work [9] where poses and
mation [58], [59]. With depth sensors such as the Microsoft actions are predicted sequentially, here we want to strengthen
Kinect, it is possible to capture 3D skeletal data without a com- the multi-task aspect of our method by predicting and refining
plex installation procedure frequently required for motion cap- poses and actions in parallel. This is implemented by the pro-
ture systems (MoCap). However, due to the required infrared posed architecture, illustrated in Fig. 2. Input images are fed
projector, depth sensors are limited to indoor environments, through the entry-flow, which extracts low level visual
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2755
Fig. 3. Network elementary units: in (a) residual unit (RU), in (b) down-
scaling unit (DU), and in (c) upscaling unit (UU). Nfin and Nfout represent
the input and output number of features, Hf Wf is the feature map
size, and k is the filter size.
[44], [45], [47] by taking five subjects for training (S1, S5, S6, experiments with the network architecture using 4 levels
S7, S8) and evaluating on two subjects (S9, S11) on one every and up to 8 pyramids (L ¼ 4 and P ¼ 8). No further signifi-
64 frames. We use ground truth person bounding boxes for cant improvement was noticed on pose estimation by using
a fair comparison with previous methods on single person more than 8 pyramids. On action recognition, this limit was
pose estimation. We report results using a single cropped observed at 4 pyramids. For that reason, when using the full
bounding box per sample. model with 8 pyramids, the action recognition part starts
On action recognition, we report results using the percent- only at the 5th pyramid, reducing the computational load.
age of correct action classification score. We use the proposed In our experiments, we used normalized RGB images of
evaluation protocol for Penn Action [49], splitting the data size 256 256 3 as input, which are reduced to a feature
as 50/50 for training/testing, and the more realistic cross- map of size 32 32 288 by the entry flow network, corre-
subject scenario for NTU, on which 20 subjects are used for sponding to level l ¼ 1. At each level, the spatial resolution
training, and the remaining are used for testing. Our method is reduced by a factor of 2 and the size of features is arith-
is evaluated on single-clip and/or multi-clip. In the first case, metically increased by 96. For action recognition, we used
we crop a single clip with T frames in the middle of the Nv ¼ 160 and Nv ¼ 192 features for Penn Action and NTU,
video. In the second case, we crop multiple video clips tem- respectively.
porally spaced of T =2 frames one from another, and the final
predicted action is the average decision among all clips from 4.2.3 Multi-Task Training
one video. For all the experiments, we first initialize the network by
In our experiments, we consider two scenarios: A) 2D training pose estimation only, for about 32k iterations with
pose estimation and action recognition, on which we use mini batches of 32 images (equivalent to 40 epochs on MPII).
respectively MPII and Penn Action datasets, and B) 3D pose Then, all the weights related to pose estimation are fixed and
estimation and action recognition, using MPII, Human3.6M, only the action recognition part is trained for 2 and 50 epochs,
and NTU datasets. respectively for Penn Action and NTU datasets. Finally, the
full network is trained in a multi-task scenario, simulta-
4.2 Implementation and Training Details neously for pose estimation and action recognition, until the
4.2.1 Function Loss validation scores plateau. Training the network on pose esti-
For the pose estimation task, we train the network using the mation for a few epochs provides a good general initializa-
elastic net loss [70] function on predicted poses: tion and a better convergence of the action recognition part.
The intermediate training stage of action recognition has two
j objectives: first, it is useful to allow a good initialization of the
1 X
N
Lp ¼ k^ pj pj k22 ;
pj pj k1 þ k^ (11) action part, since it is built on top of the pre-initialized pose
Nj j¼1 estimator; and second, it is about 3 times faster than perform-
ing multi-task training directly while resulting in similar
where p ^ j and pj are respectively the estimated and the
scores. This process is specially useful for NTU, due to the
ground truth positions of the jth body joint. The same loss is large amount of training data. The training procedure takes
used for both 2D and 3D cases, but only available values about one day for the pose estimation initialization, then
(ðx; yÞ for 2D and ðx; y; zÞ for 3D) are taken into account for two/three days for the remaining process for Penn Action/
backpropagation, depending on the dataset. We use poses NTU, using a desktop GeForce GTX 1080Ti GPU.
in the camera coordinate system, with ðx; yÞ laying on the For initialization on pose estimation, the network was
image plane and z corresponding to the depth distance, nor- optimized with RMSprop and initial learning rate of 0.001.
malized in the interval [0,1], where the top-left image corner For action and multi-task training, we use RMSprop for
corresponds to (0,0), and the bottom-right image corner cor- Penn Action with learning rate reduced by a factor of 0.1
responds to (1,1). For depth normalization, the root joint is after 15 and 25 epochs, and, for NTU, a vanilla SGD with
assumed to have z ¼ 0:5, and a range of 2 meters is used to Nesterov momentum of 0.9 and initial learning rate of 0.01,
represent the remaining joints. If a given body joint falls out- reduced by a factor of 0.1 after 50 and 55 epochs. We weight
side the cropped bounding box on training, we set the the loss on body joint confidence scores and action estima-
ground truth confidence flag cj to zero, otherwise we set it tions by a factor of 0.01, since the gradients from the cross
to one. The ground truth confidence information is used to entropy loss are much stronger than the gradients from the
supervise predicted joint confidence scores ^c with the binary elastic net loss on pose estimation. This parameter was
cross entropy loss. Despite giving an additional informa- empirically chosen and we did not observe a significant var-
tion, the supervision on confidence scores has negligible iation in the results with slightly different values (e.g., with
influence on the precision of estimated poses. For the action 0.02). Each iteration is performed on 4 batches of 8 frames,
recognition part, we use categorical cross entropy loss on composed of random images for pose estimation and video
predicted actions. clips for action. We train the model by alternating one batch
containing pose estimation samples only and another batch
4.2.2 Network Architecture containing action samples only. This strategy resulted in
Since the pose estimation part is the most computationally slightly better results compared to batches composed of
expensive, we chose to use separable convolutions with ker- mixed pose and action samples. We augment training data
nel size equals to 5 5 for single frame layers and standard by performing random rotations from 40 to þ40 , scaling
convolutions with kernel size equals to 3 3 for video clip from 0.7 to 1.3, video temporal subsampling by a factor
processing layers (action recognition layers). We performed from 3 to 10, random horizontal flipping, and random color
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2759
TABLE 1
Comparison With Previous Work on Human3.6M Evaluated Using the Mean Per Joint Position
Error (MPJPE, in Millimeters) Metric on Reconstructed Pzoses
shifting. On evaluation, we also subsampled Penn Action/ Penn Action, improving our previous work [9] by 1.3 percent.
NTU videos by a factor of 6/8, respectively. Our method outperformed all previous methods, including
the ones using ground truth (manually annotated) poses.
4.3 Evaluation on 3D Pose Estimation For 3D, we trained our multi-task network using mixed
data from Human3.6M (50 percent), MPII (37.5 percent) and
Our results compared to previous approaches are shown in
NTU (12.5 percent) for pose estimation and NTU video clips
Table 1. Our multi-task method achieves the state-of-the-art
for action recognition. Our results compared to previous
average prediction error of 48.6 millimeters on Human3.6M
methods are presented in Table 3. Our approach reached
for 3D pose estimation, improving our previous work [9] by
89.9 percent of correctly classified actions on NTU, which is
4.6 mm. Considering only the pose estimation task, our aver-
a strong result considering the hard task of classifying
age error is 49.5 mm, 0.9 mm higher than the multi-tasking
among 60 different actions in the cross-subject split. Our
result, which shows the benefit of multi-task training for 3D
method improves previous results by at least 3.3 percent
pose estimation. For the activity “Sit down”, which is the
most challenging case, we improve previous methods (e.g. ,
Yang et al. [47]) by 21 mm. The generalization of our method TABLE 2
is demonstrated by qualitative results of 3D pose estimation Results for Action Recognition on Penn Action
for all datasets in Fig. 10. Note that a single model and a sin- Methods RGB Optical Annot. Estimated Acc.
gle training procedure was used to produce all the images Flow poses poses
and scores, including 3D pose estimation and 3D action rec- Nie et al. [49] @ - - @ 85.5
ognition, as discussed in the following.
Iqbal et al. [3] - - - @ 79.0
@ @ - @ 92.9
4.4 Evaluation on Action Recognition Cao et al. [51] @ - @ - 98.1
For action recognition, we evaluate our method considering @ - - @ 95.3
both 2D and 3D scenarios. For the first, a single model was Du et al. [54]? @ @ - @ 97.4
trained using MPII for single frames (pose estimation) and Liu et al. [57]y @ - @ - 98.2
Penn Action for video clips. In the second scenario, we use @ - - @ 91.4
Human3.6M for 3D pose supervision, MPII for data augmenta- Our previous @ - @ - 98.6
tion, and NTU video clips for action. Similarly, a single model work [9] @ - - @ 97.4
was trained for all the reported 3D pose and action results. Ours (single-clip) @ - - @ 98.2
For 2D, the pose estimation was trained using mixed data Ours (multi-clip) @ - - @ 98.7
from MPII (80 percent) and Penn Action (20 percent), using
Results are given as the percentage of correctly classified actions. Our method
16 body joints. Results are shown in Table 2. We reached the uses extra 2D pose data from MPII for training
state-of-the-art action classification score of 98.7 percent on ?
Including UCF101 data; y using add. deep features.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2760 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021
TABLE 3 TABLE 5
Comparison Results on NTU Cross-Subject Results With Pose and Appearance Features Alone, Combined
for 3D Action Recognition Pose and Appearance Features, and Decoupled Poses
Methods RGB Kinect Estimated Acc. cross Action features MPII val. PCKhPennAction Acc.
poses poses subject
Pose features only 84.9 97.7
Shahroudy et al. [69] - @ - 62.9 Appearance features only 85.2 97.9
Liu et al. [60] - @ - 69.2 Combined 85.1 98.1
Song et al. [62] - @ - 73.4 Combined + decoupled poses 85.4 98.2
Liu et al. [61] - @ - 74.4
Shahroudy et al. [63] @ @ - 74.9 Experiments with a Multi-PB network with P ¼ 2 and L ¼ 4.
Liu et al. [57] @ - @ 78.8
Baradel et al. [64] - @ - 77.1
@ ?
- 75.6
@ @ - 84.8
Baradel et al. [71] - - - 86.6
Our previous @ - @ 85.5
work [9]
Ours @ - @ 89.9
Results are given as the percentage of correctly classified actions. Our method
uses extra pose data from MPII and H36M for training
?
Ground truth poses used on test to select visual features.
TABLE 4
The Influence of the Network Architecture on Pose Estimation Fig. 7. Two sequences of RGB images (top), predicted supervised poses
and on Action Recognition, Evaluated Respectively on MPII (middle), and decoupled action poses (bottom).
Validation Set ([email protected], Single-Crop) and on Penn
Action (Classification Accuracy, Single-Clip)
Single-PB are indexed by pyramid p and level l, and P and L represent the
total number of pyramids and levels on Multi-PB scheme.
TABLE 6 TABLE 7
Results Comparing the Effect of Single and Multi-Task Results on All Tasks With the Proposed Multi-Task Model
Training for Action Recognition Compared to Recent Approaches Using RGB Images
and/or Estimated Poses on MPII PCKh Validation
Training protocol PennAction Acc. NTU Acc. Set (Higher is Better), Human3.6M MPJPE
Single-task (action only) 87.5 88.0 (Lower is Better), Penn Action and NTU
Multi-task (same dataset) 97.4 – RGB+D Action Classification Accuracy
Multi-task (+MPII +H36M for 3D) 98.2 89.9 (Higher is Better)
Fig. 9. Inference speed of the proposed method considering 2D (a) and 3D (b,c) scenarios. A single multi-task model was trained for each scenario.
The trained models were cut a posteriori for inference analysis. Markers with gradient colors from purple to red represent respectively network infer-
ences from faster to slower.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021
Fig. 10. Predicted 3D poses from RGB images for both 2D and 3D datasets.
inference speed. Note that our method is the only to perform ACKNOWLEDGMENTS
both pose and action estimation in a single prediction, while
This work was partially supported by the Brazilian National
achieving state-of-the-art results at a very high speed.
Council for Scientific and Technological Development
(CNPq) – Grant 233342/2014-1.
5 CONCLUSION
REFERENCES
In this work, we presented a new approach for human pose
estimation and action recognition using multi-task deep [1] G. Cheron, I. Laptev, and C. Schmid, “P-CNN: Pose-based CNN
features for action recognition,” in Proc. ICCV, 2015, pp. 3218–3226.
learning. The proposed method for 3D pose provides highly [2] A. Yao, J. Gall, and L. Van Gool, “Coupled action recognition and
precise estimations with low resolution feature maps and pose estimation from multiple views,” Int. J. Comput. Vis., vol. 100,
departs from requiring the expensive volumetric heat maps no. 1, pp. 16–37, Oct. 2012.
[3] U. Iqbal, M. Garbade, and J. Gall, “Pose for action - action for
by predicting specialized depth maps per body joints. The pose,” Proc. 12th IEEE Int. Conf. Autom. Face & Gesture Recognit.,
proposed CNN architecture, along with the pose regression 2017, pp. 438–445.
method, allows multi-scale pose and action supervision and [4] I. Kokkinos, “Ubernet: Training a ‘universal’ convolutional neural
re-injection, resulting in a highly efficient densely super- network for low-, mid-, and high-level vision using diverse data-
sets and limited memory,” in Proc. IEEE Conf. Comput. Vis. Pattern
vised approach. Our method can be trained with mixed 2D Recognit., 2017, pp. 5454–5463.
and 3D data, benefiting from precise indoor 3D data, as well [5] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained
as “in-the-wild” images manually annotated with 2D poses. multi-stream networks exploiting pose, motion, and appearance
for action classification and detection,” in Proc. IEEE Int. Conf.
This has demonstrated significant improvements for 3D Comput. Vis., Oct. 2017, pp. 2923–2932.
pose estimation. The proposed method can also be trained [6] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion:
with single frames and video clips simultaneously and in a Pose motion representation for action recognition,” in Proc. IEEE
seamless way. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7024–7033.
[7] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression
More importantly, we show that the hard problem of by combining indirect part detection and contextual information,”
multi-tasking human poses and action recognition can be Comput. Graph., vol. 85, pp. 15–22, 2019.
handled by a carefully designed architecture, resulting in a [8] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invariant
feature transform,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 467–483.
better solution for each task than learning them separately. [9] D. C. Luvizon, D. Picard, and H. Tabia, “2D/3D pose estimation
In addition, we show that joint learning human poses and action recognition using multitask deep learning,” in Proc.
results in consistent improvement of action recognition. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5137–5146.
[10] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris, “3d human
Finally, with a single training procedure, our multi-task
pose estimation: A review of the literature and analysis of covariates,”
model can be cut at different levels for pose and action pre- Comput. Vis. Image Understanding, vol. 152, no. Supplement C, pp. 1–20,
dictions, resulting in a highly scalable approach. 2016.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2763
[11] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action rec- [37] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet
ognition: A survey,” Image Vis. Comput., vol. 60, no. Supplement C, effective baseline for 3D human pose estimation,” in Proc. IEEE
pp. 4–21, 2017. Int. Conf. Comput. Vis., 2017, pp. 2659–2668.
[12] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revis- [38] B. Tekin, P. Marquez-Neila, M. Salzmann, and P. Fua, “Fusing 2D
ited: People detection and articulated pose estimation,” in Proc. uncertainty and 3D cues for monocular body pose estimation,”
Comput. Vis. Pattern Recognit., 2009, pp. 1014–1021. CoRR, vol. abs/1611.05708, 2016. [Online]. Available: http://
[13] M. Dantone, J. Gall, C. Leistner, and L. V. Gool, “Human pose esti- arxiv.org/abs/1611.05708
mation using body parts dependent joint regressors,” in Proc. [39] D. Mehta et al., “Monocular 3d human pose estimation in the wild
Comput. Vis. Pattern Recognit., 2013, pp. 3041–3048. using improved cnn supervision,” CoRR, vol. abs/1611.09813,
[14] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet pp. 506–516, Oct. 2017.
conditioned pictorial structures,” in Proc. Comput. Vis. Pattern [40] A.-I. Popa, M. Zanfir, and C. Sminchisescu, “Deep multitask archi-
Recognit., 2013, pp. 588–595. tecture for integrated 2D and 3D human sensing,” in Proc. IEEE
[15] G. Ning, Z. Zhang, and Z. He, “Knowledge-guided deep fractal Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4714–4723.
neural networks for human pose estimation,” IEEE Trans. Multi- [41] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m:
media, vol. 20, no. 5, pp. 1246–1259, May 2018. Large scale datasets and predictive methods for 3D human sensing
[16] I. Lifshitz, E. Fetaya, and S. Ullman, Human Pose Estimation Using Deep in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell.,
Consensus Voting. Switzerland Cham: Springer, 2016, pp. 246–260. vol. 36, no. 7, pp. 1325–1339, Jul. 2014.
[17] L. Pishchulin et al., “DeepCut: Joint subset partition and labeling [42] D. Mehta et al., “Vnect: Real-time 3D human pose estimation with
for multi person pose estimation,” in Proc. IEEE Conf. Comput. Vis. a single RGB camera,” ACM Trans. Graph., vol. 36, 2017, Art. no. 4.
Pattern Recognit., 2016, pp. 4929–4937. [43] C.-H. Chen and D. Ramanan, “3d human pose estimation = 2d
[18] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and pose estimation + matching,” in Proc. IEEE Conf. Comput. Vis. Pat-
B. Schiele, “DeeperCut: A deeper, stronger, and faster multi-person tern Recognit., 2017, pp. 5759–5767.
pose estimation model,” in Proc. Eur. Conf. Comput. Vis., 2016, [44] X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional human
pp. 34–50. pose regression,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
[19] U. Rafi, I. Kostrikov, J. Gall, and B. Leibe, “An efficient convolu- pp. 2621–2630.
tional network for human pose estimation,” in Proc. Conf. Brit. [45] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-
Mach. Vis. Conf., 2016, vol. 1, Art. no. 2. to-fine volumetric prediction for single-image 3D human pose,” in
[20] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1263–1272.
pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [46] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose
2016, pp. 4724–4732. regression,” in Proc. Eur. Conf. Comput. Vis., 2018, pp 536–553.
[21] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust [47] W. Yang, W. Ouyang, X. Wang, J. S. J. Ren, H. Li, and X. Wang, “3D
optimization for deep regression,” in Proc. Int. Conf. Comput. Vis., human pose estimation in the wild by adversarial learning,” in
2015, pp. 2830–2838. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5255–5264.
[22] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, [48] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz, “Hand
“Efficient object localization using convolutional networks,” in pose estimation via latent 2.5D heatmap regression,” in Proc. Eur.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 648–656. Conf. Comput. Vis., 2018, pp. 125–143.
[23] A. Toshev and C. Szegedy, “DeepPose: Human pose estimation [49] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu, “Joint action recognition
via deep neural networks,” in Proc. Comput. Vis. Pattern Recognit., and pose estimation from video,” in Proc. IEEE Conf. Comput. Vis.
2014, pp. 1653–1660. Pattern Recognit., 2015, pp. 1293–1301.
[24] T. Pfister, K. Simonyan, J. Charles, and A. Zisserman, “Deep con- [50] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards
volutional neural networks for efficient pose estimation in gesture understanding action recognition,” in Proc. IEEE Int. Conf. Comput.
videos,” in Proc. Asian Conf. Comput. Vis., 2014, pp. 538–552. Vis., 2013, pp. 3192–3199.
[25] A. Bulat and G. Tzimiropoulos, “Human pose estimation via con- [51] C. Cao, Y. Zhang, C. Zhang, and H. Lu, “Body joint guided
volutional part heatmap regression,” in Proc. Eur. Conf. Comput. 3D deep convolutional descriptors for action recognition,”
Vis., 2016, pp. 717–732. IEEE Trans. Cybern., vol. 48, no. 3, pp. 1095–1108, Mar. 2018.
[26] G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions using [52] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A
convolutional neural networks,” in Proc. Eur. Conf. Comput. Vis., new model and the kinetics dataset,” in Proc. IEEE Conf. Comput.
2016, pp 728–743. Vis. Pattern Recognit., 2017, pp. 4724–4733.
[27] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for [53] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolu-
human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016, tions for action recognition,” IEEE Trans. Pattern Anal. Mach.
pp. 483–499. Intell., vol. 40, no. 6, pp. 1510–1517, Jun. 2017.
[28] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, [54] W. Du, Y. Wang, and Y. Qiao, “RPAN: An end-to-end recurrent
“Multi-context attention for human pose estimation,” in Proc. pose-attention network for action recognition in videos,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1831–1840. IEEE Int. Conf. Comput. Vis., 2017, pp. 3745–3754.
[29] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature [55] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse clouds:
pyramids for human pose estimation,” in Proc. IEEE Int. Conf. Human activity recognition from unstructured feature points,”
Comput. Vis., 2017, pp. 1290–1299. in Proc. Comput. Vis. Pattern Recognit. 2018, pp. 469–478.
[30] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- [56] D. Wang, W. Ouyang, W. Li, and D. Xu, “Dividing and aggregat-
sentation learning for human pose estimation,” in Proc. IEEE Conf. ing network for multi-view action recognition,” in Proc. Eur. Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703. Comput. Vis., 2018, pp. 457–473.
[31] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. [57] M. Liu and J. Yuan, “Recognizing human actions as the evolution
Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680. of pose estimation maps,” in Proc. IEEE Conf. Comput. Vis. Pattern
[32] C. Chou, J. Chien, and H. Chen, “Self adversarial training for Recognit., 2018, pp. 1159–1168.
human pose estimation,” in Proc. Asia-Pacific Signal Inf. Process. [58] D. C. Luvizon, H. Tabia, and D. Picard, “Learning features combi-
Assoc. Annu. Summit Conf., 2017, pp. 17–30. nation for human action recognition from skeleton sequences,”
[33] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial pose- Pattern Recognit. Lett., vol. 99, pp. 13–20, 2017.
net: A structure-aware convolutional network for human pose [59] L. L. Presti and M. L. Cascia, “3D skeleton-based human action clas-
estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1221–1230. sification: A survey,” Pattern Recognit., vol. 53, pp. 130–147, 2016.
[34] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human [60] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal
pose estimation with iterative error feedback,” in Proc. IEEE Conf. LSTM with trust gates for 3D human action recognition,” in Proc.
Comput. Vis. Pattern Recognit., 2016, pp. 4733–4742. Eur. Conf. Comput. Vis., 2016, pp. 816–833.
[35] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and [61] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-
K. Daniilidis, “Monocap: Monocular human motion capture using aware attention LSTM networks for 3D action recognition,” in
a CNN coupled with a geometric prior,” IEEE Trans. Pattern Anal. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3671–3680.
Mach. Intell., vol. 41, no. 4, pp. 901–914, Apr. 2017. [62] S. Song, C. Lan, J. Xing, W. Z. Wezeng, and J. Liu, “An end-to-end
[36] D. Tome, C. Russell, and L. Agapito, “Lifting from the deep: Con- spatio-temporal attention model for human action recognition
volutional 3d pose estimation from a single image,” in Proc. IEEE from skeleton data,” in Proc. 31st AAAI Conf. Artif. Intell., 2017,
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5689–5698. pp. 4263–4270.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021
[63] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, “Deep multimodal David Picard received the MSc in electrical engi-
feature analysis for action recognition in rgb+d videos,” IEEE Trans. neering, in 2005, the PhD degree in image and sig-
Pattern Anal. Mach. Intell., vol. 40, no. 5, pp. 1045–1058, May 2017. nal processing, in 2008 and the Habilitation in
[64] F. Baradel, C. Wolf, and J. Mille, “Pose-conditioned spatio-temporal computer science, in 2017. He joined the ETIS lab-
attention for human action recognition,” CoRR, vol. abs/1703.10106, oratory at ENSEA Graduate School (France), in
2017. [Online]. Available: http://arxiv.org/abs/1703.10106 2010 as an associate professor. Since 2019, he
[65] F. Chollet, “Xception: Deep learning with depthwise separable
is senior research scientist at Ecole des Ponts
convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., ParisTech (France). His research interests include
2017, pp. 1800–1807. computer vision and machine learning, with a
[66] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new focus on kernel methods, deep learning, and
representation of skeleton sequences for 3D action recognition,” in distributed learning.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4570–4579.
[67] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human
pose estimation: New benchmark and state of the art analysis,” in Hedi Tabia received the MS degree in computer
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3686–3693. science from the INSA of Rouen - Public school
[68] W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A of engineers, France, in 2008, and the PhD
strongly-supervised representation for detailed action under- degree in computer science from the University
standing,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2248–2255. of Lille, in 2011. From October 2011 to August
[69] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A 2012, he held a postdoctoral research associate
large scale dataset for 3D human activity analysis,” in Proc. IEEE position at the IEF laboratory (University of Paris-
Conf. Comput. Vis. Pattern Recognit. 2016, pp. 1010–1019. sud). During 2012-2019, he was an associate
[70] H. Zou and T. Hastie, “Regularization and variable selection via professor at the ENSEA. Since September 2019
the elastic net,” J. R. Statist. Soc. Ser. B, vol. 67, pp. 301–320, 2005. Paris Saclay.
he is a professor with Universite
[71] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse clouds:
Human activity recognition from unstructured feature points,”
in Proc. Comput. Vis. Pattern Recognit., 2018, pp. 469–478. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
Diogo Carbonera Luvizon received the BSc
degree in electrical engineering, and the MSc in
image processing and graphics from the Federal
University of Technology - Parana (Brazil), in 2015,
and the PhD degree in computer science from
the Cergy Paris Universite , France, in 2019. His
main research interests include machine learning
and deep learning algorithms for computer vision,
humans and 3D scene understanding.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.