0% found this document useful (0 votes)
50 views51 pages

Author's Accepted Manuscript: Signal Processing: Image Communication

This document discusses using deep learning approaches to understand Indian classical dance. It aims to identify body poses and hand gestures from video data to comprehend the intended meaning of dance performances. The authors propose a convolutional neural network to classify poses and gestures. They validate the approach on standard datasets and constrained/real-world dance video data. Experimental results show it is possible to recognize poses and gestures using this method, though challenges remain from occlusion, clothing, and background clutter.

Uploaded by

Lohith B M Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views51 pages

Author's Accepted Manuscript: Signal Processing: Image Communication

This document discusses using deep learning approaches to understand Indian classical dance. It aims to identify body poses and hand gestures from video data to comprehend the intended meaning of dance performances. The authors propose a convolutional neural network to classify poses and gestures. They validate the approach on standard datasets and constrained/real-world dance video data. Experimental results show it is possible to recognize poses and gestures using this method, though challenges remain from occlusion, clothing, and background clutter.

Uploaded by

Lohith B M Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Author’s Accepted Manuscript

Nrityabodha: Towards understanding Indian


classical dance using a deep learning approach

Aparna Mohanty, Pratik Vaishnavi, Prerana Jana,


Anubhab Majumdar, Alfaz Ahmed, Trishita
Goswami, Rajiv R. Sahay
www.elsevier.com/locate/image

PII: S0923-5965(16)30084-4
DOI: http://dx.doi.org/10.1016/j.image.2016.05.019
Reference: IMAGE15097
To appear in: Signal Processing : Image Communication
Received date: 15 October 2015
Revised date: 26 May 2016
Accepted date: 29 May 2016
Cite this article as: Aparna Mohanty, Pratik Vaishnavi, Prerana Jana, Anubhab
Majumdar, Alfaz Ahmed, Trishita Goswami and Rajiv R. Sahay, Nrityabodha:
Towards understanding Indian classical dance using a deep learning approach,
Signal Processing : Image Communication,
http://dx.doi.org/10.1016/j.image.2016.05.019
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
Nrityabodha: Towards understanding Indian classical
dance using a deep learning approach

Aparna Mohanty
Computational Vision Laboratory
Department of Electrical Engineering, Indian Institute of Technology Kharagpur, India

Pratik Vaishnavi
Department of Electronics and Communication Engineering
Sardar Vallabhbhai National Institute of Technology, Surat, India

Prerana Jana, Anubhab Majumdar, Alfaz Ahmed, Trishita Goswami


Department of Computer Science and Technology
Indian Institute of Engineering Science and Technology, Shibpur, Kolkata, India

Rajiv R. Sahay
Department of Electrical Engineering, Indian Institute of Technology Kharagpur, India

Abstract

Indian classical dance has existed since over 5000 years and is widely practised
and performed all over the world. However, the semantic meaning of the dance
gestures and body postures as well as the intricate steps accompanied by music
and recital of poems is only understood fully by the connoisseur. The common
masses who watch a concert rarely appreciate or understand the ideas conveyed
by the dancer. Can machine learning algorithms aid a novice to understand the
semantic intricacies being expertly conveyed by the dancer? In this work, we
aim to address this highly challenging problem and propose deep learning based
algorithms to identify body postures and hand gestures in order to comprehend
the intended meaning of the dance performance. Specifically, we propose a
convolutional neural network and validate its performance on standard datasets
for poses and hand gestures as well as on constrained and real-world datasets
of classical dance. We use transfer learning to show that the pre-trained deep
networks can reduce the time taken during training and also improve accuracy.

Preprint submitted to Elsevier May 31, 2016


Interestingly, we show with experiments performed using Kinect in constrained
laboratory settings and data from Youtube, that it is possible to identify body
poses and hand gestures of the performer to understand the semantic meaning
of the enacted dance piece.
Keywords: Deep learning, convolutional neural network, gesture recognition,
body pose estimation, Kinect, histogram-of-gradients.

1. Introduction

India is the home of ancient civilizations such as the Indus valley and Mo-
henjodaro/Harappa settlements dated around 6000 B.C. Indian classical dance
(ICD) forms have existed since these ancient times and their importance can be
5 gauged from the patronage received from the rulers and society in general since
times immemorial. In fact, one of the most famous excavations of Mohenjodaro
was the statuette of a dancing girl striking a sensuous pose. Temples of ancient
and medieval India depict sculptures with intricate details of dance postures
and in particular, the Chidambaram temple in the southern Indian province
10 of Tamil Nadu has preserved the postures of a popular classical dance form,
Bharatnatyam. The Natya Shastra is the most celebrated and comprehensive
treatise encompassing the performing arts of dance, theatre and music. It is
widely believed to be 2000 years old with detailed instructions outlining the
grammar and rules associated with classical dance, theatre, music and virtually
15 every aspect of stagecraft. With the passage of time various artistes belonging
to disparate schools of art/dance have given their own interpretations of the
basic rules outlined in Natya Shastra. However, there exists in particular, a set
of 108 dance postures named Karanas in the original Natya Shastra enshrined
in the Chidambaram temple [1] which was constructed around the 12th century
20 A.D.. These dance poses have been depicted by performers of Bharatnatyam
even today as shown in Fig. 1.
As an example, we show the Nataraaj pose depicted by a dancer in Fig. 1 (a)
and the corresponding sculpture in Fig. 1 (b). Fig. 1 (c) depicts vartita Karana

2
(a) (b) (c) (d)

Figure 1: (a) The Nataraaj posture [1]. (b) A sculpture depicting Natraaj pose. (c) The
vartita Karana [1]. (d) A sculpture depicting the same Karana.

from the Bharatnatyam dance form enacted by a dancer. On the walls of an


25 ancient temple, a sculpture shows the same Karana as in Fig. 1 (c). These body
postures used by the dancer convey semantic meaning and are accompanied by
the recital of a poem/song set to music. In this work, we seek to identify a
subset of the 108 Karanas from data recorded in the laboratory using the Kinect
sensor as well as real-world data from videos of dance performances obtained
30 from Youtube.
However, this is not an easy task since the dancer wears long dresses which
occludes the legs and camera viewpoint changes affect classification perfor-
mance. Compounding this is the presence of multiple dancers on the stage
which causes occlusions. Clutter in the background also causes pose estimation
35 algorithms to produce erroneous results. To illustrate these challenges, we show
the performance of state-of-the-art pose estimation algorithms [2, 3] on few im-
ages from our dataset of dance poses. The result of the approach proposed by
Ramanan et al. [2] for pose estimation is depicted in Fig. 2 (a). As can be seen
in Fig. 2 (a), the skeleton estimated by the method in [2] does not correctly
40 reflect the left leg of the dancer. The result of another recent technique using
tree models for pose estimation proposed by Wang et al. [3] on our dataset is
depicted in Fig.2 (b). The cluttered background and the complex posture of the
dancer in Fig. 2 (b) result in an inaccurate estimation of skeletal configuration
when the method of [3] is used. These results suggest that the existing state-of-

3
(a) (b)

Figure 2: (a) Failure of state-of-the-art approach [2] on one image from our dataset. (b)
Images depicting failure of the approach for pose estimation in [3] on another image of our
dataset.

45 the-art approaches are not able to estimate pose correctly on the dance posture
dataset due to occlusions, clothing on the body of the dancer, clutter in the
background etc. Hence, in this work we adopt an image recognition approach
using deep learning.
Dances are accompanied by song and governed by strict rules. To com-
50 prehend the meaning of a dance, it is necessary to interpret hand gestures, in
addition to body posture. The Natya Shastra mentions 28 single hand ges-
tures or Asamyukta Hastah Mudras: Pataaka, Tripataaka, Ardhapataaka, Kar-
tarimukha, Mayuram, Ardhachandram, Araalam, Shukatunda, Mushthi, Shikhara,
Kapitta, Katakaamukha, Suchi, Chandrakalaa, Padmakosha, Sarpashirsha, Mri-
55 gashirsha, Simhamukha, Kangula, Alapadma, Chatura, Bhramara, Hamsasye,
Hansapakshika, Sandamsha, Mukula, Tamrachuda, Trishula. Furthermore, over
time four new hand gestures were added to this list, namely, Kataka, Vyagraha,
Ardhasuchi and Palli.
Unlike these single hand gestures, Samyukta Hastah mudras require use of
60 both the palms to convey the message or a particular meaning. There are 24
double hand gestures described in the Natya Shastra: Anjali, Kapota, Karkata,
Swastika, Dola, Pushpaputa, Utsanga, Shivalinga, Kataka-vardhana, Kartari-
swastika, Shakata, Shankha, Chakra, Pasha, Kilaka, Samputa, Matsya, Kurma,
Varaha, Garuda, Nagabandha, Khatava, Bhairunda, Avahitta. Both single and
65 double hand gestures are used by the dancer to convey the meaning of the
poem/song which is being enacted.

4
Recently, deep learning has emerged as a powerful paradigm for complex
machine learning tasks such as object/image recognition [4, 5], handwritten
character recognition [6] etc. The ability of deep learning algorithms to ab-
70 stract appropriate features from images for classification is exemplary and so
we are motivated to use it for both pose and hand gesture identification. Firstly,
we create a dataset collected using the Kinect sensor in laboratory settings con-
taining images of twelve dance postures of ICD and propose a convolutional
neural network (CNN) to classify them. This collection of images is named
75 as Kinect (Laboratory) pose data. Next, using videos from Youtube we show
that a trained CNN model can recognize a subset of fourteen body postures
(Karanas) with high accuracy. The images corresponding to these Karanas are
collectively referred as Youtube (RGB) pose dataset.
Images of a subset of hand gestures pertaining to ICD are captured under
80 controlled laboratory settings and grouped into Computational Vision Lab single
hand (CVLSH) and Computational Vision Lab double hand (CVLDH) gesture
datasets. We train the proposed CNN model on both CVLSH and CVLDH
gestures datasets. We show that it is possible to identify with high accuracy 10
single hand gestures comprising the CVLSH dataset and 9 Asamyukta Hastah
85 mudras from a dataset of Youtube videos. For the case of double hand gestures,
we show the recognition performance for 14 Samyukta Hastah mudras of the
CVLDH dataset. For real-world Youtube videos of actual concerts, we estimate
identification rates of 7 double hand gestures (Samyukta Hastah mudras). We
also conduct several experiments using our trained CNN model over three stan-
90 dard single hand gesture datasets [7, 8, 9]. It is observed that the proposed
CNN with a simple architecture is able to perform well over datasets of hand
gestures with both uniform and complex backgrounds [7, 8, 9].
We compare the traditional approach of using hand-crafted features such as
scale-invariant feature transform (SIFT), speeded up robust features (SURF),
95 binary robust invariant scalable keypoints (BRISK), histogram-of-oriented gra-
dient (HoG) features and SVM-based classification with the proposed CNN
model. For this pupose we used both Kinect (Laboratory) pose and Youtube

5
(RGB) pose datasets as well as CVLSH, CVLDH and real-world hand gesture
datasets in our experiments.
100 Since we have limited labeled data (for both body poses and hand gestures)
we used transfer learning by resorting to pre-trained models which are trained
with a large labeled dataset such as MNIST [10] and CIFAR-10 [11]. Hence,
beginning from random initialization of network parameters, we obtain a good
initial estimate by pre-training over MNIST [10]/CIFAR-10 [11] datasets and
105 then we use this pre-trained model to facilitate faster training and avoid over-
fitting on our pose and hand gesture databases. We observed much faster con-
vergence during training as well as improved generalization and higher accuracy
by using transfer learning.
Once the classifiers are trained, we use them to understand the semantic
110 meaning of real-world dance performances from Hindu mythology. Specifically,
to demonstrate the possibility of parsing and understanding the meaning of
dance pieces depicting Shlokas (couplets/short poems) we show how our system
can recognize the body postures and hand gestures enacted by the performer
using videos from Youtube. Note that in this work we aim to identify the body
115 postures and hand gestures of the dancer independent of each other. Indian clas-
sical dance uses an amalgamation of postures, hand gestures, facial expressions,
movement of eye pupils, neck, torso and sophisticated motion of the entire body
to present an extravagant spectacle of mythical or even contemporary themes.
We recognize the fact that conveying a complete semantic experience of the
120 dance performances is beyond the scope of this work.
Presently, the proposed scheme is semi-automated in the sense that pre-
processing steps for extraction of images containing dance postures from real-
world dance videos have to be performed offline. As part of future work, we seek
to extend our method to become fully automatic so that continuous parsing of
125 video data can be done with explanation of semantic meaning to the user to
enable him/her to obtain a richer understanding of the dance performance.

6
The primary contributions of this paper are summarized as follows:

• We created a dataset of poses using the Kinect sensor in laboratory set-


tings and also from real-world Youtube videos. We proposed a CNN for
130 classification of these postures from ICD.

• We also created datasets for both single hand and double hand gestures
under controlled settings in the laboratory as well as in uncontrolled sce-
narios from real-world Youtube videos. The proposed CNN architecture
is shown to perform well for classification of these hand gestures.

135 • We show the superiority of CNN over shallow learning approaches using
hand crafted features such as HoG, SIFT, SURF and BRISK. We used the
proposed datasets of body poses and hand gestures to obtain comparison
results.

• We show faster convergence during training as well as improved gener-


140 alization and higher accuracy by using transfer learning in the proposed
CNN framework.

• Finally, we investigate the interesting aspect of understanding ICD from


body postures and hand gestures using the proposed CNN-based approach.

This paper is organised as follows. Section 2 describes the prior work per-
145 taining to the recognition of poses and hand gestures. The architecture details
of the proposed CNN model is given in section 3. Section 4 gives a comparison
of the experimental results obtained using shallow learning techniques and the
proposed CNN-based approach on the Kinect (Laboratory) pose and Youtube
(RGB) pose datasets. In this section we show failure cases of state-of-the-art
150 approaches of pose estimation such as [2] and [3] on the proposed pose datasets.
Also, we demonstrate the impact of transfer learning on the proposed CNN-
based framework in this section. The semantic interpretation of a shloka from
the postures is described in section 5. This section also introduces the proposed
hand gesture datasets along with comparison results between shallow learning

7
155 and the proposed CNN-based approach. Here we present experimental results
showing the effect of varying number of classes as well as the utility of trans-
fer learning. The semantic understanding of a shloka by recognition of hand
gestures is demonstrated in section 6. Section 7 concludes the work presented.

2. Prior work

160 We place our work in the context of related works in the recent literature
pertaining to both identification of pose and recognition of hand gestures. In
the literature, works pertaining to dance involve dance form classification as
well as recognition and estimation of poses of the performer. A multimedia
database retrieval system to preserve the living heritage of Indian classical dance
165 is reported in [12]. However, unlike our work, they do not identify body postures
or hand gestures to semantically understand the dance. For classification of
ICD a sparse representation based dictionary learning technique is proposed in
[13]. Classification of folk dances is attempted in [14]. Activity recognition was
attempted using a bag of words approach with a dataset consisting of Greek
170 traditional dances [15].
There are very few significant works addressing the problem of recognition
of body postures in ICD but a vast literature on general pose identification of
humans exists. Initial works for 2D pose estimation in the images/video domains
appear in [16, 17]. Entire human shapes have been matched in [18]. Image
175 segmentation has been used in [19] with detection of human skin color regions
in [20]. Discriminatively trained, multi-scale, deformable parts based model for
pose estimation is proposed in [21]. This idea is also used for object detection in
[22]. Felzenszwalb et al. [23] describe a statistical framework for representing the
visual appearance of objects composed of rigid parts arranged in a deformable
180 configuration. Andriluka et al. [24] propose a generic approach for human
detection and pose estimation based on the pictorial structures framework. The
work of [25] proposed discriminative learning of visual words for 3D human pose
estimation. In [26], an efficient model for pose estimation was formulated using

8
higher order dependencies. Johnson et al. [27] proposed a scheme to achieve
185 high quality annotation at low cost. An efficient method for pose estimation
using tree models is given in [3]. A new hierarchical spatial model that can
capture an exponential number of poses with a compact mixture representation
is proposed in [28]. 2D human pose is estimated from still images by Dantone
et al. [29] by proposing novel, nonlinear joint regressors.
190 Pischchulin et al. [30] gave a method for automatic generation of training
examples from an arbitrary set of images and proposed a new challenge of joint
detection and pose estimation of multiple articulated people in cluttered sport
scenes. Eichner et al. [31] estimate upper body pose in highly challenging un-
controlled images, without prior knowledge of background, clothing, lighting, or
195 the location and scale of the person. A novel approach for estimating articu-
lated body posture and motion from monocular video sequences is proposed in
[32]. A learning based method for recovering 3D human body pose from a single
image and monocular image sequences is given by [33]. Human pose estimation
in static images based on a novel representation of part models is proposed by
200 [34]. The work in in [35] proposes a conditional Bayesian mixture of experts
Markov model for discriminative visual tracking.
Researchers have used depth data also for predicting human poses. Shotton
et al. [36] proposed an efficient method to accurately predict human pose from
a single depth image. The performance is further improved by using a feedback
205 loop by Markus et al. [37]. CNN have also been used by researchers for pose
estimation. Real-time continuous pose recovery from a single depth map is
attempted in [38] by extracting dense features using a CNN, followed by a
decision forest classifier.
Various methods proposed for image/video based hand gesture identification
210 can be divided roughly into four categories: (i) hidden Markov model (HMM)
based [39, 40] (ii) neural network and learning based [41, 42] (iii) other meth-
ods such as graph-based [8, 43] or 3D-model based [44] and (iv) model based
optimization approach [45]. A complete survey of all the existing approaches
for hand pose estimation is given in [46]. A system for person independent

9
215 recognition of hand postures against complex backgrounds is given by [43]. A
system for the classification of hand postures against complex backgrounds in
grayscale images is presented in [8]. A hand gesture recognition system [47]
claims real time performance in unconstrained environments. Pisharady et al.
[48] proposed a system which utilizes a Bayesian model of visual attention to
220 generate a saliency map, for detecting and identifying the hand region. Very
recently, hand gesture recognition using CNNs has been attempted in [49]. A
simple five layer CNN proposed in [49] is used to classify seven different grasp
types. However, [49] does not exploit transfer learning or use dropout [50].
Various optimization approaches have been considered for tracking hands.
225 The work in [45] recover and track the 3D position, orientation and articulation
of human hand using a Kinect sensor. However, the work in [51] eliminates
the need of any external wearable hardware. Sharp et al. [52] have used a
single depth camera for accurately and robustly tracking hands in real-time
over significant range of distance with arbitrary camera placements. A fast
230 method for accurately tracking of hands and pose using a single depth camera
is proposed in [53] . A new dataset and an approach to accurately study each
feature without a full tracking pipeline is proposed by Tzionas et al. [54]. Bray
et al. [55] track hands without any constraint on the degrees of freedom.
Keskin et al. [56] used depth sensors to overcome problems associated with
235 vision based articulated hand pose estimation and hand shape classification.
The work in [56] used a randomized decision forest based hand shape classi-
fier for articulated hand pose estimation. Xu et al. [57] deal with hand pose
estimation using a single noisy depth map. A structured approach of locating
all skeletal joints guided by a latent tree model is proposed by [58]. The work
240 in [59] estimates the motion of an articulated object filmed by two or more
fixed low quality cameras, given only an approximation of the geometric model
of the tracked object. The approach of [60] simultaneously handles initializa-
tion, tracking, and recovery of hand motion with no prior information of the
hand pose. A scale, translation and rotation-invariant approach for recognizing
245 various single hand gestures of a dancer is proposed by Hariharan et al. [61]

10
However, [61] does not address the problem of estimating the pose of the dancer.
Our work bears some overlap with the area of fine-grained activity recogni-
tion [62], [63]. Although, a significant amount of literature exists for the general
problem of human body pose estimation and hand gesture identification, se-
250 mantic understanding of the poses and gestures in ICD has not received enough
attention. Recently, research on fine-grained activity recognition has gathered
momentum with various works [62, 63] proposing datasets for semantic activ-
ities e.g. those involved during cooking [62]. We believe that our work is the
first to address the highly challenging problem of semantically understanding
255 ICD using a computer vision approach.

3. Deep learning framework: Convolutional Neural Network

Convolutional neural nets originally proposed by LeCun [6] have been shown
to be accurate and versatile for several challenging real-world machine learning
problems [5, 6]. According to LeCun [4, 6], CNNs can be effectively trained
260 to recognize objects directly from their images with robustness to scale, shape,
camera viewpoint, noise etc. This motivates us to use CNNs in our problem
since in real-world scenarios image data of body postures and hand gestures in
ICD will be affected by such variations.

3.1. Architecture

265 The general architecture of the proposed CNN is shown in Fig. 3. Apart
from the input and the output layers, it consists of two convolution and two
pooling layers. The input is a 32 × 32 pixels image of a dance posture or hand
gesture (single/double hand). The output layer consists of nodes depending
upon the number of classes in the specific classification problem for which the
270 CNN is being used.
As shown in Fig. 3, the input image of 32 × 32 pixels is convolved with 10
filter maps of size 5 × 5 pixels to produce 10 output maps of 28 × 28 pixels in
layer 1. These feature maps are downsampled with max-pooling of 2 × 2 regions

11
Figure 3: Architecture of the proposed CNN model used for both pose and hand gesture
classification.

to yield 10 output maps of 14 × 14 pixels in layer 2. The 10 output maps of


275 layer 2 are convolved with each of the 20 kernels of size 5 × 5 pixels to obtain 20
maps of size 10 × 10 pixels. These maps are further downsampled by a factor
of 2 by max-pooling to produce 20 output maps of size 5 × 5 pixels of layer 4.
The output maps from layer 4 are concatenated to form a single vector while
training and fed to the next layer. The quantity of neurons in the final output
280 layer depends upon the number of classes in the database. The output neurons
are fully connected by weights with the previous layer. Akin to the neurons
in the convolutional layer, the responses of the output layer neurons are also
modulated by a non-linear activation function to produce the resultant score for
each class.

285 3.2. Activation function

A layer of non-linear activation function follows each of the convolutional


layers in the proposed CNN depicted in Fig. 3. We use both the sigmoid and
rectified linear unit (ReLU) activation functions in this work. It has been shown
in literature [64] that ReLU non-linear activation function aids the network to

12
290 learn faster. We observed faster convergence during training and higher testing
set accuracy by using rectified linear units (ReLU) as the activation function
instead of the non-linear sigmoid function before the sub-sampling layer in the
architecture. We show in detail the advantage of using the ReLU activation
function in Table 4.

295 3.3. Dropout

Sometimes the CNN suffers from over-fitting in which case the training ac-
curacy is high but the testing accuracy is poor. This can be ameliorated using
dropout [50]. In our work, we show improvements in accuracy over the testing
dataset by using dropout after the pooling layer of the proposed CNN shown
300 in Fig. 3. Dropout makes the network avoid over-fitting by randomly dropping
some nodes and their connections [50].

3.4. Transfer learning

We observe that CNNs uses a supervised learning paradigm. As reported in


literature, using a pre-trained model aids in convergence of network parameters
305 as well as prevents overfitting. Hence, we used transfer learning using two
large labeled datasets, namely, MNIST [10] and CIFAR-10 [11] to boost the
performance of the proposed CNN model.

3.4.1. Pre-training using MNIST [10]


Since we have limited labeled training data, we pre-trained the proposed
310 CNN from randomly initialized weights using MNIST [10] which contains 50,000
labeled training images of hand-written digits. The CNN is trained for 100
epochs with this data yielding an MSE of 0.0034 and testing accuracy of 99.08%
over 10,000 images. We have used the sigmoid activation function in our sim-
ulations with [65]. The converged weights of this trained network are used to
315 initialize the weights of the CNN model to which were fed as input our datasets
of dance postures and hand gestures as discussed in section 4.5 and 5.3, re-
spectively. Interestingly, we observed much faster convergence during training

13
with a pre-trained network and also improved accuracies on test datasets using
[65, 66].

320 3.4.2. Pre-training using CIFAR-10 [11]


We also pre-trained the proposed CNN from randomly initialized weights
using CIFAR-10 [11] which contains 50, 000 training images and 10, 000 test
images belonging to 10 different classes, namely, airplane, automobile, bird, cat,
deer, dog, frog, horse, ship and truck. The CNN is trained for 500 epochs with
325 three dropout layers. Two dropout layers are placed after the respective pooling
layers in the proposed architecture. Note that the third dropout layer is placed
after the fully connected layer in Fig. 3. We used the ReLU activation function
for speeding up convergence and to avoid over-fitting. This yields an MSE of
0.483 and testing accuracy of 61.6%. Akin to the case wherein we used the
330 MNIST data for pre-training, the converged weights of the trained CNN model
using the CIFAR-10 [11] dataset are used to initialize the weights of the CNN to
which we feed our datasets of dance postures and hand gestures as described in
section 4.5 and 5.3, respectively. We again observed faster convergence during
the training phase and improved accuracies on test datasets using MatConvNet
335 [66].
Note that the converged weights of the CNN obtained after pre-training on
the MNIST [10] and CIFAR-10 [11] datasets will be further fine-tuned using our
datasets of dance postures and hand gestures. This procedure has been further
elucidated in section 4.5 and 5.3, respectively.

340 4. Experimental results

4.1. Kinect (Laboratory) pose dataset: Handcrafted features

Initially, we show the performance of our classification methodology using


the skeletal data or the joint configuration obtained from a Kinect sensor for the
constrained dataset of a small subset of the total 108 Karanas outlined originally
345 in the Natya Shastra. We consider 12 Karanas performed by 7 different actors

14
who repeat each body posture 12 times. For this dataset created under con-
trolled conditions in the laboratory, we captured coordinates of 20 joints tracked
by the Kinect camera. In Fig. 4 we show the color images of the 12 poses with
the corresponding HoG features for the same. We also recorded the depth maps
350 using the Kinect sensor for each dance pose enacted by all the dancers. This set
of skeletal configurations, RGB images and depth maps is named as the Kinect
(Laboratory) pose dataset in this work.
We recorded the co-ordinates of all the 20 joints tracked by Kinect and es-
timated the 19 joint angles made with respect to the hip centre. The entire
355 database of 12 × 7 × 12 = 1008 images was split into a training set and the test
set without any overlapping images between them. A support vector machine
(SVM) classifier was trained with a linear kernel and the classification perfor-
mance for the Kinect (Laboratory) pose dataset consisting of 12 Karanas with
864 training and 144 test images is 95.83%.

Figure 4: Twelve Karanas along with the HoG features derived from the silhouettes.

15
360 We used histogram-of-gradients (HoG) features extracted from the RGB
images recorded by the camera in the Kinect sensor. For this purpose, we
first segmented the dancer from the background using the technique of [67] and
binarized the image to obtain the silhouette of the dancer. Each binarized frame
is then resized to 100 × 200 pixels. Considering 9-bin histogram of gradients
365 over 8 × 8 pixels sized cells, and blocks of 2 × 2 cells, we extracted HoG feature
vectors using a dense grid of total length 9504 from silhouette images. The
discriminative nature of HoG features corresponding to each dance posture is
demonstrated in Fig. 4. It is observed that HoG features capture the general
shape of each of the postures. A support vector machine (SVM) classifier was
370 trained with a linear kernel and the classification performance for the Kinect
(Laboratory) pose dataset consisting of 12 Karanas with 720 training and 144
test images is 86.11% as shown in Table 1.
We also extracted HoG features from the depth maps recorded by Kinect
corresponding to each body posture. The depth maps were re-sized to 100 × 200
375 pixels. Akin to the case of RGB images we extracted HoG feature vectors of
total length 9504 using a dense grid on the 100 × 200 depth maps. With 720
training and 144 test depth images we obtained an accuracy of 84.1% using an
SVM classifier having a linear kernel.

Data Kinect joints+SVM HoG + SVM depth+HoG+SVM


Kinect (Laboratory) pose data 95.83% 86.11% 84%
Youtube (RGB) pose data 88.89%

Table 1: Performance using hand crafted features on both Kinect (Laboratory) and Youtube
(RGB) databases of poses of Indian classical dance.

4.2. Youtube (RGB) pose data: Handcrafted features

380 We then performed experiments using a dataset obtained from frames of


Youtube videos. We adopted a similar procedure as the images from the Kinect
(Laboratory) pose dataset and obtained silhouette images of the dance pose
by segmentation using GrabCuts technique [67]. Using HoG features extracted

16
from the binary silhouette images from 288 training images we trained an SVM
385 classifier. The recognition accuracy with 72 test images was 88.89%. The clas-
sification results using the above described hand crafted features have been
summarized in Table 1. Since for the real world dataset collected from Youtube
the joint locations and depth data are not available, hence the corresponding
entries are blank in Table 1.
390 It is to be noted that for all the above cases, parameters of the SVM classifier
were tuned using the standard cross-validation procedure [68].

4.3. CNN model: Pose data

4.3.1. Training phase: Kinect (Laboratory) pose dataset


The constrained database used for training the proposed CNN architecture
395 shown in Fig. 3 consists of 864 images which were captured using Kinect RGB
camera originally at 640 × 480 pixels resolution. We used images of 12 different
poses as shown in Fig. 4 enacted 12 times by 6 different volunteers. The training
set is composed of 10 images of each pose by 6 different persons leading to a
total of 720 photos. The test set is made up of the rest of 144 images. There
400 is no overlap between the training and the test datasets. All images are down-
sampled to 32 × 32 pixels before feeding to the CNN.
The weights of the proposed CNN are trained from a random initialization
by the conventional back-propagation method using the package in [65]. The
learnable parameters of the proposed CNN model are kernel and bias weights for
405 convolutional and the fully connected output layer. Therefore, the total number
of learnable network parameters in the proposed CNN architecture is 6282. We
have chosen batch size as 4 and constant learning rate α throughout all the
layers. The network is trained from random initial weights for 300 epochs. The
variation of the mean squared error (MSE) versus epochs during the training
410 phase is shown in Fig. 5. The blue curve represents the variation of MSE for
learning rate α = 0.5 and the orange plot represents the variation of MSE for
α = 1. The final MSE over the training set for α = 1 and α = 0.5 are 1.53 and
1.54 respectively. Since the learning rate, α =1 trained the network better as

17
Figure 5: Mean squared error (MSE) versus epochs for the CNN trained on the Kinect
(Laboratory) pose dataset.

seen in Fig. 5, we use this trained CNN to report the accuracy on the testing
415 set in Table 2. The package [65] does not have the option to use the ReLU
activation function. So, to enable a fair comparison regarding the effect of the
sigmoid and ReLU activation functions, we used the package MatConvNet [66].

4.3.2. Testing phase


For the testing phase, we give as input to the trained CNN model images
420 from the test dataset. Given a test image, the label with maximum score is
chosen at the output layer of the CNN. The accuracy for 144 images is 97.22%
obtained using [65] as given in Table 2. We observe that the accuracy of the
CNN is better than the ‘shallow’ learning modality of HoG features derived
from images/depth maps used in conjunction with an SVM classifier (first row
425 of Table 1). If we choose the sigmoidal activation function in [66] we obtain an
accuracy of 93.7% with a CNN trained for 2000 epochs using a learning rate
α = 75 × 10−5 as shown in Table 3. The improvement in accuracy and time
complexity obtained by using a ReLU activation function is shown in Table
4. Comparing first row of Tables 3 and 4, we observe that by using a ReLU
430 activation function the testing accuracy is improved to 97.2% in 1000 epochs.

4.3.3. Training phase: Real-world data


We downloaded some dance videos from Youtube and extracted the dancer
in the frames from the background using GrabCuts [67]. The extracted frame is
then re-sized to 100×200 pixels. We created a dataset of such real-world images

18
Figure 6: A snapshot of fourteen Karanas extracted from Youtube videos.

Data α Batch Epochs No. of Training Testing Accuracy


size Classes set set on testing set
Kinect (Laboratory) pose data 1 5 300 12 720 144 97.22%
Youtube (RGB) pose data 0.5 4 200 14 1008 252 93.25%

Table 2: Performance of the proposed CNN with randomly initialized weights on both Kinect
(Laboratory) dataset and Youtube (RGB) databases of poses of ICD using [65].

Data α Batch Epochs No. of Training Testing Accuracy


size Classes set set on testing set
Kinect (Laboratory) pose data 75 × 10−5 10 2000 12 720 144 93.7%
−5
Youtube (RGB) pose data 75 × 10 10 1500 14 1008 252 92.5%

Table 3: Performance of the proposed CNN (using weights initialized with random numbers)
on both Kinect (Laboratory) dataset and Youtube (RGB) databases of poses of ICD with
sigmoidal activation function in MatConvNet [66].

Data α Batch Epochs No. of Training Testing Accuracy


size Classes set set on testing set
Kinect (Laboratory) pose data 5 × 10−6 10 1000 12 720 144 97.2%
−6
Youtube (RGB) pose data 5 × 10 10 1000 14 1008 252 98.4%

Table 4: Performance of the proposed CNN (randomly initialized weights) on both Kinect
(Laboratory) dataset and Youtube (RGB) dataset of poses of ICD with ReLU activation
function in MatConvNet [66].

435 for 14 different poses performed by 6 different dancers extracting 15 frames per
pose for each dancer. A snapshot of the 14 postures is depicted in Fig. 6. To
create the training set, we used 12 frames per pose for each of the 6 performers
leading to 1008 images. The testing set consisted of the rest 252 images. There

19
Figure 7: MSE versus epochs plot obtained training a CNN for Youtube (RGB) pose data
obtained from Youtube videos.

is no overlap between the training and testing sets. All images were further
440 re-sized to 32 × 32 pixels before feeding to the CNN.
The CNN model was trained for 200 epochs from random initial weights
with batch size 6 and constant learning rate α = 0.5 throughout all the layers
using [65]. The variation of the MSE versus epochs during the training phase
is shown in Fig. 7 for two different choices of α. The blue curve represents
445 the variation of MSE of for a learning rate of α = 0.5, while the orange curve
represents the MSE variation for a learning rate of α = 0.6. As can be seen
from Fig. 7, training of the proposed CNN is better for the case of α = 0.5. The
final MSE yielded by the CNN during the training phase for α = 0.5 is 0.0258.
Hence, this network was chosen to obtain the accuracy for the testing set of the
450 real-world pose data.

4.3.4. Testing phase


The test set containing 252 images is given as input to the trained CNN
which yields an overall accuracy of 93.25%. The accuracy obtained for this
large dataset is superior to that obtained using HoG and SVM as shown in the
455 second row of Table 2. Hence, we observe that the proposed simple architecture
for the CNN is deep enough to outperform the traditional ‘shallow’ machine
learning techniques using handcrafted features. To visualize the features learnt
automatically by the trained CNN model we show in Fig. 8 (a) one input
image from the real-world pose dataset. The filter kernels of the first layer of
460 the convolutional neural network obtained after 200 epochs using the sigmoid

20
(a) (b) (c)

Figure 8: (a) Original input image of a dance pose from the real-world dataset created
using Youtube videos. (b) First layer filter kernels in the proposed CNN architecture using
sigmoid activation function in [65]. (c) Feature maps at the first convolutional layer of the
CNN obtained by convolving the filter kernels in (b) with the input image in (a).

activation function in [65] are shown in Fig. 8 (b). These kernels when convolved
with the original image of Fig. 8(a) give the feature maps as shown in Fig. 8(c).
As we had already demonstrated for the Kinect (Laboratory) pose dataset in
subsection 4.3, we used the ReLU activation function in MatConvNet [66] for
465 training the CNN with real-world pose data. For the real-world pose data
also we observed a reduction in number of epochs for training the CNN and
improvement in accuracy over the test dataset with the use of ReLU (second
row of Table 3 and 4).

4.4. Comparison results with state-of-the-art pose estimation methods

470 The state-of-the-art approaches on pose estimation work well on datasets


where there is less clutter, view-point is frontal, and clothing does not occlude
body parts. But our proposed Indian classical dance dataset is complex because
of the variations in viewing angle, clutter in the background, multiple dancers
on the stage etc. Hence, the state-of-the-art approaches such as in [2, 3] fail to
475 perform on our dataset as depicted in Figs. 9 and 10.
Fig. 9 (a) through (j) depict images from our Kinect (Laboratory) pose as
well as Youtube (RGB) pose datasets illustrating the weakness of existing state-
of-the-art method of Ramanan et al. [2] which failed due to the complexity of the

21
images. The method of Ramanan et al. [2] fails to estimate the pose accurately
480 if there is occlusion due to clothing as can be seen in Fig. 9 (e) where the
dancer’s clothing occludes the leg. The effect of clutter in the image can be
seen in Fig. 9 (h) where an idol on the stage affects the pose estimation badly.
Similarly, Fig. 10 (a) through (d) shows the output of another state-of-the-
art approach by Wang et al. [3] on our proposed dataset. The approach of
485 [3] fails on both our pose datasets due to the complexity involved in terms of
clothing, occlusion, view point variations etc.
Furthermore, in order to provide a fair comparison with the state-of-the-
art pose estimation methods we used the nearest neighbourhood classifier to
identify the pose in both the Kinect (Laboratory) pose dataset and the Youtube
490 (RGB) pose data from Youtube videos. A total number of 14 significant joint
coordinates are extracted from each image of the training sets of both pose
datasets and used for training the nearest neighbour classifier. The test data
consisted of the 14 coordinates extracted from the skeleton obtained from [2] for
each image of the test datasets of the Kinect (Laboratory) pose and Youtube
495 (RGB) pose databases. Classification is performed by computing the minimum
Euclidean distance between the joint coordinates of the skeletons estimated by
[2] from images of test data and the joint coordinates of training data. For the
Kinect (Laboratory) pose data the nearest neighbour classifier performed badly
giving an accuracy of only 16% for the pose estimation technique of [2]. The
500 performance of the nearest neighbour classifier for identifying the dance poses
in the Youtube (RGB) pose data is far worse with an accuracy of 4% for the
pose estimated by [2].

4.5. Transfer learning on pose datasets

Initially, we used the CNN model pre-trained on the MNIST [10] data. Note
505 that here we used the package in [65] with the sigmoid activation function and
without any dropout layers to pre-train the proposed CNN. We obtained an
accuracy of 97.92% with only 10 epochs using this pre-trained CNN on the
Kinect (Laboratory) pose dataset. Contrast this with 300 epochs taken by our

22
(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 9: A snapshot of the failure cases for the approach in [2].

(a) (b)

(c) (d)

Figure 10: A snapshot of Karanas extracted from our proposed dataset where the state-of-
the-art approach of Wang et al. [3] fails due to poor illumination, clutter in the background,
clothing etc.

CNN model which had been trained using random initialization yielding an
510 inferior accuracy of 96.53% as shown in Fig. 11 (a). Similarly, for the real world
pose dataset, the pretrained model using MNIST [10] resulted in convergence
within 20 epochs giving better testing accuracy of 97.62% than the case of
randomly initialized CNN which gave a testing accuracy of 93.25% with 200

23
(a) (b)

Figure 11: (a) Variation of MSE for the CNN pre-trained on MNIST [10] for the case of
Kinect (Laboratory) pose data. (b) Performance of the CNN pre-trained on MNIST [10] on
the Youtube (RGB) pose dataset.

epochs as can be seen in Fig. 11 (b).

Data α Batch Epochs No. of Training Testing Accuracy


size classes set set on testing set
Kinect (Laboratory) pose data 5 × 10−6 10 100 12 720 144 98.6%
Youtube (RGB) pose data 5 × 10−6 10 350 14 1008 252 98.4%

Table 5: Results obtained with the CNN pre-trained on the CIFAR-10 [11] dataset for both
Kinect (Laboratory) pose data and Youtube (RGB) pose database using MatConvNet [66].

515 Apart from using MNIST [10] as described above, the proposed CNN is also
pre-trained from random initial weights in MatConvNet [66] using the CIFAR-10
[11] labeled dataset. In this case, we used three dropout layers in the proposed
architecture of the CNN of Fig. 3. Two dropout layers were placed after the
respective pooling layers. The third dropout layer is located just after the fully
520 connected layer in the CNN. We used the ReLU activation function during
pre-training. The proposed architecture resulted in an MSE of 0.483 after 500
epochs and achieved a testing accuracy of 61.6% on the CIFAR-10 [11] data. The
weights of the above obtained pre-trained model are used to initialize the weights
of the CNN to be trained on our Kinect (Laboratory) pose data and Youtube
525 (RGB) pose dataset. From Tables 4 and 5, it is observed that a pre-trained
model accelerates convergence compared to random weight initialization.

24
Figure 12: Dataset [69] comprising of 6 classes. First row: “cricket batting”,“cricket bowling”,
“croquet shot”. Second row: “tennis forehand”,“tennis serve” and “volleyball smash”.

We next demonstrate the effectiveness of using transfer learning for classifi-


cation of pose images of a recently proposed challenging database in [69]. The
dataset in [69] consist of six classes, namely, “cricket batting”,“cricket bowling”,
530 “croquet shot”, “tennis forehand”,“tennis serve” and “volleyball smash” with
50 images per class.
The pre-trained CNN model using CIFAR-10 [11] with a ReLU activation
function and three dropout layers is used on the database of [69]. This pre-
trained CNN is trained with a variable learning rate α= 5 × 10−6 upto 600
535 epochs and α= 9 × 10−7 thereon until 1000 epochs. The resulting trained CNN
achieved an accuracy of 75.3% on the testing set of 120 images from [69].

5. Semantic understanding of a shloka using postures

Next, we demonstrate the possibility of recognizing postures of the dancer


to comprehend the meaning of the enacted dance piece.
540 The shloka (short poem/couplet in the Sanskrit language) enacted by us
during testing was :
Vande Deva Umaa Pathim Suragurum Vande Jagat Kaaranam Vande Pan-
naga Bhooshanam Mruga Dharam Vande Pashoonam Pathim Vande Soorya
Shashanka Vahni Nayanam Vande Mukunda Priyam Vande Bhakta Jana Ashrayam
545 Cha Varadam Vande Shiva Shankaram

25
(a) (b) (c)

(d) (e) (f)

Figure 13: Sequence of poses that are enacted by a performer to convey a shloka.

Meaning:
I bow down to the Lord of Uma (Parvathi), the divine Guru, the cause of
the universe. I bow down to the Lord who is adorned with snake and wears
tiger skin, the Lord of all creatures. I bow down to the Lord whose three eyes
550 are the sun, moon and fire and to whom Lord Vishu is near. I bow down to the
Lord who is the refuge of all devotees and the giver of boons, Shiva Shanakara.
The whole shloka was enacted using various Karanas out of which we could
identify 6 poses as belonging to our training set of originally 12 Karanas with
which we trained the SVM classifier. These 6 Karanas are: 1. Samanakha-
555 denotes beginning of a dance piece. 2. Lina- paying respects 3. Danda recita-
cobra bed of Lord Vishnu 4. Chatura- cobra on Lord Shiva’s body 5. Talasam-
photita- Lord Shiva 6. Valita- blessing.
We recorded a video using Kinect such that skeletal data could be extracted
for every pose enacted. We extracted the skeletal feature vector from each frame
560 of the video recorded with the Kinect sensor and passed the feature vector as a
test data to the trained SVM classifier. As the dancer transitions from one pose
to another, there are some frames which do not correspond to any particular
pose in the training set. We used a two class SVM to eliminate the frames which
were outside the training set and passed the remaining candidate frames to the
565 trained multi-class SVM classifier.

26
Reference to Significance of
Figure Pose
shloka Pose
Begining of dance
13 (a) Samanakha
recital
13 (b) Lina Vande Paying respects
The Lord Almighty
Deva Uma Patim
13 (c) Valita who blesses his devo-
Suragurum
tees
The snake around
Pannaga
13 (d) Chatura the neck of Lord
Bhooshanam
Shiva
The cobra bed of
13 (e) Danda Rechita Mukunda Priyam
Lord Vishnu
classic Nataraj pose
13 (f) Talasmophotita Talasamphotita of Lord Shiva in
Hindu mythology

Table 6: Decomposition of the enactment of a shloka using dance postures.

As the dancer enacts the shloka using several postures, we attempt to iden-
tify them and thereby interpret the semantic meaning of the dance piece. In
Fig. 13 we show the 6 poses that the SVM classifier correctly identified out of a
sequence of frames obtained from the recorded video. The various dance poses
570 and the associated semantic meanings have been summarized in Table 6.

5.1. Hand gesture datasets

We consider a subset of hand gestures of ICD and capture images of both


single and double hand postures under controlled laboratory conditions. We
grouped these images under two categories CVLSH and CVLDH datasets. We
575 also used real-world images of hand gestures for ICD from Youtube videos. We
now present classification results on these hand gesture datasets using both

27
handcrafted features as input to an SVM classifier as well as the proposed CNN
architecture.

(a) (b)

Figure 14: (a) Sample images showing 10 single hand gestures used in our CVLSH dataset.
(b) Nine single hand gestures for the dataset constructed using Youtube videos.

5.1.1. Asamyukta Hastah Mudra (Single hand gestures)


580 CVLSH data:
In the Natya shastra, 28 Asamyukta Hastah Mudra are listed. We captured
images of a subset of 10 hand gestures each performed 10 times by 14 different
persons for a total of 1400 images. In Fig. 14 (a) we show a sample of the 10
mudras recorded for a single person. The single hand gestures chosen for the
585 CVLSH dataset are Pataaka, Mayuram, Ardhachandra, Mushthi, Shikharam,
Suchi, Padmakosha, Mrigashirsha, Bhramaram and Chandrakalaa.
Youtube videos (single hand gestures):
For the case of real-world data, we selected 10 Youtube videos wherein frame
regions corresponding to 9 single hand gestures were isolated manually to obtain
590 a dataset of 972 images. The Asamyukta Hastah mudras chosen are Pataaka,
Ardhachandra, Mushthi, Suchi, Mrigashirsham, Chandrakalaa, Trishulam, Ala-

28
padma and Hamsasye. Fig. 14 (b) shows sample images corresponding to the 9
Asamyukta Hastah mudras.

(a) (b)

Figure 15: (a) Sample images showing 14 double hand gestures used in our CVLDH dataset.
(b) Seven double hand gestures for the dataset constructed using Youtube videos.

Figure 16: Ten single hand gestures in the proposed CVLSH dataset along with their HoG
features.

29
No. of Training Test Feature Accuracy
Classes set set vector
HoG + SVM 95.37%
9 756 216 SIFT 84.5%
SURF 71.43%
BRISK 61.23%

Table 7: Single hand gestures- Asamyukta Hastah Mudra (Youtube video dataset)

5.1.2. Samyukta Hastah Mudra (Double hand gestures)


595 CVLDH data:
In the Natya shastra, 24 Samyukta Hastah Mudra are listed. We captured
images of a subset of 14 double hand gestures each performed 10 times by 6
individuals for a total of 840 observations. In Fig. 15 (a) we show a snapshot
of all the 14 mudras. These subset of mudras were namely, Anjali, Kapota,
600 Swastika, Karkata, Pushpaputa, Shivalinga, Shankha, Shakata, Kurma, Chakra,
Pasha, Garuda, Bherunda and Matsya.
Youtube videos (double hand gestures): We collected 630 images of
7 double hand gestures performed by 9 different dancers in several Youtube
videos by isolating 10 separate instances of each mudra. The mudras enacted
605 were Anjali, Karkata, Swastika, Pushpaputa, Shivalinga, Chakra and Matsya.
Sample images of the individual hand gestures are shown in Fig. 15 (b).

Several features such as histogram of gradients (HoG), scale-invariant feature


610 transform (SIFT), speeded up robust features (SURF) and Binary Robust In-
variant Scalable Keypoints (BRISK) have been extracted from the input data
and an SVM with a linear kernel is trained to recognize the hand gestures. We
show the HoG features corresponding to each of the 10 single hand gestures of
our CVLSH dataset in Fig. 16 to depict its discriminative nature. It is observed
615 that the gradients capture the general shape of each of the hand gestures.
Similarly, we observe that the HoG features encode the shape of the dou-
ble hand gestures also leading to a good classification performance. For the

30
No. of Training Test Feature Accuracy
Classes set set vector
HoG + SVM 100%
7 490 140 SIFT 77.65%
SURF 80.71%
BRISK 72.86%

Table 8: Double hand gestures- Samyukta Hastah Mudra (Youtube video dataset)

CVLSH dataset corresponding to single hand gestures we obtained an accuracy


of 99.75% considering 10 different mudras with 1000 training and 400 test im-
620 ages using HoG features. The performance with SIFT, SURF and BRISK image
descriptors are 91%, 85.75% and 87%, respectively. Similarly, for the CVLDH
dataset of double hand gestures containing images of 14 different mudras with a
training set of 560 and testing set of 280 images we obtained identification per-
formance of 98.57%, 96.78%, 99.98% and 99.92% with HoG, SIFT, SURF and
625 BRISK, respectively. We have compiled the results for the real-world datasets
(Youtube videos) for both single hand and double hand gestures in Tables 7 and
8, respectively. Following the procedure mentioned in section 4.1, we extracted
HoG feature vectors of total length 4356 considering a dense grid on each hand
gesture image of size 100 × 100 pixels. In contrast, we did not compute SIFT,
630 SURF and BRISK features over a dense grid. Rather, we detected keypoints
in the input image. Hence, the total dimension of the feature vectors for SIFT,
SURF and BRISK depend upon the number of keypoints detected in the given
image. Note that the descriptor length at each keypoint for SIFT is 128 and 64
for SURF and BRISK features.

635 5.2. CNN model: Indian classical dance hand gestures

We now evaluate the performance of the proposed CNN model on the database
of hand gestures for ICD considered by us in this work. Specifically, we show
the performance of the proposed CNN on both CVLSH, CVLDH and real-world
databases of single and double hand static hand gestures in Table 9. Note that

31
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
Single hand 0.5 5 500 10 1120 280 98.57%
(CVLSH)
Double hand 0.5 4 200 14 672 168 96.43%
(CVLDH)
Single hand 0.5 3 100 9 1155 105 97.14%
(Youtube)
Double hand 0.5 5 500 7 1323 567 100%
(Youtube)

Table 9: Performance of the proposed CNN with random initial weights on CVLSH, CVLDH
datasets and real-world databases of single and double hand gestures of Indian classical dance
using sigmoid activation function in [65].

640 these results have been obtained using [65] with the sigmoid activation function
and a random initialization of weights. We observe that the performance of deep
learning algorithms is quite comparable to the accuracies obtained by ‘shallow ’
learning algorithms which use handcrafted features in conjunction with an SVM
classifier. Several researchers have reported the advantages of using the ReLU
645 activation function. Hence, to enable a fair comparison the proposed CNN is
trained using [66] using both sigmoid as well as ReLU activation functions and
the performance is shown in Tables 10 and 11, respectively. Note that we initial-
ized the CNN with random weights for comparison purpose. we observe faster
convergence during training and better accuracy over the test dataset with the
650 ReLU activation function.

5.3. Transfer learning

As described in section 3.2 we obtain a pre-trained CNN model by using the


large labeled training data of MNIST [10]. We used this pre-trained model on
the CVLSH dataset to obtain an accuracy of 98.93% with only 2 epochs using
655 the toolbox in [65]. Contrast this with 200 epochs taken by a CNN model which

32
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
−5
Single hand 75 × 10 10 500 10 1120 280 98.6%
(CVLSH)
Double hand 75 × 10−5 10 1000 14 672 168 97.6%
(CVLDH)
Single hand 75 × 10−5 10 500 9 1155 105 100%
(Youtube)
Double hand 75 × 10−5 10 1000 7 1323 567 99.8%
(Youtube)

Table 10: Performance of the proposed CNN from random initial weights on both CVLSH,
CVLDH and real-world databases of single and double hand gestures of ICD using sigmoid
activation function in MatConvNet [66].

Data α Batch Epochs No. of Training Testing Accuracy


size Classes set set on testing set
−6
Single hand 5 × 10 10 200 10 1120 280 99.6%
(CVLSH)
Double hand 5 × 10−6 10 500 14 672 168 98.8%
(CVLDH)
Single hand 5 × 10−6 10 100 9 1155 105 100%
(Youtube)
Double hand 5 × 10−6 10 100 7 1323 567 100%
(Youtube)

Table 11: Performance of the proposed CNN (with random initialization of weights) on both
CVLSH, CVLDH and real-world databases of single and double hand gestures of ICD using
ReLU activation function in MatConvNet [66].

had been trained using random initialization yielding an inferior accuracy of


only 98.57% . For CVLDH dataset pre-training gave an accuracy of 98.81% in
only 2 epochs with [65] as compared to 96.43% achieved after 200 epochs from
random initialization of network weights. For real-world single and double hand
660 gesture datasets pre-training achieved better results giving an accuracy of 100%
in both the cases in only 2 and 50 epochs, respectively, as shown in Table 12.

33
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
Single hand 0.5 5 2 10 1120 280 98.93%
(CVLSH)
Double hand 0.5 2 10 14 672 168 98.81%
(CVLDH)
Single hand 0.5 5 2 9 1155 105 100%
(Youtube)
Double hand 0.5 5 50 7 1323 567 100%
(Youtube)

Table 12: Effect of pre-training using MNIST [10]: Performance of the proposed CNN on
both CVLSH, CVLDH and real-world databases of single and double hand gestures of Indian
classical dance with sigmoid activation function using [65].

Data α Batch Epochs No. of Training Testing Pre-training with Pre-training with
size Classes set set CIFAR-10 [11] MNIST [10]
Single hand 5 × 10−6 10 20 10 1120 280 100% 98.2%
(CVLSH)
Double hand 5 × 10−6 10 50 14 672 168 100% 97.6%
(CVLDH)
Single hand 5 × 10−6 10 50 9 1155 105 100% 100%
(Youtube)
Double hand 5 × 10−6 10 40 7 1323 567 100% 99.6%
(Youtube)

Table 13: Effect of pre-training using CIFAR [11]: Performance of the proposed CNN on
both CVLSH, CVLDH and real-world databases of single and double hand gestures of Indian
classical dance pre-trained using CIFAR-10 [11] ReLU activation function and three dropout
layers using MatConvNet [66].

Comparing Tables 9 and 12 we observe that pre-training the proposed CNN


with MNIST data [10] yields faster convergence for the CVLSH, CVLDH and
real world pose datasets. We also used transfer learning with the aid of CIFAR-
665 10 [11] database and investigate the performance of the pre-trained CNN model
on the various hand gesture databases considered in our work. In Table 13,
we compile the results of transfer learning using CIFAR-10 [11] with the CNN
toolbox of [66]. Note that here in addition to using the pre-trained model we
also incorporated three dropout layers to reduce over-fitting and used the ReLU

34
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
MU Hand Images ASL gestures [9] 75 × 10−5 10 1000 10 560 140 77.9%
(10 class)
MU Hand Images ASL gestures [9] 75 × 10−5 10 2500 15 840 210 81.4%
(15 class)
MU Hand Images ASL gestures [9] 25 × 10−5 10 3000 20 1120 280 80.4%
(20 class)
MU Hand Images ASL gestures [9] 75 × 10−5 10 3000 25 1400 350 74.9%
(25 class)
MU Hand Images ASL gestures [9] 5 × 10−6 (upto 1000 epochs) 10 1200 36 2012 503 66.8%
(36 class) 5 × 10−6 (upto 1200 epochs)

Table 14: Performance of the proposed CNN (with random initial weights) on the MU Hand
Images ASL gestures dataset [9] with sigmoid activation function in MatConvNet [66].

Data α Batch Epochs No. of Training Testing Accuracy


size Classes set set on testing set
MU Hand Images ASL gestures [9] 5 × 10−6 10 4500 10 560 140 89.6%
(10 class)
MU Hand Images ASL gestures [9] 5 × 10−6 10 3000 15 840 210 91.9%
(15 class)
MU Hand Images ASL gestures [9] 5 × 10−6 10 2000 20 1120 280 89.3%
(20 class)
MU Hand Images ASL gestures [9] 5 × 10−6 10 3000 25 1400 350 86.0%
(25 class)
MU Hand Images ASL gestures [9] 5 × 10−6 (upto 1000 epochs) 10 1200 36 2012 503 66.8%
36 class) 5 × 10−6 (upto 1200 epochs)

Table 15: Performance of the proposed CNN (with random initial weights) on the MU Hand
Images ASL gestures dataset [9] with ReLU activation function and two dropout layers (after
the pooling layers in proposed architecture) in MatConvNet [66].

670 activation function to speed up convergence.


For the sake of comparison we also report results with transfer learning using
MNIST [10] in the last column of Table 13. Note that the CNN is pre-trained
for 100 epochs with batch size 10 and α = 5 × 10−6 using ReLU activation
function and 3 dropout layers. This resulted in an accuracy of 99.1% over the
675 testing set of MNIST [10]. We observe that the performance of the proposed
CNN with transfer learning using CIFAR-10 [11] is marginally better than that
using MNIST [10].

35
Data α Batch Epochs No. of Training Testing Pre-training Pre-training
size Classes set set with CIFAR-10 [11] with MNIST [10]
MU Hand Images ASL 5 × 10−6 10 150 10 560 140 90.5% 88.5%
gestures [9] (10 class)
MU Hand Images ASL 5 × 10−6 10 200 10 840 210 93.5% 88.0%
gestures [9] (15 class)
MU Hand Images ASL 5 × 10−6 10 300 20 1120 280 91.3% 91.1%
gestures [9] (20 class)
5 × 10−6
MU Hand Images ASL (upto 300 epochs) 10 400 25 1400 350 88.0% 85.0%
gestures [9] (25 class) 5 × 10−7
(upto 400 epochs)
MU Hand Images ASL 5 × 10−6 10 100 36 2012 503 65% 66.9%
gestures [9] (36 class)

Table 16: Performance of the proposed CNN model on the MU Hand Images ASL gestures
dataset [9] with pre-training using CIFAR-10 [11] and MNIST [10] data.

5.4. CNN model: Comparison results with standard hand gesture datasets [7, 8,
9]

680 We demonstrate that the proposed CNN architecture is deep enough to yield
recognition rates comparable with the state-of-the-art on the standard hand
gesture datasets. We have trained a CNN with images of hand gestures from
three standard datasets [7, 8, 9] considering the cases of both plain/uniform and
complex backgrounds.

685 5.4.1. Standard database [8]


The architecture of the CNN is identical to that shown in Fig. 3. Initially, we
considered the dataset in [8] as depicted in Fig. 17 for training the CNN model
consisting of hand gestures against a uniformly dark background. The base
dataset of [8] consisted of 240 images of 10 distinct hand gestures performed by
690 24 persons. We used 200 out of the 240 images as the un-augmented training set.
To train the CNN model, we augmented this training dataset by 8 times using
the operations of cropping and re-sizing in five different ways, flipping along the
vertical axis as well as addition of Gaussian noise with standard deviations of
0.1 and 0.2. The augmented training database consisted of 1600 images. The
695 test dataset consists of the remaining 40 images of the original dataset of 240

36
Figure 17: Sample images of ten hand gestures in [8].

images. We did not perform augmentation of the test database. There is no


overlap between the training and the testing datasets.
The weights of the proposed CNN are trained by the back-propagation
method utilizing stochastic gradient descent with the package in [66]. We have
700 chosen batch size as 10 and constant learning rate α = 9 × 10−6 throughout all
the layers of the proposed CNN for the dataset in [8]. The network is trained
for 2000 epochs from random initial weights. The final MSE achieved during
training for the un-augmented training data is 0.025. However, using a learning
rate α = 5 × 10−6 with batch size 10, the MSE achieved is 0.016 after train-
705 ing the CNN for 700 epochs for the augmented training data containing 1600
images.
During the test phase 40 images of hand gestures with a uniformly dark
background were fed as input to the trained CNN after re-sizing each of them
to 32 × 32 pixels. The accuracy obtained on the test dataset is 85%. The
710 accuracy improved further to 93% on the test data by using the CNN trained
with the augmented training dataset containing 1600 training images. For both
the cases of augmented and un-augmented training data the results are obtained
with a ReLU activation function along with two dropout layers, each following
the respective pooling layers in the proposed architecture of Fig. 3.

37
Figure 18: Marcel dataset [7]

715 5.4.2. Marcel dataset [7]


Next, we trained a CNN model of identical architecture as above on the
dataset in [7] shown in Fig. 18 consisting of total 5149 images of 6 different
hand gestures with cluttered complex backgrounds. The full dataset was split
without any overlapping images into a training set consisting of 3608 images
720 and a test dataset of 1541 examples.
The efficacy of our CNN for images of hand gestures with complex back-
grounds on a large dataset is further demonstrated by the accuracy obtained on
the test data using [7]. We obtained an accuracy of 85.98% using [65] with 216
out of 1541 images being misclassified on this challenging dataset. This perfor-
725 mance of the proposed model on the challenging dataset of [7] is better than
the state-of-the-art result of 76.10% reported in [7]. We trained the proposed
CNN using [66] utilizing the sigmoid activation function with a learning rate of
5 × 10−3 , batch size of 10 for 100 epochs. We obtained an accuracy of 86.8%
over the test dataset. We interchanged the sigmoid non-linearity with the ReLU
730 activation function and re-trained the CNN using [66] with a learning rate of
α = 5 × 10−6 , batch size of 10 for 500 epochs. This CNN yielded an improved
accuracy of 89.5% for the test data. Note that we had initialized the CNN with
random weights.

38
Figure 19: MU Hand Images ASL gestures [9]

5.4.3. MU Hand Images ASL gesture dataset [9]


735 The Massey University (MU) Hand Images American Sign Language (ASL)
dataset Fig. 19 is a 36 class dataset having 70 images in each category. The
classes have low inter-class variation, making it a tough dataset for classifica-
tion. We chose different set of classes from the full dataset and split the set
of images without any overlap into respective training and test datasets. The
740 proposed CNN architecture with a random initialization of weights was used on
this dataset to show the variation in accuracy with increasing number of classes.
We observe from Table 14 that with increasing number of classes the accuracy
decreased if the new classes had less inter-class variation. The variation of ac-
curacy with increasing number of classes using sigmoidal and ReLU activation
745 functions are reported in Tables 14 and 15, respectively.

5.5. Transfer Learning results for [7, 8, 9]

As mentioned in section 3.4, transfer learning was used to boost the perfor-
mance of the proposed CNN on the datasets of [7, 8, 9]. For the dataset in [8],
we used the CIFAR-10 [11] labeled dataset to pre-train the proposed CNN.

750 5.5.1. Pre-training using MNIST [10]


Using a CNN pre-trained with MNIST [10] we obtained a test accuracy of
83% on the dataset of [8] using [66]. This pre-trained model was trained on

39
the hand gesture data in [8] using a learning rate of 5 × 10−6 with batch size
of 10. Similarly, the CNN pre-trained on MNIST [10] was further trained on
755 the dataset of [7] with a learning rate of 5 × 10−6 for 50 epochs with batch size
of 10. It yielded an accuracy of 84.3% on the test dataset of [7]. The impact
of using a CNN pre-trained with MNIST [10] on the MU Hand Images ASL
gesture dataset [9] for various choices of number of classes is reported in the last
column of Table 16.

760 5.5.2. Pre-training using CIFAR-10 [11]


We also pre-trained the proposed CNN from randomly initialized weights
using CIFAR-10 [11] which contains 50, 000 training images and 10, 000 test
images belonging to 10 different classes, namely, automobile, bird, cat, deer,
dog, frog, horse, sheep and truck. Akin to the case wherein we used the MNIST
765 [10] data for pre-training, the converged weights of the trained CNN model using
CIFAR-10 [11] dataset are used for initialization before training on datasets
of [8, 7, 9]. We again observed that the pre-trained model helped in faster
convergence during the training phase and yielded improved accuracies on test
datasets using MatConvNet [66].
770 It was also observed that the CNN obtained using transfer learning on
CIFAR-10 [11] dataset performed better over the hand gesture datasets in
[7, 8, 9] than the CNN pre-trained using MNIST [10]. Transfer learning us-
ing CIFAR-10 [11] yielded an accuracy of 96% for a learning rate of 5 × 10−6 ,
with batch size 10 after being trained for only 100 epochs on the dataset of [8].
775 Contrast this result with the accuracy of 83% over the test data of [8] obtained
using transfer learning on the MNIST [10] dataset. The CNN pre-trained with
CIFAR-10 [11] dataset was trained over [7] with α = 5 × 10−6 , batch size 10
for 50 epochs. This network yielded an accuracy of 89.5% on the test dataset
from [7]. Note that it required 500 epochs to train the proposed CNN over the
780 dataset of [7] from random initialization of weights to obtain the same accuracy
as the pre-trained model. We observe that transfer learning using MNIST [10]
only yielded an accuracy of 84.3% over the testing set of [7].

40
The pre-trained CNN using CIFAR-10 [11] dataset when used for the MU
Hand Images ASL gesture dataset [9] also yielded superior performance in terms
785 of accuracy as well as resulted in reduced time complexity compared to training
the network from random initial weights. The results of transfer learning using
CIFAR-10 [11] on the MU Hand Images ASL gesture dataset [9] dataset are
reported in Table 16. We note that these results are superior to those obtained
by transfer learning using MNIST [10] (last column of Table 16).

790 6. Semantic understanding of a shloka using hand gestures

We chose a video from Youtube in order to demonstrate the possibility of


recognizing hand gestures of the dancer to comprehend the meaning of the
enacted dance piece. Here a dancer performs the Guru Stuti which is a very
important shloka (short poem/invocation) in Hindu scriptures:
795 Gurur Brahma Gurur Vishnum Gurur Devo Maheshwaraha Guru Sakshat
Parabrahma Tasmai Shree Gurave Namaha
Meaning:
Oh teacher, I see you as Brahma. Teacher. I see you as Vishnu. Hey Guru,
I see you as Maheshwaraha (Shiva). You are the lord of Lords. I bow unto thee.
800 We can decompose the Youtube video of the dancer performing this shloka
into frames and identify the hand gestures using our previously trained SVM
or CNN classifier. As shown in Fig. 20 (a) the gesture of the right hand of
the dancer is Hamsasye (refer Fig. 14) and it corresponds to the words ‘Gurur
Brahma’ which has been successfully recognized by both the trained SVM and
805 CNN models. In Figs. 20 (b), (c) the double hand gestures enacted by the
dancer are Shankha and Chakra, respectively, and they have been correctly
identified by SVM. However, the CNN model misclassified them. These two
gestures refer to the words Gurur Vishnum in the Shloka. In Sanskrit language,
Chakra means a disc or a wheel. This double hand gesture is used to represent
810 Lord Vishnu’s Sudarshan Chakra. Fig. 20 (d) shows the dancer enacting the
word Devo in the Shloka by Pataka single hand gesture with each hand. We

41
(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 20: Identification of hand gestures to comprehend the meaning of a shloka (Guru
Stuti).

are able to identify the left hand gesture by the trained SVM but the right
hand mudra is not recognized by either SVM/CNN due to large variation in
the viewing angle. Fig. 20 (e) shows the enactment of the word Maheshwaraha
815 with the double hand gesture Shivalinga. Both CNN and the SVM trained
models are able to accurately identify this gesture. The words Gurur Sakshat
have been depicted in Fig. 20 (f) as Pataka single hand gestures in each hand.
This gesture enacted by the performer using her left hand has been successfully
identified by both CNN and SVM. Fig. 20 (g) shows the dancer enacting the
820 word Parabrahma which refers to ‘salutation to the Almighty’. Since this gesture
is not present in our training database both CNN and SVM classifiers fail to
detect it. Finally, in Fig. 20 (h), a double hand gesture Anjali is identified
by only the SVM classifier. This gesture is used to denote the words Guruve
Namahe which is the final salutation to the Teacher.

825 7. Conclusions

In this work we have presented a novel application of computer vision algo-


rithms for semantic understanding of Indian classical dance. The interplay of

42
hand gestures and body poses to express profound ideas in religious shlokas and
poems from classical literature is the hallmark of ICD. We showed that taking
830 a deep learning approach to this challenging problem we can outperform tradi-
tional ‘shallow ’ algorithms which use handcrafted image feature detectors and
a traditional SVM classifier. The proposed CNN model has been demonstrated
to be able to recognize both body postures and hand gestures to a high de-
gree of accuracy on both standard popular datasets and specifically on the ICD
835 datasets. We showed that a transfer learning overcomes the disadvantage of
supervised learning in convolutional neural network making it able to learn bet-
ter by transferring the knowledge of an already trained model. Finally, using
real-world videos of a dancer enacting various shlokas, we demonstrated that
it is possible to comprehend their meaning by identifying the body postures
840 and hand gestures. There are several challenges in the problem addressed here
such as occlusions, varying viewpoint, change of illumination and ambiguity of
meanings of the hand gestures. ICD employs the use of dynamic hand gestures
and facial expressions to express deep semantic connotations of the lyrics and
music which are integral to the dance performances. We aim to address these
845 challenges in our future work.

References

[1] P. Subrahmanyam, Karana Prakaranam- Marga Tradition Revived,


swathi’s Sanskriti Series (DVD).

[2] D. Ramanan, Y. Yang, Articulated pose estimation using flexible mixtures


850 of parts, CVPR, 2011.

[3] F. Wang, Y. Li, Beyond physical connections: Tree models in human pose
estimation, in: CVPR, IEEE, 2013, pp. 596–603.

[4] Y. Lecun, F. J. Huang, L. Bottou, Learning methods for generic object


recognition with invariance to pose and lighting, in: CVPR, IEEE Press,
855 2004.

43
[5] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
Processing Systems, 2012.

[6] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-


860 plied to document recognition, in: Proceedings of the IEEE, 1998, pp.
2278–2324.

[7] S. Marcel, Hand posture recognition in a body-face centered space, in: CHI
’99 Extended Abstracts on Human Factors in Computing Systems, CHI EA
’99, ACM, New York, NY, USA, 1999, pp. 302–303. doi:10.1145/632716.
865 632901.
URL http://doi.acm.org/10.1145/632716.632901

[8] J. Triesch, C. Von Der Malsburg, Robust classification of hand postures


against complex backgrounds, in: 2nd International Conference on Auto-
matic Face and Gesture Recognition (FG), IEEE, 1996, pp. 170–175.

870 [9] A. L. C. Barczak, N. H. Reyes, M. Abastillas, A. Piccio, T. Susnjak, A


new 2d static hand gesture colour image dataset for asl gestures, Research
Letters in the Information and Mathematical Sciences 15 (2011) 12–20.
URL http://www.massey.ac.nz/massey/fms/Colleges/College%20of%
20Sciences/IIMS/RLIMS/Volume%2015/GestureDatasetRLIMS2011.pdf

875 [10] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-


plied to document recognition, Proceedings of the IEEE 86 (11) (1998)
2278–2324.

[11] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny im-
ages, Master’s thesis, Computer Science Department, University of Toronto
880 (2009).

[12] A. Mallik, S. Chaudhury, H. Ghosh, Nrityakosha: Preserving the intangi-


ble heritage of indian classical dance, Journal on Computing and Cultural
Heritage (JOCCH) 4 (3) (2011) 11.

44
[13] S. Samanta, P. Purkait, B. Chanda, Indian classical dance classification by
885 learning dance pose bases, in: Applications of Computer Vision (WACV),
2012 IEEE Workshop on, IEEE, 2012, pp. 265–270.

[14] I. Kapsouras, S. Karanikolos, N. Nikolaidis, A. Tefas, Folk dance recogni-


tion using a bag of words approach and ISA/STIP features, in: Proceedings
of the 6th Balkan Conference in Informatics, ACM, 2013, pp. 71–74.

890 [15] I. Kapsouras, S. Karanikolos, N. Nikolaidis, A. Tefas, Feature compari-


son and feature fusion for traditional dances recognition, in: Engineering
Applications of Neural Networks, Springer, 2013, pp. 172–181.

[16] F. Fleck, D. A. Forsyth, M. M. Fleck, Body plans, in: CVPR, IEEE, 1997,
pp. 678–683.

895 [17] J. O’Rourke, N. Badler, Model-based image analysis of human motion using
constraint propagation, IEEE Trans. Patt. Anal. Mach. Intell., PAMI-2 (6)
(1980) 522–536.

[18] G. Mori, J. Malik, Estimating human body configurations using shape con-
text matching, in: Proc. 7th European Conference on Computer Vision-
900 Part III, ECCV ’02, Springer-Verlag, London, UK, 2002, pp. 666–680.

[19] X. Ren, A. Berg, J. Malik, Recovering human body configurations using


pairwise constraints between parts, in: Proc. ICCV, Vol. 1, 2005, pp. 824–
831 Vol. 1.

[20] G. Hua, M.-H. Yang, Y. Wu, Learning to estimate human pose with data
905 driven belief propagation, in: IEEE CVPR, Vol. 2, 2005, pp. 747–754 vol.
2.

[21] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained,


multiscale, deformable part model, in: CVPR, IEEE, 2008, pp. 1–8.

[22] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detec-


910 tion with discriminatively trained part-based models, Pattern Anal. Mach.
Intell., IEEE Trans. 32 (9) (2010) 1627–1645.

45
[23] P. F. Felzenszwalb, D. P. Huttenlocher, Pictorial structures for object recog-
nition, Intnl. Jrnl. Comp. Vis. 61 (1) (2005) 55–79.

[24] M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: People


915 detection and articulated pose estimation, in: CVPR, IEEE, 2009, pp.
1014–1021.

[25] H. Ning, W. Xu, Y. Gong, T. Huang, Discriminative learning of visual


words for 3d human pose estimation, in: CVPR, IEEE, 2008, pp. 1–8.

[26] L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, Poselet conditioned


920 pictorial structures, in: CVPR, IEEE, 2013, pp. 588–595.

[27] S. Johnson, M. Everingham, Learning effective human pose estimation from


inaccurate annotation, in: CVPR, IEEE, 2011, pp. 1465–1472.

[28] Y. Tian, C. L. Zitnick, S. G. Narasimhan, Exploring the spatial hierarchy


of mixture models for human pose estimation, in: Computer Vision–ECCV
925 2012, Springer, 2012, pp. 256–269.

[29] M. Dantone, J. Gall, C. Leistner, L. Van Gool, Human pose estimation


using body parts dependent joint regressors, in: CVPR, IEEE, 2013, pp.
3041–3048.

[30] L. Pishchulin, A. Jain, M. Andriluka, T. Thormahlen, B. Schiele, Artic-


930 ulated people detection and pose estimation: Reshaping the future, in:
CVPR, IEEE, 2012, pp. 3178–3185.

[31] M. Eichner, M. Marin-Jimenez, A. Zisserman, V. Ferrari, 2D articulated


human pose estimation and retrieval in (almost) unconstrained still images,
Intl. Jrnl. Comp. Vis. 99 (2) (2012) 190–214.

935 [32] R. Rosales, S. Sclaroff, Inferring body pose without tracking body parts,
in: CVPR, Vol. 2, IEEE, 2000, pp. 721–727.

[33] A. Agarwal, B. Triggs, Recovering 3D human pose from monocular images,


Pattern Anal. Mach. Intell., IEEE Trans. 28 (1) (2006) 44–58.

46
[34] Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-
940 of-parts, in: CVPR, IEEE, 2011, pp. 1385–1392.

[35] C. Sminchisescu, A. Kanaujia, D. Metaxas, Bm3 e: Discriminative density


propagation for visual tracking, Pattern Anal. Mach. Intell., IEEE Trans.
29 (11) (2007) 2030–2044.

[36] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake,


945 M. Cook, R. Moore, Real-time human pose recognition in parts from single
depth images, Communications of the ACM 56 (1) (2013) 116–124.

[37] M. Oberweger, P. Wohlhart, V. Lepetit, Training a feedback loop for hand


pose estimation, in: Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 3316–3324.

950 [38] J. Tompson, M. Stein, Y. Lecun, K. Perlin, Real-time continuous pose


recovery of human hands using convolutional networks, ACM Transactions
on Graphics (TOG) 33 (5) (2014) 169.

[39] K. H. Lee, J. H. Kim, An HMM based threshold model approach for gesture
recognition, IEEE Trans. Patt. Anal. Mach. Intell. 21 (10) (1999) 961–973.

955 [40] A. Just, S. Marcel, A comparative study of two state-of-the-art sequence


processing techniques for hand gesture recognition, CVIU 113 (4) (2009)
532–543.

[41] J. Alon, V. Athitsos, Q. Yuan, S. Sclaroff, A unified framework for gesture


recognition and spatiotemporal gesture segmentation, IEEE Trans. Patt.
960 Anal. Mach. Intell. 31 (9) (2009) 1685–1699.

[42] A. Licsar, T. Sziranyi, User-adaptive hand gesture recognition system with


interactive training, Image and Vision Computing 23 (2005) 1102–1114.

[43] J. Triesch, C. Von Der Malsburg, A system for person-independent hand


posture recognition against complex backgrounds, IEEE Trans. Pattern
965 Anal. Mach. Intell. 23 (12) (2001) 1449–1453.

47
[44] V. Athitsos, S. Sclaroff, Estimating 3D hand pose from a cluttered image,
in: IEEE CVPR, 2003, pp. 432–439.

[45] I. Oikonomidis, N. Kyriazis, A. A. Argyros, Efficient model-based 3d track-


ing of hand articulations using kinect., in: BmVC, Vol. 1, 2011, p. 3.

970 [46] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, X. Twombly, Vision-based


hand pose estimation: A review, Computer Vision and Image Understand-
ing 108 (1) (2007) 52–73.

[47] T. H. Maung, et al., Real-time hand tracking and gesture recognition sys-
tem using neural networks, World Academy of Science, Engineering and
975 Technology 50 (2009) 466–470.

[48] P. K. Pisharady, P. Vadakkepat, A. P. Loh, Attention based detection


and recognition of hand postures against complex backgrounds, Intl. Jrnl.
Comp. Vis. 101 (3) (2013) 403–419.

[49] Y. Yang, C. Fermuller, Y. Li, Y. Aloimonos, Grasp type revisited: A mod-


980 ern perspective on a classical feature for vision, in: Computer Vision and
Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp.
400–408.

[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,


Dropout: A simple way to prevent neural networks from overfitting, The
985 Journal of Machine Learning Research 15 (1) (2014) 1929–1958.

[51] S. Melax, L. Keselman, S. Orsten, Dynamics based 3d skeletal hand track-


ing, in: Proceedings of Graphics Interface 2013, Canadian Information Pro-
cessing Society, 2013, pp. 63–70.

[52] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhe-


990 mann, I. Leichter, A. Vinnikov, Y. Wei, et al., Accurate, robust, and flexible
real-time hand tracking, in: Proceedings of the 33rd Annual ACM Confer-
ence on Human Factors in Computing Systems, ACM, 2015, pp. 3633–3642.

48
[53] S. Sridhar, F. Mueller, A. Oulasvirta, C. Theobalt, Fast and robust hand
tracking using detection-guided optimization, in: Proceedings of the IEEE
995 Conference on Computer Vision and Pattern Recognition, 2015, pp. 3213–
3221.

[54] D. Tzionas, J. Gall, A comparison of directional distances for hand pose


estimation, in: Pattern Recognition, Springer, 2013, pp. 131–141.

[55] M. Bray, E. Koller-Meier, P. Müller, L. Van Gool, N. N. Schraudolph, 3d


1000 hand tracking by rapid stochastic gradient descent using a skinning model,
in: In 1st European Conference on Visual Media Production (CVMP, Cite-
seer, 2004.

[56] C. Keskin, F. Kıraç, Y. E. Kara, L. Akarun, Hand pose estimation and


hand shape classification using multi-layered randomized decision forests,
1005 in: Computer Vision–ECCV 2012, Springer, 2012, pp. 852–863.

[57] C. Xu, L. Cheng, Efficient hand pose estimation from a single depth image,
in: Proceedings of the IEEE International Conference on Computer Vision,
2013, pp. 3456–3462.

[58] D. Tang, H. Chang, A. Tejani, T.-K. Kim, Latent regression forest: Struc-
1010 tured estimation of 3d articulated hand posture, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
3786–3793.

[59] Q. Delamarre, O. Faugeras, 3d articulated models and multiview tracking


with physical forces, Computer Vision and Image Understanding 81 (3)
1015 (2001) 328–357.

[60] B. Stenger, A. Thayananthan, P. H. Torr, R. Cipolla, Model-based hand


tracking using a hierarchical bayesian filter, Pattern Analysis and Machine
Intelligence, IEEE Transactions on 28 (9) (2006) 1372–1384.

49
[61] D. Hariharan, T. Acharya, S. Mitra, Recognizing hand gestures of a dancer,
1020 in: Pattern recognition and machine intelligence, Springer, 2011, pp. 186–
192.

[62] M. Rohrbach, S. Amin, M. Andriluka, B. Schiele, A database for fine


grained activity detection of cooking activities, in: Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp.
1025 1194–1201.

[63] J. Lei, X. Ren, D. Fox, Fine-grained kitchen activity recognition using rgb-
d, in: Proceedings of the 2012 ACM Conference on Ubiquitous Computing,
ACM, 2012, pp. 208–211.

[64] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with


1030 deep convolutional neural networks, in: Advances in neural information
processing systems, 2012, pp. 1097–1105.

[65] R. B. Palm, Prediction as a candidate for learning deep hierarchical models


of data, Master’s thesis (2012).
URL https://github.com/rasmusbergpalm/DeepLearnToolbox

1035 [66] A. Vedaldi, K. Lenc, Matconvnet: Convolutional neural networks for mat-
lab, in: Proceedings of the 23rd Annual ACM Conference on Multimedia
Conference, ACM, 2015, pp. 689–692.

[67] C. Rother, V. Kolmogorov, A. Blake, Grabcut: interactive foreground ex-


traction using iterated graph cuts, in: ACM SIGGRAPH, Vol. 23, 2004,
1040 pp. 309–314. doi:10.1145/1015706.1015720.

[68] C.-C. Chang, C.-J. Lin, Libsvm: A library for support vector machines,
ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3)
(2011) 27.

[69] A. Gupta, A. Kembhavi, L. S. Davis, Observing human-object interactions:


1045 Using spatial and functional compatibility for recognition, Pattern Analysis
and Machine Intelligence, IEEE Transactions on 31 (10) (2009) 1775–1789.

50

You might also like