Author's Accepted Manuscript: Signal Processing: Image Communication
Author's Accepted Manuscript: Signal Processing: Image Communication
PII: S0923-5965(16)30084-4
DOI: http://dx.doi.org/10.1016/j.image.2016.05.019
Reference: IMAGE15097
To appear in: Signal Processing : Image Communication
Received date: 15 October 2015
Revised date: 26 May 2016
Accepted date: 29 May 2016
Cite this article as: Aparna Mohanty, Pratik Vaishnavi, Prerana Jana, Anubhab
Majumdar, Alfaz Ahmed, Trishita Goswami and Rajiv R. Sahay, Nrityabodha:
Towards understanding Indian classical dance using a deep learning approach,
Signal Processing : Image Communication,
http://dx.doi.org/10.1016/j.image.2016.05.019
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
Nrityabodha: Towards understanding Indian classical
dance using a deep learning approach
Aparna Mohanty
Computational Vision Laboratory
Department of Electrical Engineering, Indian Institute of Technology Kharagpur, India
Pratik Vaishnavi
Department of Electronics and Communication Engineering
Sardar Vallabhbhai National Institute of Technology, Surat, India
Rajiv R. Sahay
Department of Electrical Engineering, Indian Institute of Technology Kharagpur, India
Abstract
Indian classical dance has existed since over 5000 years and is widely practised
and performed all over the world. However, the semantic meaning of the dance
gestures and body postures as well as the intricate steps accompanied by music
and recital of poems is only understood fully by the connoisseur. The common
masses who watch a concert rarely appreciate or understand the ideas conveyed
by the dancer. Can machine learning algorithms aid a novice to understand the
semantic intricacies being expertly conveyed by the dancer? In this work, we
aim to address this highly challenging problem and propose deep learning based
algorithms to identify body postures and hand gestures in order to comprehend
the intended meaning of the dance performance. Specifically, we propose a
convolutional neural network and validate its performance on standard datasets
for poses and hand gestures as well as on constrained and real-world datasets
of classical dance. We use transfer learning to show that the pre-trained deep
networks can reduce the time taken during training and also improve accuracy.
1. Introduction
India is the home of ancient civilizations such as the Indus valley and Mo-
henjodaro/Harappa settlements dated around 6000 B.C. Indian classical dance
(ICD) forms have existed since these ancient times and their importance can be
5 gauged from the patronage received from the rulers and society in general since
times immemorial. In fact, one of the most famous excavations of Mohenjodaro
was the statuette of a dancing girl striking a sensuous pose. Temples of ancient
and medieval India depict sculptures with intricate details of dance postures
and in particular, the Chidambaram temple in the southern Indian province
10 of Tamil Nadu has preserved the postures of a popular classical dance form,
Bharatnatyam. The Natya Shastra is the most celebrated and comprehensive
treatise encompassing the performing arts of dance, theatre and music. It is
widely believed to be 2000 years old with detailed instructions outlining the
grammar and rules associated with classical dance, theatre, music and virtually
15 every aspect of stagecraft. With the passage of time various artistes belonging
to disparate schools of art/dance have given their own interpretations of the
basic rules outlined in Natya Shastra. However, there exists in particular, a set
of 108 dance postures named Karanas in the original Natya Shastra enshrined
in the Chidambaram temple [1] which was constructed around the 12th century
20 A.D.. These dance poses have been depicted by performers of Bharatnatyam
even today as shown in Fig. 1.
As an example, we show the Nataraaj pose depicted by a dancer in Fig. 1 (a)
and the corresponding sculpture in Fig. 1 (b). Fig. 1 (c) depicts vartita Karana
2
(a) (b) (c) (d)
Figure 1: (a) The Nataraaj posture [1]. (b) A sculpture depicting Natraaj pose. (c) The
vartita Karana [1]. (d) A sculpture depicting the same Karana.
3
(a) (b)
Figure 2: (a) Failure of state-of-the-art approach [2] on one image from our dataset. (b)
Images depicting failure of the approach for pose estimation in [3] on another image of our
dataset.
45 the-art approaches are not able to estimate pose correctly on the dance posture
dataset due to occlusions, clothing on the body of the dancer, clutter in the
background etc. Hence, in this work we adopt an image recognition approach
using deep learning.
Dances are accompanied by song and governed by strict rules. To com-
50 prehend the meaning of a dance, it is necessary to interpret hand gestures, in
addition to body posture. The Natya Shastra mentions 28 single hand ges-
tures or Asamyukta Hastah Mudras: Pataaka, Tripataaka, Ardhapataaka, Kar-
tarimukha, Mayuram, Ardhachandram, Araalam, Shukatunda, Mushthi, Shikhara,
Kapitta, Katakaamukha, Suchi, Chandrakalaa, Padmakosha, Sarpashirsha, Mri-
55 gashirsha, Simhamukha, Kangula, Alapadma, Chatura, Bhramara, Hamsasye,
Hansapakshika, Sandamsha, Mukula, Tamrachuda, Trishula. Furthermore, over
time four new hand gestures were added to this list, namely, Kataka, Vyagraha,
Ardhasuchi and Palli.
Unlike these single hand gestures, Samyukta Hastah mudras require use of
60 both the palms to convey the message or a particular meaning. There are 24
double hand gestures described in the Natya Shastra: Anjali, Kapota, Karkata,
Swastika, Dola, Pushpaputa, Utsanga, Shivalinga, Kataka-vardhana, Kartari-
swastika, Shakata, Shankha, Chakra, Pasha, Kilaka, Samputa, Matsya, Kurma,
Varaha, Garuda, Nagabandha, Khatava, Bhairunda, Avahitta. Both single and
65 double hand gestures are used by the dancer to convey the meaning of the
poem/song which is being enacted.
4
Recently, deep learning has emerged as a powerful paradigm for complex
machine learning tasks such as object/image recognition [4, 5], handwritten
character recognition [6] etc. The ability of deep learning algorithms to ab-
70 stract appropriate features from images for classification is exemplary and so
we are motivated to use it for both pose and hand gesture identification. Firstly,
we create a dataset collected using the Kinect sensor in laboratory settings con-
taining images of twelve dance postures of ICD and propose a convolutional
neural network (CNN) to classify them. This collection of images is named
75 as Kinect (Laboratory) pose data. Next, using videos from Youtube we show
that a trained CNN model can recognize a subset of fourteen body postures
(Karanas) with high accuracy. The images corresponding to these Karanas are
collectively referred as Youtube (RGB) pose dataset.
Images of a subset of hand gestures pertaining to ICD are captured under
80 controlled laboratory settings and grouped into Computational Vision Lab single
hand (CVLSH) and Computational Vision Lab double hand (CVLDH) gesture
datasets. We train the proposed CNN model on both CVLSH and CVLDH
gestures datasets. We show that it is possible to identify with high accuracy 10
single hand gestures comprising the CVLSH dataset and 9 Asamyukta Hastah
85 mudras from a dataset of Youtube videos. For the case of double hand gestures,
we show the recognition performance for 14 Samyukta Hastah mudras of the
CVLDH dataset. For real-world Youtube videos of actual concerts, we estimate
identification rates of 7 double hand gestures (Samyukta Hastah mudras). We
also conduct several experiments using our trained CNN model over three stan-
90 dard single hand gesture datasets [7, 8, 9]. It is observed that the proposed
CNN with a simple architecture is able to perform well over datasets of hand
gestures with both uniform and complex backgrounds [7, 8, 9].
We compare the traditional approach of using hand-crafted features such as
scale-invariant feature transform (SIFT), speeded up robust features (SURF),
95 binary robust invariant scalable keypoints (BRISK), histogram-of-oriented gra-
dient (HoG) features and SVM-based classification with the proposed CNN
model. For this pupose we used both Kinect (Laboratory) pose and Youtube
5
(RGB) pose datasets as well as CVLSH, CVLDH and real-world hand gesture
datasets in our experiments.
100 Since we have limited labeled data (for both body poses and hand gestures)
we used transfer learning by resorting to pre-trained models which are trained
with a large labeled dataset such as MNIST [10] and CIFAR-10 [11]. Hence,
beginning from random initialization of network parameters, we obtain a good
initial estimate by pre-training over MNIST [10]/CIFAR-10 [11] datasets and
105 then we use this pre-trained model to facilitate faster training and avoid over-
fitting on our pose and hand gesture databases. We observed much faster con-
vergence during training as well as improved generalization and higher accuracy
by using transfer learning.
Once the classifiers are trained, we use them to understand the semantic
110 meaning of real-world dance performances from Hindu mythology. Specifically,
to demonstrate the possibility of parsing and understanding the meaning of
dance pieces depicting Shlokas (couplets/short poems) we show how our system
can recognize the body postures and hand gestures enacted by the performer
using videos from Youtube. Note that in this work we aim to identify the body
115 postures and hand gestures of the dancer independent of each other. Indian clas-
sical dance uses an amalgamation of postures, hand gestures, facial expressions,
movement of eye pupils, neck, torso and sophisticated motion of the entire body
to present an extravagant spectacle of mythical or even contemporary themes.
We recognize the fact that conveying a complete semantic experience of the
120 dance performances is beyond the scope of this work.
Presently, the proposed scheme is semi-automated in the sense that pre-
processing steps for extraction of images containing dance postures from real-
world dance videos have to be performed offline. As part of future work, we seek
to extend our method to become fully automatic so that continuous parsing of
125 video data can be done with explanation of semantic meaning to the user to
enable him/her to obtain a richer understanding of the dance performance.
6
The primary contributions of this paper are summarized as follows:
• We also created datasets for both single hand and double hand gestures
under controlled settings in the laboratory as well as in uncontrolled sce-
narios from real-world Youtube videos. The proposed CNN architecture
is shown to perform well for classification of these hand gestures.
135 • We show the superiority of CNN over shallow learning approaches using
hand crafted features such as HoG, SIFT, SURF and BRISK. We used the
proposed datasets of body poses and hand gestures to obtain comparison
results.
This paper is organised as follows. Section 2 describes the prior work per-
145 taining to the recognition of poses and hand gestures. The architecture details
of the proposed CNN model is given in section 3. Section 4 gives a comparison
of the experimental results obtained using shallow learning techniques and the
proposed CNN-based approach on the Kinect (Laboratory) pose and Youtube
(RGB) pose datasets. In this section we show failure cases of state-of-the-art
150 approaches of pose estimation such as [2] and [3] on the proposed pose datasets.
Also, we demonstrate the impact of transfer learning on the proposed CNN-
based framework in this section. The semantic interpretation of a shloka from
the postures is described in section 5. This section also introduces the proposed
hand gesture datasets along with comparison results between shallow learning
7
155 and the proposed CNN-based approach. Here we present experimental results
showing the effect of varying number of classes as well as the utility of trans-
fer learning. The semantic understanding of a shloka by recognition of hand
gestures is demonstrated in section 6. Section 7 concludes the work presented.
2. Prior work
160 We place our work in the context of related works in the recent literature
pertaining to both identification of pose and recognition of hand gestures. In
the literature, works pertaining to dance involve dance form classification as
well as recognition and estimation of poses of the performer. A multimedia
database retrieval system to preserve the living heritage of Indian classical dance
165 is reported in [12]. However, unlike our work, they do not identify body postures
or hand gestures to semantically understand the dance. For classification of
ICD a sparse representation based dictionary learning technique is proposed in
[13]. Classification of folk dances is attempted in [14]. Activity recognition was
attempted using a bag of words approach with a dataset consisting of Greek
170 traditional dances [15].
There are very few significant works addressing the problem of recognition
of body postures in ICD but a vast literature on general pose identification of
humans exists. Initial works for 2D pose estimation in the images/video domains
appear in [16, 17]. Entire human shapes have been matched in [18]. Image
175 segmentation has been used in [19] with detection of human skin color regions
in [20]. Discriminatively trained, multi-scale, deformable parts based model for
pose estimation is proposed in [21]. This idea is also used for object detection in
[22]. Felzenszwalb et al. [23] describe a statistical framework for representing the
visual appearance of objects composed of rigid parts arranged in a deformable
180 configuration. Andriluka et al. [24] propose a generic approach for human
detection and pose estimation based on the pictorial structures framework. The
work of [25] proposed discriminative learning of visual words for 3D human pose
estimation. In [26], an efficient model for pose estimation was formulated using
8
higher order dependencies. Johnson et al. [27] proposed a scheme to achieve
185 high quality annotation at low cost. An efficient method for pose estimation
using tree models is given in [3]. A new hierarchical spatial model that can
capture an exponential number of poses with a compact mixture representation
is proposed in [28]. 2D human pose is estimated from still images by Dantone
et al. [29] by proposing novel, nonlinear joint regressors.
190 Pischchulin et al. [30] gave a method for automatic generation of training
examples from an arbitrary set of images and proposed a new challenge of joint
detection and pose estimation of multiple articulated people in cluttered sport
scenes. Eichner et al. [31] estimate upper body pose in highly challenging un-
controlled images, without prior knowledge of background, clothing, lighting, or
195 the location and scale of the person. A novel approach for estimating articu-
lated body posture and motion from monocular video sequences is proposed in
[32]. A learning based method for recovering 3D human body pose from a single
image and monocular image sequences is given by [33]. Human pose estimation
in static images based on a novel representation of part models is proposed by
200 [34]. The work in in [35] proposes a conditional Bayesian mixture of experts
Markov model for discriminative visual tracking.
Researchers have used depth data also for predicting human poses. Shotton
et al. [36] proposed an efficient method to accurately predict human pose from
a single depth image. The performance is further improved by using a feedback
205 loop by Markus et al. [37]. CNN have also been used by researchers for pose
estimation. Real-time continuous pose recovery from a single depth map is
attempted in [38] by extracting dense features using a CNN, followed by a
decision forest classifier.
Various methods proposed for image/video based hand gesture identification
210 can be divided roughly into four categories: (i) hidden Markov model (HMM)
based [39, 40] (ii) neural network and learning based [41, 42] (iii) other meth-
ods such as graph-based [8, 43] or 3D-model based [44] and (iv) model based
optimization approach [45]. A complete survey of all the existing approaches
for hand pose estimation is given in [46]. A system for person independent
9
215 recognition of hand postures against complex backgrounds is given by [43]. A
system for the classification of hand postures against complex backgrounds in
grayscale images is presented in [8]. A hand gesture recognition system [47]
claims real time performance in unconstrained environments. Pisharady et al.
[48] proposed a system which utilizes a Bayesian model of visual attention to
220 generate a saliency map, for detecting and identifying the hand region. Very
recently, hand gesture recognition using CNNs has been attempted in [49]. A
simple five layer CNN proposed in [49] is used to classify seven different grasp
types. However, [49] does not exploit transfer learning or use dropout [50].
Various optimization approaches have been considered for tracking hands.
225 The work in [45] recover and track the 3D position, orientation and articulation
of human hand using a Kinect sensor. However, the work in [51] eliminates
the need of any external wearable hardware. Sharp et al. [52] have used a
single depth camera for accurately and robustly tracking hands in real-time
over significant range of distance with arbitrary camera placements. A fast
230 method for accurately tracking of hands and pose using a single depth camera
is proposed in [53] . A new dataset and an approach to accurately study each
feature without a full tracking pipeline is proposed by Tzionas et al. [54]. Bray
et al. [55] track hands without any constraint on the degrees of freedom.
Keskin et al. [56] used depth sensors to overcome problems associated with
235 vision based articulated hand pose estimation and hand shape classification.
The work in [56] used a randomized decision forest based hand shape classi-
fier for articulated hand pose estimation. Xu et al. [57] deal with hand pose
estimation using a single noisy depth map. A structured approach of locating
all skeletal joints guided by a latent tree model is proposed by [58]. The work
240 in [59] estimates the motion of an articulated object filmed by two or more
fixed low quality cameras, given only an approximation of the geometric model
of the tracked object. The approach of [60] simultaneously handles initializa-
tion, tracking, and recovery of hand motion with no prior information of the
hand pose. A scale, translation and rotation-invariant approach for recognizing
245 various single hand gestures of a dancer is proposed by Hariharan et al. [61]
10
However, [61] does not address the problem of estimating the pose of the dancer.
Our work bears some overlap with the area of fine-grained activity recogni-
tion [62], [63]. Although, a significant amount of literature exists for the general
problem of human body pose estimation and hand gesture identification, se-
250 mantic understanding of the poses and gestures in ICD has not received enough
attention. Recently, research on fine-grained activity recognition has gathered
momentum with various works [62, 63] proposing datasets for semantic activ-
ities e.g. those involved during cooking [62]. We believe that our work is the
first to address the highly challenging problem of semantically understanding
255 ICD using a computer vision approach.
Convolutional neural nets originally proposed by LeCun [6] have been shown
to be accurate and versatile for several challenging real-world machine learning
problems [5, 6]. According to LeCun [4, 6], CNNs can be effectively trained
260 to recognize objects directly from their images with robustness to scale, shape,
camera viewpoint, noise etc. This motivates us to use CNNs in our problem
since in real-world scenarios image data of body postures and hand gestures in
ICD will be affected by such variations.
3.1. Architecture
265 The general architecture of the proposed CNN is shown in Fig. 3. Apart
from the input and the output layers, it consists of two convolution and two
pooling layers. The input is a 32 × 32 pixels image of a dance posture or hand
gesture (single/double hand). The output layer consists of nodes depending
upon the number of classes in the specific classification problem for which the
270 CNN is being used.
As shown in Fig. 3, the input image of 32 × 32 pixels is convolved with 10
filter maps of size 5 × 5 pixels to produce 10 output maps of 28 × 28 pixels in
layer 1. These feature maps are downsampled with max-pooling of 2 × 2 regions
11
Figure 3: Architecture of the proposed CNN model used for both pose and hand gesture
classification.
12
290 learn faster. We observed faster convergence during training and higher testing
set accuracy by using rectified linear units (ReLU) as the activation function
instead of the non-linear sigmoid function before the sub-sampling layer in the
architecture. We show in detail the advantage of using the ReLU activation
function in Table 4.
Sometimes the CNN suffers from over-fitting in which case the training ac-
curacy is high but the testing accuracy is poor. This can be ameliorated using
dropout [50]. In our work, we show improvements in accuracy over the testing
dataset by using dropout after the pooling layer of the proposed CNN shown
300 in Fig. 3. Dropout makes the network avoid over-fitting by randomly dropping
some nodes and their connections [50].
13
with a pre-trained network and also improved accuracies on test datasets using
[65, 66].
14
who repeat each body posture 12 times. For this dataset created under con-
trolled conditions in the laboratory, we captured coordinates of 20 joints tracked
by the Kinect camera. In Fig. 4 we show the color images of the 12 poses with
the corresponding HoG features for the same. We also recorded the depth maps
350 using the Kinect sensor for each dance pose enacted by all the dancers. This set
of skeletal configurations, RGB images and depth maps is named as the Kinect
(Laboratory) pose dataset in this work.
We recorded the co-ordinates of all the 20 joints tracked by Kinect and es-
timated the 19 joint angles made with respect to the hip centre. The entire
355 database of 12 × 7 × 12 = 1008 images was split into a training set and the test
set without any overlapping images between them. A support vector machine
(SVM) classifier was trained with a linear kernel and the classification perfor-
mance for the Kinect (Laboratory) pose dataset consisting of 12 Karanas with
864 training and 144 test images is 95.83%.
Figure 4: Twelve Karanas along with the HoG features derived from the silhouettes.
15
360 We used histogram-of-gradients (HoG) features extracted from the RGB
images recorded by the camera in the Kinect sensor. For this purpose, we
first segmented the dancer from the background using the technique of [67] and
binarized the image to obtain the silhouette of the dancer. Each binarized frame
is then resized to 100 × 200 pixels. Considering 9-bin histogram of gradients
365 over 8 × 8 pixels sized cells, and blocks of 2 × 2 cells, we extracted HoG feature
vectors using a dense grid of total length 9504 from silhouette images. The
discriminative nature of HoG features corresponding to each dance posture is
demonstrated in Fig. 4. It is observed that HoG features capture the general
shape of each of the postures. A support vector machine (SVM) classifier was
370 trained with a linear kernel and the classification performance for the Kinect
(Laboratory) pose dataset consisting of 12 Karanas with 720 training and 144
test images is 86.11% as shown in Table 1.
We also extracted HoG features from the depth maps recorded by Kinect
corresponding to each body posture. The depth maps were re-sized to 100 × 200
375 pixels. Akin to the case of RGB images we extracted HoG feature vectors of
total length 9504 using a dense grid on the 100 × 200 depth maps. With 720
training and 144 test depth images we obtained an accuracy of 84.1% using an
SVM classifier having a linear kernel.
Table 1: Performance using hand crafted features on both Kinect (Laboratory) and Youtube
(RGB) databases of poses of Indian classical dance.
16
from the binary silhouette images from 288 training images we trained an SVM
385 classifier. The recognition accuracy with 72 test images was 88.89%. The clas-
sification results using the above described hand crafted features have been
summarized in Table 1. Since for the real world dataset collected from Youtube
the joint locations and depth data are not available, hence the corresponding
entries are blank in Table 1.
390 It is to be noted that for all the above cases, parameters of the SVM classifier
were tuned using the standard cross-validation procedure [68].
17
Figure 5: Mean squared error (MSE) versus epochs for the CNN trained on the Kinect
(Laboratory) pose dataset.
seen in Fig. 5, we use this trained CNN to report the accuracy on the testing
415 set in Table 2. The package [65] does not have the option to use the ReLU
activation function. So, to enable a fair comparison regarding the effect of the
sigmoid and ReLU activation functions, we used the package MatConvNet [66].
18
Figure 6: A snapshot of fourteen Karanas extracted from Youtube videos.
Table 2: Performance of the proposed CNN with randomly initialized weights on both Kinect
(Laboratory) dataset and Youtube (RGB) databases of poses of ICD using [65].
Table 3: Performance of the proposed CNN (using weights initialized with random numbers)
on both Kinect (Laboratory) dataset and Youtube (RGB) databases of poses of ICD with
sigmoidal activation function in MatConvNet [66].
Table 4: Performance of the proposed CNN (randomly initialized weights) on both Kinect
(Laboratory) dataset and Youtube (RGB) dataset of poses of ICD with ReLU activation
function in MatConvNet [66].
435 for 14 different poses performed by 6 different dancers extracting 15 frames per
pose for each dancer. A snapshot of the 14 postures is depicted in Fig. 6. To
create the training set, we used 12 frames per pose for each of the 6 performers
leading to 1008 images. The testing set consisted of the rest 252 images. There
19
Figure 7: MSE versus epochs plot obtained training a CNN for Youtube (RGB) pose data
obtained from Youtube videos.
is no overlap between the training and testing sets. All images were further
440 re-sized to 32 × 32 pixels before feeding to the CNN.
The CNN model was trained for 200 epochs from random initial weights
with batch size 6 and constant learning rate α = 0.5 throughout all the layers
using [65]. The variation of the MSE versus epochs during the training phase
is shown in Fig. 7 for two different choices of α. The blue curve represents
445 the variation of MSE of for a learning rate of α = 0.5, while the orange curve
represents the MSE variation for a learning rate of α = 0.6. As can be seen
from Fig. 7, training of the proposed CNN is better for the case of α = 0.5. The
final MSE yielded by the CNN during the training phase for α = 0.5 is 0.0258.
Hence, this network was chosen to obtain the accuracy for the testing set of the
450 real-world pose data.
20
(a) (b) (c)
Figure 8: (a) Original input image of a dance pose from the real-world dataset created
using Youtube videos. (b) First layer filter kernels in the proposed CNN architecture using
sigmoid activation function in [65]. (c) Feature maps at the first convolutional layer of the
CNN obtained by convolving the filter kernels in (b) with the input image in (a).
activation function in [65] are shown in Fig. 8 (b). These kernels when convolved
with the original image of Fig. 8(a) give the feature maps as shown in Fig. 8(c).
As we had already demonstrated for the Kinect (Laboratory) pose dataset in
subsection 4.3, we used the ReLU activation function in MatConvNet [66] for
465 training the CNN with real-world pose data. For the real-world pose data
also we observed a reduction in number of epochs for training the CNN and
improvement in accuracy over the test dataset with the use of ReLU (second
row of Table 3 and 4).
21
images. The method of Ramanan et al. [2] fails to estimate the pose accurately
480 if there is occlusion due to clothing as can be seen in Fig. 9 (e) where the
dancer’s clothing occludes the leg. The effect of clutter in the image can be
seen in Fig. 9 (h) where an idol on the stage affects the pose estimation badly.
Similarly, Fig. 10 (a) through (d) shows the output of another state-of-the-
art approach by Wang et al. [3] on our proposed dataset. The approach of
485 [3] fails on both our pose datasets due to the complexity involved in terms of
clothing, occlusion, view point variations etc.
Furthermore, in order to provide a fair comparison with the state-of-the-
art pose estimation methods we used the nearest neighbourhood classifier to
identify the pose in both the Kinect (Laboratory) pose dataset and the Youtube
490 (RGB) pose data from Youtube videos. A total number of 14 significant joint
coordinates are extracted from each image of the training sets of both pose
datasets and used for training the nearest neighbour classifier. The test data
consisted of the 14 coordinates extracted from the skeleton obtained from [2] for
each image of the test datasets of the Kinect (Laboratory) pose and Youtube
495 (RGB) pose databases. Classification is performed by computing the minimum
Euclidean distance between the joint coordinates of the skeletons estimated by
[2] from images of test data and the joint coordinates of training data. For the
Kinect (Laboratory) pose data the nearest neighbour classifier performed badly
giving an accuracy of only 16% for the pose estimation technique of [2]. The
500 performance of the nearest neighbour classifier for identifying the dance poses
in the Youtube (RGB) pose data is far worse with an accuracy of 4% for the
pose estimated by [2].
Initially, we used the CNN model pre-trained on the MNIST [10] data. Note
505 that here we used the package in [65] with the sigmoid activation function and
without any dropout layers to pre-train the proposed CNN. We obtained an
accuracy of 97.92% with only 10 epochs using this pre-trained CNN on the
Kinect (Laboratory) pose dataset. Contrast this with 300 epochs taken by our
22
(a) (b) (c) (d) (e)
(a) (b)
(c) (d)
Figure 10: A snapshot of Karanas extracted from our proposed dataset where the state-of-
the-art approach of Wang et al. [3] fails due to poor illumination, clutter in the background,
clothing etc.
CNN model which had been trained using random initialization yielding an
510 inferior accuracy of 96.53% as shown in Fig. 11 (a). Similarly, for the real world
pose dataset, the pretrained model using MNIST [10] resulted in convergence
within 20 epochs giving better testing accuracy of 97.62% than the case of
randomly initialized CNN which gave a testing accuracy of 93.25% with 200
23
(a) (b)
Figure 11: (a) Variation of MSE for the CNN pre-trained on MNIST [10] for the case of
Kinect (Laboratory) pose data. (b) Performance of the CNN pre-trained on MNIST [10] on
the Youtube (RGB) pose dataset.
Table 5: Results obtained with the CNN pre-trained on the CIFAR-10 [11] dataset for both
Kinect (Laboratory) pose data and Youtube (RGB) pose database using MatConvNet [66].
515 Apart from using MNIST [10] as described above, the proposed CNN is also
pre-trained from random initial weights in MatConvNet [66] using the CIFAR-10
[11] labeled dataset. In this case, we used three dropout layers in the proposed
architecture of the CNN of Fig. 3. Two dropout layers were placed after the
respective pooling layers. The third dropout layer is located just after the fully
520 connected layer in the CNN. We used the ReLU activation function during
pre-training. The proposed architecture resulted in an MSE of 0.483 after 500
epochs and achieved a testing accuracy of 61.6% on the CIFAR-10 [11] data. The
weights of the above obtained pre-trained model are used to initialize the weights
of the CNN to be trained on our Kinect (Laboratory) pose data and Youtube
525 (RGB) pose dataset. From Tables 4 and 5, it is observed that a pre-trained
model accelerates convergence compared to random weight initialization.
24
Figure 12: Dataset [69] comprising of 6 classes. First row: “cricket batting”,“cricket bowling”,
“croquet shot”. Second row: “tennis forehand”,“tennis serve” and “volleyball smash”.
25
(a) (b) (c)
Figure 13: Sequence of poses that are enacted by a performer to convey a shloka.
Meaning:
I bow down to the Lord of Uma (Parvathi), the divine Guru, the cause of
the universe. I bow down to the Lord who is adorned with snake and wears
tiger skin, the Lord of all creatures. I bow down to the Lord whose three eyes
550 are the sun, moon and fire and to whom Lord Vishu is near. I bow down to the
Lord who is the refuge of all devotees and the giver of boons, Shiva Shanakara.
The whole shloka was enacted using various Karanas out of which we could
identify 6 poses as belonging to our training set of originally 12 Karanas with
which we trained the SVM classifier. These 6 Karanas are: 1. Samanakha-
555 denotes beginning of a dance piece. 2. Lina- paying respects 3. Danda recita-
cobra bed of Lord Vishnu 4. Chatura- cobra on Lord Shiva’s body 5. Talasam-
photita- Lord Shiva 6. Valita- blessing.
We recorded a video using Kinect such that skeletal data could be extracted
for every pose enacted. We extracted the skeletal feature vector from each frame
560 of the video recorded with the Kinect sensor and passed the feature vector as a
test data to the trained SVM classifier. As the dancer transitions from one pose
to another, there are some frames which do not correspond to any particular
pose in the training set. We used a two class SVM to eliminate the frames which
were outside the training set and passed the remaining candidate frames to the
565 trained multi-class SVM classifier.
26
Reference to Significance of
Figure Pose
shloka Pose
Begining of dance
13 (a) Samanakha
recital
13 (b) Lina Vande Paying respects
The Lord Almighty
Deva Uma Patim
13 (c) Valita who blesses his devo-
Suragurum
tees
The snake around
Pannaga
13 (d) Chatura the neck of Lord
Bhooshanam
Shiva
The cobra bed of
13 (e) Danda Rechita Mukunda Priyam
Lord Vishnu
classic Nataraj pose
13 (f) Talasmophotita Talasamphotita of Lord Shiva in
Hindu mythology
As the dancer enacts the shloka using several postures, we attempt to iden-
tify them and thereby interpret the semantic meaning of the dance piece. In
Fig. 13 we show the 6 poses that the SVM classifier correctly identified out of a
sequence of frames obtained from the recorded video. The various dance poses
570 and the associated semantic meanings have been summarized in Table 6.
27
handcrafted features as input to an SVM classifier as well as the proposed CNN
architecture.
(a) (b)
Figure 14: (a) Sample images showing 10 single hand gestures used in our CVLSH dataset.
(b) Nine single hand gestures for the dataset constructed using Youtube videos.
28
padma and Hamsasye. Fig. 14 (b) shows sample images corresponding to the 9
Asamyukta Hastah mudras.
(a) (b)
Figure 15: (a) Sample images showing 14 double hand gestures used in our CVLDH dataset.
(b) Seven double hand gestures for the dataset constructed using Youtube videos.
Figure 16: Ten single hand gestures in the proposed CVLSH dataset along with their HoG
features.
29
No. of Training Test Feature Accuracy
Classes set set vector
HoG + SVM 95.37%
9 756 216 SIFT 84.5%
SURF 71.43%
BRISK 61.23%
Table 7: Single hand gestures- Asamyukta Hastah Mudra (Youtube video dataset)
30
No. of Training Test Feature Accuracy
Classes set set vector
HoG + SVM 100%
7 490 140 SIFT 77.65%
SURF 80.71%
BRISK 72.86%
Table 8: Double hand gestures- Samyukta Hastah Mudra (Youtube video dataset)
We now evaluate the performance of the proposed CNN model on the database
of hand gestures for ICD considered by us in this work. Specifically, we show
the performance of the proposed CNN on both CVLSH, CVLDH and real-world
databases of single and double hand static hand gestures in Table 9. Note that
31
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
Single hand 0.5 5 500 10 1120 280 98.57%
(CVLSH)
Double hand 0.5 4 200 14 672 168 96.43%
(CVLDH)
Single hand 0.5 3 100 9 1155 105 97.14%
(Youtube)
Double hand 0.5 5 500 7 1323 567 100%
(Youtube)
Table 9: Performance of the proposed CNN with random initial weights on CVLSH, CVLDH
datasets and real-world databases of single and double hand gestures of Indian classical dance
using sigmoid activation function in [65].
640 these results have been obtained using [65] with the sigmoid activation function
and a random initialization of weights. We observe that the performance of deep
learning algorithms is quite comparable to the accuracies obtained by ‘shallow ’
learning algorithms which use handcrafted features in conjunction with an SVM
classifier. Several researchers have reported the advantages of using the ReLU
645 activation function. Hence, to enable a fair comparison the proposed CNN is
trained using [66] using both sigmoid as well as ReLU activation functions and
the performance is shown in Tables 10 and 11, respectively. Note that we initial-
ized the CNN with random weights for comparison purpose. we observe faster
convergence during training and better accuracy over the test dataset with the
650 ReLU activation function.
32
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
−5
Single hand 75 × 10 10 500 10 1120 280 98.6%
(CVLSH)
Double hand 75 × 10−5 10 1000 14 672 168 97.6%
(CVLDH)
Single hand 75 × 10−5 10 500 9 1155 105 100%
(Youtube)
Double hand 75 × 10−5 10 1000 7 1323 567 99.8%
(Youtube)
Table 10: Performance of the proposed CNN from random initial weights on both CVLSH,
CVLDH and real-world databases of single and double hand gestures of ICD using sigmoid
activation function in MatConvNet [66].
Table 11: Performance of the proposed CNN (with random initialization of weights) on both
CVLSH, CVLDH and real-world databases of single and double hand gestures of ICD using
ReLU activation function in MatConvNet [66].
33
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
Single hand 0.5 5 2 10 1120 280 98.93%
(CVLSH)
Double hand 0.5 2 10 14 672 168 98.81%
(CVLDH)
Single hand 0.5 5 2 9 1155 105 100%
(Youtube)
Double hand 0.5 5 50 7 1323 567 100%
(Youtube)
Table 12: Effect of pre-training using MNIST [10]: Performance of the proposed CNN on
both CVLSH, CVLDH and real-world databases of single and double hand gestures of Indian
classical dance with sigmoid activation function using [65].
Data α Batch Epochs No. of Training Testing Pre-training with Pre-training with
size Classes set set CIFAR-10 [11] MNIST [10]
Single hand 5 × 10−6 10 20 10 1120 280 100% 98.2%
(CVLSH)
Double hand 5 × 10−6 10 50 14 672 168 100% 97.6%
(CVLDH)
Single hand 5 × 10−6 10 50 9 1155 105 100% 100%
(Youtube)
Double hand 5 × 10−6 10 40 7 1323 567 100% 99.6%
(Youtube)
Table 13: Effect of pre-training using CIFAR [11]: Performance of the proposed CNN on
both CVLSH, CVLDH and real-world databases of single and double hand gestures of Indian
classical dance pre-trained using CIFAR-10 [11] ReLU activation function and three dropout
layers using MatConvNet [66].
34
Data α Batch Epochs No. of Training Testing Accuracy
size Classes set set on testing set
MU Hand Images ASL gestures [9] 75 × 10−5 10 1000 10 560 140 77.9%
(10 class)
MU Hand Images ASL gestures [9] 75 × 10−5 10 2500 15 840 210 81.4%
(15 class)
MU Hand Images ASL gestures [9] 25 × 10−5 10 3000 20 1120 280 80.4%
(20 class)
MU Hand Images ASL gestures [9] 75 × 10−5 10 3000 25 1400 350 74.9%
(25 class)
MU Hand Images ASL gestures [9] 5 × 10−6 (upto 1000 epochs) 10 1200 36 2012 503 66.8%
(36 class) 5 × 10−6 (upto 1200 epochs)
Table 14: Performance of the proposed CNN (with random initial weights) on the MU Hand
Images ASL gestures dataset [9] with sigmoid activation function in MatConvNet [66].
Table 15: Performance of the proposed CNN (with random initial weights) on the MU Hand
Images ASL gestures dataset [9] with ReLU activation function and two dropout layers (after
the pooling layers in proposed architecture) in MatConvNet [66].
35
Data α Batch Epochs No. of Training Testing Pre-training Pre-training
size Classes set set with CIFAR-10 [11] with MNIST [10]
MU Hand Images ASL 5 × 10−6 10 150 10 560 140 90.5% 88.5%
gestures [9] (10 class)
MU Hand Images ASL 5 × 10−6 10 200 10 840 210 93.5% 88.0%
gestures [9] (15 class)
MU Hand Images ASL 5 × 10−6 10 300 20 1120 280 91.3% 91.1%
gestures [9] (20 class)
5 × 10−6
MU Hand Images ASL (upto 300 epochs) 10 400 25 1400 350 88.0% 85.0%
gestures [9] (25 class) 5 × 10−7
(upto 400 epochs)
MU Hand Images ASL 5 × 10−6 10 100 36 2012 503 65% 66.9%
gestures [9] (36 class)
Table 16: Performance of the proposed CNN model on the MU Hand Images ASL gestures
dataset [9] with pre-training using CIFAR-10 [11] and MNIST [10] data.
5.4. CNN model: Comparison results with standard hand gesture datasets [7, 8,
9]
680 We demonstrate that the proposed CNN architecture is deep enough to yield
recognition rates comparable with the state-of-the-art on the standard hand
gesture datasets. We have trained a CNN with images of hand gestures from
three standard datasets [7, 8, 9] considering the cases of both plain/uniform and
complex backgrounds.
36
Figure 17: Sample images of ten hand gestures in [8].
37
Figure 18: Marcel dataset [7]
38
Figure 19: MU Hand Images ASL gestures [9]
As mentioned in section 3.4, transfer learning was used to boost the perfor-
mance of the proposed CNN on the datasets of [7, 8, 9]. For the dataset in [8],
we used the CIFAR-10 [11] labeled dataset to pre-train the proposed CNN.
39
the hand gesture data in [8] using a learning rate of 5 × 10−6 with batch size
of 10. Similarly, the CNN pre-trained on MNIST [10] was further trained on
755 the dataset of [7] with a learning rate of 5 × 10−6 for 50 epochs with batch size
of 10. It yielded an accuracy of 84.3% on the test dataset of [7]. The impact
of using a CNN pre-trained with MNIST [10] on the MU Hand Images ASL
gesture dataset [9] for various choices of number of classes is reported in the last
column of Table 16.
40
The pre-trained CNN using CIFAR-10 [11] dataset when used for the MU
Hand Images ASL gesture dataset [9] also yielded superior performance in terms
785 of accuracy as well as resulted in reduced time complexity compared to training
the network from random initial weights. The results of transfer learning using
CIFAR-10 [11] on the MU Hand Images ASL gesture dataset [9] dataset are
reported in Table 16. We note that these results are superior to those obtained
by transfer learning using MNIST [10] (last column of Table 16).
41
(a) (b) (c) (d)
Figure 20: Identification of hand gestures to comprehend the meaning of a shloka (Guru
Stuti).
are able to identify the left hand gesture by the trained SVM but the right
hand mudra is not recognized by either SVM/CNN due to large variation in
the viewing angle. Fig. 20 (e) shows the enactment of the word Maheshwaraha
815 with the double hand gesture Shivalinga. Both CNN and the SVM trained
models are able to accurately identify this gesture. The words Gurur Sakshat
have been depicted in Fig. 20 (f) as Pataka single hand gestures in each hand.
This gesture enacted by the performer using her left hand has been successfully
identified by both CNN and SVM. Fig. 20 (g) shows the dancer enacting the
820 word Parabrahma which refers to ‘salutation to the Almighty’. Since this gesture
is not present in our training database both CNN and SVM classifiers fail to
detect it. Finally, in Fig. 20 (h), a double hand gesture Anjali is identified
by only the SVM classifier. This gesture is used to denote the words Guruve
Namahe which is the final salutation to the Teacher.
825 7. Conclusions
42
hand gestures and body poses to express profound ideas in religious shlokas and
poems from classical literature is the hallmark of ICD. We showed that taking
830 a deep learning approach to this challenging problem we can outperform tradi-
tional ‘shallow ’ algorithms which use handcrafted image feature detectors and
a traditional SVM classifier. The proposed CNN model has been demonstrated
to be able to recognize both body postures and hand gestures to a high de-
gree of accuracy on both standard popular datasets and specifically on the ICD
835 datasets. We showed that a transfer learning overcomes the disadvantage of
supervised learning in convolutional neural network making it able to learn bet-
ter by transferring the knowledge of an already trained model. Finally, using
real-world videos of a dancer enacting various shlokas, we demonstrated that
it is possible to comprehend their meaning by identifying the body postures
840 and hand gestures. There are several challenges in the problem addressed here
such as occlusions, varying viewpoint, change of illumination and ambiguity of
meanings of the hand gestures. ICD employs the use of dynamic hand gestures
and facial expressions to express deep semantic connotations of the lyrics and
music which are integral to the dance performances. We aim to address these
845 challenges in our future work.
References
[3] F. Wang, Y. Li, Beyond physical connections: Tree models in human pose
estimation, in: CVPR, IEEE, 2013, pp. 596–603.
43
[5] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
Processing Systems, 2012.
[7] S. Marcel, Hand posture recognition in a body-face centered space, in: CHI
’99 Extended Abstracts on Human Factors in Computing Systems, CHI EA
’99, ACM, New York, NY, USA, 1999, pp. 302–303. doi:10.1145/632716.
865 632901.
URL http://doi.acm.org/10.1145/632716.632901
[11] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny im-
ages, Master’s thesis, Computer Science Department, University of Toronto
880 (2009).
44
[13] S. Samanta, P. Purkait, B. Chanda, Indian classical dance classification by
885 learning dance pose bases, in: Applications of Computer Vision (WACV),
2012 IEEE Workshop on, IEEE, 2012, pp. 265–270.
[16] F. Fleck, D. A. Forsyth, M. M. Fleck, Body plans, in: CVPR, IEEE, 1997,
pp. 678–683.
895 [17] J. O’Rourke, N. Badler, Model-based image analysis of human motion using
constraint propagation, IEEE Trans. Patt. Anal. Mach. Intell., PAMI-2 (6)
(1980) 522–536.
[18] G. Mori, J. Malik, Estimating human body configurations using shape con-
text matching, in: Proc. 7th European Conference on Computer Vision-
900 Part III, ECCV ’02, Springer-Verlag, London, UK, 2002, pp. 666–680.
[20] G. Hua, M.-H. Yang, Y. Wu, Learning to estimate human pose with data
905 driven belief propagation, in: IEEE CVPR, Vol. 2, 2005, pp. 747–754 vol.
2.
45
[23] P. F. Felzenszwalb, D. P. Huttenlocher, Pictorial structures for object recog-
nition, Intnl. Jrnl. Comp. Vis. 61 (1) (2005) 55–79.
935 [32] R. Rosales, S. Sclaroff, Inferring body pose without tracking body parts,
in: CVPR, Vol. 2, IEEE, 2000, pp. 721–727.
46
[34] Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-
940 of-parts, in: CVPR, IEEE, 2011, pp. 1385–1392.
[39] K. H. Lee, J. H. Kim, An HMM based threshold model approach for gesture
recognition, IEEE Trans. Patt. Anal. Mach. Intell. 21 (10) (1999) 961–973.
47
[44] V. Athitsos, S. Sclaroff, Estimating 3D hand pose from a cluttered image,
in: IEEE CVPR, 2003, pp. 432–439.
[47] T. H. Maung, et al., Real-time hand tracking and gesture recognition sys-
tem using neural networks, World Academy of Science, Engineering and
975 Technology 50 (2009) 466–470.
48
[53] S. Sridhar, F. Mueller, A. Oulasvirta, C. Theobalt, Fast and robust hand
tracking using detection-guided optimization, in: Proceedings of the IEEE
995 Conference on Computer Vision and Pattern Recognition, 2015, pp. 3213–
3221.
[57] C. Xu, L. Cheng, Efficient hand pose estimation from a single depth image,
in: Proceedings of the IEEE International Conference on Computer Vision,
2013, pp. 3456–3462.
[58] D. Tang, H. Chang, A. Tejani, T.-K. Kim, Latent regression forest: Struc-
1010 tured estimation of 3d articulated hand posture, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
3786–3793.
49
[61] D. Hariharan, T. Acharya, S. Mitra, Recognizing hand gestures of a dancer,
1020 in: Pattern recognition and machine intelligence, Springer, 2011, pp. 186–
192.
[63] J. Lei, X. Ren, D. Fox, Fine-grained kitchen activity recognition using rgb-
d, in: Proceedings of the 2012 ACM Conference on Ubiquitous Computing,
ACM, 2012, pp. 208–211.
1035 [66] A. Vedaldi, K. Lenc, Matconvnet: Convolutional neural networks for mat-
lab, in: Proceedings of the 23rd Annual ACM Conference on Multimedia
Conference, ACM, 2015, pp. 689–692.
[68] C.-C. Chang, C.-J. Lin, Libsvm: A library for support vector machines,
ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3)
(2011) 27.
50