_Real-Time Vision-Based Hand Tracking and Gesture Recognition
_Real-Time Vision-Based Hand Tracking and Gesture Recognition
by
Qing Chen
Doctor of Philosophy
in Electrical and Computer Engineering
ii
virtual objects, he can use a set of hand gestures to select the target traffic sign and open
a window to check the information of the correspondent learning object. This application
demonstrates the gesture-based interface can achieve an improved interaction, which are
more intuitive and flexible for the user.
iii
Acknowledgements
First of all, I thank my supervisors Professor Nicolas D. Georganas and Professor Emil
M. Petriu for their guidance, advice and encouragement throughout my Ph.D. study. I
have benefited tremendously from their vision, technical insights and profound thinking.
They are my great teachers.
I wish to thank the members of the DiscoverLab for their suggestions and helps on
my work. I owe lots of thanks to Francois Malric, who helped so much on my cameras,
computers and software. I also wish to thank Professor Abdulmotaleb El Saddik, who
suggested and helped the project of navigating the virtual environment by hand gestures.
I thank Dr. Xiaojun Shen and ASM Mahfujur Rahman for their cooperations in this
project.
Last but not least, I thank my parents and families, I would never have made it
without them.
iv
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Publications Arising from the Thesis . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Appearance-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Colors and Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Hand Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 SIFT Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.6 Stereo Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.7 The Viola-Jones Algorithm . . . . . . . . . . . . . . . . . . . . . . 22
2.3 3D Hand Model-Based Approaches . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Analysis-by-Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Syntactic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.1 Appearance vs. 3D Hand Model . . . . . . . . . . . . . . . . . . . 38
2.6.2 Statistical vs. Syntactic . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
3 A Two-Level Architecture 40
3.1 Selection of Postures and Gestures . . . . . . . . . . . . . . . . . . . . . 40
3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Conclusions 95
vi
List of Tables
vii
List of Figures
2.1 The hand skeleton and joints with associated DOF (from [18]). . . . . . . 9
2.2 The signs for “d” and “z” in ASL (from [22]). . . . . . . . . . . . . . . . 10
2.3 Hand tracking using the color cue (from [33]). . . . . . . . . . . . . . . . 14
2.4 The “Visual Panel” system (from [45]). . . . . . . . . . . . . . . . . . . . 15
2.5 Gesture recognition based on fingertips (from [46]). . . . . . . . . . . . . 16
2.6 Hand motion analysis using the optical flow algorithm. . . . . . . . . . . 18
2.7 Hand tracking using motion residue and hand color (from [52]). . . . . . 18
2.8 The mean shift algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 Hand tracking use the CamShift algorithm. . . . . . . . . . . . . . . . . . 20
2.10 The SIFT features proposed by Lowe. . . . . . . . . . . . . . . . . . . . . 21
2.11 The robustness of the SIFT features against image rotation. . . . . . . . 21
2.12 Ye’s stereo gesture recognition system (from [28]). . . . . . . . . . . . . . 22
2.13 The “Flocks of Features” (from [62]). . . . . . . . . . . . . . . . . . . . . 23
2.14 The block diagram of 3D hand model-based approaches. . . . . . . . . . 24
2.15 Clutter-tolerant image retrieval experiment results (from [64]). . . . . . . 26
2.16 3D posture estimation by matching hand contours (from [65]). . . . . . . 27
2.17 The block diagram of statistical approaches. . . . . . . . . . . . . . . . . 29
2.18 The Chomsky hierarchy of grammars. . . . . . . . . . . . . . . . . . . . . 32
2.19 The block diagram of syntactic approaches. . . . . . . . . . . . . . . . . . 34
2.20 Shaw’s picture description language (from [74]). . . . . . . . . . . . . . . 34
2.21 A partial tree for hand postural features (from [76]). . . . . . . . . . . . . 36
viii
3.2 Different semantic meanings for different global hand motions (from [84]). 41
3.3 The gesture taxonomy in the context of human-computer interaction. . . 41
3.4 The two-level architecture for hand gesture recognition. . . . . . . . . . . 43
3.5 The block diagram of the two-level architecture. . . . . . . . . . . . . . . 44
3.6 The web-camera for the video input. . . . . . . . . . . . . . . . . . . . . 45
ix
4.30 The depth information recovered according the perspective projection. . . 70
4.31 The background subtraction and noise removal. . . . . . . . . . . . . . . 71
x
Chapter 1
Introduction
1.1 Background
Human-computer interfaces (HCI) have developed from text-based interfaces through 2D
graphical-based interfaces, multimedia-supported interfaces, to fully fledged multimodel-
based 3D virtual environment (VE) systems. While providing a new sophisticated par-
adigm for communication, learning, training and entertaining, VEs also provide new
challenges for human-computer interaction. The traditional 2D HCI devices such as key-
boards and mice are not adequate for the latest VE applications. Instead, VE systems
provide the opportunity to integrate different communication modalities and sensing
technologies together to provide a more immersive user experience [1, 2]. As shown in
Figure 1.1, devices that can sense body position and orientation, speech and sound, facial
and gesture expression, haptic feedback and other aspects of human behavior or state
can be used for more powerful and effective interactions between human and computers.
To achieve natural and immersive human-computer interaction, the human hand
could be used as an interface device [3, 4, 5]. Hand gestures are a powerful human-
to-human communication channel, which forms a major part of information transfer in
our everyday life. Hand gestures are an easy to use and natural way of interaction.
For example, sign languages have been used extensively among speech-disabled people.
People who can speak also use many kinds of gestures to help their communications.
However, the expressiveness of hand gestures has not been fully explored for human-
computer interaction. Compared with traditional HCI devices, hand gestures are less
intrusive and more convenient for users to interact with computers and explore the 3D
virtual worlds [6]. Hand gestures can be used in a wide range of applications such as
1
2
keyboard
and mouse
virtual
environment
haptic feedback
Figure 1.2: The CyberGlove from the Immersion Corporation (from [7]).
and manipulate the virtual objects with a set of gesture commands without the help of
keyboards, mice or joysticks [12, 13, 14]. Early research on vision-based hand tracking
and gesture recognition usually needs the help of markers or colored gloves [15, 16]. In
current state-of-the-art vision-based hand tracking and gesture recognition techniques,
research is more focused on tracking the bare hand and identify hand gestures without
the help of any markers and gloves.
1.2 Motivation
The latest computer vision technologies and the advanced computer hardware capacity
make real-time, accurate and robust hand tracking and gesture recognition promising.
Many different approaches have been proposed such as appearance-based approaches and
3D hand model-based approaches [17]. Most of these approaches deal the hand gesture
as a whole object and try to extract the corresponding mathematical description from
a large number of training samples. These approaches analyze hand gestures without
breaking them into their constituent atomic elements that could simplify the complexity
of hand gestures. As a result, many current approaches are still limited by the lack
of speed, accuracy and robustness. They are either too fragile or demand too many
prerequisites such as markers, clean backgrounds or complex camera calibration steps,
and thus make the gesture interaction indirect and unnatural. Currently there is no
real-time vision-based hand tracking and gesture recognition system that can track and
identify hand gestures in a fast, accurate, robust and easily accessible manner.
The goal of this work is to build a real-time 3D hand tracking and gesture recognition
system for the purpose of human-computer interaction. To achieve this goal, the charac-
ters of hand gestures need to be taken into account and the principles that can improve
the system’s performance in terms of speed, accuracy and robustness need to be applied.
4
Video input
Hand gestures are complex human actions with rich structure information, which can
be exploited for a syntactic pattern recognition approach. To fully use the composite
property of hand gestures, we prefer a divide-and-conquer strategy by using a hybrid
approach based on statistical and syntactic analysis as illustrated in Figure 1.3. This
approach decomposes the complex hand gesture recognition problem into their atomic
elements that would be easier to handle. The statistical analysis is responsible for the
extraction of the primitives of hand gestures such as hand postures and motion direc-
tions. The syntactic analysis focuses on the structure analysis for hand gestures based
on the extracted primitives and a set of grammars which defines the composition rules.
The prerequisite for our system is that it will focus on bare hand gestures without the
help of any markers or colored gloves. Meanwhile, the system will use only one regular
video camera (e.g. web-camera) as input device to be cost-efficient. The performance
requirements to be met by the 3D hand tracking and gesture recognition system are:
• Accuracy: a hand gesture recognition system should be able to tolerate some mis-
takes and errors. However, it still needs to be accurate enough in order to be viable.
For instance, the system should be able to achieve a detection rate above 90% while
maintaining a low false positive rate for each hand gesture. Meanwhile, the system
should also be able to recognize different hand gestures without confusion among
5
them.
• Chapter 1 introduces the background and motivations of this work. The goals to
be achieved by the work are also given.
• Chapter 2 first introduces two fundamental concepts – hand posture and hand
gesture. A comprehensive literature review is given based on two dichotomies for
vision-based hand tracking and gesture recognition approaches: appearance-based
approaches versus 3D hand model-based approaches; statistical approaches versus
syntactic approaches. The pros and cons of these different approaches are compared
and discussed.
• Chapter 3 proposes the overall architecture for the system: a two-level architecture
which decouples hand gesture recognition into low-level hand posture detection and
tracking and high-level hand gesture recognition and motion analysis. The low-
level of the architecture detects hand postures using a statistical approach based
on Haar-like features and a boosting algorithm. The high-level of the architecture
employs a syntactic approach to recognize hand gestures and analyze hand motions
based on grammars.
6
• Chapter 5 presents the high-level hand gesture recognition and motion analysis us-
ing stochastic context-free grammars (SCFG). With the input string converted from
the postures detected by the low-level of the architecture, the hand gestures are
recognized based on a defined gesture SCFG. For the global hand motion analy-
sis, two SCFGs are used to analyze two structured hand gestures with different
trajectory patterns. Based on the different probabilities associated with these two
grammars, the SCFGs can effectively disambiguate the distorted trajectories and
classify them correctly.
1.4 Contributions
This thesis proposes a new architecture to solve the problem of real-time vision-based
hand tracking and gesture recognition with the combination of statistical and syntactic
analysis. The fundamental idea is to use a divide-and-conquer strategy based on the
hierarchical composition property of hand gestures so that the problem can be decoupled
7
into two levels. The low-level of the architecture focuses on hand posture detection and
tracking with Haar-like features and the AdaBoost learning algorithm. The high-level
of the architecture focuses on the hand gesture recognition and motion analysis using
syntactic analysis based on SCFGs. The original contributions of this thesis are:
3. The hand gestures are analyzed base on a SCFG, which defines the composite prop-
erties based on the constituent hand postures. The assignment of the probability to
each production rule of the SCFG can be used to control the “wanted” gestures and
the “unwanted” gestures. Smaller probability could be assigned to the “unwanted”
gestures while greater value could be assigned to “wanted” gestures so that the
resulting SCFG would generate the “wanted” gestures with higher probabilities.
4. For hand motion analysis, with the uncertainty of hand trajectories, the ambiguous
versions can be identified by looking for the SCFG that has the higher probability
to generate the input string. The motion patterns can be controlled by adjusting
the probabilities associated with the production rules so that the resulting SCFG
would generate the standard motion patterns with higher probabilities.
3. Q. Chen, E. M. Petriu and N. D. Georganas, “3D hand tracking and motion analysis
with a combination approach of statistical and syntactic analysis, in Proc. IEEE
Intl. Workshop on Haptic, Audio and Visual Environments and their Applications,
2007.
Literature Review
2.1 Introduction
The human hand is a complex articulated structure consisting of many connected links
and joints. The skeleton structure and the joints of the human hand are shown in
Figure 2.1 [18]. Each finger consists of three joints whose names are indicated. For the
Figure 2.1: The hand skeleton and joints with associated DOF (from [18]).
9
10
thumb, there are 3 degrees of freedom (DOF) for the trapeziometacarpal joint, and 1
DOF for the thumb IP joint and the thumb MP joint respectively. For the rest four
fingers, there are 2 DOF for MCP joints, and 1 DOF for each PIP joints and DIP joints.
There is 1 DOF for each metacarpocarpal joint at the bottom of the ring finger and the
little finger. Considering the 4 DOF for the hand wrist, there are 27 DOF for the human
hand all together [6].
Due to the high DOF of the human hand, hand gesture recognition becomes a very
challenging problem. To better understand hand gestures and hand motions, there are
two important concepts need to be cleared [19, 20]:
• Hand Posture: a hand posture is a static hand pose and its current location without
any movements involved.
A hand posture is defined as a static hand pose. For example, making a fist and
holding it in a certain position is considered a hand posture. A hand gesture is defined
as a dynamic movement, such as waving goodbye. The dynamic movement of hand
gestures includes two aspects: global hand motions and local finger motions [21]. Global
hand motions change the position or orientation of the hand. Local finger motions involve
moving the fingers in some way but without change in the position and orientation of
the hand. For example, moving the index finger back and forth to urge someone to
come closer. Compared with hand postures, hand gestures can be viewed as complex
composite hand actions constructed by global hand motions and a series of hand postures
that act as transition states. To further explain the difference between hand postures
and hand gestures, consider the signs for “d” and “z” in American Sign Language shown
in Figure 2.2 [22]. According to above definitions, the sign for “d” is a hand posture,
Figure 2.2: The signs for “d” and “z” in ASL (from [22]).
11
• The intuition aspect means the selected gestures should be intuitive and comfort-
able for the user to learn and to remember. The gestures should be straightfor-
ward so that least effort will be required for the user to learn the gestures. The
user should be able to use their natural hand configurations and not be required
to learn any specific or complex hand configurations, which are very easy to cause
fatigue and make the user uncomfortable.
• The articulatory aspect means the selected gestures should be easy for recognition
and do not cause confusions for the user. Gestures involving complicated hand
poses and finger movements should be avoided due to the difficulty to articulate
and repeat.
• The technology aspect refers to the fact that in order to be viable, the selected
gestures must take into account the properties of employed algorithms and tech-
niques. The required data and information can be extracted and analyzed from
the selected gesture commands without causing excessive computation cost for the
employed approach.
To recognize hand gestures, we need a good set of characteristic features and the
knowledge of how they interrelate in representing hand gestures. Many algorithms and
approaches have been proposed. We summarized the latest trends and ideas for vision-
based hand tracking and gesture recognition systems proposed by different researchers
in Table 2.1.
One dichotomy to differentiate vision-based hand tracking and gesture recognition
algorithms is appearance-based approaches versus 3D hand model-based approaches.
Another dichotomy is based on the methodology used to describe hand gestures, which
could be statistical approaches or syntactic approaches. We will give detailed introduc-
tions of these different approaches in the following sections.
12
Table 2.1: Latest vision-based hand tracking and gesture recognition systems.
Barczak et al. Real-time hand The Viola-Jones Appearance- Real-time Single posture
2005 [24] tracking method based (hand palm)
Statistical
Zhou et al. Articulated object Inverted indexing in an Hand model- 3s/query Non real-time
2005 [25] (e.g. body/hand image database using based
postures) recogni- local image features Statistical
tion
Derpanis et al. Hand gesture Use linguistic analysis Appearance- 8s/frame Non real-time
2004 [26] recognition from to decompose dynamic based
a monocular gestures into their sta- Syntactic
temporal sequence tic and dynamic com-
of images ponents
Lin et al. Tracking the artic- Searching for an op- Hand model- 2s/frame Non real-time
2004 [27] ulated hand mo- timal motion estimate based
tion in a video se- in a high dimensional Statistical
quence configuration space
Ye et al. Classify manipula- Compute 3D hand Stereo vision Real-time Two cameras,
2004 [28] tive and controlling appearance using a statistical calibration
gestures region-based coarse required
stereo matching
algorithm
Kölsch et al. Fast tracking for Use flock of KLT fea- Appearance- Real-time Strict re-
2004 [29] non-rigid and tures/colors to facili- based quirement on
highly articulated tate 2D hand track- statistical hand pose
objects such as ing and posture recog- configuration
hands nition from a monocu- for recognition
lar view
Bowden et al. Sign language Use a two-stage clas- Appearance- Real-time Clean back-
2004 [30] recognition sification by structur- based ground,
ing the classification Syntactic long-sleeve shirt
model around a linguis- is required for
tic definition of signed the user
words and the Markov
chain to encode tempo-
ral transitions
Tomasi et al. 3D tracking Use a combination of Hand model- Real-time Clean back-
2003 [31] for hand finger 2D image classification based ground
spelling motions and 3D motion interpo- Statistical
lation
Figure 2.3: Hand tracking using the color cue (from [33]).
Figure 2.3, the color cue is used for the hand segmentation step due to its computational
simplicity. To prevent errors from hand segmentation, they add a second step: hand
tracking. Tracking is performed assuming a constant velocity model and using a pixel
labeling approach. Several hand features are extracted and fed to a finite state classifier
to identify the hand configuration. The hand can be classified into one of the four gesture
classes or one of the four movement directions.
For shape-based algorithms, global shape descriptors such as Zernike moments and
Fourier descriptors are used to represent different hand shapes [43]. Most shape descrip-
tors are pixel-based and the computation cost is usually too high to implement real-time
systems [50]. Another disadvantage for shape-based approaches is the requirement for
noise-free image segmentation, which is a difficult task for the usually cluttered back-
ground images. Bowden et al. proposed an approach for sign language recognition that
provides high classification rates on minimal training data [30]. Key to this approach is
a 2 stage classification procedure where an initial classification stage extracts a high level
description of hand shape and motion. This high level description is based upon sign
linguistics and actions description at a conceptual level easily understood by humans.
The second stage of classification is to model the temporal transitions of individual signs
using a bank of Markov chains combined with Independent Component Analysis. The
classification rates of their approach can reach 97.67% for a lexicon of 43 words using
only single in-stance training which outperforms previous approaches where thousands
of training examples are required.
interface depending on accurate, real-time hand and fingertip tracking for seamless in-
tegration between real objects and associated digital information [44]. They introduce
a method for locating fingertip positions in image frames and measuring fingertips tra-
jectories across image frames. By using an infrared camera, their method can track
multiple fingertips reliably even on a complex background under changing lighting con-
ditions without invasive devices or color markers. A mechanism for combining direct
manipulation and symbolic gestures based on multiple fingertip motions was proposed.
Zhang et al. presented a vision-based interface system named “Visual Panel” (see
Figure 2.4), which employs an arbitrary quadrangle-shaped panel (e.g. an ordinary piece
of paper) and a fingertip pointer as an intuitive input device [45]. The system can
accurately and reliably track the panel and the fingertip pointer. By detecting the
clicking and dragging hand actions, the system can fulfill many tasks such as controlling
a remote large display, and simulating a physical keyboard. Users can naturally use their
fingers to issue commands and type text. Furthermore, by tracking the 3D position and
orientation of the visual panel, the system can also provide 3D information, serving as a
virtual joystick to control 3D virtual objects.
Malik et al. designed a plane-based augmented reality system that tracks planar pat-
terns in real-time, onto which virtual 2D and 3D objects can be augmented. Interaction
with the virtual objects is possible via a fingertip-based gesture recognition system [46].
As illustrated in Figure 2.5, the basis of the mechanism to capture the gesture is the num-
ber of detected fingertips. A single detected fingertip represents the gesture of pointing,
whereas multiple detected fingertips represent the gesture of selecting. The fingers of the
hand are detected by background subtraction and scanning the binary image for pixels
of full intensity. Each time such a pixel is found, a subroutine is called to perform a
neighborhood flood-fill operation to collect all neighboring pixels. The orientation of the
finger can be calculated using the central moments of the detected finger blob. The axis
16
line is then defined by forcing it through the blob’s centroid. The fingertip location is
recovered by finding the farthest point in the blob from the root point, and is used as
the pointer location.
Huang et al. introduced a model-based hand gesture recognition system, which con-
sists of three phases: feature extraction, training, and recognition [47]. In the feature
extraction phase, a hybrid technique combines hand edges and hand motions informa-
tion of each frame to extract the feature images. Then, in the training phase, they use
the principal component analysis (PCA) to characterize spatial shape variations and the
hidden Markov models (HMM) to describe the temporal shape variations. Finally, in
recognition phase, with the pre-trained PCA models and HMM, the observation pat-
terns can be generated from the input sequences, and then apply the Viterbi algorithm
to identify the gesture.
For feature-based based approaches, a clean image segmentation is generally a must
step to recover the hand features. This is not a trivial task when the background is
cluttered. On the other hand, for the highly articulated human hand, it is sometimes
difficult to find local hand features and heuristics that can handle the large variety of
hand gestures. It is not always clear about how to correlate local hand features with
different hand gestures in an efficient manner.
the intensity function in a Taylor series and ignoring the higher order terms, we have:
∂I ∂I ∂I
I(x + dx, y + dy, t + dt) = I(x, y, t) + dx + dy + dt
∂x ∂y ∂t
If the intensity value at (x + dx, y + dy, t + dt) is a translation of the intensity value at
(x, y, t), then I(x + dx, y + dy, t + dt) = I(x, y, t), so it must follow:
∂I ∂I ∂I
dx + dy + dt = 0
∂x ∂y ∂t
which equals:
∂I ∂I dx ∂I dy ∂I ∂I
− = + = u+ v
∂t ∂x dt ∂y dt ∂x ∂y
∂I/∂t at a given pixel is just how fast the intensity is changing with time. ∂I/∂x and
∂I/∂y are the spatial rates of the intensity change, i.e. how rapidly the intensity changes
across the picture. All three of these quantities can be computed for each pixel from
f (x, y, t). The goal is to compute the velocity:
dx dy
(u, v) = ,
dt dt
Unfortunately the above constraint gives us only one equation for two unknowns, which
is not enough to get the answer. Thus an additional constraint is required to determine
the value of (u, v). One popular constraint is the smoothness constraint proposed by
Horn et al. in [51]. According to this constraint, when objects of finite size undergo rigid
motion or deformation, neighboring points on the objects have similar velocities and the
velocity field of the brightness patterns in the image varies smoothly almost everywhere.
One way to express this additional constraint is to minimize the square of the magnitude
of the gradient of the optical flow velocity:
2 2 2 2
∂u ∂u ∂v ∂v
+ , +
∂x ∂y ∂x ∂y
Taking advantage of this constraint, the two unknowns (u, v) can be computed by an
iteration process based on minimizing the square of the magnitude.
Figure 2.6 shows two connected frames of the hand motion analysis using the optical
flow algorithm. We can easily tell the hand is moving left according to the directions
of the optical flow. Cutler et al. presented a body gesture recognition system using
optical flow in [48]. To recognize different body gestures, optical flow is estimated and
18
Figure 2.6: Hand motion analysis using the optical flow algorithm.
segmented into motion blobs. Gestures are recognized using a rule-based technique with
characteristics of the motion blobs such as relative motion and size of the arm.
Yuan et al. described a 2D hand tracking method that extracts trajectories of un-
ambiguous hand locations [52]. Candidate hand bounding squares are detected using
a novel feature based on motion residue. This feature is combined with skin detection
in color video. A temporal filter employs the Viterbi algorithm to identify consistent
hand trajectories. An additional consistency check is added to the Viterbi algorithm,
to increase the likelihood that each extracted trajectory will contain hand locations cor-
responding to the same hand. Their experiment on video sequences of several hundred
frames demonstrates the systems ability to track hands robustly (see Figure 2.7).
Figure 2.7: Hand tracking using motion residue and hand color (from [52]).
19
Figure 2.11: The robustness of the SIFT features against image rotation.
each image pair. Forward HMMs and neural networks are used to model the dynamics
of the gestures. A real-time system for gesture recognition shown in Figure 2.12 is
implemented to analyze the performance with different combinations of appearance and
motion features. Their experiment results show the system can achieve a recognition
accuracy of 96%.
detection systems which is approximately 15 times faster than any previous approaches
while achieving equivalent accuracy to the best published results [57]. However, limited
research has been done to extend the method to hand detection and gesture recognition.
Kölsch and Turk studied view-specific hand posture detection with the Viola-Jones
algorithm in [59]. They presented a frequency analysis-based method for instantaneous
estimation of the “detectability” of different hand postures without the need for compute-
intensive training. Their experiment results show the classification accuracy increases
with a more expressive frequency-based feature type such as a closed hand palm. Kölsch
and Turk also evaluated the in-plane rotational robustness of the Viola-Jones algorithm
for hand detection in [60]. They found the in-plane rotation bounds for hand detection
are around ±15◦ for the same performance without increasing the classifier’s complexity.
A vision-based hand gesture interface named “HandVu” was developed by them, which
can recognize standard hand postures [61]. The “HandVu” system divides hand posture
recognition into three steps: hand detection, hand tracking and posture recognition [62].
To detect the hand, the user is required to put his hand palm on a predefined area of
the camera scene, and the system is able to extract corresponding features that can
represent the hand palm. As illustrated in Figure 2.13, the “HandVu” system employs
“Flocks of Features” based on gradients and colors to facilitate 2D tracking from a
monocular view [29]. After successful hand detection and tracking, the system attempts
posture classification using a fanned detector based on the Viola-Jones method. The
“HandVu” system uses a sequential process that the posture recognition is dependent
on the previous results of hand detection and tracking. The posture recognition will
fail if the hand tracking is lost, and the whole process need to be start over again from
the beginning. Another limitation is the compulsory hand detection step in this system,
which reduces the easiness and naturalness of the system.
Barczak et al. introduced a method to efficiently generate hand image samples for
training the classifier based on the Viola-Jones algorithm [24]. The classifiers trained by
the generated samples are able to track and recognize the hand posed in a single posture.
Image processing
Human hand
Gesture recognition
Parameter adjustment
Projection to 2D image
3D hand model
2.3.1 Analysis-by-Synthesis
Tomasi et al. implemented an analysis-by-synthesis 3D hand tracking system with a
detailed 3D hand model which can reflect the shape and articulations of a hand [31].
The 3D hand model can be animated by specifying pose parameters and joint angles.
A database of known hand poses and configurations is set up in advance. With each
configuration (and therefore for a whole set of views and samples for the same configura-
tion), the database store the set of joint angles and pose parameters that describe that
configuration, or at least a similar one, obtained manually through the same graphics
package used for rendering. During tracking, they compare input video frames to these
samples, and whenever they see a “familiar” view, they retrieve the corespondent hand
configuration. In their experiments, a database with 15 views for each of 24 hand signs
is used. Vector quantization principal component analysis is employed to reduce the
dimensionality of the feature space. Because hand motions often occur too quickly in
the video, and the hand configurations are often too complex to track from each single
frame, 3D configurations interpolation between familiar views is used. This system re-
quires single-user and restricted lighting and background as well as fairly disciplined way
to sign the gesture.
To alleviate the computation load of searching a high dimensional space, Wu et al.
proposed an approach by decoupling hand poses and finger articulations and integrating
them in an iterative framework [63]. They treat the palm as a rigid planar object and use
a 3D cardboard hand model to determine the hand pose based on the Iterative Closed
Point (ICP) algorithm. Since the finger articulation is also highly constrained, they
proposed an articulation prior model that reduces the dimensionality of the joint angle
space and characterizes the articulation manifold in the lower-dimensional configuration
space. To effectively incorporate the articulation prior into the tracking process, they
proposed a sequential Monte Carlo tracking algorithm by using the important sampling
technique. In their implementation, the hand gestures are performed in front of a clean
black background, and the proposed cardboard hand model can not handle large out-
of-plane rotations and scaling invariance very well. In addition, the system requires an
user-specific calibration of the hand model that is manually done.
of hand pose estimation can be converted to an image retrieval problem by finding the
best match between the input image and the database. Zhou et al. proposed an approach
to integrate the powerful text retrieval tools with computer vision techniques to improve
the efficiency for hand images retrieval [25]. An Okapi-Chamfer matching algorithm is
used in their work based on the inverted index technique. With the inverted index, a
training image is treated as a document, and a test image is treated as a query. In the
matching process, only documents that contain query terms are accessed and used so
that the query image can be identified at constant computational cost. This approach
can accelerate the database matching and improve the efficiency of image retrieval. To
enable inverted indexing in an image database, they built a lexicon of local visual features
by clustering the features extracted from the training images. Given a query image, they
extract visual features and quantize them based on the lexicon, and then look up the
inverted index to identify the subset of training images with non-zero matching score.
The Okapi weighting formula matches a query with a document based on a weighted
sum of the terms that appears in both the document and the query. To use the Okapi
weighting formula for image matching, they proposed to combine the Okapi weighting
formula with the Chamfer distance and assign a spacial tag to each local feature to record
its relative image position, and use this tag to calculate the Chamfer distance. The local
features they used are the binary patches along the boundary between foreground and
background. The geometry of the boundary is modeled by the spacial tags. The trade off
of their approach is that labeling real-world images is time consuming and error prone,
and their tests on real-world query images are not very extensive.
Athitsos et al. proposed a method that can generate a ranked list of three-dimensional
hand configurations that best match an input image in [64]. Hand pose estimation is
formulated as an image database indexing problem, where the closest matches for an
input hand image are retrieved from a large database of synthetic hand images. The
novelty of their system is the ability to handle the presence of clutter by using two
clutter-tolerant indexing methods. First, a computationally efficient approximation of
the image-to-model chamfer distance is obtained by embedding binary edge images into a
high-dimensional Euclidean space. Second, a general-purpose, probabilistic line matching
method identifies those line segment correspondences between model and input images
that are the least likely to have occurred by chance. The performance of this clutter
tolerant approach is demonstrated in experiments with hundreds of real hand images as
Figure 2.15 shows. The total processing time, including hand segmentation, extraction
of line segments, and the two retrieval steps, was about 15 seconds per input image on
a PC with a 1.2GHz Athlon processor.
To overcome the high computation cost of the large number of appearance varia-
tions due to the high DOF of the 3D hand model and different viewpoints, Imai et al.
proposed a 2D appearance-based method by using hand contours to estimate 3D hand
posture in [65]. In their method, the variations of possible hand contours around the
registered typical appearances are trained from a number of computer graphic images
generated from a 3D hand model. The possible variations are efficiently represented as
Statistical Raw data, feature vector, etc. Direct matching, minimum distance,
nearest neighbor, maximum likelihood,
Bayes rule, etc.
Syntactic String, tree, graph, etc. Parsing, string matching, tree match-
ing, graph matching, etc.
RECOGNITION
Feature
Test samples Classifier Recognition result
measurement
TRAINING
between the feature selection module and the learning module allows the user to opti-
mize the training process by adjusting the training strategies and parameters. In the
recognition component, the feature measurement module extracts the features from the
test sample and the trained classifier recognizes them based on the measured features
and the corresponding decision rules.
There are four different methods to design a classifier [67]:
• Similarity measurement: this is the most intuitive and simplest method. With
this approach, patterns that are similar should belong to the same class. Template
matching and minimum distance classifiers are typical methods of this category.
• Probabilistic classifiers: the most popular probabilistic classifier is the Bayes clas-
sifier that uses the Bayes rule to estimate the conditional probability of a class.
template.
• Decision tree classifiers: this type of classifiers can deduce the conclusion using
an iterative multistage decisions based on individual features at each node of the
decision tree. The basic idea is to break up a complex decision problem into a
series of several simpler decisions, so that the final conclusion would resemble the
desired solution. The most popular decision tree classifiers are binary decision tree
classifiers, which make true or false decisions at each stage based on the single
corresponding feature at the node.
To improve the the overall classification accuracy, different classifiers can be combined
so that the overall performance can be optimized. A classifier combination can achieve
better performance especially when the individual classifiers are largely independent [67].
A typical example is the boosting algorithm, which combines a series of weak classifiers
(whose accuracies are only slightly better than 50%) into a strong classifier which has
a very small error rate on the training data [69]. In the boosting algorithm, individual
classifiers are invoked in a linear sequence. The inaccurate but cheap classifiers (low
computational cost) are applied first, followed by more accurate and expensive classifiers.
The number of mistakenly classified samples is reduced gradually as more individual
classifiers have been invoked and added to the sequence. The final strong classifier
(whose accuracy meets the requirement) is a linear combination of the invoked individual
classifiers.
Besides the discussed statistical models, there is a special model need to be intro-
duced: the neural networks, which have become an important tool in computer vision.
Neural networks are parallel computing systems consisting of a large number of intercon-
nected neurons (elementary processors), which mimics the complex structure of neurons
in human brains. Neural networks are able to learn complex nonlinear input-output
relationships using sequential training procedures and adapt themselves to the data. In
spite of the seemingly different mechanisms, there is considerable overlap between neural
networks and statistical models in the field of pattern recognition. Most of the well
known neural network models are implicitly equivalent or similar to classical statisti-
cal approaches [67]. Neural networks implement pattern recognition as black boxes by
concealing the complex statistics from the user.
31
• Primitives should correspond with significant natural elements of the object struc-
ture being described.
After the primitives are extracted, a grammar representing a set of rules must be
defined so that different patterns and activities can be constructed based on the extracted
primitives. This syntactic approach can be better explained by analogy with the English
language. A mathematical model for the grammar’s structure can be defined as:
G = [Vt , Vn , P, S]
In this model:
• Vt is a set of terminals, which are the most primitive symbols in the grammar.
Unrestricted grammars
Context-sensitive grammars
Regular grammars
• S is a start symbol, which shows where all valid sequences of symbols we want to
be able to produce can be derived from.
αAβ → αγβ
A→γ
(1)S → aB (2)S → aB
(3)A → aS (4)A → bAA
(5)A → a (6)B → bS
(7)B → aBB (8)B → b
This grammar defines a language L(G) which is the set of strings consisting of an
equal number of a’s and b’s such as ab, ba, abba and bbaa.
A → αB or A → β
where A and B are nonterminals, α and β are terminals or empty strings. Besides
limiting the left side consisting of only a single nonterminal, this substitution rule
also restricts the right side: it may be an empty string, or a single terminal symbol,
or a single terminal symbol followed by a nonterminal symbol, but nothing else.
The languages generated by regular grammars are called regular or finite state
languages. One example is: G = (VN , VT , P, S), where VN = {S, A}, VT = {a, b},
and P :
S → aA
A → aA
A→b
This grammar defines a language L(G) = {an b|n = 1, 2, . . .}, which includes strings
such as ab, aab, aaaab.
Pattern representation
All regular languages can be recognized by a finite state machine, and for context-free
grammars, there are many efficient parsers to recognize the corresponding languages.
After appropriate primitives are selected and the grammar is defined, a pattern or an
activity can be analyzed according to the block diagram shown in Figure 2.19. There are
two major components included in this diagram: the pattern representation component
and the structural/syntactic analysis (parsing) component. The pattern representation
component includes the decomposition module and the primitive recognition module.
After all of the primitives are extracted and recognized, the structural/syntactic analysis
component will analyze the relationship of the extracted primitives according to the
defined grammar. Different from statistical approaches, the training component is not
required for syntactic approaches.
One classic example for the application of syntactic approaches in computer vision
is the Picture Description Language (PDL) proposed by Shaw in [74]. As illustrated
in Figure 2.20, Shaw’s PDL accepts the description of line drawings in terms of a set
of primitives, operations and a grammar generating strings. The grammar is used to
direct the analysis or parse, and to control the calls on pattern classification routines for
primitive picture components. The benefits of Shaw’s PDL include ease of implemen-
tation and modification of picture processing systems, and simplification of the pattern
recognition problem by automatically taking advantage of contextual information.
Hand et al. believe it is feasible to employ a syntactic approach to understand hand
gestures in [75]. To do this, they defined a set of hand postures as the terminals of the
language. These terminals included hand postures like HO – hand open, HF – hand
fist, IF – index finger outstretched. Besides these hand posture terminals, they added
several hand movement terminals such as: MU, MD – move up and down; ML, MR –
move left and right; and MT, MA – move towards and away. After all of the terminals
are decided, a set of production rules is defined:
<Stop> :== HF HO
According to the production rules, if the user wants to perform a “stop” command,
he would make a fist and then release it. To drag an object, the user need to hold the
thumb and index finger closed and continue to move it.
36
Root
Figure 2.21: A partial tree for hand postural features (from [76]).
Jones et al. suggest a method to sequentially parsing the structure for hand postures
given a set of observed values [76]. The approach is based on constructing and parsing a
feature tree like the one shown in Figure 2.21. Key features are represented by branches
such as “extended index”, and as the observed values are examined, a tree parsing takes
place. For example, if the user’s index finger is extended, the parsing first takes the left
branch, and all of the postures with this feature would be represented under this branch.
Derpanis et al. proposed an approach to exploit previous linguistic theory to represent
complex gestures in terms of their primitive components [26]. In their approach, dynamic
gestures are decomposed into static and dynamic components in terms of three sets of
primitives: hand shape, location and movement. An algorithm is proposed, which can
recognize gesture movement primitives given data captured with a single video camera.
By working with a finite set of primitives, which can be combined in a wide variety
of ways, their approach has the potential to deal with a large vocabulary of gestures.
They demonstrated that given a monocular gesture sequence, kinematic features can be
recovered from the apparent motion that provide distinctive signatures for 14 primitive
movements of ASL.
Another important application of syntactic approaches is for activity recognition.
Ivanov et al. used the stochastic context-free grammar parsing to recognize activities
taking place over extended sequences such as car parking and structured gestures com-
posed of simple hand trajectories [77]. The author first split activities extended over time
into events using probabilistic event detectors. The sequence of events was then sent to
a stochastic context-free grammar parsing mechanism for recognition. The grammar
parsing mechanism provides longer range temporal constraints, disambiguates uncertain
37
of the events, and allows the inclusion of a priori knowledge about the structure of tem-
poral events in a given domain. They demonstrated how the system correctly interprets
activities of single and multiple interacting objects in experiments on gesture recognition
and video surveillance.
Ryoo et al. proposed a general methodology for automated recognition of complex
human activities [78]. They use a context-free grammar based representation scheme
to represent composite actions and interactions. The context-free grammar describes
complex human activities based on simple actions or movements. Human activities
are classified into three categories: atomic action, composite action, and interaction.
The system was tested to represent and recognize eight types of interactions: approach,
depart, point, shake-hands, hug, punch, kick, and push. The experiments show that the
system can recognize sequences of represented composite actions and interactions with a
high recognition rate.
Moore et al. presented a model of stochastic context-free grammar for characterizing
complex, multi-tasked activities that require both exemplars and models [79]. Exem-
plars are used to represent object context, image features, and motion appearances to
label domain-specific events. Then, by representing each event with a unique symbol, a
sequence of interactions can be described as an ordered symbolic string. The stochastic
context-free grammar, which is developed using underlying rules of an activity, provides
the structure for recognizing semantically meaningful behavior over extended periods.
Symbolic strings are parsed using the Earley-Stolcke algorithm to determine the most
likely semantic derivation for recognition. Parsing substrings allows the system to recog-
nize patterns that describe high-level, complex events taking place over segments of the
video sequence. The performance of the system is shown through experiments with a
popular card game by identifying player strategies and behavior extracted from real-time
video input.
Minnen et al. implemented a system that uses an extended stochastic grammar to
recognize a person performing the Towers of Hanoi task from a video sequence by ana-
lyzing object interaction events [80]. In this system, they extend stochastic grammars
by adding event parameters, state checks, and sensitivity to an internal scene model.
Experimental results from several videos showed robust recognition for the full task and
the constituent sub-tasks even though no appearance models are provided for the objects
in the video.
Yamamoto et al. proposed a new approach for recognition of task-oriented actions
based on a stochastic context-free grammar [81]. They use a context-free grammar to
38
recognize the action in Japanese tea services. A segmentation method was proposed to
segment the tea service action into a string of finer actions corresponding to terminal
symbols with very few errors. A stochastic context-free grammar-based parsing and a
Bayesian classifier were used to analyze and identify the action with a maximum posterior
probability.
2.6 Discussions
2.6.1 Appearance vs. 3D Hand Model
Generally speaking, it is easier for appearance-based approaches to achieve real-time
performance due to the comparatively simpler 2D image features. 3D hand model-
based approaches offer a rich description that potentially allows a wide class of hand
gestures. However, as the 3D hand model is a complex articulated deformable object
with many degrees of freedom, a very large image database is required to cover all the
characteristic hand images under different views. Matching the query images from the
video input with all hand images in the database is time-consuming and computationally
expensive. Another limitation of 3D hand model-based approaches is the lack of the
capability to deal with singularities that arise from ambiguous views [82]. Based on the
literature review, rather than hand gesture recognition, most current 3D hand model-
based approaches focus on real-time tracking for global hand motions and local finger
motions with restricted lighting and background conditions. Another issue for 3D hand
model-based approaches is the scalability problem, where a 3D hand model with specific
kinematic parameters cannot deal with a wide variety of hand sizes from different people.
itives extraction and identification; the syntactic approach is responsible for analyzing
the structural relationship among the identified primitives so that the whole pattern can
be recognized.
Most current gesture recognition systems treat the hand gesture as a whole element
without considering its hierarchical composite property and breaking it into its simpler
constituent components that would be easier to process. This results in a rather slow
and inefficient system unsuited for real-time applications. To solve the problem, the
advantages brought by syntactic approaches need to be considered. An appropriate
combination of statistical and syntactic approaches can result in an efficient and effective
hand gesture recognition system.
2.7 Summary
This chapter introduces the concepts of hand posture, hand gesture, and how to select
appropriate gesture commands. The previous research work is reviewed from two different
dichotomies: appearance-based approaches versus hand model-based approaches and
statistical approaches versus syntactic approaches.
For appearance-based approaches, it is generally easier to achieve real-time perfor-
mance. The tradeoff is their limited ability to cover different classes of hand gestures
due to simpler image features employed. 3D hand model-based approaches offer a rich
description that potentially allow a wide class of hand gestures. However, the higher
computation cost often reduces the system’s processing speed.
For computer vision problems involving complex patterns and activities, it is more
appropriate and effective to use a syntactic approach to describe each pattern or activity
with their primitives. The recognition of primitives may be better accomplished by
statistical approaches due to little structural information involved.
Chapter 3
A Two-Level Architecture
40
41
Figure 3.2: Different semantic meanings for different global hand motions (from [84]).
shown in Figure 3.3, all hand movements are divided into two categories: hand gestures
and unintentional movements. Unintentional movements are hand motions that do not
have any intentions to communicate information. Hand gestures can be divided into two
groups: manipulative gestures and communicative gestures. Manipulative gestures are
the ones used to act on objects in an environment (such as picking up a box). Commu-
nicative gestures intend to communicate information. Communicative gestures can be
further divided into acts and symbols. Acts are gestures that are directly related to the
interpretation of the movement itself. Symbols are the gestures that have a linguistic
role and symbolize some referential actions. For the applications of human-computer
interaction, communicative gestures are the most commonly used since they can often
be represented by different static hand postures and movements.
In order to implement a system for hand gesture recognition, it is important to first
build an architecture that can provide a framework for the task. The goal of our architec-
Hand movements
Manipulative Communicative
Acts Symbols
Grasp
Quote
ture is not to build a system capable of recognizing all manipulative and communicative
hand gestures, but to demonstrate an effective methodology which can achieve accurate
and robust recognition of a set of intuitive communicative hand gestures for the purpose
of human-computer interaction. For communicative hand gestures, only a relatively small
subset of all possible hand postures play a fundamental role [31]. The key hand postures
selected for our system are shown in Table 3.1. The selection of this posture set is based
on our experiment results which show no confusion is caused among these postures by
the employed algorithm. This set of key hand postures is sufficient for the hand doing a
set of hand gestures listed in Table 3.2. Each gesture is composed of two postures. The
order of the postures can be reversed so that each gesture can be a repetitive action,
which improves the comfort for the user.
43
Training
Global hand
Motion recognition
Gesture recognition Syntactic analysis
Local finger
motion recognition
Training
Training samples
in Figure 3.6. This web-camera provides video capture with maximum resolution of
640 × 480 up to 15 frames-per-second. For our system, we set the camera resolution at
320 × 240 with 15 frames per second.
After the input image is loaded, the preprocessing module will segment the hand
posture from the background, remove noise, and perform any other operation which will
contribute to a clean image of the hand posture. The classifiers will detect and recognize
the posture so that they can be converted into terminal strings for the syntactic analysis.
The hand tracking module tracks global hand motions and keep the trajectory for the
high-level global hand motion analysis.
To meet the requirement of real-time performance, we train our classifiers with a
statistical approach based on a set of Haar-like features, which can be computed very
fast by using the “Integral Image” technique. To achieve detection accuracy and further
improve the computation speed, we use a boosting algorithm which can efficiently get rid
of false images and detect target images with selected Haar-like features stage by stage.
45
With a set of positive samples (i.e. images contain hand postures) and negative samples
(i.e. random images do not contain hand postures), the training module selects a series
of Haar-like features that can achieve the best classification accuracy, and combine them
into a final accurate classifier.
The high-level of the architecture is responsible for syntactic analysis for hand ges-
tures. The local finger motion is analyzed according to the primitives represented by
different hand postures detected from the low-level and a grammar that defines the re-
lationship between the gestures and the primitives. The goal of global hand motion
analysis module is to identify different patterns of hand motion trajectories, which can
have different semantic meanings. The global hand motion is analyzed based on the
primitives represented by movement directions detected by the hand tracking module
at the low-level. The global hand motion grammar defines the relationship between the
hand motion trajectories and the correspondent primitives.
3.3 Summary
In this chapter, we propose a hybrid two-level architecture to solve the problem of hand
gesture recognition. Considering the hierarchical composite property of hand gestures,
this architecture decouples hand gesture recognition into two levels: low-level hand pos-
ture detection and tracking and high-level hand gesture recognition and motion analysis.
The low-level of the architecture detect hand postures using a statistical approach based
on Haar-like features and a boosting algorithm. The high-level of the architecture em-
ploys a syntactic approach to recognize hand gestures and analyze hand motions based
on defined grammars.
Chapter 4
46
47
Line features
center-surround features, one special diagonal line feature and eight line features. There
are three reasons for the employment of the Haar-like features rather than raw pixels.
The first reason is that the Haar-like features can encode ad-hoc domain knowledge which
is difficult to learn using a finite quantity of training data. Haar-like features are effective
to catch the characters represented by the difference between the dark and bright areas
within an image. One typical example is that the eye region of the human face is darker
than the cheek region, and one Haar-like feature can efficiently catch that character.
Compared with raw pixels, the Haar-like features can efficiently reduce/increase the
in-class/out-of-class variability and thus making classification easier [87]. The second
reason is that a Haar-like feature-based system can operate much faster than a pixel-
based system. The third advantage brought by Haar-like features is that they are more
robust against noises and lighting variations than other image features such as colors.
Haar-like features focus more on the difference between two or three connected areas
(i.e. the white and black rectangles) inside an image rather than the value of each single
pixel. Noises and lighting variations affect the pixel values of all areas, and this influence
can be effectively counteracted by computing the difference between them.
The value of a Haar-like feature is the difference between the sum of the pixels’ values
within the black and white rectangular regions:
X X
f (x) = Wblack · (pixelV alue) − Wwhite · (pixelV alue)
blackRec whiteRec
where Wblack and Wwhite are the weights that meet the compensation condition:
For example, for the Haar-like feature shown in Figure 4.2(a), Wblack = 2 · Wwhite ; for the
48
Edge features Line features Center-surround features Special diagonal line feature
(a) (b)
Haar-like feature shown in Figure 4.2(b), Wblack = 8 · Wwhite . If we set Wwhite = 1, then
the value of Haar-like feature in Figure 4.2(a) is:
X X
f (a) = 2 · (pixelV alue) − 1 · (pixelV alue)
blackRec whiteRec
The concept of “Integral Image” is used to compute the Haar-like features containing
upright rectangles [57]. The “Integral Image” at the location of p(x, y) contains the sum
of the pixel values above and left of this pixel inclusive (see Figure 4.3(a)):
X
P (x, y) = p(x0 , y 0 )
x0 ≤x,y 0 ≤y
According to the definition of “Integral Image”, the sum of the grey level value within
the area “D” in Figure 4.3(b) can be computed as:
P1 + P 4 − P2 − P 3
since
P1 + P4 − P2 − P3 = (A) + (A + B + C + D) − (A + B) − (A + C) = D
For Haar-like features containing 45◦ rotated rectangles, the concept of “Rotated
Summed Area Table (RSAT)” was introduced by Lienhart in [87]. RSAT is defined as
the sum of the pixels of a rotated rectangle with the bottom most corner at p(x, y) and
extending upwards to the boundaries of the image, which is illustrated in Figure 4.4(a):
X
R(x, y) = p(x0 , y 0 )
y 0 ≤y,y 0 ≤y−|x−x0 |
According to the definition of “RSAT”, the sum of the grey level value within the area
“D” in Figure 4.4(b) can be computed as:
R1 + R4 − R2 − R3
49
A B
P1 P2
P(x, y)
C D
p(x, y)
P3 P4
(a) (b)
R4
R(x, y)
D R3
R2
p(x, y)
R1
(a) (b)
Different locations
Cascade classifiers
Weak classifiers
Haar-like features
weights over each training sample (in our case, the hand posture images). It should be
noted that a Haar-like feature could be used repeatedly in the training process. We start
with the selection of the first Haar-like feature in the boosting process. The training
algorithm keeps the Haar-like feature that yields the best classification accuracy in the
first iteration. The classifier based on this Haar-like feature is added to the linear com-
bination with a strength proportional to the resulting accuracy. For the next iteration,
the training samples are re-weighted: training samples missed by the first weak classi-
fier are boosted in importance so that the second selected Haar-like feature must pay
more attention towards these misclassified samples. To be selected, the second Haar-like
feature must achieve a better accuracy for these miss-classified training samples so that
52
the error can be reduced. The iteration goes on by adding new weak classifiers based on
selected Haar-like features to the linear combination until the required overall accuracy is
achieved. The final training result is a strong classifier composed by a linear combination
of the selected weak classifiers.
The detailed steps of the AdaBoost learning algorithm are [57]:
• Given training images (x1 , y1 ), . . . , (xn , yn ) where yi = 0, 1 for negative and positive
examples respectively.
• For t = 1, . . . , T :
All sub-windows
Trained cascade
Rejected
sub-windows
Rejected
sub-windows
time as the initial stage of classifiers try to quickly remove a large fraction of background
as illustrated in Figure 4.9, and more computation will be focused on the more difficult
sub-windows which passed the scrutiny of the initial stages of the cascade.
The object of interest is detected and tracked frame by frame from the images of
the video input. To detect the object of interest, each frame of input image series is
swept from the top-left corner to the bottom-right corner by a stretchable sub-window.
The sub-window of a specific size scans the picture pixel by pixel. As the hand size
W
N
M
H …...
Scale factor = 1 Scale factor = 1.05 Scale factor = 1.052 Scale factor = 1.05n
Figure 4.10: Stretch the sub-window’s size with the scale factor.
in the video input varies, the size of the sub-window also changes consistently so that
different hand sizes can be detected (see Figure 4.10). For the stretchable sub-window,
the ideal approach is to start from a very small initial kernel size and increase its width
and hight for one pixel at each scan. However, this strategy will result in a large number
of sub-windows, and it is not feasible if the real-time requirement is to be met. To avoid
the excessive computation cost, a number of sub-windows have to be discarded. A scale
factor is used to implement this strategy. For instance, to detect a hand palm within
an image frame, we can start the sub-window with an initial kernel size of 15 × 30 and
a scale factor of 1.05. For the next scan, the sub-window’s size will be stretched to
(15 × 1.05) × (30 × 1.05). This process goes on until one side of the sub-window reaches
the image frame. According to the example shown in Figure 4.10, if the size of the input
image is W ×H, the initial kernel size is M ×N and the scale factor is f, the total number
of sub-windows that have to be processed is:
n
X
(W − M · f i )(H − N · f i )
i=0
where n is the number of times the scale factor has been increased, which satisfies
M · f n ≤ W or N · f n ≤ H. The bigger the scale factor, the faster the computation.
However, if the scale factor is too big, the object with a size in between may be missed
by the sub-window.
is trained to reject about half of the negative samples, while correctly accepting 99.9%
of the face samples. The final trained cascade consists of 20 stages [91].
We first evaluate the robustness against image rotations, which includes in-plane
rotations and out-of-plane rotations. In-plane rotation means the image is rotated for a
certain degree around “Z ” axis perpendicular to the image plane such as the example
shown in Figure 4.11(a). Out-of-plane rotations are the rotations around “X ” axis or
“Y ” axis (e.g. Figure 4.11(b) and Figure 4.11(c)).
(a) Rotation around “Z” axis (b) Rotation around “X” axis (c) Rotation around “Y” axis
Figure 4.12: Test results for face images rotated around “Z” axis.
Figure 4.13: Test results for face images rotated around “X” axis.
We generate 500 test samples with a rotated face image superimposed on random
backgrounds as Figure 4.11 shows. The rotation range is from 0◦ to 20◦ with a step of
5◦ . For out-of-plane rotations, we also generate 500 test images with the rotation from
56
Figure 4.14: Test results for face images rotated around “Y” axis.
0◦ to 40◦ at the step of 10◦ . Some test results are shown in Figure 4.12, Figure 4.13 and
Figure 4.14.
Figure 4.15 shows the detection rates corresponding to different in-plane rotation
degrees. The detection rate reduces to 83.8% when the maximum rotation degree reaches
10◦ . The detection rate further reduces to 62% when the rotation degree reaches 15◦ .
Figure 4.16 and Figure 4.17 show the detection rates corresponding to different out-of-
plane rotations. The detection rates are kept around 99% when the rotation reaches
20◦ . When the rotation reaches 30◦ , the detection rates reduce to 87.8% and 82.4%
respectively for rotations around X axis and Y axis.
To evaluate the robustness against different lighting conditions, we adopt the model
100
X: 0
Y: 100 X: 5
Y: 98.8
90
X: 10
Y: 83.8
80
detection rate (%)
70
X: 15
Y: 62
60
X: 20
50 Y: 48
40
0 2 4 6 8 10 12 14 16 18 20
rotation degree
Figure 4.15: The robustness evaluation for images rotated around “Z ” axis.
57
100
X: 0 X: 10
Y: 100 Y: 99.6 X: 20
Y: 99
95
90 X: 30
Y: 87.8
detection rate (%)
85
80
75
X: 40
70 Y: 68.6
65
0 5 10 15 20 25 30 35 40
rotation degree
Figure 4.16: The robustness evaluation for images rotated around “X ” axis.
100
X: 0 X: 10
Y: 100 Y: 100 X: 20
95 Y: 98.4
90
detection rate (%)
85 X: 30
Y: 82.4
80
75
70
X: 40
Y: 64.2
65
60
0 5 10 15 20 25 30 35 40
rotation degree
Figure 4.17: The robustness evaluation for images rotated around “Y ” axis.
58
of HSV (Hue, Saturation, Value) color space shown in Figure 4.18 [92]. In HSV color
space, hue is the color reflected from an object. Saturation is the strength or purity of
the color. Value is the relative brightness or darkness of the color, usually measured as
a percentage from 0% (black) to 100% (white).
We variate the brightness of the test images by adjusting the “Value” range. For
example, the “Value” range of an original image is from 0% to 100%, to brighten the
image, we can narrow this range to 60% to 100% (i.e. +60%), and this shifting can
brighten the image accordingly. Figure 4.19 shows the test images with different bright-
ness values. We narrow the range of the input image from original to ±40%, ±60% and
±80%. At each brightness value, 500 test images are generated by superimposing the
image on random backgrounds. Based on the evaluation results, it is noticed that the
front face cascade classifier achieved 100% detection rate for all of the test images of
different brightness. Figure 4.20 shows some test results.
(a) -80% (b) -60% (c) -40% (d) original (e) +40% (f) +60% (g) +80%
Figure 4.20: Test results for face images with different brightness values.
Figure 4.21: A fraction of the positive samples for the “two fingers” posture.
achieve a better accuracy by getting rid of false samples with a large variety. All negative
samples are passed through a background description file which contains the filenames
(relative to the directory of the description file) of all negative sample images.
The numbers of the positive samples and negative samples are based on our exper-
iment result. When the cascades trained with 450 positive and 500 negative samples
already come close to the representation power, larger training sets do not affect the
training result significantly. However, the training time will be increased dramatically
by a larger number of training sets.
61
Figure 4.22: A fraction of the negative samples used in the training process.
Table 4.2: The initial Haar-like features selected by the trained cascades of classifiers.
Line features
Line features
Line features
Line features
In order to evaluate the performance of the cascade classifiers, 100 marked-up test
images for each posture is collected. The scale factor is set at 1.05 since it has the
best balance between the detection rate and the processing speed. Table 4.1 shows
the performance of the four cascades of classifiers and the time spent to detect all 100
test images. Figure 4.23, Figure 4.24, Figure 4.25 and Figure 4.26 showed some of the
detection pictures for each posture.
By analyzing the detection results, we find that some of the missed positive samples
are caused by the excessive in-plane rotations. For the false alarms, the majority of them
happened in very small areas in the image scene, which contains similar black and right
patterns to be detected by the selected Haar-like features. These false detections can
be easily removed by defining a threshold of the minimum size for the detected object.
Among the four classifiers, the “fist” classifier achieves the most stable performance. The
major reason for this is because the “fist” posture has a comparatively uniformed round
contour (similar to the human face) and displays a set of stable image features which
can be detected more easily by the Haar-like features.
The times required for classifiers to detect 100 testing images are all within 3 seconds.
We tested the real-time performance with live input from the web-camera with 15 frames-
per-second at the resolution of 320 × 240, and there is no detectable pause and latency
to track and detect the hand postures with all our trained classifiers.
63
Figure 4.23: A fraction of the testing results for the “two fingers” cascade classifier.
64
Figure 4.24: A fraction of the testing results for the “palm” cascade classifier.
65
Figure 4.25: A fraction of the testing results for the “fist” cascade classifier.
66
Figure 4.26: A fraction of the testing results for the “little finger” cascade classifier.
67
Trained cascade 1
Trained cascade 2
Classified
posture
Trained cascade 3
Trained cascade 4
Figure 4.27: The parallel cascades structure for hand posture classification.
displayed at the center coordinates of the detected posture showing the motion direction
of the hand movement. For our implementation, the major challenge is to recover the
depth information z, which is the distance between the camera and the hand. The camera
performs a linear transformation from the 3D projective space to the 2D projective space.
With the known camera’s intrinsic parameters, according to the perspective projection
illustrated by Figure 4.29, the projected point (x0 , y 0 , z 0 ) on the image plane satisfies the
triangle equations:
x0 y0 f
= =
x y z
where f is the focal length of the camera. Based on the triangle equation, we can get
the perspective projection equations:
f f
x0 =
x, y 0 = y
z z
According to above equations, if the sizes of the user’s hand and the sub-window are
69
Y
(x, y, z)
X y
(x’, y’, z’)
y’ x
x’
Z
z
Image plane
4.8 Summary
This chapter presents the low-level of the architecture: hand posture detection and track-
ing. To meet the real-time requirement, we trained a series of cascades of classifiers for
70
Figure 4.30: The depth information recovered according the perspective projection.
each hand posture based on Haar-like features and the AdaBoost learning algorithm,
which can achieve fast, accurate and robust performance. A parallel cascades structure
is implemented to identify the selected hand postures from the camera’s live input. The
3D position of the hand is recovered according to the location of the sub-window and the
perspective projection based on the camera’s intrinsic parameters. Background subtrac-
tion and noise removal are used to achieve the robustness against cluttered backgrounds.
71
x L(G1) ?
x x L(G2) ? V x L(Gi)
x L(Gn) ?
72
73
Is x ∈ L(Gi ), f or i = 1, . . . n?
If the language generated by a grammar is finite with a reasonable size, the syntactic
classifier can search for a match between the unknown pattern and all the words of the
language. When the grammar is complex and the generated language involves a large
number of words, it is more appropriate to use a syntactic analysis approach (i.e. parsing)
based on whether the production rules of the grammar can generate the unknown pattern.
If the parsing process is successful, which means the unknown pattern can be generated
by this grammar, we can classify this unknown pattern to the class represented by this
grammar. If the parsing process is not successful, then the unknown pattern is not
accepted as an object of this class.
Syntactic classifier can also be constructed based on the production rules of a gram-
mar. With this approach, each unknown pattern corresponds to a specific word generated
by the grammar. By analyzing the correspondent production rule as well as the asso-
ciated probability (e.g. stochastic grammars), the unknown pattern can be classified
according to the production rule which has the greatest probability to generated the
pattern.
In practical applications such as hand motion analysis, a certain amount of uncer-
tainty exists such as distorted movement trajectories. Grammars used to describe dis-
torted patterns are often ambiguous in the sense that a distorted pattern might be
generated by more than one grammar. Under these situations, stochastic grammars can
be used in order to avoid the confusions caused by the distorted patterns.
Stochastic context-free grammars (SCFG) are used in our system to describe the
structural and dynamic information about hand gestures. The SCFG is an extension of
the context-cree grammar (CFG). The difference between SCFGs and CFGs is that for
each production rule in SCFGs, there is a probability associated with it. Each SCFG is
a four tuple:
GS = (VN , VT , PS , S)
74
where VN and VT are finite sets of nonterminals and terminals; S ∈ VN is the start
symbol; PS is a finite set of stochastic production rules each of which is of the form:
P
X−
→λ
where µj are all of the strings that are derived from X. In SCFG, the notion of context-free
essentially means that the production rules are conditionally independent [93].
If a string y ∈ L(GS ) is unambiguous and has a derivation with production rules
r1 , r2 , . . . , rk ∈ PS , then the probability of y with respect to GS is:
k
Y
P (y|GS ) = P (ri )
i=1
SCFGs extend CFGs in the same way that Hidden Markov models (HMMs) extend
regular grammars. The relation between SCFGs and HMMs is very similar to that be-
tween CFGs and non-probabilistic Finite State Machines (FSMs), where CFGs relax
some of the structural limitations imposed by FSMs, and because of this, SCFGs have
more flexibilities than HMMs [77]. Compared with non-stochastic CFGs, with the prob-
ability attached to each production rule, SCFGs provide a quantitative basis for ranking
and pruning syntactic analysis [79]. With SCFGs, we just need to compute the proba-
bility of the pattern belonging to different classes (each class is described with its own
grammar), and then choose the class with the highest probability value. The probabilis-
tic information allows us to use statistical techniques not only for finding best matches,
but also for implementing robust feature detection guided by the grammar’s probability
structure.
75
VN G = {S}, VT G = {p, f, t, l}
and PG :
40% 35% 25%
r1 : S −−→ pf, r2 : S −−→ tf, r3 : S −−→ lf
The terminals p, f, t, l stand for the four postures: “palm”, “fist”, “two fingers” and
“little finger”. The probabilities assigned to different production rules take two aspects
into account: the user’s preference for different gestures and the avoidance of confusion
for distorted gestures.
Gesture set
(Postures)
p f t f l f
A pipe structure shown in Figure 5.3 is implemented to convert the input postures
into a sequence of terminal strings.
After a string “x” is converted from the postures, we can decide the most likely
product rule that can generate this string by computing the probability:
D(zr , x) is the similarity between the input string “x” and “zr ”, which is the string
derived by the production rule “r”. D(zr , x) can be computed according to:
Count(zr ∩ x)
D(zr , x) =
Count(zr ) + Count(x)
76
Posture terminals
SCFG
Gestures
Figure 5.3: The pipe structure to convert hand postures to terminal strings.
Example 5.3: If an input string is detected as “pl”, which does not match any of
77
The flexibility of the SCFG allows the user to easily change the grammar so that other
gestures with different combinations of detected postures or more complex gestures can
be described. The assignment of the probability to each production rule can also be
used to control the “wanted” gestures and the “unwanted” gestures. Greater values of
probability could be assigned to the “wanted” gestures such as the “Grasp” gesture in
previous examples. Smaller probability could be assigned to the “unwanted” gestures so
that the resulting SCFG would generate the “wanted” gestures with higher probabilities.
Example 5.4: If we noticed the “Quote” gesture is more often performed by users
and the “J” gesture is least popular, we can define PG as:
35% 40% 25%
r1 : S −−→ pf, r2 : S −−→ tf, r3 : S −−→ lf
where “r2 ” (corresponding to the “Quote” gesture) is assigned with the highest proba-
bility. If an input string is detected as “tl”, since
0 35
P (r1 ⇒ tl) = × =0
4 100
1 40
P (r2 ⇒ tl) = × = 10%
4 100
1 25
P (r3 ⇒ tl) = × = 6.25%
4 100
Based on the highest probability, the gesture represented by string “tl” will be recog-
nized as a “Quote” gesture. Even though the production rule “r3 ” also has a posture
primitive “t” in its derivation, but the associated probability is only 25%, which resulted
in a smaller probability to generate the string “tl”.
Y
78
3 1
4 0
5 7
dy
dx
Figure 5.6: The assignment of direction primitives according to the slope values.
1
1 4
1 4
2
4
2
4
2
3
2
3
3
starting from the top vertex. The rectangle and the diamond can be drawn by any of
the four hand postures. When moving the hand in front of the camera, it is very hard
to keep the hand moving strictly along the line so that perfect shapes can be drawn.
As the result of this, many noisy versions will be drawn like the distorted shapes shown
in Figure 5.8. Among these noisy versions, some distorted shapes can be clustered into
either the rectangle class or the diamond class, and thus cause ambiguity for our motion
80
(a) The rectangle’s distorted versions (b) The diamond’s distorted versions
Figure 5.8: The distorted versions for the two standard trajectories.
analysis.
To solve the problem, we define two SCFGs to describe the structured gestures. The
SCFG defined for the rectangle gesture is:
where
VN rectangle = {S, A1 , A2 , A3 , A4 },
VT rectangle = {t, p, l, f, 0, 1, 2, 3, 4, 5, 6, 7}
The terminals p, f, t, l stand for the four postures: “palm”, “fist”, “two fingers” and “little
finger”. The numbers are the primitives for the motion directions defined in Figure 5.4.
The grammar Rrectangle is defined by:
25% 25%
S −−→ tA1 , S −−→ pA1
25% 25%
S −−→ lA1 , S −−→ f A1
80% 10% 10%
A1 −−→ 4A2 , A1 −−→ 3A2 , A1 −−→ 5A2
80% 10% 10%
A2 −−→ 6A3 , A2 −−→ 5A3 , A2 −−→ 7A3
80% 10% 10%
A3 −−→ 0A4 , A3 −−→ 1A4 , A3 −−→ 7A4
80% 10% 10%
A4 −−→ 2, A4 −−→ 1, A4 −−→ 3
where
VN diamond = {S, A1 , A2 , A3 , A4 },
VT diamond = {t, p, l, f, 0, 1, 2, 3, 4, 5, 6, 7}
The grammar Rdiamond is:
25% 25%
S −−→ tA1 , S −−→ pA1
25% 25%
S −−→ lA1 , S −−→ f A1
70% 15% 15%
A1 −−→ 5A2 , A1 −−→ 4A2 , A1 −−→ 6A2
70% 15% 15%
A2 −−→ 7A3 , A2 −−→ 0A3 , A2 −−→ 6A3
70% 15% 15%
A3 −−→ 1A4 , A3 −−→ 0A4 , A3 −−→ 2A4
70% 15% 15%
A4 −−→ 3, A4 −−→ 1, A4 −−→ 2
In Rrectangle and Rdiamond , there are four possibilities for the start symbol S to begin
with, which means both gestures can be performed by any of the four hand postures with
the same probability of 25%. The string for the standard rectangle gesture is “4602”
according to the defined primitives. Considering the noisy versions, we set a probability
of 80% for the first primitive to be 4, and 10% for its distorted version that can be
either 3 or 5. The same probabilities are also assigned to the other three primitives and
their noisy versions. The string for the standard diamond gesture is “5713”. We set a
probability of 70% for the standard primitives, and 15% for the distorted versions.
Example 5.5: Consider the distorted shape “4603” drawn by the “little finger” posture
in Figure 5.9 (a), the parsing according to the grammar Rrectangle is:
25%
S −−→ pA1
80%
A1 −−→ 4A2
80%
A2 −−→ 6A3
80%
A3 −−→ 0A4
10%
A4 −−→ 3
Based on the parsing, the probability for this distorted shape to be generated by the
rectangle gesture class is:
P (4603|Grectangle )
= 25% × 80% × 80% × 80% × 10%
= 1.28%
82
It is noticed that this distorted shape can also be generated by the grammar Rdiamond
with the parsing:
25%
S −−→ pA1
15%
A1 −−→ 4A2
15%
A2 −−→ 6A3
15%
A3 −−→ 0A4
70%
A4 −−→ 3
The probability for this distorted shape to be generated by the diamond gesture class is:
P (4603|Gdiamond )
= 25% × 15% × 15% × 15% × 70%
= 0.06%
83
(a) The classification result for “4603”. (b) The classification result for “5603”.
the distorted shape represented by string “4603” should be classified as a rectangle ges-
ture as Figure 5.10(a) shows.
Example 5.6: Now consider another distorted shape “5603” drawn by the “fist” pos-
ture in Figure 5.9 (b). The probabilities for the grammars to generate this distorted
shape are:
P (5603|Grectangle )
= 25% × 10% × 80% × 80% × 10%
= 0.16%
P (5603|Gdiamond )
= 25% × 70% × 15% × 15% × 70%
= 0.28%
Since
P (5603|Gdiamond ) > P (5603|Grectangle )
the distorted shape represented by string “5603” should be classified as a diamond ges-
ture as Figure 5.10(b) shows.
84
5.4 Summary
This chapter introduces the high-level hand gesture recognition and motion analysis
based on SCFGs. The postures detected by the lower level are first converted into a
sequence of terminal strings according to the grammar. Based on the similarity mea-
surement and the probability associated with the correspondent production rule, the
corresponding gesture can be identified by looking for the production rule that has the
highest probability to generate this input string. For the global hand motion analysis, we
use SCFGs to analyze two structured hand gestures with different trajectory patterns:
the rectangle gesture and the diamond gesture. Based on the probabilities associated
with the production rules, given an input string, the SCFGs can effectively disambiguate
the distorted trajectory patterns and identify them correctly.
Chapter 6
85
86
Visual metaphors such as graphs and charts are more effective and easier for peo-
ple to understand and learn abstract information [98, 99]. These visual metaphors can
motivate people, improve memorizing and focus the attention of the learner. To exploit
the advantages brought by visual metaphors, appropriate information visualization tools
need to be selected to represent the abstract data efficiently. With the immersive ex-
perience provided by VEs, learning object repositories can be mapped to a 3D gaming
VE, which can provide a novel access paradigm and improves users’ leaning experience
significantly. Students and teachers can search, access and interact with learning objects
in an interesting 3D gaming VE rather than formulating a complex search criteria and
reading a long list of search results.
To facilitate the information transformation process, as illustrated in Figure 6.1, we
implement a 3D gaming VE structure to search the learning object repository with hand
gestures. The layout of the 3D gaming VE is shown in Figure 6.2, which allows user
visualizing search results in an attractive display space. When a search query is sent by
the user, as Figure 6.3 shows, the search results are displayed as different traffic signs
grouped along the virtual highways according to the keywords. The visually organized
representation of the information allows users to get insight into the data and interact
with them directly. The user is represented by the avatar car that can navigate through
the gaming VE along the virtual highways. As Figure 6.4 shows, the attributes of a
search result can be displayed by selecting its traffic sign, and an attribute window will
pop up with the information including the result’s title, format, keyword, etc. If the user
Gesture-based interface
3D gaming VE
User
Task
Figure 6.3: Search results are mapped to traffic signs along the virtual highway.
88
Figure 6.4: Checking the attributes and detailed information of the search result.
feels necessary to have a detailed look of the correspondent learning object, he can open
a sperate explorer window that goes to the location and displays all of the information
of this learning object. An overall world layout is provided to tell the user his current
position. The learning experience can be enhanced by presenting the game-like avatar
model so that the user can be entertained while learning.
Figure 6.6: The work envelopes: trackball vs hand gesture (adapted from [100]).
90
(c) Turning right by “little finger”. (d) Turning left by “two fingers”.
the learning object. We map these manipulation commands to our hand gestures as
Table 6.2 shows. To select a traffic sign, the user first drives the avatar car close to the
target. As both sides of the virtual highway are aligned with traffic signs, the user needs
to specify which side the targeted traffic sign is located. For example, as Figure 6.8
shows, if the user is interested in the traffic sign with a PDF mark on the right side ,
he just needs to perform the “J” gesture, and this traffic sign will be highlighted with
an attribute window showing its attributes. Figure 6.9 shows the example of selecting
the left traffic sign with the “Quote” gesture. If the user find the left traffic sign is what
he is looking for according to its attributes, he can access this learning object using the
“Grasp” gesture, and an explorer window will be displayed with the detailed information
of this learning object (see Figure 6.10).
The system has been demonstrated to and used by different people. Most users find
92
Select Right Traffic Sign Select Left Traffic Sign Open Learning Object
J Quote Grasp
the gesture-based interface intuitive to use and can control the application by a short
instruction. Feedbacks from users are positive and many users are interested and excited
about being able to interact with the 3D gaming VE by just moving their hand into
the work envelope and the navigation and manipulation would commence. Figure 6.11
shows one user navigating the 3D gaming VE and manipulating virtual objects with his
hand.
Figure 6.8: Selecting the right traffic sign with the “J” gesture.
93
Figure 6.9: Selecting the left traffic sign with the “Quote” gesture.
6.5 Summary
This chapter presents an application of a gesture interface for interaction with a 3D
gaming VE. With this system, the user can navigate the 3D gaming world by driving the
avatar car with a set of hand postures. When the user wants to manipulate the virtual
objects, he can use a set of hand gestures to select the target traffic sign and open a
window to check the information of the correspondent learning object. This application
demonstrates the gesture-based interface can achieve an improved interaction, which are
more intuitive and flexible for the user.
Conclusions
This thesis presents a two-level architecture to recognize hand gestures in real-time with
one camera as the input device. This architecture considers the hierarchical composite
properties of hand gestures and combines the advantages of statistical and syntactic
approaches to achieve real-time vision-based hand tracking and gesture recognition.
The low-level of the architecture focuses on hand posture detection and tracking with
Haar-like features and the AdaBoost learning algorithm. The Haar-like features can
effectively catch the appearance properties of the selected hand postures. Meanwhile, the
Haar-like features also display very good robustness against different lighting conditions
and a certain degree of robustness against image rotations. The AdaBoost learning
algorithm can significantly speed up the performance and construct strong cascades of
classifiers by combining a sequence of weak classifiers. A parallel cascades structure
is implemented to classify different hand postures. The experiment results show this
structure can achieve satisfactory performance for hand posture detection and tracking.
The 3D position of the hand is recovered according to the camera’s perspective projection.
To achieve the robustness against cluttered backgrounds, background subtraction and
noise removal are applied.
For the high-level hand gesture recognition and motion analysis, we use a SCFG to
analyze the syntactic structure of the hand gestures based on the input strings converted
from the postures detected by the low-level of the architecture. Based on the similarity
measurement and the probability associated with each production rule, given an input
string, the corresponding hand gesture can be identified by looking for the production
rule that has the greatest probability to generate it. For the hand motion analysis, we use
SCFGs to analyze two structured hand gestures with different trajectory patterns: the
95
96
rectangle gesture and the diamond gesture. Based on the probabilities associated with
the production rules, the SCFGs can effectively disambiguate the distorted trajectories
and identify them correctly.
The major contributions of this thesis are:
3. The hand gestures are analyzed base on a SCFG, which defines the composite prop-
erties based on the constituent hand postures. The assignment of the probability to
each production rule of the SCFG can be used to control the “wanted” gestures and
the “unwanted” gestures. Smaller probability could be assigned to the “unwanted”
gestures while greater value could be assigned to “wanted” gestures so that the
resulting SCFG would generate the “wanted” gestures with higher probabilities.
4. For hand motion analysis, with the uncertainty of hand trajectories, the ambiguous
versions can be identified by looking for the SCFG that has the higher probability
to generate the input string. The motion patterns can be controlled by adjusting
the probabilities associated with the production rules so that the resulting SCFG
would generate the standard motion patterns with higher probabilities.
gesture performed within different contexts and environments can have different semantic
meanings. For example, with the background extracted from the video, if there is a
computer detected, we can say that a pointing gesture means turning on the computer
in an office. However, if there is a stove detected from the background, we can be pretty
much sure that the user is in a kitchen and the pointing gesture probably means turning
on the stove. To analyze the context information, the syntactic approach again can be
a good candidate. All of the relationships between the gesture and the context can be
described by defining a grammar for the primitives (e.g. pointing gesture, computer,
stove etc.).
Another possible improvement is to track and recognize multiple objects such as
human faces, eye gaze and hand gestures at the same time. With this multimodel-
based tracking and recognition strategy, the relationships and interactions among these
tracked objects can be defined and assigned with different semantic meanings so that a
richer command set can be covered. By integrating this richer command set with other
communication modalities such as speech recognition and haptic feedback, the human-
computer interaction experience can be enriched greatly and be much more interesting.
The system developed in this work can be extended to many other research topics in
the field of computer vision. We hope this research could trigger more investigations to
make computers see and think better.
Bibliography
[6] Y. Wu and T. S. Huang, “Hand modeling analysis and recgonition for vision-based
human computer interaction,” IEEE Signal Processing Magazine, Special Issue on
Immersive Interactive Technology, vol. 18, no. 3, pp. 51–60, 2001.
98
99
[10] J. Yang, Y. Xu, and C. S. Chen, “Gesture interface: Modeling and learning,” in
Proc. IEEE International Conference on Robotics and Automation, vol. 2, 1994,
pp. 1747–1752.
[11] D. Geer, “Will gesture-recognition technology point the way,” IEEE Computer,
pp. 20–23, 2004.
[16] C. Keskin, A. Erkan, and L. Akarun, “Real time hand tracking and gesture recog-
nition for interactive interfaces using HMM,” in Proc. ICANN/ICONIP, 2003.
[17] H. Zhou and T. S. Huang, “Tracking articulated hand motion with Eigen dynamics
analysis,” in Proc. of International Conference on Computer Vision, vol. 2, 2003,
pp. 1102–1109.
[21] J. Lin, Y. Wu, and T. S. Huang, “Modeling the constraints of human hand motion,”
in Proc. IEEE Workshop on Human Motion, 2000, pp. 121–126.
[25] H. Zhou and T. Huang, “Okapi-Chamfer matching for articulate object recogni-
tion,” in Proc. International Conference on Computer Vision, 2005, pp. 1026–1033.
[27] J. Y. Lin, Y. Wu, and T. S. Huang, “3D model-based hand tracking using stochastic
direct search method,” in Proc. IEEE International Conference on Automatic Face
and Gesture Recognition, 2004, pp. 693–698.
[29] M. Kölsch and M. Turk, “Fast 2D hand tracking with flocks of features and multi-
cue integration,” in Proc. Conference on Computer Vision and Pattern Recognition
Workshop, vol. 10, 2004, pp. 158–165.
[33] C. Manresa, J. Varona, R. Mas, and F. J. Perales, “Real time hand tracking and
gesture recognition for human-computer interaction,” Electronic Letters on Com-
puter Vision and Image Analysis, pp. 1–7, 2000.
[36] L. Bretzner, I. Laptev, and T. Lindeberg, “Hand gesture recognition using multi-
scale colour features, hierarchical models and particle filtering,” in Proc. 5th IEEE
International Conference on Automatic Face and Gesture Recognition, 2002, pp.
405–410.
[37] S. Jung, Y. Ho-Sub, W. Min, and B. W. Min, “Locating hands in complex images
using color analysis,” in Proc. IEEE International Conference on Systems, Man,
and Cybernetics, vol. 3, 1997, pp. 2142–2146.
[39] Y. Cui and J. Weng, “Appearance-based hand sign recognition from intensity image
sequences,” Computer Vision Image Understanding, vol. 78, no. 2, pp. 157–176,
2000.
[41] E. Ong and R. Bowden, “Detection and segmentation of hand shapes using boosted
classifiers,” in Proc. IEEE 6th International Conference on Automatic Face and
Gesture Recognition, 2004, pp. 889–894.
102
[42] F. Chen, C. Fu, and C. Huang, “Hand gesture recognition using a real-time tracking
method and Hidden Markov Models,” Image and Vision Computing, vol. 21, no. 8,
pp. 745–758, 2003.
[44] K. Oka, Y. Sato, and H. Koike, “Real-time fingertip tracking and gesture recog-
nition,” in Proc. IEEE Computer Graphics and Applications, vol. 22, no. 6, 2002,
pp. 64–71.
[45] Z. Zhang, Y. Wu, Y. Shan, and S. Shafer, “Visual panel: Virtual mouse keyboard
and 3D controller with an ordinary piece of paper,” in Proc. Workshop on Percep-
tive User Interfaces, 2001.
[46] S. Malik, C. McDonald, and G. Roth, “Hand tracking for interactive pattern-based
augmented reality,” in Proc. International Symposium on Mixed and Augmented
Reality, 2002.
[47] C. Huang and S. Jeng, “A model-based hand gesture recognition system,” Machine
Vision and Application, vol. 12, no. 5, pp. 243–258, 2001.
[48] R. Cutler and M. Turk, “View-based interpretation of real-time optical flow for
gesture recognition,” in Proc. Third IEEE Conference on Face and Gesture Recog-
nition, 1998.
[49] S. Lu, D. Metaxas, D. Samaras, and J. Oliensis, “Using multiple cues for hand
tracking and model refinement,” in Proc. IEEE Conference on Computer Vision
and Pattern Recognition, 2003, pp. 443–450.
[51] B. Horn and B. Schunk, “Determining optical flow,” Artifical Intelligence, vol. 17,
pp. 185–204, 1981.
[54] G. Bradski, “Real time face and object tracking as a component of a perceptual user
interface,” in Proc. IEEE Workshop on Applications of Computer Vision, 1998, pp.
214–219.
[56] H. Zhou, D. J. Lin, and T. S. Huang, “Static hand gesture recognition based on
local orientation histogram feature distribution model,” in Proc. Conference on
Computer Vision and Pattern Recognition Workshop, vol. 10, 2004, pp. 161–169.
[57] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition,
2001, pp. 511–518.
[58] ——, “Robust real-time object detection,” Cambridge Research Laboratory Tech-
nical Report Series CRL2001/01, pp. 1–24, 2001.
[59] M. Kölsch and M. Turk, “Robust hand detection,” in Proc. 6th IEEE Conference
on Automatic Face and Gesture Recognition, 2004.
[60] ——, “Analysis of rotational robustness of hand detection with a Viola-Jones de-
tector,” in Proc. International Conference on Pattern Recognition, vol. 3, 2004, pp.
107–110.
[63] Y. Wu, J. Lin, and T. S. Huang, “Analyzing and capturing articulated hand mo-
tion in image sequences,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, pp. 1910–1922, 2005.
[64] V. Athitsos and S. Sclaroff, “Estimating 3D hand pose from a cluttered image,”
in Proc. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2003.
[65] A. Imai, N. Shimada, and Y. Shirai, “3-D hand posture recognition by training
contour variation,” in Proc. 6th IEEE International Conference on Automatic Face
and Gesture Recognition, 2004, pp. 895–900.
[70] K. S. Fu, Syntactic Pattern Recognition and Applications. New Jersey: Prentice-
Hall, 1982.
[71] M. Sonka, V. Hlavac, and R. Boyle, Image processing, analysis, and machine vision.
PWS Publishing, 1999.
[74] A. C. Shaw, “A formal picture description scheme as a basis for picture processing
systems,” Information and Control, vol. 14, no. 1, pp. 9–52, 1969.
105
[76] M. Jones, R. Doyle, and P. ONeill, “The gesture interface module,” Technical
Report, GLAD-IN-ART Deliverable 3.3.2, Trinity College, Dublin, Ireland, 1993.
[79] D. Moore and I. Essa, “Recognizing multitasked activities using stochastic context-
free grammar,” in Proc. IEEE CVPR Workshop on Models vs Exemplars, 2001.
[82] D. D. Morris and J. M. Rehg, “Singularity analysis for articulated object tracking,”
in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1998, pp.
289–296.
[83] K. S. Fu, “A step towards unification of syntactic and statistical pattern recogni-
tion,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 8, no. 3,
pp. 398–404, 1986.
[85] F. Quek, “Toward a vision-based hand gesture interface,” in Proc. 7th Virtual
Reality Software and Technology Conference, 1994, pp. 17–31.
106
[87] R. Lienhart and J. Maydt, “An extended set of Haar-like features for rapid object
detection,” in Proc. IEEE International Conference on Image Processing, vol. 1,
2002, pp. 900–903.
[97] J. Klerkx, E. Duval, and M. Meire, “Visualization for accessing learning object
repositories,” in Proc. 8th International Conference on Information Visualization,
vol. 14, no. 5, 2004, pp. 465–470.
[98] M. Bauer and P. Johnson-Laird, “How diagrams can improve reasoning,” Psycho-
logical Science, vol. 4, no. 6, pp. 372–378, 1993.
[99] J. Larkin and H. Simon, “Why a diagram is (sometimes) worth ten thousand
words,” Cognitive Science, vol. 11, no. 1, pp. 65–99, 1987.
[100] G. C. Burdea and P. Coiffet, Virtual Reality Technology. New Jersey: John Wiley
& Sons, 2003.