Human Body Posture Recognition
Human Body Posture Recognition
Student Name:
Title Human Body Posture Recognition Using Open-Pose Deep learning Algorithm
1. Legislation Basis:
(Including research purpose, significance, research status at home and abroad; to discuss
the scientific significance of the proposal topic based on its development trend, or to
discuss the application prospects of the topic by considering urgent key technology
problem in the economic and social development.) (No less than 480 words.)
Research Purpose/Significance
Image processing basically includes three steps that is : Input image, analyze and
manipulate image and get the result. Image processing systems are becoming popular due
to the availability of human resources computer, large memory, graphics software and so
on. With the advancement in image processing techniques in the decades it is possible to
analyze the human behaviour by recognizing the posture of human body and it has
become one of the most researched topic even in pattern recognition field. Pose
Recognition is a field that has a vast area in terms of depth and width. Also, Basically,
human postures includes standing, sleeping, sitting, kneeling etc. Recently, many studies
have devoted a lot of attention to body postures as they also express more or less
emotions
Human Body Pose Recognition is the evolving and challenging problems in the field of
Artificial Intelligence. It deals with the localization of human joints in any skeletal
representations. Normally, its difficult to determine human’s pose in an image as it
depends on various aspects such as image resolution, background clutter, illumination
variations, surroundings and so on. A Human posture can give some quantity of
information too that is based on non-verbal communications. Many results have shown
the body postures based on emotions of human. Physical posture can be study using the
static and dynamic methods. Dynamic methods is the method where posture can be
1
Dissertation Proposal of Master's Degree
identified while human perform certain actions. Similarly, static method is the method
where human can just sit in one position and pose can be identified.
This project lays the foundation for understanding how the open pose deep learning
approach estimate the posture of single or multiple persons in real-time or in any
prestored or prerecorded images and videos respectively. The proposed work used the
convolutional neural network moreover to generate the confidence maps and affinity
fields that plays a key bit part in pose evaluation for the skeletal structure in the given
image.Human body posture recognition has made huge advancement in the past years
and has evolved from 2D to 3D estimations and also from single person to multi person
estimation. Formally Body posture Estimation predicts the different parts that is joint
position of any human body. Since it is one of the most evolving research area, it can be
used in predicting human emotions or any pattern recognition, medicine image
segmentation, human tracking and so on. Among the different algorithms for pose
estimations. Open Pose algorithm is being used with Tensorflow in this approach. This
algorithm represents the real- time system to jointly detect human body parts on single
image. The system follows the approach using the RGB image to generate the whole
human body keypoints for person being detected. The real time multi person system
which jointly detects human body, foot, hand and facial keypoints that equals 135
keypoints on single image. Some available 2D pose estimations libariries such as Alpha
pose require the user to implement most of the pipeline, display to visualize the results
and also the body and face keypoints are not combined which would require a separate
library for each purpose. Hereby, Open Pose addresses all these problems. The other
significance of this research using Open Pose is that it can run on various platform like
Ubuntu, Windows, MAC and also in embedded systems and provides the support for
different hardwares. Human Posture detection using opencv is cheaper compared to any
other methods. The system also requires minimum materials s compared to others.
2
Dissertation Proposal of Master's Degree
In today’s real world applications of human posture recognition, high degree of accuracy
as well as real-time inference is required. Openpose which is developed by researchers at
Carnegei Mellon University can be taken as the state for real time human body posture
recognition. The first original paper on Openpose was submitted on 2016 and the most
recent was submitted on 2018 which showed the minor difference as the neural network
architecture and some aspects resulting in improved speed and accuracy. The basic
architecture of Openpose is: First the image is taken then part the confidence maps and
part the affinity fields . After that Bipartite matching is done and lastly the result is
combined and parsed. It helps to identify the human body joints using RGB camera.
Openpose keypoints includes ear, eyes, neck, nose, elbows, shoulder, knee, wrist, ankle
and hips. It outputs the results obtained by processing the inputs from a camera in real-
time or pre-recorded videos or static images. Hence, it find its use in varios applications
like survelliance, activity recognition too. The work proposed uses the open pose for key
point identifications followed by Convolutional neural network for different pose.
The first step is detecting key points of every person in the image which is followed by
assigning parts to each distinct individual. Open pose network starts with the extractions
of features from the image using the initial layers . These features are then passed to two
convolutional layer branch which run in parallel. A prediction of confidence map which
represents the specific parts of human body is made by the first branch. On the other
side, Part Affinity Fields denote the association degree parts that is done bythe second
branch . Also, more stages are used to make refinement to all the predictions that resulted
form the previous branch. Then the bipartite graphs are formed between the different
parts using the part confidence map. With these steps human skeletons are estimated for
each person in a single frame. Confidence maps is actually a 2D representation of the
belief that a particular body part can be located in any given pixel. These maps are
described by the given equations:
3
Dissertation Proposal of Master's Degree
It encodes the data in the form of pairwise connections between body parts.
It is a gaussian curve with gradual changes where sigma controls the spread of the peak.
The predicted peak of the network is an aggregation of the individual confidence maps
by maximum operator.
For, the convolutional neural network , its divided into three basic parts:
The first set of stages predicted the part affinity fields refines Lt form the feature
maps of base network F.
The second set of stages takes the output part affinity fields from the previous
layers to refine the prediction of confidence maps detection.
The third stage is about the parsing of the body parts with the help of matching
algorithm.
References
4
Dissertation Proposal of Master's Degree
4. W. Tang, P. Yu, and Y. Wu, “Deeply learned compositional models for human
pose estimation,” in ECCV, 2018.
10. Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose
estimation using part affinity fields,” in CVPR, 2017
12. M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation
using body parts dependent joint regressors,” in CVPR, 2013.
5
Dissertation Proposal of Master's Degree
learning”
14. P. Szczuko, “Deep neural networks for human pose estimation from a very
low resolution depth image”, Multimedia Tools and Appl, 2019.
18. P. Dar, “AI guardman – a machine learning application that uses pose
estimation to detect shoplifters”.
19. A. Agarwal and B. Triggs, “3D human pose from silhouettes by relevance
vector regression”, Intl Conf. on Computer Vision & Pattern Recogn.pp.882–
888, 2004.
20. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” arXiv preprint arXiv:1409.1556, 2014
22. U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose
estimation via latent 2.5 d heatmap regression,” in ECCV, pp. 118–134, 2018.
6
Dissertation Proposal of Master's Degree
7
Dissertation Proposal of Master's Degree
CNN has been used as it promises in making a highly desirable choice. They can be
trained on keypoints of joint location of human skeleton.
8
Dissertation Proposal of Master's Degree
In case of keypoints, CNN basically extracts the features from 2D coordinates of the
Openpose keypoints using the convolutional filter techniques. Based on the filter size, the
convolutional filter slides to next set of input. After the convolution, the activation function
to be generally applied to add the non-lineraity in CNN as the real world data is mostly
non-linear.
The openpose algorithm is divided into three different parts: body detection, face detection
and hand detection. The core block is combined body key point detector. It can
alternatively be use original body only detector trained on datasets. Thus, based on the
9
Dissertation Proposal of Master's Degree
output of the body detector, facial bounding box proposals could roughly be estimated
from some body parts locations which could particularly be eyes, ears, nose and neck.
In the previous works done, both the part affinity field and confidence maps branches were
refined at the each stage. But here in this improvised approach, part affinity field branch
are only refined and confidence maps would be predicted in one stage only. Hence that the
amount of computation per stage would be reduced by half. This results that refined
affinity field predictions improve the confidence map results. Looking up to the part
affinity field channels outputs, the body part location could be guessed. However if the
bunch of body parts is seen with no another extra information then it cannot be parsed into
the different people. More in addition, the network depth is increased.The proposed project
needs to be accurate and also fast. So, training any individual part affinity field based
network to predict each individual set of joints would somehow achieve the accuracy goal.
However there is little chance of inefficency so the extension of body part affinity field
framework to pose estimation requires different modification to train the network
architecture.
MobileNet_V2 Model
It is one of mobile net model which consists of two types of blocks that will be used in this
project. The first block represents the residual block with stride of 1 and other block with
stride 2 for downsizing. Both the blocks consists of three layers each; the first layer is 1 * 1
convolutional with RELU6 whose purpose is to expand the number of channels in the data
before it enter into the depth-wise convolution; hence this expansion layer has always
more output result than of the input channels, second layer is the depth-wise convolutional
which is similar as in mobile-net_V1 architecture and lastly the third layer is another 1 * 1
convolutional layer but without any non-linearity which is claimed that if RELU is used
10
Dissertation Proposal of Master's Degree
again, the only deep networks have the power of linear classification.
Research Goals
As I have already discussed earlier about the growth and advancement in Information
Technolgy in various field including the machine learning, Computer vision , image
processing area. Human body posture recognizing is the task that infers the pose of a
person in certain image or videos. Any person can think of estimating the pose as the
problem of determining of the position of camera relative to an image. The basic goal of
the proposed system is to track the human body keypoints in an image or video. The other
purpose is to promote the area of posture estimation and also to find out more approaches
and techniques that could be used for refining the performance in this field. There is key
distinction to be made between 2D and 3D estimation.2D pose estimation simply estimates
the locations of keypoints in 2D space relative to an image or video frame. 3D estimations
works to transform an object in a 2D image into 3D image by adding a-z dimensions to
predictions. Pose recognition matters a lot in today’s world because with this one is able to
track an object or person that could be multiple people too in the real world space at an
incredibly granular level. Somehow estimating pose differs from other computer vision
tasks such as object detection also locates an object within an image but this localization is
typically coarse grained consisting of a bounding box encompassing object. However, pose
recognition goes further predicting the precise location of keypoints associated with the
object. Talking about the other objective, the power of pose recognition is envisioned by
considering its applications in automatically tracking human movement. In addition of
human tracking movement, pose estimation opens up application in wide range such as:
Animation, augmented reality, gaming or robotics.
Limitations/Problems to be Solved
1. Pose estimation is classified into single person and multi person pose estimation
depending on number of people in an image. Some previous works done are more
focused to single person pose estimating however with the availability of huge
11
Dissertation Proposal of Master's Degree
multi person datasets, multi person pose estimation is getting increased attention.
The multi task learn training is proposed in this task that modifies the definition of
confidence map as the concatenation of body, face, hand confidence maps. An
interconnection between the different annotation task must be created that allows
the different set of keypoints of the same person to be assembled together. So, to
solve the problem, multi-task is proposed.
Multi-Task Learning
Usually multi task learning is an approach that improves generalization by using
domain information contained in training signals of tasks. It is a subfield of machine
learning in which multiple learning task are solved at same time. This learning has
been one of the successful approach across all approach of machine learning
applications. It has been successful in body keypoints detector. In order to improve and
speed up the learning process different multi task learning models are introduced. In
various fields such as computer vision, bioinformatics, speech, natural language
processing, multi task learning is used to improve the applications. Multi task learning
can also be viewed as one way for machines to mimic human learning activities as
people often transfer knowledge from one task to another. In this work, multiple task
leraning method is integrated which is combined with an updated model CNN network
architecture design that is able to train united models out of various keypoints detection
tasks with different sacle features and this results in first single network method for
human body posture estimation. Moreover it is trained in a single stage rather than
requiring independent network training for each individual tasks. So this reduces the
total training time almost by half. The proposed approach yields the high accuracy than
that of the previous Openpose works done specially for face and hand keypoints
detection. As per the multi network approach, it uses the existing body, face and hand
key point detection algorithm which results in suffering from some commitment like if
the body detector fails there is no way to recovery and it is prone to do so when only
12
Dissertation Proposal of Master's Degree
hand or face is visible in an image. So its overall runtime is proportional to the number
of people in an image making the human body posture recognition. This requires a
high receptive field to learn the complex interaction among the people while latter
requires high hand and facial requirements.
2. The image usually seen on a daily basis are most common types of inputs for Pose
recognition. But the system that works on RGB inputs have the huge advantages
over other in terms of mobility of input sources. The color information is not
frequently used which needs to be improved because in image processing,
identification of objects or any anatomical structure is important. So, for
enhancement of image information that quash the unwanted distortion and
strengthen some of the features of image; preprocessing technique will be
discussed. Complete-linkage method that is basically a classic clustering method
will be used where the points corresponds to one or more pixel in the given image
and this methods merge those points into clusters.
13
Dissertation Proposal of Master's Degree
This figure shows the backbone for the network architecture using a preprocessed RGB image of
size w × hto generate a human body key-points for every person detected in the screen. The
convolutional neural network determines the set of two-dimensional confidence maps S of body
part locations and also a set of two-dimensional vector fields L of affinity fields. The set S = (S1,
S2, ..., SJ) has J confidence maps, one per part, where The set L = (L1,
Sj ∈ R w ×h , j∈ { 1. . . J } .
L2, ..., LC) has C vector fields, one per limb, where . The steps are
Lc ∈ R w ×h ×2 , c ∈ { 1. . . C }
14
Dissertation Proposal of Master's Degree
2. After that the feature map is extracted, it is then processed in a multi satge convolutional
neural network that integrates the multi task learning to generate the confidence maps and
part affinity field.
3. After the generation of these maps and fields, they are processed to a matching algorithm
for parsing in order to estimates the postures in a given image.
Confidence Maps:
A Confidence Map is a 2D representation of the belief that a particular body part can be located
in any given pixel. Confidence Maps are described by following equation:
Matching Algorithm
15
Dissertation Proposal of Master's Degree
Since the candidate for each of the body parts has been detected, the other step is to connect them
and form a pair for which matching algorithm is used that is bipartite matching algorithm. Once
the weighted bipartite graph shows all the possible connection between body of two parts, it
means it holds a score for every connection. Then the use of assignment problem is there for
finding the connection that maximizes the total score. After that the final step is to transform the
detected connections and joints into final skeleton structure that estimates the posture in given
image.
Feasibility Analysis
The recent growth of Internet technology and world wide web makes it appear that the
world is witneesing the arrival of completely new technology.
Previously done works on this task has the problem with the accuracy so as speed.
Therefore the multi task learning proposed here would result in improving and speeding
up the learning process and also helps in learning the accurate tasks yielding the high
accuracy.
16
Dissertation Proposal of Master's Degree
Besides the previous works related to posture estimation, here during the testing time, the network
approach provides the constant real time inference regardless of the number of people detected in
the frame. Also the training time has always been the issue in previous works but in this project, it
is trained in single stage rather than requiring the independent network training for individual taks
which most probably reduces the total training time. This approach also yields higher accuracy
than that of the previus openpose especially for face and hand joints detections. Analogous to fast
R- convolutional neural network , it brings together multiple and currently independent keypoint
detection tasks into a unified framework.
Going through the different papers, I found that for 2D posture estimations libraries , mask R-
convolutional neural network or alpha pose requires the users to implement most of the pipeline,
display to visualize the results , output file generations with the results. Plus existing facial and
body keypoints detectoors are not combined requiring different library for each purpose.
Therefore, Open pose overcomes solves these limitations.
17
Dissertation Proposal of Master's Degree
I am doing my research work under the supervision of my supervisor: from my home country,
Nepal as I am currently not present in the university due to the world wide spread of Covid.
Frankly speaking it is not so effective to work form home on research but however I have
managed to make the proper working and favourable environment with the internet facility for my
research work.
Day-by Day, the improvement is increasing in the technological world. So I tried my best to
include some new suggestions and ideas in my research work which will make the work better.
18
Dissertation Proposal of Master's Degree
19
Dissertation Proposal of Master's Degree
20
Dissertation Proposal of Master's Degree
Comments of Supervisor:
Signature:
Date:
Signature:
Date:
Comments of School:
Signature:
21
Dissertation Proposal of Master's Degree
Date:
22