0% found this document useful (0 votes)
39 views6 pages

Soccer Object Motion Recognition

This document proposes a method for recognizing soccer object motions using 3D convolutional neural networks (CNNs). The method involves three steps: 1) acquiring tracking data of soccer players and referees from game videos, 2) processing the data through data augmentation, and 3) learning a motion classifier using 3D CNNs to recognize motions based on the tracked object regions. The goal is to automatically recognize motions like tackles, fouls, and offsides to enable more sophisticated soccer analysis compared to existing manual approaches. Experimental results show the method guarantees real-time performance and satisfactory accuracy.

Uploaded by

Valeria Rocha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Soccer Object Motion Recognition

This document proposes a method for recognizing soccer object motions using 3D convolutional neural networks (CNNs). The method involves three steps: 1) acquiring tracking data of soccer players and referees from game videos, 2) processing the data through data augmentation, and 3) learning a motion classifier using 3D CNNs to recognize motions based on the tracked object regions. The goal is to automatically recognize motions like tackles, fouls, and offsides to enable more sophisticated soccer analysis compared to existing manual approaches. Experimental results show the method guarantees real-time performance and satisfactory accuracy.

Uploaded by

Valeria Rocha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Communication Papers of the Federated Conference on DOI: 10.

15439/2018F48
Computer Science and Information Systems pp. 129–134 ISSN 2300-5963 ACSIS, Vol. 17

Soccer Object Motion Recognition


based on 3D Convolutional Neural Networks
Jiwon Lee, Do-Won Nam, and Wonyoung Yoo Yoonhyung Kim, Minki Jeong, and Changick Kim
SW·Content Research Laboratory, Electrical Engineering,
Electronics and Telecommunications Research Institute, Korea Advanced Institute of Science and Technology,
218 Gajeong-ro, Yuseong-gu, Daejeon, Republic of Korea 291 Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
Email: {ez1005, dwnam, zero2}@etri.re.kr Email: {yhkim, rhm033, changick}@kaist.ac.kr

Abstract—Due to the development of video understanding and home team performance and analyze the strengths and weak-
big data analysis research field using deep learning technique, nesses of the away teams to win the 2014 World Cup [5].
intelligent machines have replaced the tasks that people per- In addition to SAP, many international companies such as
formed in the past in various fields such as traffic, surveillance,
and security area. In the sports field, especially in soccer Chyronhego, OPTA, Deltatre, GPSports, and StatSport have
games, it is also attempting quantitative analysis of players and technologies and services to perform quantitative analyzes on
games through deep learning or big data analysis technique. soccer matches and players.
However, because of the nature of soccer analysis, it is still In general, quantitative analysis of soccer game is consist of
difficult to make sophisticated automatic analysis due to technical three steps: multi-object tracking, event analysis, and tactical
limitations. In this paper, we propose a deep learning based
motion recognition technique which is the basis of high level analysis. Multi-object tracking can be automatically performed
automatic soccer analysis. For sophisticated motion recognition, due to technological advances. However, in the cases of event
we maximize recognition accuracy by sequentially processing the analysis and tactical analysis, which require understanding of
data in three steps: data acquisition, data augmentation, and 3D high level semantic from a given match, data is still extracted
CNN based motion classifier learning. As can be seen from the depending on the manual work of the expert group, and only
experimental results, the proposed method guarantees real-time
speed performance and satisfactory accuracy performance. the big data extracted by hand is secondarily processed and
visualized. There are many reasons why these steps are not
I. I NTRODUCTION automated, but one of the biggest reasons is that the soccer
N the past, professional sports field was a human-oriented event can be recognized only by the motion information of
I area. The training of the player has been done through the
subjective guidance based on the know-how and experience
the player or referee. For example, it is necessary to be able
to recognize a tackle motion of the player, a movement of the
of the manager and the coaching staff. Even in the case of head referee’s hand, and a flag motion of the assistant referee
a game judgement, it is judged through the intuition and so that the tackle event, the foul event, and the offside event
observation of the referee, and the occasional misjudgement can be recognized. To solve this problem, this paper proposes
by the referee is accepted as part of sports. In addition, sports a soccer object motion recognition technique.
audiences were able to enjoy sports through unilateral delivery This paper is composed as follows. In Sec. II, we describe
of sports contents. However, in recent years, many changes the related researches. In Sec. III, we propose a soccer object
have been made in the field of professional sports as a result motion recognition pipeline based on 3D convolutional neural
of quantitative analysis of sports through sports science and networks (CNN). Sec. IV shows the experimental results of
ICT technology. The manager and coaching staff can use the proposed method. Finally, Sec. V discusses the concluding
data and video-based match analysis tools (eg, dartfish video remarks.
analysis tool [1]) to check the objective player performance
or conditions in detail, and to enable player training method II. R ELATED W ORKS
or tactical changes. It also uses technology to help referee Motion recognition is a kind of computer vision field that
judges such as high-speed camera readings (eg, hawk-eye recognizes human pose or action. The general process of
technology [2]) and produces interesting content using brilliant motion recognition is as follows: 1) extracting feature points
visualization tools (eg, freeD technology in NFL [3]) to give a necessary for motion recognition in a given input source; 2)
sense of sports immersion. These sports analytic technologies analyzing pattern of obtained feature points; 3) calculating
are being developed to reflect the needs of people in many similarity with predefined motion list; and 4) determining the
directions, thus the sports analysis market size was $4.7 final motion that has the highest similarity for the given input
Billions in 2017 [4]. source. It is a kind of image classification technology in that
This trend has also affected the professional soccer market. the purpose of video-based motion recognition technology is
Germany World Cup is to take advantage of big data analytics to determine the final motion based on the similarity with
company, SAP’s Match Insights technology to improve the predefined motion list.

c 2018, PTI 129


130 COMMUNICATION PAPERS. POZNAŃ, 2018

Fig. 1. System outline of the proposed method

Conventional motion recognition technologies are divided the problem defined in this paper. In this paper, we use 2D
into several subdivisions according to several criteria. As a first video sources taken from camera equipment installed in the
criterion, it can be classified into two-dimensional(2D) motion stadium. The target area of the field player and referees in the
recognition and three-dimensional(3D) motion recognition ac- game is tracked, and the goal is to recognize the motion based
cording to the dimension of the input source. 2D motion recog- on the tracking data. In addition, it is possible to construct
nition technique performs motion recognition from 2D video large-sized learning data, which is suitable for data-driven
sources taken from a general camera equipment [6], [7], [8]. feature learning and extraction. According to this analysis, the
3D motion recognition technique performs motion recognition motion recognition technology proposed in this paper can be
with stereoscopic video sources taken using special equipment specified as 1) 2D video source based, 2) data-driven feature
such as MicroSoft’s Kinect [9], [10]. As s second criterion, it extraction, and 3) technology to recognize human pose.
can be classified as recognizing human action according to
the human pose recognition and recognizing a gesture of a III. P ROPOSED M ETHOD
specific part of the human body. Motion recognition technique
In this Section, a method of performing motion recognition
based on human pose tries to recognize motions such as
by inputting object regions tracked from a soccer game video
human arm movements, arm extension, waist bending, and
will be described in detail. Figure 1 depicts the system
jumping motion based on video sources of human action [6],
outline of the proposed method. For the motion recognition
[7], [8], [9], [10]. Motion recognition technique for a specific
specialized for the soccer object, we constructed the motion
gesture recognizes a partial movement of a specific part of
recognition system through three steps of data acquisition,
the human body (hands, legs, etc.) [11], [12]. The third
data processing, and motion classifier learning [14]. A detailed
criterion is feature extraction method for motion recognition,
description of each is given in the subsection.
and it is divided into hand-crafted feature extraction method
and data-driven feature extraction method. A feature point A. Data Acquisition
is a clue that is used to distinguish different labels when
performing motion classification. The accuracy of motion The data acquisition is performed first to recognize the
classification depends on the quality of the feature points. motion. To do this, we need to define motion classification
The hand-crafted feature extraction method is a method in criteria. We classify motions of each soccer object and gener-
which the user manually designs and extracts feature points ate learning data based on the following principles:
according to a given classification purpose [6], [7], [8], [9], • The object is categorized into field player, head referee
[10], [11], [12]. The hand-crafted feature extraction method are and assistant referee.
advantageous in that direct design of the user is easy and the • All the motions that each object can take on the field
patterns of motions to be classified are monotonous, but they must be included in the motion list.
have a disadvantage in that the performance is significantly • The body direction of the object with respect to the same
lowered for motions with complex patterns. Recently, data- motion secures data of at least four directions.
driven feature extraction method automatically learns feature
points necessary for classification based on given information
TABLE I
(video clip and label) [13]. Although this feature extraction D EFINED MOTION LIST FOR EACH SOCCER OBJECT
method requires a large amount of computation and huge input
data for learning, it performs much better than the hand-crafted Field player Head referee Assistant referee
feature extraction method in terms of accuracy and execution Stand Sidle
Walk Walk Walk
speed. Run in Run Run
Kick One arm pointing Flag up
According to the above classification criteria, we can specify Tackle/Lie Card Flag chest
the category of motion recognition technique needed to solve Throw in Flag side
JIWON LEE ET AL: SOCCER OBJECT MOTION RECOGNITION BASED ON 3D CONVOLUTIONAL NEURAL NETWORKS 131

Fig. 3. Designed data augmentation scheme

Fig. 2. Location of cameras in the stadium for data acquisition


three ways is to use the original image, and the second and
third are to perform up-scaling of the pre-processed image
• Motion with a duration of less than one second is by 1.25 times and 1.5 times, respectively, and then random
excluded. cropping (with a size of 112 pixels in the horizontal and
The motion classification list based on the above principles vertical directions). As shown in Fig. 3, as a result of this
is shown in Table I. process, the relative sizes of the objects existing in the pre-
After that, we acquired match videos through four cameras processing image are divided into three, and three images are
installed on a soccer stadium as shown in Fig. 2. In the figure, obtained from one original image. This process is called scale
the first and fourth cameras shot the right half and the left half augmentation, which imitates that the size of the bounding
of the stadium respectively, and the second and third cameras box changes irregularly in the tracker output. We have learned
shot the whole stadium. Here, the reason why the video was to diversify the relative size of the object to be recognized
taken at various angles is to increase the recognition rate of the through the scale augmentation so that the classifier can be
flag motion of the assistant referee. Then, the data necessary robust to the perspective of the tracked object. In order to
for motion classifier learning are acquired based on the object secure the robustness against trembling phenomenon of the
position extracted through the multi-object tracker [15]. tracking result, randomly cropped data was learned through
random cropping. The learning data that has been processed
B. Data Augmentation through data augmentation process is finally used as input to
After acquiring initial motion learning data, we extend the motion classifier after normalization process with a size of
the scale of learning data through data augmentation [16]. 120×120 pixels.
Motion data augmentation is closely related to the stability
C. Motion Classifier Learning
of the motion classifier. In general, the input to the motion
classifier is a bounding box image including a soccer object Finally, we have learned a deep learning based motion
which is an output of the multi-object tracker. Due to the classifier based on acquired and augmented learning data.
nature of the tracking algorithm, the size and position of A motion classifier performs learning according to a prede-
the bounding box fluctuate irregularly. Such trembling may termined number of labels at the training part and maps a
cause performance degradation of motion recognition results. given input image to one of the learned motion labels at
Therefore, a technique for effectively processing and extending the testing part. In this paper, we propose a 3D CNN-based
motion learning data is needed for robust motion classifier that motion classifier, which is a deep learning architecture that can
are robust to changes in bounding box size and position. We understand the correlation between adjacent frames in order to
design a data processing algorithm suitable for this problem take advantage of this feature, were used [17], [18].
and incorporate it into motion classifier learning. Figure 4 shows a comparison of the 2D convolution and the
The designed data processing scheme is shown in Fig. 3, and 3D convolution. In the case of 2D convolution, a feature map
its operation is as follows. First, a given image is normalized
to an image of 140 pixels in width and in height. Then, a
random cropping is performed at a size of 112 pixels in the
horizontal and vertical directions. This process is introduced
to imitate the phenomenon that the position of the bounding
box shakes irregularly in the tracker output. Then, image
up-scaling is performed in the following three ways for the
image obtained through the random cropping. The first of the Fig. 4. Comparison of 2D convolution and 3D convolution
132 COMMUNICATION PAPERS. POZNAŃ, 2018

(a) Field player

Fig. 5. Deep learning based motion classifier structure with 3D convolution


(when N = 6)

is extracted using only spatial information for a single image,


whereas a 3D convolution extracts not only spatial information
but also temporal information for a plurality of continuous (b) Head referee
images to extract a feature map. Based on these features,
the 3D CNN structure can be used to learn spatiotemporal
information and contribute to performance enhancement and
stabilization of the motion classifier. Figure 5 shows the
network structure of a proposed motion classifier designed
with a 3D CNN structure. The inputs to the network are Fb
consecutive frame bundles (the frame bundle unit may vary (c) Assistant referee
depending on the applications), and a hierarchical feature map
Fig. 7. An example of extracted data in each object
is extracted over a total of five layers for a given input video.
In each feature map extraction step, 3D convolution is applied.
A kernel having a differential depth (denoted by d in Fig. 5) in the soccer stadium (see Fig. 2) for 9 K-league classic from
is applied according to a layer of the feature map. The last 2016 to 2017. The captured video then proceeded to object
three layers apply a fully connected network structure and tracking and divided the tracked data into field player, head
apply a sof tmax function to finally output the similarity for referee, and assistant referee to generate learning data. A total
N motions. Here, the sof tmax function is a generalization of of 170,000 pieces of initial learning data were generated, and
the logistic function that normalizes a K-dimensional vector about 600,000 pieces of final learning data were constructed
z having an arbitrary real values to a K-dimensional vector after data augmentation process. Example screenshots of test
σ (z) having real values in the range [0, 1] with a sum of 1. videos and the generated learning data we have used are shown
The function is given by in Fig. 6 and Fig. 7, respectively.
ezj In the case of soccer game, 3D CNN based motion classi-
σ (z)j = PK for j = 1, ..., K (1) fiers are designed to enable parallel processing using tensor-
k=1 ezk
flow [19], since the number of objects appearing in one frame
IV. E XPERIMENTAL R ESULTS is large at the same time. In order to improve the accuracy of
In order to verify the performance of the proposed motion motion classifier, we need to consider not only spatial clue but
classifier, we took 4K-sized videos at four different locations also temporal clue. Here, we provide different temporal clues
according to the characteristics of object to be recognized. In
more detail, since the motion of the field player occurs with
a shorter duration than the motion of the referee, the field
player classifies the motion into 4 frames by one unit(Fb = 4),
but in the case of the referee, 6 frames are grouped into one
unit(Fb = 6) to perform motion classification.
To evaluate the performance of the proposed motion classi-
fier, we used the i7-6770 core processor, DDR3 64GB RAM,
(a) First camera (b) Second camera
and three different GPUs. The performance of the motion

TABLE II
P ERFORMANCE VARIATIONS OF MOTION CLASSIFIER FOR EACH GPU

Performance
GPU types
ops f ps
(c) Third camera (d) Fourth camera GeForce GTX 1070 800 32
GeForce GTX TITAN X 930 37.2
Fig. 6. A sample screenshot of four cameras in the test video clip NVIDIA TITAN X 1200 48
JIWON LEE ET AL: SOCCER OBJECT MOTION RECOGNITION BASED ON 3D CONVOLUTIONAL NEURAL NETWORKS 133

TABLE V
C ONFUSION MATRIX OF MOTION CLASSIFIER FOR ASSISTANT REFEREE

Flag Flag Flag


Out\GT Sidle Walk Run
up chest side
(a) Field player Sidle 12,610 131 126 491 962 1,279
Walk 116 12,564 107 374 723 1,369
Run 51 159 11,887 216 430 342
Flag up 0 0 4 9,033 869 621
Flag chest 2 0 3 1,153 16,030 442
Flag side 46 18 5 1,616 1,317 11,724
(b) Head referee Accuracy 0.983 0.976 0.980 0.701 0.788 0.743

However, in the case of the field player, the number of motions


to be recognized is relatively large and the duration of occurred
(c) Assistant referee motion is also shorter than that of the referees (Recognition
using only 66% frames compared to referees). In addition, the
Fig. 8. Output examples of each motion classifier field player has a high degree of similarity between different
motions, which is considered to have affected the accuracy.
classifier measured for each GPU is shown in Table II. Here, V. C ONCLUSION AND F UTURE D IRECTIONS
ops and f ps refer to objects per second and frames per second,
In this paper, we introduced data acquisition, data process-
respectively. Since the number of field players and referees in
ing, and motion classifier learning method to recognize motion
one frame are 22 and 3, respectively, so the f ps is calculated
of soccer objects from soccer video. In particular, to design
by dividing 25 from ops as shown in Table II. It means the
motion classifiers with high accuracy, we use 3DCN N ,
proposed classifier has real-time motion recognition capability.
which is a structure that extracts spatio-temporal features
The finally obtained confusion matrix of each motion clas-
well, and developed motion classifier considering real-time by
sifier is shown in Table III, IV, V, and the output example of
using parallel processing technique. As can be seen from the
the motion classifier is depicted in Fig. 8.
experimental results, it can be seen that the proposed method
As can be seen from the experimental results, the average
satisfies the real-time speed performance and the high motion
motion recognition accuracy of the field player, the head
recognition accuracy of the referee. However, the accuracy of
referee, and the assistant referee was 0.449, 0.851, and 0.872,
the recognition of field player motion is rather low, and further
respectively. It can be confirmed that the accuracy of the
research is needed.
motion recognition of the field player is relatively low com-
In the future, we will design a sophisticated motion classifier
pared to the referee. In the case of the referees, it is easy
with high accuracy even for objects with ambiguous motion
to distinguish the motion because the number of motion to
classification such as field player, and will try to incorporate
be recognized is small and the motion itself is stereotyped.
the developed motion classifier into other sports fields such as
basketball and figure skating.
TABLE III
C ONFUSION MATRIX OF MOTION CLASSIFIER FOR FIELD PLAYER
ACKNOWLEDGMENT

Tackle Throw
This research is supported by Ministry of Culture,
Out\GT Stand Walk Run Kick Sports and Tourism(MCST) and Korea Creative Content
/Lie in
Stand 1,779 0 0 1 0 0 Agency(KOCCA) in the Culture Technology(CT) Reasearch
Walk 58 325 1,158 261 6 0 & Development Program 2016 (R2016030044, Development
Run 0 0 0 4 0 0
Kick 569 694 702 1,440 1,147 648 of Context-Based Sport Video Analysis, Summarization, and
Tackle/Lie 70 1 0 207 768 357 Retrieval Technologies)
Throw in 9 254 37 308 35 989
Accuracy 0.716 0.255 0 0.648 0.393 0.496 R EFERENCES
[1] DartFish sports analysis tool [Online] Available :
TABLE IV http://www.dartfish.com
C ONFUSION MATRIX OF MOTION CLASSIFIER FOR HEAD REFEREE [2] Hawk-eye innovations [Online] Available: https://www.hawkeyeinno va-
tions.com
One arm [3] FreeD on NFL [Online] Available : https://newsroom.intel.com/news/
Out\GT Walk Run Card intel-nfl-kickoff-freed-technology-11-stadiums-create-immersive-
pointing
Walk 11,244 202 376 17 highlights-2017-season/
Run 1,210 10,579 446 132 [4] “Sports analytics: market shares, strategies, and forecasts, worldwide,
One arm pointing 395 499 8,254 143 2015 to 2021,” Wintergreen Research, 472 pages, May 2015
Card 388 295 402 546 [5] A. Ghosh, “How ‘Match Insight’ is changing soccer,” 6th Aug. 2014.
[Online] Available: https://blogs.sap.com/2014/08/06/how-software- is-
Accuracy 0.850 0.914 0.871 0.652 making-football-even-more-beautiful/
134 COMMUNICATION PAPERS. POZNAŃ, 2018

[6] C. P. Huang, C. H. Hsieh, K. T. Lai, and W. Y. Huang, “Human W. Yoo, “3D convolutional neural networks for soccer object motion
action recognition using histogram of oriented gradient of motion history recognition,” in Proc. ICACT 2018, pp. 354-358, Feb. 2018.
image,” in International Conference on Instrumentation, Measurement, [15] W. Kim, S. Moon, J. Lee, D. Nam, and C. Jung , “Multiple Player Track-
Computer, Communication and Control, pp. 353-356, Oct. 2011. ing in Soccer Videos : An Adaptive Multiscale Sampling Approach,”
[7] L. Hu, W. Liu, B. Li, and W. Xing, “Robust motion detection using Multimedia Systems, pp. 1-13, Feb. 2018.
histogram of oriented gradients for illumination variations,” in Proc. [16] A. Krizhevsky, I. Sutskever, and G. E. hinton, “ImageNet classification
ICIMA 2010, pp. 443-447, May. 2010. with deep convolutional neural network,” in Proc. NIPS 2012, pp. 1-9,
[8] P. Banerjee and S. Sengupta, “Human motion detection and tracking for Dec. 2012.
video surveillance,” in National Conference for Communication, 2008. [17] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for
[9] O. Patsadu, C. Nukoolkit, and B. Watanapa, “Human gesture recognition human action recognition,” IEEE Trans. Pattern Analysis and Machine
using Kinect camera,” in Proc. JCSSE 2012, pp. 28-32, May. 2012. Intelligence, vol. 35, no. 1, pp. 221-231, Mar. 2012.
[10] E. E. Stone and M. Skubic, “Fall detection in homes of older adults using [18] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
the Microsoft Kinect,” IEEE Jour. Biomedical and Health Informatics, spatiotemporal features with 3d convolutional networks,” in Proc. ICCV
vol. 19, no. 1, pp. 290-301, Mar. 2014. 2015, pp. 4489-4497, Dec. 2015.
[11] N. C. Kiliboz and U. Gudukbay, “A hand gesture recognition for [19] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
human computer interaction,” Jour. Visual Communication and Image Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
Representation, vol. 28, pp. 97-104, Apr. 2015. A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M.
[12] M. B. Brahem, B. J. Menelas, and M. D. Otis, “Use of 3DOF accelerom- Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C.
eter for foot tracking and gesture recognition in mobile HCI,” Peocedia Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P.
Computer Science, vol. 19, pp. 453-460, 2013. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden,
[13] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” in Nature, vol. M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, “Tensorflow: Large-
521, pp. 436-444, May. 2015. scale machine learning on heterogeneous distributed systems,” arXiv
[14] J. Lee, Y. Kim, M. Jeong, C. Kim, D. Nam, J. Lee, S. Moon, and prepreprint arXiv:1603.04467v2, 2016.

You might also like