0% found this document useful (0 votes)
100 views

A Survey of Deep Learning in Sports Applications

This document presents a comprehensive survey of deep learning techniques in sports performance. It discusses algorithms used for perception, comprehension and decision making. It also reviews widely used datasets and benchmarks in sports. Finally, it summarizes current challenges and future trends of applying deep learning to optimize athletic performance and sports analytics.

Uploaded by

mayank.tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

A Survey of Deep Learning in Sports Applications

This document presents a comprehensive survey of deep learning techniques in sports performance. It discusses algorithms used for perception, comprehension and decision making. It also reviews widely used datasets and benchmarks in sports. Finally, it summarizes current challenges and future trends of applying deep learning to optimize athletic performance and sports analytics.

Uploaded by

mayank.tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

A Survey of Deep Learning in Sports Applications:


Perception, Comprehension, and Decision
Zhonghan Zhao∗ , Wenhao Chai∗ , Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao,
Mingli Song, Senior Member, IEEE, Jenq-Neng Hwang, Fellow, IEEE, Gaoang Wang† , Member, IEEE

Abstract—Deep learning has the potential to revolutionize


sports performance, with applications ranging from percep-
tion and comprehension to decision. This paper presents a
comprehensive survey of deep learning in sports performance,
arXiv:2307.03353v1 [cs.CV] 7 Jul 2023

focusing on three main aspects: algorithms, datasets and virtual


environments, and challenges. Firstly, we discuss the hierarchical
structure of deep learning algorithms in sports performance
which includes perception, comprehension and decision while
comparing their strengths and weaknesses. Secondly, we list
re-id and tracking camera calibration
widely used existing datasets in sports and highlight their
characteristics and limitations. Finally, we summarize current Perception
challenges and point out future trends of deep learning in sports.
Our survey provides valuable reference material for researchers
interested in deep learning in sports applications.
Index Terms—Sports Performance, Internet of Things, Com-
puter Vision, Deep Learning, Survey

I. I NTRODUCTION
RTIFICIAL Intelligence (AI) has found wide-ranging action recognition

A applications and holds a bright future in the world of


sports. Its ever-growing involvement is set to revolutionize the
action quality assessment

Comprehension

industry in myriad ways, enabling new heights of efficiency


and precision.
A prominent application of AI in sports is the use of deep
learning techniques. Specifically, these advanced algorithms
are utilized in areas like player performance analysis, injury
prediction, and game strategy formulation [1]. Through cap-
turing and processing large amounts of data, deep learning play forecasting
video synthesizing
models can predict outcomes, uncover patterns, and formulate
Decision
strategies that might not be evident to the human eye. This
seamless integration of deep learning and the sports indus-
Fig. 1. The examples of the applications in sports performance in perception,
try [2], [3] exemplifies how technology is enhancing our ability comprehension, and decision.
to optimize sporting performance and decision-making.
Although predicting and optimizing athletic performance
has numerous advantages, it remains a complex problem. an automated system powered by machine learning emerges as
Traditionally, sports experts like coaches, managers, scouts, a promising solution that can revolutionize the sports industry
and sports health professionals have relied on conventional by automating the processing of large-scale data.
analytical methods to tackle these challenges. However, gath- In recent years, there has been a notable increase in compre-
ering statistical data and analyzing decisions manually is a hensive surveys exploring the applications of machine learning
demanding and time-consuming endeavor [4]. Consequently, and deep learning in sports performance. These surveys cover
∗ Equal contribution. a wide range of topics, including the recognition of sports-
† Corresponding author: Gaoang Wang. specific movements [5], mining sports data [6], and employ-
Zhonghan Zhao, Shengyu Hao, Wenhao Hu, Guanhong Wang, Mingli Song ing AI techniques in team sports [7]. While some surveys
are with College of Computer Science and Technology, Zhejiang University.
Shidong Cao is with the Zhejiang University-University of Illinois Urbana- focus on specific sports like soccer [7] and badminton [8],
Champaign Institute, Zhejiang University. others concentrate on particular tasks within computer vision,
Gaoang Wang is with the Zhejiang University-University of Illinois Urbana- such as video action recognition [9], video action quality
Champaign Institute, and College of Computer Science and Technology,
Zhejiang University. assessment [10], and ball tracking [11]. Furthermore, several
Wenhao Chai and Jenq-Neng Hwang are with the University of Washington. studies explore the usage of wearable technology [12], [13]
2

Sec. V & VI Sec. II Sec. III Sec. IV


Benchmark Perception Comprehension Decision
A. real-world datasets A. localization A. action recognition A. match evaluation
B. virtual environments B. tracking B. action quality assessment B. play forecasting
C. re-id C. summarization C. game simulator
D. segmentation D. captioning D. synthesizing
E. human pose estimation
F. camera calibration
Fig. 2. Taxonomy. A hierarchical structure that contains three categories of tasks: Perception, Comprehension, and Decision, as well as Benchmark.

and motion capture systems [14] in sports, with a particular


emphasis on the Internet of Things (IoT). A. Localization
Previous studies [15], [16] have employed a hierarchical Identifying and determining the B. Tracking
approach to analyze sports performance, starting from lower- spatial location of players and Following and identifying the
level aspects and progressing to higher-level components, balls. location and motion of objects
while also providing training recommendations. In order to across consecutive frames.
comprehend the utilization of deep learning in sports, we C. Re-id
have segmented it into three levels: Perception, Compre- Matching and recognizing D. Segmentation
hension, and Decision. Additionally, we have categorized individuals across time and Assigning pixel-level labels to
diverse datasets according to specific sports disciplines and different views. each region or object.
outlined the primary challenges associated with deep learning
E. Human Pose Estimation
methodologies and datasets. Furthermore, we have highlighted
Predicting the body joint F. Camera Calibration
the future directions of deep learning in motion, based on the locations and their spatial Estimating the intrinsic and
current work built upon foundational models. relationships. extrinsic parameters of camera.
The contributions of this comprehensive survey of deep
learning in sports performance can be summarized in three
key aspects. Fig. 3. Taxonomy and description of perception tasks.
• We propose a hierarchical structure that systematically
divides deep learning tasks into three categories: Percep-
tion, Comprehension, and Decision, covering low-level to A. Player and Ball Localization
high-level tasks. Player and ball localization aims at identifying and de-
• We provide a summary of sports datasets and virtual termining the spatial location of players and balls, which is
environments. Meanwhile, this paper covers dozens of an essential undertaking in sports video analysis. Precisely
sports scenarios, processing both visual information and identifying these entities can provide valuable insights into
IoT sensor data. team performance, enabling coaches to make well-informed
• We summarize the current challenges and future feasible decisions using data. In recent years, numerous deep learning-
research directions for deep learning in various sports based techniques have emerged, specifically designed for
fields. accurately localizing players and balls in a variety of sports,
The paper is organized as follows: Section II, III, and such as soccer, basketball, and cricket.
IV introduce different tasks with methods for perception, 1) Player Localization: Player localization or detec-
comprehension, and decision tasks in sports. Section V and VI tion [17]–[19] serves as a foundation for various downstream
discuss the sports-related datasets and virtual environments. In applications within the field of sports analysis. These applica-
Section VII and VIII, we highlight the current challenges and tions include identifying player jersey numbers [20]–[22] and
future trends of deep learning in sports. Lastly, we conclude teams [23], [24], predicting movements and intentions [25]–
the paper in Section IX. [27]. Some works [28] leverage advancements in generic
object detection to enhance the understanding of soccer broad-
casts. Others [24] focus on unsupervised methods to differenti-
II. P ERCEPTION
ate player teams and employ multi-modal and multi-view dis-
Perception involves the fundamental interpretation of ac- tillation approaches for player detection in amateur sports [29].
quired data. This section presents different deep-learning Vandeghen et al. [30] introduces a distillation method for semi-
methodologies tailored to specific sports tasks at the perception supervised learning, which significantly reduces the reliance
level as shown in Figure 3. The subsequent perception segment on labeled data. Moreover, certain studies [31], [32] utilize
will encompass tasks such as player tracking, player pose player localization for action recognition and spotting. Object
recognition, player instance segmentation, ball localization, tracking [33] is also crucial for the temporal localization of
camera calibration etc.. players.
3

2) Ball Localization: Ball localization provides crucial 3D 2) Ball Tracking: Accurately recognizing and tracking
positional information about the ball, which offers comprehen- a high-speed, small ball from raw video poses significant
sive insights into its movement state [11]. This task involves challenges. Huang et al. [52] propose a heatmap-based deep
estimating the ball’s diameter in pixels within an image patch learning network [53], [54] to identify the ball image in a
centered on the ball, and it finds applications in various aspects single frame and learn its flight patterns across consecutive
of game analytics [34]. These applications include automated frames. Furthermore, precise ball tracking is essential for
offside detection in soccer [35], release point localization in assisting other tasks, such as recognizing spin actions in table
basketball [36], and event spotting in table tennis [37]. tennis [55] by combining ball tracking information.
Existing solutions often rely on multi-view points [38]–
[40] to triangulate the 2D positions of the ball detected in C. Player Re-identification
individual frames, providing robustness against occlusions that
Player re-identification (ReID) is a task of matching and
are prevalent in team sports such as basketball or American
recognizing individuals across time and different views. In
football.
technical terms, this involves comparing an image of a person,
However, in single-view ball 3D localization, occlusion be-
referred to as the query, against a collection of other images
comes a significant challenge. Most approaches resort to fitting
within a large database, known as the gallery, taken from
3D ballistic trajectories based on the 2D detections [40], [41],
various camera viewpoints. In sports, the ReID task aims to re-
limiting their effectiveness in detecting the ball during free fall
identify players, coaches, and referees across images captured
when it follows ballistic paths. Nonetheless, in many game
successively from moving cameras [36], [56]. Challenges such
situations, the ball may be partially visible or fully occluded
as similar appearances and occlusions and the low resolu-
during free fall. Van et al. [36], [42] address these limitations
tion of player details in broadcast videos make player re-
by deviating from assumptions of ballistic trajectory, time
identification a challenging task.
consistency, and clear visibility. They propose an image-based
Addressing these challenges, many approaches have focused
method that detects the ball’s center and estimates its size
on recognizing jersey numbers as a means of identifying
within the image space, bridging the gap between trajectory
players [22], [57], or have employed part-based classification
predictions offered by ballistic approaches. Additionally, there
techniques [58]. Recently, Teket et al. [59] proposed a real-
are also works on reconstructing 3D shuttle trajectories in
time capable pipeline for player detection and identification
badminton [43].
using a Siamese network with a triplet loss to distinguish
B. Player and Ball Tracking players from each other, without relying on fixed classes or
jersey numbers. An et al. [60] introduced a multi-granularity
Player and ball tracking is the process of consistently network with an attention mechanism for player ReID, while
following and identifying the location and motion of objects Habel et al. [61] utilized CLIP with InfoNCE loss as an
across consecutive frames. This tracking operation is integral objective, focusing on class-agnostic approaches.
to facilitating an automated understanding of sports activities. To address the issue of low-resolution player details in
1) Player Tracking: Tracking players in the temporal
multi-view soccer match broadcast videos, Comandur et
dimension is immensely valuable for gathering player-specific
al. [56] proposed a model that re-identifies players by ranking
statistics. Recent works [44], [45] utilize the SORT algorithm
replay frames based on their distance to a given action frame,
[46], which combines Kalman filtering with the Hungarian
incorporating a centroid loss, triplet loss, and cross-entropy
algorithm to associate overlapping bounding boxes. Addition-
loss to increase the margin between clusters.
ally, Hurault et al. [47] employ a self-supervised approach,
In addition, some researchers have explored semi-supervised
fine-tuning an object detection model trained on generic ob-
or weakly supervised methods. Maglo et al. [62] developed a
jects specifically for soccer player detection and tracking.
semi-interactive system using a transformer-based architecture
In player tracking, a common challenge arises from similar
for player ReID. Similarly, in hockey, Vats et al. [63] employed
appearances that make it difficult to associate detections and
a weakly-supervised training approach with cross-entropy loss
maintain identity consistency. Intuitively, integrating informa-
to predict jersey numbers as a form of classification.
tion from other tasks can assist in tracking. Some works [48]
explore patterns in jersey numbers, team classification, and
pose-guided partial features to handle player identity switches D. Player Instance Segmentation
and correlate player IDs using the K-shortest path algo- Player instance segmentation aims at assigning pixel-level
rithm. In dance scenarios, incorporating skeleton features labels to each player. In player instance segmentation, occlu-
from human pose estimation significantly improves tracking sion is the key problem, especially in crowded regions, like
performance in challenging scenes with uniform costumes and basketball [36]. Some works [64], [65] utilize online specific
diverse movements [49]. copy-paste method [66] to address the occlusion issue.
To address identity mismatches during occlusions, Naik et Moreover, instance segmentation features can be used to
al. [44] utilize the difference in jersey color between teams distinguish different players in team sports with different
and referees in soccer. They update color masks in the tracker actions [24], [67]. In hockey, Koshkina et al. [24] use Mask
module from frame to frame, assigning tracker IDs based R-CNN [68] to detect and segment each person on the playing
on jersey color. Additionally, other works [45], [50] tackle surface. Zhang et al. [67] utilize the segmentation task to
occlusion issues using DeepSort [51]. enhance throw action recognition [67] and event spotting [37].
4

E. Player Pose Estimation


A. Action Recognition
Player pose estimation contributes to predicting the body Classifying and detecting B. Action Quality Assessment
joint locations and their spatial relationships. It often serves specific human action. Evaluating and quantifying the
as a foundational component for various tasks [69], but there overall performance or
are limited works that specifically address the unique charac- C. Summarization proficiency of human actions
teristics of sports scenes, such as their long processing times, Generating concise and based on the analysis of video
reliance on appearance models, and sensitivity to calibration coherent summaries that or motion data.
errors and noisy detections. capture the key information.
Recent approaches have employed OpenPose [70] for action D. Captioning
detection or positional predictions of different elements in Generating descriptive and
coherent textual descriptions.
sports practice [71]–[73]. For sports with rapidly changing
player movements, such as table tennis, some works [74]
utilize a long short-term pose prediction network [75] to ensure Fig. 4. Taxonomy and description of comprehension tasks.
real-time performance. In specific actions analysis of sports
videos, certain works [76] use pose estimation techniques.
Furthermore, Thilakarathne et al. [77] utilize tracked poses In this section, we delve into specific tasks related to
as input to enhance group activity recognition in volleyball. understanding and analyzing sports as shown in Figure 4.
In more spatial heavy sports where less action or movement These tasks include individual and group action recognition,
is present but more complexity lies in the poses, researchers action quality assessment, action spotting, sports video sum-
focus on providing practitioners with tools to verify the marization, and captioning.
correctness of their poses for more efficient learning, such as
in Taichi [78] and Yoga [79]. A. Individual Action Recognition
Player action recognition targets classifying and detecting
F. Camera Calibration
specific human action. Individual action recognition is com-
Camera calibration in sports, also known as field registra- monly used for automated statistical analysis of individual
tion, aims at estimating the intrinsic and extrinsic parameters sports, such as counting the occurrences of specific actions.
of cameras. Homography provides a mapping between a planar Moreover, it plays a crucial role in analyzing tactics, identi-
field and the corresponding visible area within an image. fying key moments in matches, and tracking player activity,
Field calibration plays a crucial role in tasks that benefit from including metrics like running distance and performance. This
position information within the stadium, such as 3D player analysis can assist players and coaches in identifying the
tracking on the field. Various approaches have been employed essential technical factors required for achieving better results.
to solve sport-field registrations in different sports domains, In team sports, coaches need to monitor all players on the
including tennis, volleyball, and soccer [80], [81], often relying field and their respective actions, particularly how they execute
on keypoint retrieval methods. them. Therefore, an automated system capable of tracking all
With the emergence of deep learning, recent approaches these elements could greatly contribute to the players’ success.
focus on learning a representation of the visible sports field However, this casts a significant challenge for computers due
through various forms of semantic segmentation [32], [82]– to the simultaneous occurrence of different actions by multiple
[84]. These approaches either directly predict or regress an players on the sports field, leading to issues such as occlusion
initial homography matrix [85]–[87], or search for the best and confusing scenes.
matching homography in a reference database [84], [88] that While end-to-end models [96], [97], [110] are commonly
contains synthetic images with known homography matrices employed in the literature on video action recognition,
or camera parameters. In other cases [83], [84], a dictionary they are often better suited for coarse-grained classification
of camera views is utilized, connecting an image projection tasks [111]–[113], which focus on broader categories like
of a synthetic reference field model to a homography. The punches or kicks. In contrast, most sports require more fine-
segmentation is then linked to the closest synthetic view in grained methods capable of distinguishing between specific
the dictionary, providing an approximate camera parameter techniques within these broader categories [114], [115].
estimate, which is further refined for the final prediction. Fine-grained action recognition within a single sport can
help mitigate contextual biases present in coarse-grained tasks,
III. C OMPREHENSION making it an increasingly important research area [114],
Comprehension can be defined as the process of under- [116], [117]. Skeleton-based methods [118]–[120] have gained
standing and analyzing data. It involves higher-level tasks popularity for fine-grained action recognition in body-centric
compared to the perception stage discussed in Section II. In sports. These approaches utilize 2D or 3D human pose as input
order to achieve a comprehensive understanding of sports, the for recognizing human actions. By representing the human
implementation can utilize raw data and directly or indirectly skeleton as a graph with joint positions as nodes and modeling
incorporate the tasks from the perception layer. Namely, it can the movement as changes in these graph coordinates over time,
utilize the outputs obtained from the perception network, such both the spatial and temporal aspects of the action can be
as human skeletons, depth images etc. captured. Additionally, some works [121]–[123] focus on fine-
5

TABLE I
D EEP LEARNING MODELS FOR S PORTS COMPREHENSION . “IAR”, “GAR”, “AQA” STAND FOR I NDIVIDUAL ACTION R ECOGNITION , G ROUP ACTION
R ECOGNITION , ACTION Q UALITY A SSESSMENT.

Task Method Venue Benchmark Link


TSM [89] ICCV-2019 FineGym, P2 A
CSN [90] ICCV-2019 Sports 1M
SlowFast [91] ICCV-2019 P2 A, Diving48
G-Blend [92] CVPR-2020 Sports 1M
AGCN [93] TIP-2020 FSD-10
ResGCN [94] MM-2020 FSD-10
IAR MoViNet [95] CVPR-2021 P2 A
TimeSformer [96] ICML-2021 P2 A, Diving48
ViSwin [97] arXiv-2021 P2 A
ORViT [98] arXiv-2021 Diving48
BEVT [99] arXiv-2021 Diving48
VIMPAC [100] arXiv-2021 Diving48
CTR-GCN [101] ICCV-2021 FSD-10
DIN [102] ICCV-2021 Diving48, HierVolleyball-v2
GAR
PoseC3D [103] CVPR-2022 FineGym, FSD-10, HierVolleyball-v2
S3D [104] ICIP-2018 AQA-7
C3D-LSTM [105] WACV-2019 AQA-7
C3D-AVG-MTL [106] CVPR-2019 MTL-AQA
AQA
C3D-MSLSTM [107] TCSVT-2020 FisV, MIT-Skate
I3D-USDL [108] CVPR-2020 AQA-7, MTL-AQA
TSA [109] ACM MM 2021 FR-FS, AQA-7, MTL-AQA

grained action recognition in sports that do not involve body- for feature extraction in group activity analysis [132] is an
centric actions. emerging approach with great potential.
Moreover, some other works focus more on specific action
recognition through temporal localization rather than classi-
B. Group Action Recognition fication. Several automated methods have been proposed to
Group activity recognition involves recognizing activities identify important actions in a game by analyzing camera shots
performed by multiple individuals or objects. It plays a sig- or semantic information. Studies [133]–[135] have explored
nificant role in automated human behavior analysis in various human activity localization in sports videos, salient game
fields, including sports, healthcare, and surveillance. Unlike action identification [136], [137], and automatic identification
multi-player activity recognition, group / team action recogni- and summarization of game highlights [138]–[140]. Recent
tion focuses on identifying a single group action that arises methods are more on soccer. For instance, Giancola et al. [141]
from the collective actions and interactions of each player introduce the concept of accurately identifying and localizing
within the group. This poses greater challenges compared to specific actions within uncut soccer broadcast videos. More
individual action recognition and requires the integration of recently, innovative methodologies have emerged in this field,
multiple computer vision techniques. aiming to automate the process. Cioppa et al. [142] propose the
application of a context-aware loss function to enhance model
Due to the involvement of multiple players, modeling player
performance. They later demonstrated how integrating camera
interaction relations becomes essential in group action anal-
calibration and player localization features can improve spot-
ysis. In general, actor interaction relations can be modeled
ting capabilities [32]. Hong et al. [143] propose an efficient
using graph convolutional networks (GCN) or Transformers
end-to-end training approach, while Darwish et al. [144]
in various methods. Transformer-based methods [124]–[129]
utilize spatiotemporal encoders. Alternative strategies, such
often explicitly represent spatiotemporal relations and employ
as graph-based techniques [145] and transformer-based meth-
attention-based techniques to model individual relations for
ods [146], offer fresh perspectives, particularly in handling
inferring group activity. GCN-based methods [102], [130]
relational data and addressing long-range dependencies. Lastly,
construct relational graphs of the actors and simultaneously
Soares et al. [147], [148] have highlighted the potential
explore spatial and temporal actor interactions using graph
of anchor-based methods in precise action localization and
convolution networks.
categorization.
Among them, Yan et al. [126] construct separate spatial and
temporal relation graphs to model actor relations. Gavrilyuk et
al. [124] encode temporal information using I3D [111] and C. Action Quality Assessment
establish spatial relations among actors using a vanilla trans- Action quality assessment (AQA) is a method used to
former. Li et al. [129] introduces a cluster attention mech- evaluate and quantify the overall performance or proficiency of
anism. Dual-AI [131] proposes a dual-path role interaction human actions based on the analysis of video or motion data.
framework for group behavior recognition, incorporating tem- AQA takes into account criteria such as technique, speed, and
poral encoding of the actor into the transformer architecture. control to assess the movement and assign a score, which can
Moreover, the use of simple multi-layer perceptrons (MLP) be used to guide training and rehabilitation programs. AQA
6

has proven to be reliable and valid for assessing movement


quality across various sports. Research in this field primarily A. Match Evaluation
focuses on analyzing the actions of athletes in the Olympic Analyzing and assessing B. Play Forecasting
Games, such as diving, gymnastics, and other sports mentioned various aspects of a sports Predicting the future actions,
in Section V. Existing methods typically approach AQA as a match, such as player strategies, or outcomes of a
regression task using various video representations supervised performance, team strategies, game or play, leveraging
by scores. and game dynamics. machine learning models to
Some studies concentrate on enhancing network structures anticipate player movements,
to extract more distinct features. For instance, Xu et al. [107] C. Game Simulator team tactics, and potential
propose self-attentive LSTM and multi-scale convolutional Creating virtual environments game-changing events.
skip LSTM models to predict Total Element Score (TES) that mimic real sports games,
and Total Program Component Score (PCS) in figure skating allowing for realistic D. Synthesizing
by capturing local and global sequential information in long- simulations and the generation Generating realistic and
term videos. Xiang et al. [104] divide the diving process of training data. immersive content, such as
player movements or game
into four stages and employ four independent P3D models
scenarios.
for feature extraction. Pan et al. [149] develop a graph-
based joint relation model that analyzes human node motion
using the joint commonality module and the joint difference Fig. 5. Taxonomy and description of decision tasks.
module. Parisi et al. [150] propose a recurrent neural network
with a growing self-organizing structure to learn body motion
sequences and facilitate matching. Kim et al. [151] model action classes. Rafiq et al. [156] propose a transfer learning-
the action as a structured process and encode action units based classification framework for categorizing cricket match
using an LSTM network. Wang et al. [109] introduce a clips [157] into five classes, utilizing a pre-trained AlexNet
tube self-attention module for feature aggregation, enabling CNN and data augmentation. Shingrakhia et al. [158] present
efficient generation of spatial-temporal contextual information a multimodal hybrid approach for classifying sports video seg-
through sparse feature interactions. Yu et al. [152] construct a ments, utilizing the hybrid rotation forest deep belief network
contrastive regression framework based on video-level features and a stacked RNN with deep attention for the identification of
to rank videos and predict accurate scores. key events. Li et al. [159] propose a supervised action proposal
Other studies focus on improving the performance of action guided Q-learning based hierarchical refinement approach
quality assessment by designing network loss functions. Li et for structure-adaptive summarization of soccer videos. While
al. [153] propose an end-to-end framework that employs C3D current research in sports video summarization focuses on
as a feature extractor and integrates a ranking loss with the specific sports, further efforts are needed to develop a generic
mean squared error (MSE) loss. Parmar et al. [106] explore the framework that can support different types of sports videos.
AQA model in a multi-task learning scenario by introducing
three parallel prediction tasks: action recognition, comment
generation, and AQA score regression. Tang et al. [108] pro-
pose an uncertainty-aware score distribution learning approach E. Captioning
that takes into account difficulty levels during the modeling
process, resulting in a more realistic simulation of the scoring Sports video captioning involves generating descriptive and
process. coherent textual descriptions. Sports video captioning models
Furthermore, some studies focus on comparing the quality are designed to generate sentences that provide specific details
of paired actions. Bertasius et al. [154] propose a model related to a particular sport, which is a multimodal [160]
for basketball games based on first-person perspective videos, task. For instance, in basketball, Yu et al. [161] propose
utilizing a convolutional-LSTM network to detect events and a structure that consists of a CNN model for categorizing
evaluate the quality of any two movements. pixels into classes such as the ball, teams, and background,
a model that captures player movements using optical flow
D. Sports Video Summarization features, and a component that models player relationships.
Sports video summarization aims at generating concise These components are combined in a hierarchical structure to
and coherent summaries that capture the key information. It generate captions for NBA basketball videos. Similarly, atten-
often prioritizes the recognition of player actions [155]. This tion mechanisms and hierarchical recurrent neural networks
research field aims to generate highlights of broadcasted sports have been employed for captioning volleyball videos [162].
videos, as these videos are often too lengthy for audiences to Furthermore, the utilization of multiple modalities can be
watch in their entirety. Given that many sports matches can extended to explore the creation of detailed captions or nar-
have durations of 90-180 minutes, it becomes a challenging ratives for sports videos. Qi et al. [163] and Yu et al. [164]
task to create a summary that includes only the most interest- have successfully generated fine-grained textual descriptions
ing and exciting events. for sports videos by incorporating attention mechanisms that
Agyeman et al. [155] employ a 3D ResNet CNN and consider motion modeling and contextual information related
LSTM-based deep model to detect five different soccer sports to groups and relationships.
7

IV. D ECISION sports environment, supporting both single-agent and multi-


The decision or decision-making process in sports involves agent training.
the highest level of tasks, where the deployment or implicit The soccer virtual environment GFootball has gained signif-
perception and understanding of sports are essential before icant attention in recent years [181]. In the 2020 Google Re-
generating more abstract decisions. This section encompasses search Football Competition, the winning team, WeKick [184],
various tasks such as match evaluation, play forecasting, game developed a powerful agent using imitation learning and
simulation, player motion generation, and match generation as distributed league training. However, WeKick is specifically
shown in Figure 5. designed for single-agent AI and cannot be extended to multi-
agent control. To address this limitation, Huang et al. [185]
propose TiKick, an offline multi-agent algorithm that com-
A. Match Evaluation
pletes full games in GFootball using replay data generated by
Match evaluation involves analyzing and assessing various WeKick [185]. Another approach, Tizero [186], trains agents
aspects of a sport match, such as player performance, team from scratch without pre-collected data and employs a self-
strategies, and game dynamics. This task requires match mod- improvement process to develop high-quality AI for multi-
eling, often employing deep reinforcement learning methods. agent control [186].
For instance, Wang et al. [165] develop a deep reinforcement Although DRL systems have made significant progress,
learning model to study NBA games with the goal of mini- they continue to encounter challenges in several areas, in-
mizing offensive scores. Luo et al. [166] combine Q-function cluding multi-agent coordination, long-term planning, and
learning and inverse reinforcement learning to devise a unique non-transitivity [187]–[189]. These challenges highlight the
ranking method and an alternating learning framework for a complexity of developing AI systems that can effectively
multi-agent ice hockey Markov game. Liu et al. [167] value coordinate with multiple agents, make strategic decisions over
player actions under different game contexts using Q-function extended periods, and account for non-transitive relationships
learning and introduce a new player evaluation metric called in dynamic environments. Further research and advancements
the Game Impact Metric. Yanai et al. [168] model basketball in these areas are crucial for enhancing the capabilities of DRL
games by extending the DDPG [169] architecture to evaluate systems.
the performance of players and teams.

D. Player Motion Synthesizing


B. Play Forecasting
Utilizing video-based sequences to capture and analyze
Play Forecasting aims at predicting the future actions, strate-
player movements represents a powerful approach to enhanc-
gies, or outcomes of a game or play, leveraging machine learn-
ing data diversity in sports. This innovative initiative has the
ing models to anticipate player movements, team tactics, and
potential to make a positive impact on the development of
potential game changing events. The availability of accurate
sports disciplines. Through detailed analysis and reproduction
player and ball tracking data in professional sports venues has
of player movements, we can gain valuable insights that
generated interest in assisting coaches and analysts with data-
have the potential to improve techniques, elevate athletic
driven predictive models of player or team behavior [170],
performance, and drive progress in the world of sports. This
[171]. Several studies have utilized multiple years of match
pioneering endeavor holds great promise for advancing the
data to predict various aspects, such as predicting the ball
field and benefiting athletes and sports enthusiasts alike.
placement in tennis [172], [173] and the likelihood of winning
1) Auto Choreographer: Creating choreography involves
a point [174]. Le et al. [175] focus on predicting how NBA
the creative design of dance movements. However, automating
defenses will react to different offensive plays, while Power et
the choreography process computationally is a challenging
al. [176] analyze the risk-reward of passes in soccer. In a more
task. It requires generating continuous and complex motion
recent work, Wang et al. [177] delve into the analysis of where
that captures the intricate relationship with accompanying
and what strokes to return in badminton.
music.
Music-to-dance motion generation can be approached from
C. Game Simulators both 2D and 3D perspectives. 2D approaches [190]–[192] rely
Game simulators typically aim at creating virtual environ- on accurate 2D pose detectors [193] but have limitations in
ments that mimic real sports games, allowing for realistic terms of expressiveness and downstream applications. On the
simulations and the generation of training data [178]–[181]. other hand, 3D dance generation methods utilize techniques
These virtual environments, which are discussed in detail in such as LSTMs [194]–[198], GANs [199], [200], transformer
Section VI, allow agents to move freely based on specific al- encoders with the RNN decoder [201] or transformer de-
gorithms, simulating real-world sports scenarios. Within such coder [202], and convolutional sequence-to-sequence mod-
environments, deep reinforcement learning (DRL) algorithms els [203], [204] to generate motion from audio.
have shown remarkable performance in sport-related tasks. Early works [191], [198], [204] in this field could predict
Zhao et al. [182] propose a hierarchical learning approach future motion deterministically from audio but struggled when
within a multi-agent reinforcement framework to emulate hu- the same audio had multiple corresponding motions. However,
man performance in sports games. Jia et al. [183] address the recent advancements, such as the work by Li et al. [202], have
challenges of asynchronous real-time scenarios in a basketball addressed this limitation by formulating the problem with seed
8

motion. This enables the generation of multiple motions from The SoccerNet series [33], [34], [141], [212] is the largest
the same audio, even with a deterministic model. Li et al. [202] one including annotations for a variety of spatial annotations
propose a novel cross-modal transformer-based model that and cross-view correspondences. It covers multiple vision-
better preserves the correlation between music and 3D motion. based tasks including player understanding like player track-
This approach results in more realistic and globally translated ing, re-identification, broadcast video understanding like action
long human motion. spotting, video captioning, and field understanding like camera
calibration.
E. Sport Video Synthesizing In recent years, the combination of large-scale datasets and
The goal of artificially synthesizing sports videos is to deep learning models has become increasingly popular in the
generate realistic and immersive content, such as player move- field of soccer tasks, raising the popularity of the SoccerNet
ments or game scenarios. Early works in this field train models series datasets [34], [141], [212]. Meanwhile, SoccerDB [211],
using annotated videos where each time step is labeled with SSET [210], and ComprehensiveSoccer [243] are more suit-
the corresponding action. However, these approaches use a able for tasks that require player detection. However, there
discrete representation of actions, which make it challenging are few datasets like SoccerKick [213] for soccer player pose
to define prior knowledge for real-world environments. Addi- estimation. It is hoped that more attention can be paid to the
tionally, devising a suitable continuous action representation recognition and understanding of player skeletal movements
for an environment is also complex. To address the complexity in the future.
of action representation in tennis, Menapace et al. [205]
propose a discrete action representation. Building upon this B. Basketball
idea, Huang et al. [206] model actions as a learned set of Basketball datasets have been developed for various tasks
geometric transformations. Davtyan et al. [207] take a different such as player and ball detection, action recognition, and pose
approach by separating actions into a global shift component estimation. APIDIS [40], [245] is a challenging dataset with
and a local discrete action component. More recent works annotations for player and ball positions, and clock and non-
in tennis have utilized a NeRF-based renderer [208], which clock actions. Basket-1,2 [38] consists of two frame sequences
allows for the representation of complex 3D scenes. Among for action recognition and ball detection. NCAA [246] is
these works, Menapace et al. [209] employ a text-based action a large dataset with action categories and bounding boxes
representation that provides precise details about the specific for player detection. SPIROUDOME [215] focuses on player
ball-hitting action being performed and the destination of the detection and localization. BPAD [154] is a first-person
ball. perspective dataset with labeled basketball events. Space-
Jam [247] is for action recognition with estimated player
V. DATASETS AND B ENCHMARKS poses. FineBasketball [248] is a fine-grained dataset with 3
In the era of deep learning, having access to effective data is broad and 26 fine-grained categories. NBA [126] is a dataset
crucial for training and evaluating models. In order to facilitate for group activity recognition, where each clip belongs to one
this, we have compiled a list of commonly used public sports of the nine group activities, and no individual annotations are
datasets, along with their corresponding details, as shown in provided, such as separate action labels and bounding boxes.
Table II. Below, we provide a more detailed description of NPUBasketball [216] contains RGB frames, depth maps, and
each dataset. skeleton information for various types of action recognition
models. DeepSportradar-v1 [36] is a multi-label dataset for
A. Soccer 3D localization, calibration, and instance segmentation tasks.
In Captioning task, NSVA [217] is the largest open-source
In soccer, most video-based datasets benefit from active
dataset in the basketball domain. Compared to SVN [249]
tasks like player tracking and action recognition, while some
and SVCDV [162], NSVA is publicly accessible and has the
datasets focus on field localization and registration or player
most sentences among the three datasets, with five times more
depth maps and meshes.
videos than both SVN and SVCDV. Additionally, there are
Some datasets focus more on player detection and tracking.
some special datasets that focus on reconstructing the player.
Soccer-ISSIA [240] is an early work and a relatively small
NBA2K dataset [250] includes body meshes and texture data
dataset with player bounding box annotations. SVPP [241]
of several NBA players.
provides a multi-sensor dataset that includes body sensor data
and video data. Soccer Player [242] is specifically designed
for player detection and tracking, while SoccerTrack [214] is C. Volleyball
a novel dataset with multi-view and super high definition. Despite being a popular sport, there are only a few volley-
Other datasets like Football Action [137] and Soc- ball datasets available, most of which are on small scales.
cerDB [211] benefit action recognition, and Comprehen- Volleyball-1,2 [38] contains two sequences with manually
siveSoccer [243] and SSET [210] can be used for various annotated ball positions. HierVolleyball [251] and its extension
video analysis tasks, such as action classification, localization, HierVolleyball-v2 [252] are developed for team activity recog-
and player detection. SoccerKicks [212] provides player pose nition, with annotated player actions and positions. Sports
estimation. GOAL [244] supports knowledge-grounded video Video Captioning Dataset-Volleyball (SVCDV) [162] is a
captioning. dataset for captioning tasks, with 55 videos from YouTube,
9

TABLE II
A LIST OF VIDEO - BASED SPORTS - RELATED DATASETS USED IN THE PUBLISHED PAPERS . N OTE THAT SOME OF THEM ARE NOT PUBLICLY AVAILABLE
AND “ MULTIPLE ” MEANS THAT THE DATASET CONTAINS VARIOUS SPORTS INSTEAD OF ONLY ONE SPECIFIC TYPE OF SPORTS . “ DET.”, “ CLS .”, “ TRA .”,
“ASS .”, “ SEG .”, “ LOC .”,“ CAL .”, “ CAP.” STAND FOR PLAYER / BALL DETECTION , ACTION CLASSIFICATION , PLAYER / BALL TRACKING , ACTION QUALITY
ASSESSMENT, OBJECT SEGMENTATION , TEMPORAL ACTION LOCALIZATION , CAMERA CALIBRATION , AND CAPTIONING RESPECTIVELY.

Sport Dataset Year Task # Videos Avg. length Link


SoccerNet [141] 2018 loc.& cls. 500 5,400
SSET [210] 2020 det.&tra. 350 0.8h
SoccerDB [211] 2020 cls.& loc. 346 1.5h
SoccerNet-v2 [212] 2021 cls.&loc. 500 1.5h1.5h
Soccer
SoccerKicks [213] 2021 pos. 38 -
SoccerNet-v3 [34] 2022 cls.&tra. 346 1.5h
SoccerNet-Tracking [33] 2022 cls.&tra. 21 45.5m
SoccerTrack [214] 2022 tra.&loc. 20 30s
BPAD [215] 2017 ass. 48 13m
NBA [126] 2020 cls. 181 -
Basketball NPUBasketball [216] 2021 cls. 2,169 -
DeepSportradar-v1 [36] 2022 seq.&cal. - -
NSVA [217] 2022 cls.&cap. 32,019 9.5s
PE-Tennis [218] 2022 det.&cal. 14,053 3s
Tennis
LGEs-Tennis [209] 2023 cal.&tra.&cap. 7,112 7.8s
FisV-5 [219] 2020 ass.& cls. 500 2m50s
Figure Skating
FR-FS [220] 2021 ass.& cls. 417 -
MTL-AQA [221] 2019 ass. 1,412 -
Diving
FineDiving [222] 2022 ass.& cls. 3,000 52s
GrooveNet [194] 2017 pos. 2 11.5m
Dance with Melody [195] 2018 pos. 61 92s
Dance EA-MUD [200] 2020 pos. 17 74s
AIST++ [202] 2021 det&pos. 1,408 13s
DanceTrack [49] 2022 tra. 100 52.9s
Golf GolfDB [223] 2019 cls. 1,400 -
Gymnastics FineGym [114] 2020 cls.& loc. - -
Rugby Rugby sevens [119] 2022 tra. 346 40s
Baseball MLB-YouTube [224] 2018 cls. 5,111 -
Sports 1M [225] 2014 cls. 1M 36s
OlympicSports [226] 2014 ass. 309 -
SVW [227] 2015 det.& cls. 4,100 11.6s
OlympicScoring [228] 2017 ass. 716 -
MADS [229] 2017 ass. 30 -
MultiTHUMOS [230] 2017 cls. 400 4.5m
AQA-7 [231] 2019 ass. 1,189 -
General
C-Sports [232] 2020 cls.&loc. 2,187 -
MultiSports [233] 2021 cls.&loc. 3,200 20.9s
ASPset-510 [234] 2021 pos. 510 -
HAA-500 [235] 2021 cls. 10,000 2.12s
SMART [236] 2021 cls. 5,000 -
Win-Fail [237] 2022 cls. 817 3.3s
SportsPose [238] 2023 pos. 25 11m
SportsMOT [239] 2023 tra. 240 25s

each containing an average of 9.2 sentences. However, this with labeled player positions and time boundaries of actions.
dataset is not available for download. THETIS [256] includes 1,980 self-recorded videos of 12 tennis
actions with RGB, depth, 2D skeleton, and 3D skeleton videos,
D. Hockey which can be used for multiple types of action recognition
models. TenniSet [257] contains five Olympic tennis match
The Hockey Fight dataset [253] contains 1,000 video
videos with six labeled event categories and textural descrip-
clips from National Hockey League (NHL) games for bi-
tions, making it suitable for both recognition, localization, and
nary classification of fight and non-fight. The Player Tracklet
action retrieval tasks.
dataset [254] consists of 84 video clips from NHL games with
annotated bounding boxes and identity labels for players and
It should be noted that some recent works focus more on
referees and is suitable for player tracking and identification.
generative tasks, like PVG [258], which obtained a tennis
dataset through YouTube videos. PE-Tennis [218] built upon
E. Tennis PVG and introduces camera calibration resulting from recon-
Various datasets have been constructed for tennis video struction, making it possible to edit the viewpoint. LGEs-
analysis. ACASVA [255] is designed for tennis action recog- Tennis [209] enables generation from text editing on player
nition and consists of six broadcast videos of tennis games movement, shot type, and location.
10

F. Table Tennis suitable for model evaluation. In contrast, MTL-AQA [221]


Various datasets have been developed for table tennis stroke consists of 1,412 samples annotated with action quality scores,
recognition, such as TTStroke-21 [259], which comprises 129 class labels, and textural commentary, making it suitable for
self-recorded videos of 21 categories, and SPIN [55], which multiple tasks. In addition, FineDiving [222] is a recent dataset
includes 53 hours of self-recorded videos with annotations of consisting of 3,000 video samples covering 52 types of actions,
ball position and player joints. OpenTTGames [37] consists 29 sub-action types, and 23 difficulty levels, providing fine-
of 12 HD videos of table tennis games, labeled with ball grained annotations including action types, sub-action types,
coordinates and events. Stroke Recognition [260] is similar coarse and fine time boundaries, and action scores. It is the
to TTStroke-21, but much larger, and P2 A [261] is one of the first fine-grained motion video dataset for the AQA task,
largest datasets for table tennis analysis, with annotations of filling the gap in fine-grained annotations in AQA and suitable
each stroke in 2,721 broadcasting videos. for designing competition strategies and better showcasing
athletes’ strengths.

G. Gymnastics
K. Dance
The FineGym [114] is a recent work developed for gymnas-
tic action recognition and localization. It contains 303 videos The field of deep learning has several research tasks for
with around 708-hour length and is annotated hierarchically, dance, including music-oriented choreography, dance mo-
making it suitable for fine-grained action recognition and tion synthesis, and multiple object tracking. Researchers
localization. On the other hand, AFG-Olympics [262] provides propose several datasets to promote research in this field.
challenging scenarios with extensive background, viewpoint, GrooveNet [194] consists of approximately 23 minutes of
and scale variations over an extended sample duration of up to motion capture data recorded at 60 frames per second and four
2 minutes. Additionally, a discriminative attention module is performances by a dancer. Dance with Melody [195] includes
proposed to embed long-range spatial and temporal correlation 40 complete dance choreographies for four types of dance,
semantics. totaling 907,200 frames collected with optical motion capture
equipment. EA-MUD [200] includes 104 video sequences
of 12 dancing genres, while AIST++ [202] is a large-scale
H. Badminton 3D human dance motion dataset with frame-level annotations
The Badminton Olympic [263] provides annotations for including 9 views of camera intrinsic and extrinsic parameters,
player detection, point localization, action recognition, and 17 COCO-format human joint locations in both 2D and 3D,
localization tasks. It comprises 10 YouTube videos of singles and 24 SMPL pose parameters. These datasets can be used for
badminton matches, each approximately an hour long. The tasks such as dance motion recognition, tracking, and quality
dataset includes annotations for player positions, temporal assessment.
locations of point wins, and time boundaries and labels of
strokes. Meanwhile, Stroke Forecasting [177] contains 43,191 L. Sport Related Datasets for General Purpose
trimmed video clips of badminton strokes categorized into 10
types, which can be used for both action recognition and stroke There are several datasets for sports action recognition
forecasting. and assessment tasks, including UCF sports [268], MSR
Action3D [269], Olympic [270], Sports 1M [225], SVW [227],
MultiSports [233], OlympicSports [226], OlympicScor-
I. Figure skating ing [228], and AQA [231]. These datasets cover different
There are 5 datasets proposed for figure skating action sports, including team sports and individual sports, and provide
recognition in recent years. FineSkating [264] is a hierarchical- various annotations, such as action labels, quality scores, and
labeled dataset of 46 videos of figure skating competitions bounding boxes.
for action recognition and action quality assessment. FSD- Additionally, Win-Fail [237] is a dataset specifically de-
10 [265] comprises ten categories of figure skating actions and signed for recognizing the outcome of actions, while Sport-
provides scores for action quality assessment. FisV-5 [107] is sPose [238] is the largest markerless dataset for 3D human
a dataset of 500 figure skating competition videos labeled with pose estimation in sports, containing 5 short sports-related
scores by 9 professional judges. FR-FS [109] is designed to activities recorded from 7 cameras, totaling 1.5 million frames.
recognize figure skating falls, with 417 videos containing the SportsMOT [239] is a large-scale and high-quality multi-
movements of take-off, rotation, and landing. MCFS [266] has object tracking dataset comprising detailed annotations for
three-level annotations of figure skating actions and their time each player present on the field in diverse sports scenarios.
boundaries, allowing for action recognition and localization. These datasets provide valuable resources for researchers to
develop and evaluate algorithms for various sports-related
J. Diving tasks.
There are three diving datasets available for action recogni-
tion and action quality assessment. Diving48 [267] contains M. Others
18,404 video segments covering 48 fine-grained categories CVBASE Handball [271] and CVBASE Squash [271]
of diving actions, making it a relatively low-bias dataset are developed for handball and squash action recognition,
11

respectively, with annotated trajectories of players and ac- the performance and reliability of deep learning models in
tion categories. GolfDB [223] facilitates the analysis of golf sports applications.
swings, providing 1,400 high-quality golf swing video seg- b) Datasets Standardization: Standardizing datasets for
ments, action labels, and bounding boxes of players. Lastly, various sports is a daunting task, as each sport has unique
FenceNet [119] consists of 652 videos of expert-level fencers technical aspects and rules that make it difficult to create
performing six categories of actions, with RGB frames, 3D a unified benchmark for specific tasks. For example, taking
skeleton data, and depth data provided. Rugby sevens [62] action recognition tasks as an example, in diving [222], only
is a public sports tracking dataset with tracking ground truth the movement of the athlete needs to be focused on, and
and the generated tracks. MLB-YouTube [224] is introduced attention should be paid to the details of role actions. However,
for fine-grained action recognition in baseball videos. in team sports such as volleyball [251], more attention is
needed to distinguish and identify targets and cluster the
VI. V IRTUAL E NVIRONMENTS same actions after identification. Given the varying emphases
of tasks, there are substantial differences in the dataset re-
Researchers can utilize virtual environments for simulation.
quirements. To go further, action recognition of the same
In a virtual environment that provides agents with simulated
sport type, involves nuanced differences in label classification,
motion tasks, multiple data information can be continuously
making it challenging to develop a one-size-fits-all solution or
generated and retained in the simulation. For example, Fever
benchmark. The creation of standardized, user-friendly, open-
Basketball [183] is an asynchronous environment, which sup-
source, high-quality, and large-scale datasets is crucial for
ports multiple characters, multiple positions, and both the
advancing research and enabling fair comparisons between dif-
single-agent and multi-agent player control modes.
ferent models and approaches in sports performance analysis.
There are many virtual soccer games, such as rSoccer [178],
RoboCup Soccer Simulator [272], the DeepMind MuJoCo c) Data Utilization: The sports domain generates vast
Multi-Agent Soccer Environment [179], [180] and JiDi amounts of fine-grained data through sensors and IoT devices.
Olympics Football [273]. rSoccer [178] and JiDi Olympics However, current data processing methods primarily focus on
Football [273] are two toy football games in which plays are computer vision and do not fully exploit the potential of end-
just rigid bodies and can just move and push the ball. However, to-end deep learning approaches. To fully harness the power of
players in GFootball [181] have more complex actions, such these rich data sources, researchers must develop methods that
as dribbling, sliding, and sprinting. Besides, environments like combine fine-grained sensor data with visual information. This
RoboCup Soccer Simulator [272] and DeepMind MuJoCo fusion of diverse data streams can enable more comprehensive
Multi-Agent Soccer Environment [179], [180] focus more on and insightful analysis, leading to significant advancements
low-level control of a physics simulation of robots, while in the field of sports performance. Some studies have shown
GFootball focuses more on developing high-level tactics. To that introducing multi-modal data can benefit the analysis of
improve the flexibility and control over environment dynamics, athletic performance. For example, in table tennis, visual and
SCENIC [274] is proposed to model and generate diverse sce- IOT signals can be simultaneously used to analyze athlete
narios in a real-time strategy environment programmatically. performance [275]. In dance, visual and audio signals are both
important [202]. More attention is needed on how to utilize
diverse data, so as to achieve better fusion. Meanwhile, multi-
VII. C HALLENGES modal algorithms and datasets [202] are both necessary.
In recent years, deep learning has emerged as a powerful
tool in the analysis and enhancement of sports performance.
VIII. F UTURE TREND
The application of these advanced techniques has revolution-
ized the way athletes, coaches, and teams approach training, The integration of deep learning methodologies into sports
strategy, and decision-making. By leveraging the vast amounts analytics can empower athletes, coaches, and teams with
of data generated in sports, deep learning models have the unprecedented insights into performance, decision-making,
potential to uncover hidden patterns, optimize performance, and injury prevention. This future work aims to explore the
and provide valuable insights that can inform decision-making transformative impact of deep learning techniques in sports
processes. However, despite its promising potential, the im- performance, focusing on data generation methods, multi-
plementation of deep learning in sports performance faces modality and multi-task models, foundation models, applica-
several challenges that need to be addressed to fully realize tions, and practicability.
its benefits. a) Multi-modality and Multi-task: By harnessing the
a) Task Challenge: The complex and dynamic nature power of multi-modal data and multi-task learning, robust
of sports activities presents unique challenges for computer and versatile models capable of handling diverse and complex
vision tasks in tracking and recognizing athletes and their sports-related challenges can be fulfilled. Furthermore, we will
movements. Issues such as identity mismatch due to similar investigate the potential of large-scale models in enhancing
appearances [48], [49], blurring [52] caused by rapid motion, predictive and analytical capabilities. It consists of practical
and occlusion [44], [45] from other players or objects in the applications and real-world implementations that can improve
scene can lead to inaccuracies and inconsistencies in tracking athlete performance and overall team dynamics. Ultimately,
and analysis. Developing robust and adaptable algorithms that this work seeks to contribute to the growing body of research
can effectively handle these challenges is essential to improve on deep learning in sports performance, paving the way for
12

novel strategies and technologies that can revolutionize the sports for everyone. There are already some works [282]–
world of sports analytics. [284] focusing on sports performance analysis, data recording
b) Foundation Model: The popularity of ChatGPT has visualization, energy expenditure estimation, and many other
demonstrated the power of large language models [276], while aspects. At the same time, in professional sports, there are also
the recent segment-anything project showcases the impressive some works [16], [275] that focus on combining various data
performance of large models in visual tasks [277]. The prompt- and methods to help improve athletic performance. Broadly
based paradigm is highly capable and flexible in natural lan- speaking, in both daily life and professional fields, there is
guage processing and even image segmentation, offering un- a need for more applications relating to health and fitness
precedented rich functionality. For example, some recent work assessments.
has leveraged segment-anything in medical image [278]–[280], e) Practicability: In more challenging, high-level tasks
achieving promising results by providing point or bounding with real-world applications, practicality becomes increasingly
box prompts for preliminary zero-shot capability assessment, important. Many practical challenges remain unexplored or
demonstrating that segment anything model (SAM) has good under-explored in applying deep learning to sports perfor-
generalization performance in medical imaging. Therefore, the mance. In decision-making, for example, current solutions
development of large models in the sports domain should often rely on simulation-based approaches. However, multi-
consider how to combine existing large models to explore agent decision-making techniques hold great potential for
applications, and how to create large models specifically for enhancing real-world sports decision-making. Tasks such as
the sports domain. ad-hoc teamwork [285] in multi-agent systems and zero-shot
Combining large models requires considering the adaptabil- human-machine interaction are crucial for enabling effec-
ity of the task. Compared to the medical field, sports involve a tive and practical real-world applications. Further research is
high level of human participation, inherently accommodating needed to bridge the gap between theoretical advancements
different levels and modalities of methods and data. We believe and their practical implications in sports performance analy-
that both large language models in natural language processing sis and decision-making. For example, RoboCup [272] aims
and large image segmentation models in computer vision to defeat human players in the World Cup by 2050. This
should have strong compatibility in sports. In short, we believe complex task requires robots to perceive their environment,
there is potential for exploring downstream tasks, such as using gather information, understand it, and execute specific actions.
ChatGPT for performance evaluation and feedback: employ Such agents must exhibit sufficient generalization, engage in
ChatGPT to generate natural language summaries of player or extensive human-machine interaction, and quickly respond to
team performance, as well as provide personalized feedback performance and environmental changes in real-time.
and recommendations for improvement.
Foundation models directly related to the sports domain IX. C ONCLUSION
require a vast amount of data corresponding to the specific In this paper, we present a comprehensive survey of deep
tasks. For visual tasks, for example, it is essential to ensure learning in sports, focusing on four main aspects: algorithms,
good scalability, adopt a prompt-based paradigm, and maintain datasets, challenges, and future works. We innovatively sum-
powerful capabilities while being flexible and offering richer marize the taxonomy and divide methods into perception,
functionality. It is important to note that large models do not comprehension, and decision from low-level to high-level
necessarily imply a large number of parameters, but rather a tasks. In the challenges and future works, we provide cutting-
strong ability to solve tasks. Recent work on segment-anything edge methods and give insights into the future trends and
has proven that even relatively simple models can achieve challenges of deep learning in sports.
excellent performance when the data volume is sufficiently
large. Therefore, creating large-scale, high-quality datasets in ACKNOWLEDGMENTS
the sports domain remains a crucial task. This work is supported by National Key R&D Program of
c) Data Generation: High-quality generated data can China under Grant No.2022ZD0162000, and National Natural
significantly reduce manual labor costs while demonstrating Science Foundation of China No.62106219.
the diversity that generative models can bring. Many stud-
ies [202], [281] have focused on generating sports videos, of- R EFERENCES
fering easily editable, high-quality generation methods, which
[1] N. Chmait and H. Westerbeek, “Artificial intelligence and machine
are elaborated upon in the relevant Section IV-D and IV-E. learning in sport research: An introduction for non-data scientists,”
Meanwhile, by combining large models, additional annotation Frontiers in Sports and Active Living, p. 363, 2021.
work can be performed at this stage, and if possible, new [2] “Smt,” https://www.smt.com/.
[3] “vizrt,” https://www.vizrt.com/.
usable data can be generated. [4] A. Duarte, C. Micael, S. Ludovic, S. Hugo, and D. Keith, Artificial
d) Applications: Though there are many excellent auto- Intelligence in Sport Performance Analysis, 2021.
matic algorithms for different tasks in the field of sports, they [5] E. E. Cust, A. J. Sweeting, K. Ball, and S. Robertson, “Machine and
deep learning for sport-specific movement recognition: a systematic
are still insufficient when it comes to deployment for specific review of model development and performance,” Journal of sports
tasks. In the daily exercise of ordinary people, who generally sciences, vol. 37, no. 5, pp. 568–600, 2019.
lack professional guidance, there should be more applications [6] R. P. Bonidia, L. A. Rodrigues, A. P. Avila-Santos, D. S. Sanches, J. D.
Brancher et al., “Computational intelligence in sports: A systematic
that make good use of these deep learning algorithms, and literature review,” Advances in Human-Computer Interaction, vol.
use more user-friendly and intelligent methods to promote 2018, 2018.
13

[7] R. Beal, T. J. Norman, and S. D. Ramchurn, “Artificial intelligence for [29] A. Cioppa, A. Deliege, M. Istasse, C. De Vleeschouwer, and
team sports: a survey,” The Knowledge Engineering Review, vol. 34, M. Van Droogenbroeck, “Arthus: Adaptive real-time human segmen-
p. e28, 2019. tation in sports through online distillation,” in Proceedings of the
[8] D. Tan, H. Ting, and S. Lau, “A review on badminton motion IEEE/CVF Conference on Computer Vision and Pattern Recognition
analysis,” in 2016 International Conference on Robotics, Automation Workshops, 2019, pp. 0–0.
and Sciences (ICORAS). IEEE, 2016, pp. 1–4. [30] R. Vandeghen, A. Cioppa, and M. Van Droogenbroeck, “Semi-
[9] F. Wu, Q. Wang, J. Bian, N. Ding, F. Lu, J. Cheng, D. Dou, and supervised training to improve player and ball detection in soccer,”
H. Xiong, “A survey on video action recognition in sports: Datasets, in Proceedings of the IEEE/CVF Conference on Computer Vision and
methods and applications,” IEEE Transactions on Multimedia, 2022. Pattern Recognition, 2022, pp. 3481–3490.
[10] S. Wang, D. Yang, P. Zhai, Q. Yu, T. Suo, Z. Sun, K. Li, and [31] R. Sanford, S. Gorji, L. G. Hafemann, B. Pourbabaee, and M. Javan,
L. Zhang, “A survey of video-based action quality assessment,” in 2021 “Group activity detection from trajectory and video data in soccer,”
International Conference on Networking Systems of AI (INSAI). IEEE, in Proceedings of the IEEE/CVF Conference on Computer Vision and
2021, pp. 1–9. Pattern Recognition Workshops, 2020, pp. 898–899.
[11] P. R. Kamble, A. G. Keskar, and K. M. Bhurchandi, “Ball tracking [32] A. Cioppa, A. Deliege, F. Magera, S. Giancola, O. Barnich, B. Ghanem,
in sports: a survey,” Artificial Intelligence Review, vol. 52, no. 3, pp. and M. Van Droogenbroeck, “Camera calibration and player localiza-
1655–1705, 2019. tion in soccernet-v2 and investigation of their representations for action
[12] Y. Adesida, E. Papi, and A. H. McGregor, “Exploring the role of spotting,” in Proceedings of the IEEE/CVF Conference on CVPR, 2021,
wearable technology in sport kinematics and kinetics: A systematic pp. 4537–4546.
review,” Sensors, vol. 19, no. 7, p. 1597, 2019. [33] A. Cioppa, S. Giancola, A. Deliege, L. Kang, X. Zhou, Z. Cheng,
[13] M. Rana and V. Mittal, “Wearable sensors for real-time kinematics B. Ghanem, and M. Van Droogenbroeck, “Soccernet-tracking: Multiple
analysis in sports: a review,” IEEE Sensors Journal, vol. 21, no. 2, pp. object tracking dataset and benchmark in soccer videos,” in Proceed-
1187–1207, 2020. ings of the IEEE/CVF Conference on Computer Vision and Pattern
[14] E. Van der Kruk and M. M. Reijne, “Accuracy of human motion cap- Recognition, 2022, pp. 3491–3502.
ture systems for sport applications; state-of-the-art review,” European [34] A. Cioppa, A. Deliège, S. Giancola, B. Ghanem, and M. Van Droogen-
journal of sport science, vol. 18, no. 6, pp. 806–819, 2018. broeck, “Scaling up soccernet with multi-view spatial localization and
[15] A. M. Turing, Computing machinery and intelligence. Springer, 2009. re-identification,” Scientific Data, vol. 9, no. 1, p. 355, 2022.
[16] J. Wang, K. Qiu, H. Peng, J. Fu, and J. Zhu, “Ai coach: Deep [35] I. Uchida, A. Scott, H. Shishido, and Y. Kameda, “Automated offside
human pose estimation and analysis for personalized athletic training detection by spatio-temporal analysis of football videos,” in Proceed-
assistance,” in Proceedings of the 27th ACM international conference ings of the 4th International Workshop on Multimedia Content Analysis
on multimedia, 2019, pp. 374–382. in Sports, 2021, pp. 17–24.
[36] G. Van Zandycke, V. Somers, M. Istasse, C. D. Don, and D. Zambrano,
[17] U. Rao and U. C. Pati, “A novel algorithm for detection of soccer ball
“Deepsportradar-v1: Computer vision dataset for sports understanding
and player,” in 2015 International Conference on Communications and
with high quality annotations,” in Proceedings of the 5th International
Signal Processing (ICCSP). IEEE, 2015, pp. 0344–0348.
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp.
[18] Y. Yang, M. Xu, W. Wu, R. Zhang, and Y. Peng, “3d multiview
1–8.
basketball players detection and localization based on probabilistic
[37] R. Voeikov, N. Falaleev, and R. Baikulov, “Ttnet: Real-time temporal
occupancy,” in 2018 Digital Image Computing: Techniques and Ap-
and spatial video analysis of table tennis,” in Proceedings of the
plications (DICTA). IEEE, 2018, pp. 1–8.
IEEE/CVF Conference on Computer Vision and Pattern Recognition
[19] M. Şah and C. Direkoğlu, “Evaluation of image representations for Workshops, 2020, pp. 884–885.
player detection in field sports using convolutional neural networks,” [38] A. Maksai, X. Wang, and P. Fua, “What players do with the ball:
in 13th International Conference on Theory and Application of Fuzzy A physically constrained interaction modeling,” in Proceedings of the
Systems and Soft Computing—ICAFS-2018 13. Springer, 2019, pp. IEEE conference on computer vision and pattern recognition, 2016,
107–115. pp. 972–981.
[20] S. Gerke, A. Linnemann, and K. Müller, “Soccer player recognition [39] X. Cheng, N. Ikoma, M. Honda, and T. Ikenaga, “Simultaneous phys-
using spatial constellation features and jersey number recognition,” ical and conceptual ball state estimation in volleyball game analysis,”
Computer Vision and Image Understanding, vol. 159, pp. 105–115, in 2017 IEEE Visual Communications and Image Processing (VCIP).
2017. IEEE, 2017, pp. 1–4.
[21] G. Li, S. Xu, X. Liu, L. Li, and C. Wang, “Jersey number recognition [40] P. Parisot and C. De Vleeschouwer, “Consensus-based trajectory es-
with semi-supervised spatial transformer network,” in Proceedings timation for ball detection in calibrated cameras systems,” Journal of
of the IEEE conference on computer vision and pattern recognition Real-Time Image Processing, vol. 16, no. 5, pp. 1335–1350, 2019.
workshops, 2018, pp. 1783–1790. [41] J. Sköld, “Estimating 3d-trajectories from monocular video sequences,”
[22] H. Liu and B. Bhanu, “Pose-guided r-cnn for jersey number recognition 2015.
in sports,” in Proceedings of the IEEE/CVF Conference on Computer [42] G. Van Zandycke and C. De Vleeschouwer, “3d ball localization from a
Vision and Pattern Recognition Workshops, 2019, pp. 0–0. single calibrated image,” in Proceedings of the IEEE/CVF Conference
[23] M. Istasse, J. Moreau, and C. De Vleeschouwer, “Associative em- on CVPR, 2022, pp. 3472–3480.
bedding for team discrimination,” in Proceedings of the IEEE/CVF [43] P. Liu and J.-H. Wang, “Monotrack: Shuttle trajectory reconstruction
Conference on Computer Vision and Pattern Recognition Workshops, from monocular badminton video,” in Proceedings of the IEEE/CVF
2019, pp. 0–0. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[24] M. Koshkina, H. Pidaparthy, and J. H. Elder, “Contrastive learning for 3513–3522.
sports video: Unsupervised player classification,” in Proceedings of the [44] B. T. Naik, M. F. Hashmi, Z. W. Geem, and N. D. Bokde, “Deepplayer-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, track: player and referee tracking with jersey color recognition in
2021, pp. 4528–4536. soccer,” IEEE Access, vol. 10, pp. 32 494–32 509, 2022.
[25] M. Manafifard, H. Ebadi, and H. A. Moghaddam, “A survey on player [45] B. T. Naik and M. F. Hashmi, “Yolov3-sort: detection and tracking
tracking in soccer videos,” Computer Vision and Image Understanding, player/ball in soccer sport,” Journal of Electronic Imaging, vol. 32,
vol. 159, pp. 19–46, 2017. no. 1, pp. 011 003–011 003, 2023.
[26] R. Theagarajan, F. Pala, X. Zhang, and B. Bhanu, “Soccer: Who has the [46] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
ball? generating visual analytics and player statistics,” in Proceedings and realtime tracking,” in 2016 IEEE international conference on image
of the IEEE Conference on Computer Vision and Pattern Recognition processing (ICIP). IEEE, 2016, pp. 3464–3468.
Workshops, 2018, pp. 1749–1757. [47] S. Hurault, C. Ballester, and G. Haro, “Self-supervised small soccer
[27] A. Arbues-Sanguesa, A. Martín, J. Fernández, C. Ballester, and player detection and tracking,” in Proceedings of the 3rd international
G. Haro, “Using player’s body-orientation to model pass feasibility workshop on multimedia content analysis in sports, 2020, pp. 9–18.
in soccer,” in Proceedings of the IEEE/CVF Conference on Computer [48] R. Zhang, L. Wu, Y. Yang, W. Wu, Y. Chen, and M. Xu, “Multi-camera
Vision and Pattern Recognition Workshops, 2020, pp. 886–887. multi-player tracking with deep player identification in sports video,”
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- Pattern Recognition, vol. 102, p. 107260, 2020.
time object detection with region proposal networks,” in Advances in [49] P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo,
Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. “Dancetrack: Multi-object tracking in uniform appearance and diverse
91–99. motion,” arXiv preprint arXiv:2111.14690, 2021.
14

[50] M. Buric, M. Ivasic-Kos, and M. Pobar, “Player tracking in sports one-shot learning technique,” in Proceedings of the 2019 2nd artificial
videos,” in 2019 IEEE International Conference on Cloud Computing intelligence and cloud computing conference, 2019, pp. 117–124.
Technology and Science (CloudCom), 2019, pp. 334–340. [72] S. Suda, Y. Makino, and H. Shinoda, “Prediction of volleyball trajectory
[51] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime using skeletal motions of setter player,” in Proceedings of the 10th
tracking with a deep association metric,” in 2017 IEEE international Augmented Human International Conference 2019, 2019, pp. 1–8.
conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649. [73] T. Shimizu, R. Hachiuma, H. Saito, T. Yoshikawa, and C. Lee,
[52] Y.-C. Huang, I.-N. Liao, C.-H. Chen, T.-U. İk, and W.-C. Peng, “Track- “Prediction of future shot direction using pose and position of tennis
net: A deep learning network for tracking high-speed and tiny objects player,” in Proceedings Proceedings of the 2nd International Workshop
in sports applications,” in 2019 16th IEEE International Conference on on Multimedia Content Analysis in Sports, 2019, pp. 59–66.
Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1–8. [74] E. Wu and H. Koike, “Futurepong: Real-time table tennis trajectory
[53] V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” forecasting using pose prediction network,” in Extended Abstracts of
in 2017 12th IEEE International Conference on Automatic Face & the 2020 CHI Conference on Human Factors in Computing Systems,
Gesture Recognition (FG 2017). IEEE, 2017, pp. 468–475. 2020, pp. 1–8.
[54] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human [75] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory re-
pose estimation in videos,” in Proceedings of the IEEE international current neural network architectures for large scale acoustic modeling,”
conference on computer vision, 2015, pp. 1913–1921. 2014.
[55] S. Schwarcz, P. Xu, D. D’Ambrosio, J. Kangaspunta, A. Angelova, [76] M. Einfalt, C. Dampeyrou, D. Zecha, and R. Lienhart, “Frame-level
H. Phan, and N. Jaitly, “Spin: A high speed, high resolution vision event detection in athletics videos with pose-based convolutional se-
dataset for tracking and action recognition in ping pong,” arXiv preprint quence networks,” in Proceedings Proceedings of the 2nd International
arXiv:1912.06640, 2019. Workshop on Multimedia Content Analysis in Sports, 2019, pp. 42–50.
[56] B. Comandur, “Sports re-id: Improving re-identification of players in [77] H. Thilakarathne, A. Nibali, Z. He, and S. Morgan, “Pose is all
broadcast videos of team sports,” arXiv preprint arXiv:2206.02373, you need: The pose only group activity recognition system (pogars),”
2022. Machine Vision and Applications, vol. 33, no. 6, p. 95, 2022.
[57] A. Nady and E. E. Hemayed, “Player identification in different sports,” [78] A. Tharatipyakul, K. T. Choo, and S. T. Perrault, “Pose estimation for
in VISIGRAPP, 2021. facilitating movement learning from online videos,” in Proceedings of
[58] A. Senocak, T.-H. Oh, J. Kim, and I. S. Kweon, “Part-based player the International Conference on Advanced Visual Interfaces, 2020, pp.
identification using deep convolutional representation and multi-scale 1–5.
pooling,” in 2018 IEEE/CVF Conference on Computer Vision and [79] E. W. Trejo and P. Yuan, “Recognition of yoga poses through an
Pattern Recognition Workshops (CVPRW), 2018, pp. 1813–18 137. interactive system with kinect based on confidence value,” in 2018
[59] O. M. Teket and I. S. Yetik, “A fast deep learning based approach for 3rd international conference on advanced robotics and mechatronics
basketball video analysis,” in Proceedings of the 2020 4th International (ICARM). IEEE, 2018, pp. 606–611.
Conference on Vision, Image and Signal Processing, ser. ICVISP [80] D. Farin, S. Krabbe, W. Effelsberg et al., “Robust camera calibration
2020. New York, NY, USA: Association for Computing Machinery, for sport videos using court models,” in Storage and Retrieval Methods
2020. [Online]. Available: https://doi.org/10.1145/3448823.3448882 and Applications for Multimedia 2004, vol. 5307. SPIE, 2003, pp.
80–91.
[60] Q. An, K. Cui, R. Liu, C. Wang, M. Qi, and H. Ma, “Attention-
[81] Q. Yao, A. Kubota, K. Kawakita, K. Nonaka, H. Sankoh, and S. Naito,
aware multiple granularities network for player re-identification,” in
“Fast camera self-calibration for synthesizing free viewpoint soccer
Proceedings of the 5th International ACM Workshop on Multimedia
video,” in 2017 IEEE International Conference on Acoustics, Speech
Content Analysis in Sports, 2022, pp. 137–144.
and Signal Processing (ICASSP). IEEE, 2017, pp. 1612–1616.
[61] K. Habel, F. Deuser, and N. Oswald, “Clip-reident: Contrastive training
[82] N. Homayounfar, S. Fidler, and R. Urtasun, “Sports field localization
for player re-identification,” in Proceedings of the 5th International
via deep structured models,” in Proceedings of the IEEE Conference
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp.
on CVPR, 2017, pp. 5212–5220.
129–135.
[83] J. Chen and J. J. Little, “Sports camera calibration via synthetic data,”
[62] A. Maglo, A. Orcesi, and Q.-C. Pham, “Efficient tracking of team sport in Proceedings of the IEEE/CVF conference on CVPR workshops,
players with few game-specific annotations,” in Proceedings of the 2019, pp. 0–0.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [84] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End-
2022, pp. 3461–3471. to-end camera calibration for broadcast videos,” in Proceedings of the
[63] K. Vats, W. McNally, P. Walters, D. A. Clausi, and J. S. Zelek, IEEE/CVF conference on CVPR, 2020, pp. 13 627–13 636.
“Ice hockey player identification via transformers,” arXiv preprint [85] X. Nie, S. Chen, and R. Hamid, “A robust and efficient framework
arXiv:2111.11535, 2021. for sports-field registration,” in Winter Conference on Applications
[64] B. Yan, Y. Li, X. Zhao, and H. Wang, “Dual data augmentation of Computer Vision, WACV. IEEE, 2021, pp. 1935–1943. [Online].
method for data-deficient and occluded instance segmentation,” in Available: https://doi.org/10.1109/WACV48630.2021.00198
Proceedings of the 5th International ACM Workshop on Multimedia [86] F. Shi, P. Marchwica, J. C. G. Higuera, M. Jamieson, M. Javan,
Content Analysis in Sports, 2022, pp. 117–120. and P. Siva, “Self-supervised shape alignment for sports field
[65] B. Yan, F. Qi, Z. Li, Y. Li, and H. Wang, “Strong instance segmentation registration,” in Winter Conference on Applications of Computer
pipeline for mmsports challenge,” arXiv preprint arXiv:2209.13899, Vision, WACV. IEEE, 2022, pp. 3768–3777. [Online]. Available:
2022. https://doi.org/10.1109/WACV51458.2022.00382
[66] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. [87] Y.-J. Chu, J.-W. Su, K.-W. Hsiao, C.-Y. Lien, S.-H. Fan, M.-C.
Le, and B. Zoph, “Simple copy-paste is a strong data augmentation Hu, R.-R. Lee, C.-Y. Yao, and H.-K. Chu, “Sports field registration
method for instance segmentation,” in Proceedings of the IEEE/CVF via keypoints-aware label condition,” in Conference on Computer
conference on computer vision and pattern recognition, 2021, pp. Vision and Pattern Recognition Workshops, CVPRW. IEEE/CVF,
2918–2928. 2022, pp. 3523–3530. [Online]. Available: https://doi.org/10.1109/
[67] C. Zhang, M. Wang, and L. Zhou, “Recognition method of basketball CVPRW56347.2022.00396
players’ throwing action based on image segmentation,” International [88] N. Zhang and E. Izquierdo, “A high accuracy camera calibration
Journal of Biometrics, vol. 15, no. 2, pp. 121–133, 2023. method for sport videos,” in International Conference on Visual
[68] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 2017 Communications and Image Processing, VCIP. IEEE, 2021, pp. 1–5.
IEEE International Conference on Computer Vision (ICCV), 2017, pp. [Online]. Available: https://doi.org/10.1109/VCIP53242.2021.9675379
2980–2988. [89] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient
[69] W. Chai, Z. Jiang, J.-N. Hwang, and G. Wang, “Global adaptation meets video understanding,” in Proceedings of the IEEE/CVF International
local generalization: Unsupervised domain adaptation for 3d human Conference on Computer Vision, 2019, pp. 7083–7093.
pose estimation,” arXiv preprint arXiv:2303.16456, 2023. [90] D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification
[70] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: re- with channel-separated convolutional networks,” in Proceedings of the
altime multi-person 2d pose estimation using part affinity fields,” IEEE IEEE/CVF International Conference on Computer Vision, 2019, pp.
transactions on pattern analysis and machine intelligence, vol. 43, 5552–5561.
no. 1, pp. 172–186, 2021. [91] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks
[71] N. Promrit and S. Waijanya, “Model for practice badminton basic skills for video recognition,” in Proceedings of the IEEE/CVF international
by using motion posture detection from video posture embedding and conference on computer vision, 2019, pp. 6202–6211.
15

[92] W. Wang, D. Tran, and M. Feiszli, “What makes training multi- [116] J. Choi, C. Gao, J. C. Messou, and J.-B. Huang, “Why can’t i dance
modal classification networks hard?” in Proceedings of the IEEE/CVF in the mall? learning to mitigate scene bias in action recognition,”
Conference on Computer Vision and Pattern Recognition, 2020, pp. Advances in Neural Information Processing Systems, vol. 32, 2019.
12 695–12 705. [117] P. Weinzaepfel and G. Rogez, “Mimetics: Towards understanding
[93] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action human actions out of context,” International Journal of Computer
recognition with multi-stream adaptive graph convolutional networks,” Vision, vol. 129, no. 5, pp. 1675–1690, 2021.
IEEE Transactions on Image Processing, vol. 29, pp. 9532–9545, 2020. [118] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling
[94] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more and unifying graph convolutions for skeleton-based action recognition,”
explainable: A graph convolutional baseline for skeleton-based action in Proceedings of the IEEE/CVF conference on computer vision and
recognition,” in proceedings of the 28th ACM international conference pattern recognition, 2020, pp. 143–152.
on multimedia, 2020, pp. 1625–1633. [119] K. Zhu, A. Wong, and J. McPhee, “Fencenet: Fine-grained footwork
[95] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, recognition in fencing,” in Proceedings of the IEEE/CVF Conference
and B. Gong, “Movinets: Mobile video networks for efficient video on Computer Vision and Pattern Recognition, 2022, pp. 3589–3598.
recognition,” in Proceedings of the IEEE/CVF Conference on Computer [120] J. Hong, M. Fisher, M. Gharbi, and K. Fatahalian, “Video pose
Vision and Pattern Recognition, 2021, pp. 16 020–16 030. distillation for few-shot, fine-grained sports action recognition,” in
[96] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all Proceedings of the IEEE/CVF International Conference on Computer
you need for video understanding,” arXiv preprint arXiv:2102.05095, Vision, 2021, pp. 9254–9263.
vol. 2, no. 3, p. 4, 2021. [121] Y. Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo,
[97] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video H. Li, and S. Gould, “The ikea asm dataset: Understanding people
swin transformer,” arXiv preprint arXiv:2106.13230, 2021. assembling furniture through actions, objects and pose,” in Proceedings
[98] R. Herzig, E. Ben-Avraham, K. Mangalam, A. Bar, G. Chechik, of the IEEE/CVF Winter Conference on Applications of Computer
A. Rohrbach, T. Darrell, and A. Globerson, “Object-region video Vision, 2021, pp. 847–859.
transformers,” arXiv preprint arXiv:2110.06915, 2021. [122] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza-
[99] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang, kos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling
L. Zhou, and L. Yuan, “Bevt: Bert pretraining of video transformers,” egocentric vision: The epic-kitchens dataset,” in Proceedings of the
arXiv preprint arXiv:2112.01529, 2021. European Conference on Computer Vision (ECCV), 2018, pp. 720–
[100] H. Tan, J. Lei, T. Wolf, and M. Bansal, “Vimpac: Video pre-training 736.
via masked token prediction and contrastive learning,” arXiv preprint [123] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. West-
arXiv:2106.11250, 2021. phal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag
[101] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel- et al., “The" something something" video database for learning and
wise topology refinement graph convolution for skeleton-based action evaluating visual common sense,” in Proceedings of the IEEE interna-
recognition,” in Proceedings of the IEEE/CVF International Conference tional conference on computer vision, 2017, pp. 5842–5850.
on Computer Vision, 2021, pp. 13 359–13 368. [124] K. Gavrilyuk, R. Sanford, M. Javan, and C. G. Snoek, “Actor-
[102] H. Yuan, D. Ni, and M. Wang, “Spatio-temporal dynamic inference net- transformers for group activity recognition,” in CVPR, 2020, pp. 839–
work for group activity recognition,” in Proceedings of the IEEE/CVF 848.
International Conference on Computer Vision, 2021, pp. 7476–7485. [125] G. Hu, B. Cui, Y. He, and S. Yu, “Progressive relation learning for
[103] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai, “Revisiting group activity recognition,” in CVPR, 2020, pp. 980–989.
skeleton-based action recognition,” arXiv preprint arXiv:2104.13586, [126] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “Social adaptive module
2021. for weakly-supervised group activity recognition,” in Computer Vision–
[104] X. Xiang, Y. Tian, A. Reiter, G. D. Hager, and T. D. Tran, “S3d: ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
Stacking segmental p3d for action quality assessment,” in ICIP, 2018, 2020, Proceedings, Part VIII 16. Springer, 2020, pp. 208–224.
pp. 928–932. [127] M. Ehsanpour, A. Abedin, F. Saleh, J. Shi, I. Reid, and H. Rezatofighi,
[105] P. Parmar and B. T. Morris, “Action quality assessment across multiple “Joint learning of social groups, individuals action and sub-group
actions,” in WACV, 2018. activities in videos,” in ECCV. Springer, 2020, pp. 177–195.
[106] ——, “What and how well you performed? a multitask learning [128] R. R. A. Pramono, Y. T. Chen, and W. H. Fang, “Empowering relational
approach to action quality assessment,” in CVPR, 2019. network by self-attention augmented conditional random fields for
[107] C. Xu, Y. Fu, B. Zhang, Z. Chen, and X. Xue, “Learning to score figure group activity recognition,” in ECCV. Springer, 2020, pp. 71–90.
skating sport videos,” IEEE Transactions on Circuits and Systems for [129] S. Li, Q. Cao, L. Liu, K. Yang, S. Liu, J. Hou, and S. Yi, “Groupformer:
Video Technology (TCSVT), vol. PP, no. 99, pp. 1–1, 2019. Group activity recognition with clustered spatial-temporal transformer,”
[108] Y. Tang, Z. Ni, J. Zhou, D. Zhang, and J. Zhou, “Uncertainty-aware ICCV, 2021.
score distribution learning for action quality assessment,” in CVPR, [130] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, “Learning actor relation
2020. graphs for group activity recognition,” in CVPR, 2019, pp. 9964–9974.
[109] S. Wang, Y. D., Z. P., C. C., and Z. L., “Tsa-net: Tube self-attention [131] M. Han, D. J. Zhang, Y. Wang, R. Yan, L. Yao, X. Chang, and
network for action quality assessment,” in ACM MM, 2021. Y. Qiao, “Dual-ai: dual-path actor interaction learning for group activity
[110] Z. Qi, R. Zhu, Z. Fu, W. Chai, and V. Kindratenko, “Weakly supervised recognition,” in Proceedings of the IEEE/CVF conference on computer
two-stage training scheme for deep video fight detection model,” arXiv vision and pattern recognition, 2022, pp. 2990–2999.
preprint arXiv:2209.11477, 2022. [132] G. Xu and J. Yin, “Mlp-air: An efficient mlp-based method for
[111] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new actor interaction relation learning in group activity recognition,” arXiv
model and the kinetics dataset,” in proceedings of the IEEE Conference preprint arXiv:2304.08803, 2023.
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [133] V. Bettadapura, C. Pantofaru, and I. Essa, “Leveraging contextual cues
[112] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, for generating basketball highlights,” in Proceedings of the 24th ACM
T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick et al., “Moments international conference on Multimedia, 2016, pp. 908–917.
in time dataset: one million videos for event understanding,” IEEE [134] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem, “Scc:
transactions on pattern analysis and machine intelligence, vol. 42, Semantic context cascade for efficient action detection,” in 2017 IEEE
no. 2, pp. 502–508, 2019. Conference on Computer Vision and Pattern Recognition (CVPR).
[113] G. Wang, K. Lu, Y. Zhou, Z. He, and G. Wang, “Human-centered prior- IEEE, 2017, pp. 3175–3184.
guided and task-dependent multi-task representation learning for action [135] P. Felsen, P. Agrawal, and J. Malik, “What will happen next? fore-
recognition pre-training,” in 2022 IEEE International Conference on casting player moves in sports videos,” in Proceedings of the IEEE
Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6. international conference on computer vision, 2017, pp. 3342–3351.
[114] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video [136] A. Cioppa, A. Deliege, and M. Van Droogenbroeck, “A bottom-up
dataset for fine-grained action understanding,” in Proceedings of the approach based on semantics for the interpretation of the main camera
IEEE/CVF conference on computer vision and pattern recognition, stream in soccer games,” in Proceedings of the IEEE Conference on
2020, pp. 2616–2625. Computer Vision and Pattern Recognition Workshops, 2018, pp. 1765–
[115] S. Sun, F. Wang, Q. Liang, and L. He, “Taichi: A fine-grained action 1774.
recognition dataset,” in Proceedings of the 2017 ACM on International [137] T. Tsunoda, Y. Komori, M. Matsugu, and T. Harada, “Football action
Conference on Multimedia Retrieval, 2017, pp. 429–433. recognition using hierarchical lstm,” in Proceedings of the IEEE
16

conference on computer vision and pattern recognition workshops, [159] W. Li, G. Pan, C. Wang, Z. Xing, and Z. Han, “From coarse to
2017, pp. 99–107. fine: Hierarchical structure-aware video summarization,” ACM Trans.
[138] Z. Cai, H. Neher, K. Vats, D. A. Clausi, and J. Zelek, “Temporal Multimedia Comput. Commun. Appl., vol. 18, no. 1s, jan 2022.
hockey action recognition via pose and optical flows,” in Proceedings [Online]. Available: https://doi.org/10.1145/3485472
of the IEEE Conference on Computer Vision and Pattern Recognition [160] W. Chai and G. Wang, “Deep vision multimodal learning: Methodol-
Workshops, 2019, pp. 0–0. ogy, benchmark, and trend,” Applied Sciences, vol. 12, no. 13, p. 6588,
[139] M. Sanabria, F. Precioso, and T. Menguy, “A deep architecture for mul- 2022.
timodal summarization of soccer games,” in Proceedings Proceedings [161] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang, “Fine-
of the 2nd International Workshop on Multimedia Content Analysis in grained video captioning for sports narrative,” in Proceedings of the
Sports, 2019, pp. 16–24. IEEE Conference on Computer Vision and Pattern Recognition, 2018,
[140] F. Turchini, L. Seidenari, L. Galteri, A. Ferracani, G. Becchi, and pp. 6006–6015.
A. Del Bimbo, “Flexible automatic football filming and summariza- [162] M. Qi, Y. Wang, A. Li, and J. Luo, “Sports video captioning via
tion,” in Proceedings Proceedings of the 2nd International Workshop attentive motion representation and group relationship modeling,” IEEE
on Multimedia Content Analysis in Sports, 2019, pp. 108–114. Transactions on Circuits and Systems for Video Technology, vol. 30,
[141] S. Giancola, M. Amine, T. Dghaily, and B. Ghanem, “Soccernet: A no. 8, pp. 2617–2633, 2019.
scalable dataset for action spotting in soccer videos,” in Proceedings [163] ——, “Sports video captioning via attentive motion representation
of the IEEE conference on computer vision and pattern recognition and group relationship modeling,” IEEE Transactions on Circuits and
workshops, 2018, pp. 1711–1721. Systems for Video Technology, vol. 30, no. 8, pp. 2617–2633, 2020.
[142] A. Cioppa, A. Deliege, S. Giancola, B. Ghanem, M. V. Droogenbroeck, [164] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang, “Fine-grained
R. Gade, and T. B. Moeslund, “A context-aware loss function for action video captioning for sports narrative,” in 2018 IEEE/CVF Conference
spotting in soccer videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6006–6015.
on Computer Vision and Pattern Recognition, 2020, pp. 13 126–13 136. [165] J. Wang, I. Fox, J. Skaza, N. Linck, S. Singh, and J. Wiens, “The
[143] J. Hong, H. Zhang, M. Gharbi, M. Fisher, and K. Fatahalian, “Spotting advantage of doubling: a deep reinforcement learning approach to
temporally precise, fine-grained events in video,” in Computer Vision– studying the double team in the nba,” arXiv preprint arXiv:1803.02940,
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23– 2018.
27, 2022, Proceedings, Part XXXV. Springer, 2022, pp. 33–51. [166] Y. Luo, “Inverse reinforcement learning for team sports: Valuing
[144] A. Darwish and T. El-Shabrway, “Ste: Spatio-temporal encoder for ac- actions and players,” 2020.
tion spotting in soccer videos,” in Proceedings of the 5th International [167] G. Liu and O. Schulte, “Deep reinforcement learning in ice hockey
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp. for context-aware player evaluation,” arXiv preprint arXiv:1805.11088,
87–92. 2018.
[145] A. Cartas, C. Ballester, and G. Haro, “A graph-based method for [168] C. Yanai, A. Solomon, G. Katz, B. Shapira, and L. Rokach, “Q-
soccer action spotting using unsupervised player classification,” in ball: Modeling basketball games using deep reinforcement learning,” in
Proceedings of the 5th International ACM Workshop on Multimedia Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36,
Content Analysis in Sports, 2022, pp. 93–102. no. 8, 2022, pp. 8806–8813.
[146] H. Zhu, J. Liang, C. Lin, J. Zhang, and J. Hu, “A transformer-based [169] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
system for action spotting in soccer videos,” in Proceedings of the D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
5th International ACM Workshop on Multimedia Content Analysis in ment learning,” arXiv preprint arXiv:1509.02971, 2015.
Sports, 2022, pp. 103–109. [170] “statsperform-optical-tracking,” https://www.statsperform.com/
[147] J. V. Soares and A. Shah, “Action spotting using dense detection team-performance/football/optical-tracking.
anchors revisited: Submission to the soccernet challenge 2022,” arXiv [171] “secondspectrum,” https://www.secondspectrum.com.
preprint arXiv:2206.07846, 2022. [172] X. Wei, P. Lucey, S. Morgan, and S. Sridharan, “Forecasting the next
[148] J. V. Soares, A. Shah, and T. Biswas, “Temporally precise action shot location in tennis using fine-grained spatiotemporal tracking data,”
spotting in soccer videos using dense detection anchors,” in 2022 IEEE IEEE Transactions on Knowledge and Data Engineering, vol. 28,
International Conference on Image Processing (ICIP). IEEE, 2022, no. 11, pp. 2988–2997, 2016.
pp. 2796–2800. [173] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Memory aug-
[149] J. H. Pan, J. Gao, and W. S. Zheng, “Action assessment by joint relation mented deep generative models for forecasting the next shot location
graphs,” in ICCV, 2019. in tennis,” IEEE Transactions on Knowledge and Data Engineering,
[150] G. I. Parisi, S. Magg, and S. Wermter, “Human motion assessment vol. 32, no. 9, pp. 1785–1797, 2019.
in real time using recurrent self-organization,” in IEEE International [174] X. Wei, P. Lucey, S. Morgan, M. Reid, and S. Sridharan, “The thin edge
Symposium on Robot and Human Interactive Communication (RO- of the wedge: Accurately predicting shot outcomes in tennis using style
MAN), 2016. and context priors,” in Proceedings of the 10th Annu MIT Sloan Sport
[151] S. T. Kim and M. R. Yong, “Evaluationnet: Can human skill be Anal Conf, Boston, MA, USA, 2016, pp. 1–11.
evaluated by deep networks?” arXiv:1705.11077, 2017. [175] H. M. Le, P. Carr, Y. Yue, and P. Lucey, “Data-driven ghosting using
[152] X. Yu, Y. Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware con- deep imitation learning,” 2017.
trastive regression for action quality assessment,” in Proceedings of [176] P. Power, H. Ruiz, X. Wei, and P. Lucey, “Not all passes are created
the IEEE/CVF International Conference on Computer Vision, 2021, equal: Objectively measuring the risk and reward of passes in soccer
pp. 7919–7928. from tracking data,” in Proceedings of the 23rd ACM SIGKDD inter-
[153] Y. Li, X. Chai, and X. Chen, “End-to-end learning for action quality national conference on knowledge discovery and data mining, 2017,
assessment,” in Advances in Multimedia Information Processing – pp. 1605–1613.
PCM, 2018. [177] W.-Y. Wang, H.-H. Shuai, K.-S. Chang, and W.-C. Peng, “Shuttlenet:
[154] G. Bertasius, H. S. Park, S. X. Yu, and J. Shi, “Am i a baller? basketball Position-aware fusion of rally progress and player styles for stroke
performance assessment from first-person videos,” in ICCV, 2019. forecasting in badminton,” in Proceedings of the AAAI Conference on
[155] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summa- Artificial Intelligence, 2022.
rization using deep learning,” in 2019 IEEE Conference on Multimedia [178] F. B. Martins, M. G. Machado, H. F. Bassani, P. H. M. Braga, and E. S.
Information Processing and Retrieval (MIPR), 2019, pp. 270–273. Barros, “rsoccer: A framework for studying reinforcement learning in
[156] M. Rafiq, G. Rafiq, R. Agyeman, G. S. Choi, and S.-I. Jin, “Scene small and very small size robot soccer,” 2021.
classification for sports video summarization using transfer learning,” [179] S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Grae-
Sensors, vol. 20, no. 6, p. 1702, Mar 2020. [Online]. Available: pel, “Emergent coordination through competition,” arXiv preprint
http://dx.doi.org/10.3390/s20061702 arXiv:1902.07151, 2019.
[157] A. A. Khan, J. Shao, W. Ali, and S. Tumrani, “Content-aware summa- [180] S. Liu, G. Lever, Z. Wang, J. Merel, S. Eslami, D. Hennes, W. M.
rization of broadcast sports videos: An audio–visual feature extraction Czarnecki, Y. Tassa, S. Omidshafiei, A. Abdolmaleki et al., “From
approach,” Neural Processing Letters, pp. 1–24, 2020. motor control to team play in simulated humanoid football,” arXiv
[158] H. Shingrakhia and H. Patel, “Sgrnn-am and hrf-dbn: A hybrid preprint arXiv:2105.12196, 2021.
machine learning model for cricket video summarization,” Vis. [181] K. Kurach, A. Raichuk, P. Stańczyk, M. Zajac,˛ O. Bachem, L. Espeholt,
Comput., vol. 38, no. 7, p. 2285–2301, jul 2022. [Online]. Available: C. Riquelme, D. Vincent, M. Michalski, O. Bousquet et al., “Google
https://doi.org/10.1007/s00371-021-02111-8 research football: A novel reinforcement learning environment,” in
17

Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 061–
no. 04, 2020, pp. 4501–4510. 10 070.
[182] Y. Zhao, I. Borovikov, J. Rupert, C. Somers, and A. Beirami, “On multi- [206] J. Huang, Y. Jin, K. M. Yi, and L. Sigal, “Layered controllable video
agent learning in team sports games,” arXiv preprint arXiv:1906.10124, generation,” in "Proceedings of the European Conference of Computer
2019. Vision (ECCV)", S. Avidan, G. Brostow, M. Cissé, G. M. Farinella,
[183] H. Jia, Y. Hu, Y. Chen, C. Ren, T. Lv, C. Fan, and C. Zhang, and T. Hassner, Eds., 2022.
“Fever basketball: A complex, flexible, and asynchronized sports game [207] A. Davtyan and P. Favaro, “Controllable video generation through
environment for multi-agent reinforcement learning,” arXiv preprint global and local motion dynamics,” in Proceedings of the European
arXiv:2012.03204, 2020. Conference of Computer Vision (ECCV), 2022.
[184] F. Z. Ziyang Li, Kaiwen Zhu, “Wekick,” https://www.kaggle.com/c/ [208] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor-
google-football/discussion/202232, 2020. thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields
[185] S. Huang, W. Chen, L. Zhang, Z. Li, F. Zhu, D. Ye, T. Chen, and for view synthesis,” in Proceedings of the European Conference of
J. Zhu, “Tikick: Towards playing multi-agent football full games from Computer Vision (ECCV), 2020.
single-agent demonstrations,” arXiv preprint arXiv:2110.04507, 2021. [209] W. Menapace, A. Siarohin, S. Lathuilière, P. Achlioptas, V. Golyanik,
[186] F. Lin, S. Huang, T. Pearce, W. Chen, and W.-W. Tu, “Tizero: Mastering E. Ricci, and S. Tulyakov, “Plotting behind the scenes: Towards
multi-agent football with curriculum learning and self-play,” arXiv learnable game engines,” arXiv preprint arXiv:2303.13472, 2023.
preprint arXiv:2302.07515, 2023. [210] N. Feng, Z. Song, J. Yu, Y.-P. P. Chen, Y. Zhao, Y. He, and T. Guan,
[187] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu, “The “Sset: a dataset for shot segmentation, event detection, player tracking
surprising effectiveness of mappo in cooperative, multi-agent games,” in soccer videos,” Multimedia Tools and Applications, vol. 79, pp.
arXiv preprint arXiv:2103.01955, 2021. 28 971–28 992, 2020.
[188] M. Wen, J. G. Kuba, R. Lin, W. Zhang, Y. Wen, J. Wang, and Y. Yang, [211] Y. Jiang, K. Cui, L. Chen, C. Wang, and C. Xu, “Soccerdb: A large-
“Multi-agent reinforcement learning is a sequence modeling problem,” scale database for comprehensive video understanding,” in Proceedings
arXiv preprint arXiv:2205.14953, 2022. of the 3rd International Workshop on Multimedia Content Analysis in
[189] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “First Sports, 2020, pp. 1–8.
return, then explore,” Nature, vol. 590, no. 7847, pp. 580–586, 2021. [212] A. Deliege, A. Cioppa, S. Giancola, M. J. Seikavandi, J. V. Dueholm,
[190] P. Tendulkar, A. Das, A. Kembhavi, and D. Parikh, “Feel the music: K. Nasrollahi, B. Ghanem, T. B. Moeslund, and M. Van Droogen-
Automatically generating a dance for an input song,” arXiv preprint broeck, “Soccernet-v2: A dataset and benchmarks for holistic under-
arXiv:2006.11905, 2020. standing of broadcast soccer videos,” in Proceedings of the IEEE/CVF
[191] X. Ren, H. Li, Z. Huang, and Q. Chen, “Self-supervised dance video Conference on Computer Vision and Pattern Recognition, 2021, pp.
synthesis conditioned on music,” in Proceedings of the 28th ACM 4508–4519.
International Conference on Multimedia, 2020, pp. 46–54. [213] N. M. Lessa, E. L. Colombini, and A. D. S. Simões, “Soccerkicks:
a dataset of 3d dead ball kicks reference movements for humanoid
[192] J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo,
robots,” in 2021 IEEE International Conference on Systems, Man, and
R. Martins, and E. R. Nascimento, “Learning to dance: A graph
Cybernetics (SMC). IEEE, 2021, pp. 3472–3478.
convolutional adversarial network to generate realistic dance motions
[214] A. Scott, I. Uchida, M. Onishi, Y. Kameda, K. Fukui, and K. Fujii,
from audio,” Computers & Graphics, vol. 94, pp. 11–21, 2021.
“Soccertrack: A dataset and tracking algorithm for soccer with fish-
[193] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
eye and drone videos,” in Proceedings of the IEEE/CVF Conference
2d pose estimation using part affinity fields,” in Proceedings of the
on Computer Vision and Pattern Recognition, 2022, pp. 3569–3579.
IEEE conference on computer vision and pattern recognition, 2017,
[215] P. Parisot and C. De Vleeschouwer, “Scene-specific classifier for
pp. 7291–7299.
effective and efficient team sport players detection from a single
[194] O. Alemi, J. Françoise, and P. Pasquier, “Groovenet: Real-time music-
calibrated camera,” Computer Vision and Image Understanding, vol.
driven dance movement generation using artificial neural networks,”
159, pp. 74–88, 2017.
networks, vol. 8, no. 17, p. 26, 2017.
[216] C. Ma, J. Fan, J. Yao, and T. Zhang, “Npu rgb+ d dataset and a
[195] T. Tang, J. Jia, and H. Mao, “Dance with melody: An lstm-autoencoder feature-enhanced lstm-dgcn method for action recognition of basketball
approach to music-oriented dance synthesis,” in Proceedings of the 26th players,” Applied Sciences, vol. 11, no. 10, p. 4426, 2021.
ACM international conference on Multimedia, 2018, pp. 1598–1606. [217] D. Wu, H. Zhao, X. Bao, and R. P. Wildes, “Sports video analysis on
[196] N. Yalta, S. Watanabe, K. Nakadai, and T. Ogata, “Weakly-supervised large-scale data,” in ECCV, Oct. 2022.
deep recurrent neural networks for basic dance step generation,” in [218] W. Menapace, S. Lathuiliere, A. Siarohin, C. Theobalt, S. Tulyakov,
2019 International Joint Conference on Neural Networks (IJCNN). V. Golyanik, and E. Ricci, “Playable environments: Video manipulation
IEEE, 2019, pp. 1–8. in space and time,” in Proceedings of the IEEE/CVF Conference on
[197] W. Zhuang, Y. Wang, J. Robinson, C. Wang, M. Shao, Y. Fu, and S. Xia, Computer Vision and Pattern Recognition, 2022, pp. 3584–3593.
“Towards 3d dance motion synthesis and control,” arXiv preprint [219] C. Xu, Y. Fu, B. Zhang, Z. Chen, Y.-G. Jiang, and X. Xue, “Learning
arXiv:2006.05743, 2020. to score figure skating sport videos,” IEEE transactions on circuits and
[198] H.-K. Kao and L. Su, “Temporally guided music-to-body-movement systems for video technology, vol. 30, no. 12, pp. 4578–4590, 2019.
generation,” in Proceedings of the 28th ACM International Conference [220] S. Wang, D. Yang, P. Zhai, C. Chen, and L. Zhang, “Tsa-net: Tube
on Multimedia, 2020, pp. 147–155. self-attention network for action quality assessment,” in Proceedings
[199] H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, of the 29th ACM International Conference on Multimedia, 2021, pp.
and J. Kautz, “Dancing to music,” Advances in neural information 4902–4910.
processing systems, vol. 32, 2019. [221] P. Parmar and B. T. Morris, “What and how well you performed? a
[200] G. Sun, Y. Wong, Z. Cheng, M. S. Kankanhalli, W. Geng, and X. Li, multitask learning approach to action quality assessment,” in Proceed-
“Deepdance: music-to-dance motion choreography with adversarial ings of the IEEE/CVF Conference on Computer Vision and Pattern
learning,” IEEE Transactions on Multimedia, vol. 23, pp. 497–509, Recognition, 2019, pp. 304–313.
2020. [222] J. Xu, Y. Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A
[201] R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance fine-grained dataset for procedure-aware action quality assessment,” in
revolution: Long-term dance generation with music via curriculum Proceedings of the IEEE/CVF Conference on Computer Vision and
learning,” arXiv preprint arXiv:2006.06119, 2020. Pattern Recognition, 2022, pp. 2949–2958.
[202] R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: [223] W. McNally, K. Vats, T. Pinto, C. Dulhanty, J. McPhee, and A. Wong,
Music conditioned 3d dance generation with aist++,” 2021. “Golfdb: A video database for golf swing sequencing,” in Proceedings
[203] H. Ahn, J. Kim, K. Kim, and S. Oh, “Generative autoregressive of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
networks for 3d dancing move synthesis from music,” IEEE Robotics nition Workshops, 2019, pp. 0–0.
and Automation Letters, vol. 5, no. 2, pp. 3501–3508, 2020. [224] A. Piergiovanni and M. S. Ryoo, “Fine-grained activity recognition in
[204] Z. Ye, H. Wu, J. Jia, Y. Bu, W. Chen, F. Meng, and Y. Wang, baseball videos,” in Proceedings of the ieee conference on computer
“Choreonet: Towards music to dance synthesis with choreographic vision and pattern recognition workshops, 2018, pp. 1740–1748.
action unit,” in Proceedings of the 28th ACM International Conference [225] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
on Multimedia, 2020, pp. 744–752. L. Fei-Fei, “Large-scale video classification with convolutional neural
[205] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, and E. Ricci, networks,” in Proceedings of the IEEE conference on Computer Vision
“Playable video generation,” in Proceedings of the IEEE Conference on and Pattern Recognition, 2014, pp. 1725–1732.
18

[226] H. Pirsiavash, C. Vondrick, and A. Torralba, “Assessing the quality of [246] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and
actions,” in Computer Vision–ECCV 2014: 13th European Conference, L. Fei-Fei, “Detecting events and key actors in multi-person videos,”
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. in Proceedings of the IEEE conference on computer vision and pattern
Springer, 2014, pp. 556–571. recognition, 2016, pp. 3043–3053.
[227] S. M. Safdarnejad, X. Liu, L. Udpa, B. Andrus, J. Wood, and [247] S. Francia, S. Calderara, and D. F. Lanzi, “Classificazione di azioni
D. Craven, “Sports videos in the wild (svw): A video dataset for sports cestistiche mediante tecniche di deep learning,” URL: https://www.
analysis,” in 2015 11th IEEE International Conference and Workshops researchgate. net/publication/330534530_Classificazione_di_Azioni_
on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, Cestistiche_mediante_Tecniche_di_Deep_Learning, 2018.
pp. 1–7. [248] X. Gu, X. Xue, and F. Wang, “Fine-grained action recognition on a
[228] P. Parmar and B. Tran Morris, “Learning to score olympic events,” in novel basketball dataset,” in ICASSP 2020-2020 IEEE International
Proceedings of the IEEE conference on computer vision and pattern Conference on Acoustics, Speech and Signal Processing (ICASSP).
recognition workshops, 2017, pp. 20–28. IEEE, 2020, pp. 2563–2567.
[229] W. Zhang, Z. Liu, L. Zhou, H. Leung, and A. B. Chan, “Martial arts, [249] Y. Yan, N. Zhuang, B. Ni, J. Zhang, M. Xu, Q. Zhang, Z. Zhang,
dancing and sports dataset: A challenging stereo and multi-view dataset S. Cheng, Q. Tian, Y. Xu et al., “Fine-grained video captioning via
for 3d human pose estimation,” Image and Vision Computing, vol. 61, graph-based multi-granularity interaction learning,” IEEE transactions
pp. 22–39, 2017. on pattern analysis and machine intelligence, vol. 44, no. 2, pp. 666–
[230] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei- 683, 2019.
Fei, “Every moment counts: Dense detailed labeling of actions in [250] L. Zhu, K. Rematas, B. Curless, S. Seitz, and I. Kemelmacher-
complex videos,” International Journal of Computer Vision, vol. 126, Shlizerman, “Reconstructing nba players,” in Proceedings of the Euro-
pp. 375–389, 2018. pean Conference on Computer Vision (ECCV), August 2020.
[231] P. Parmar and B. Morris, “Action quality assessment across multiple [251] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori,
actions,” in 2019 IEEE winter conference on applications of computer “A hierarchical deep temporal model for group activity recognition,”
vision (WACV). IEEE, 2019, pp. 1468–1476. in Proceedings of the IEEE conference on computer vision and pattern
[232] C. Zalluhoglu and N. Ikizler-Cinbis, “Collective sports: A multi- recognition, 2016, pp. 1971–1980.
task dataset for collective activity recognition,” Image and Vision [252] Ibrahim, Mostafa S and Muralidharan, Srikanth and Deng, Zhiwei and
Computing, vol. 94, p. 103870, 2020. Vahdat, Arash and Mori, Greg, “Hierarchical deep temporal models
[233] Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: for group activity recognition,” CoRR, vol. abs/1607.02643, 2016.
A multi-person video dataset of spatio-temporally localized sports [Online]. Available: http://arxiv.org/abs/1607.02643
actions,” in Proceedings of the IEEE/CVF International Conference [253] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, and R. Suk-
on Computer Vision, 2021, pp. 13 536–13 545. thankar, “Violence detection in video using computer vision tech-
[234] A. Nibali, J. Millward, Z. He, and S. Morgan, “Aspset: An outdoor niques,” in Computer Analysis of Images and Patterns: 14th Inter-
sports pose video dataset with 3d keypoint annotations,” Image and national Conference, CAIP 2011, Seville, Spain, August 29-31, 2011,
Vision Computing, vol. 111, p. 104196, 2021. Proceedings, Part II 14. Springer, 2011, pp. 332–339.
[254] K. Vats, P. Walters, M. Fani, D. A. Clausi, and J. Zelek, “Player tracking
[235] J. Chung, C.-h. Wuu, H.-r. Yang, Y.-W. Tai, and C.-K. Tang, “Haa500:
and identification in ice hockey,” arXiv preprint arXiv:2110.03090,
Human-centric atomic action dataset with curated videos,” in Proceed-
2021.
ings of the IEEE/CVF International Conference on Computer Vision,
[255] T. De Campos, M. Barnard, K. Mikolajczyk, J. Kittler, F. Yan,
2021, pp. 13 465–13 474.
W. Christmas, and D. Windridge, “An evaluation of bags-of-words and
[236] X. Chen, A. Pang, W. Yang, Y. Ma, L. Xu, and J. Yu,
spatio-temporal shapes for action recognition,” in 2011 IEEE Workshop
“Sportscap: Monocular 3d human motion capture and fine-grained
on Applications of Computer Vision (WACV). IEEE, 2011, pp. 344–
understanding in challenging sports videos,” International Journal of
351.
Computer Vision, Aug 2021. [Online]. Available: https://doi.org/10.
[256] S. Gourgari, G. Goudelis, K. Karpouzis, and S. Kollias, “Thetis: Three
1007/s11263-021-01486-4
dimensional tennis shots a human action dataset,” in Proceedings of
[237] P. Parmar and B. Morris, “Win-fail action recognition,” in Proceedings the IEEE Conference on Computer Vision and Pattern Recognition
of the IEEE/CVF Winter Conference on Applications of Computer Workshops, 2013, pp. 676–681.
Vision, 2022, pp. 161–171. [257] H. Faulkner and A. Dick, “Tenniset: a dataset for dense fine-grained
[238] C. K. Ingwersen, C. Mikkelstrup, J. N. Jensen, M. R. Hannemose, event recognition, localisation and description,” in 2017 International
and A. B. Dahl, “Sportspose: A dynamic 3d sports pose dataset,” in Conference on Digital Image Computing: Techniques and Applications
Proceedings of the IEEE/CVF International Workshop on Computer (DICTA). IEEE, 2017, pp. 1–8.
Vision in Sports, 2023. [258] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, and E. Ricci,
[239] Y. Cui, C. Zeng, X. Zhao, Y. Yang, G. Wu, and L. Wang, “Sportsmot: “Playable video generation,” in Proceedings of the IEEE/CVF Confer-
A large multi-object tracking dataset in multiple sports scenes,” arXiv ence on Computer Vision and Pattern Recognition, 2021, pp. 10 061–
preprint arXiv:2304.05170, 2023. 10 070.
[240] T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, and P. L. Mazzeo, “A [259] P.-E. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier, “Sport action
semi-automatic system for ground truth generation of soccer video recognition with siamese spatio-temporal cnns: Application to table
sequences,” in 2009 Sixth IEEE International Conference on Advanced tennis,” in 2018 International Conference on Content-Based Multime-
Video and Signal Based Surveillance. IEEE, 2009, pp. 559–564. dia Indexing (CBMI). IEEE, 2018, pp. 1–6.
[241] S. A. Pettersen, D. Johansen, H. Johansen, V. Berg-Johansen, V. R. [260] K. M. Kulkarni and S. Shenoy, “Table tennis stroke recognition
Gaddam, A. Mortensen, R. Langseth, C. Griwodz, H. K. Stensland, using two-dimensional human pose estimation,” in Proceedings of the
and P. Halvorsen, “Soccer video and player position dataset,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Proceedings of the 5th ACM Multimedia Systems Conference, 2014, 2021, pp. 4576–4584.
pp. 18–23. [261] J. Bian, Q. Wang, H. Xiong, J. Huang, C. Liu, X. Li, J. Cheng, J. Zhao,
[242] K. Lu, J. Chen, J. J. Little, and H. He, “Light cascaded convolu- F. Lu, and D. Dou, “P2a: A dataset and benchmark for dense action
tional neural networks for accurate player detection,” arXiv preprint detection from table tennis match broadcasting videos,” arXiv preprint
arXiv:1709.10230, 2017. arXiv:2207.12730, 2022.
[243] J. Yu, A. Lei, Z. Song, T. Wang, H. Cai, and N. Feng, “Comprehensive [262] S. Zahan, G. M. Hassan, and A. Mian, “Learning sparse temporal
dataset of broadcast soccer videos,” in 2018 IEEE Conference on video mapping for action quality assessment in floor gymnastics,” arXiv
Multimedia Information Processing and Retrieval (MIPR). IEEE, preprint arXiv:2301.06103, 2023.
2018, pp. 418–423. [263] A. Ghosh, S. Singh, and C. Jawahar, “Towards structured analysis
[244] J. Qi, J. Yu, T. Tu, K. Gao, Y. Xu, X. Guan, X. Wang, Y. Dong, of broadcast badminton videos,” in 2018 IEEE Winter Conference on
B. Xu, L. Hou et al., “Goal: A challenging knowledge-grounded video Applications of Computer Vision (WACV). IEEE, 2018, pp. 296–304.
captioning benchmark for real-time soccer commentary generation,” [264] Z. T. L. Shan, “Fineskating: A high-quality figure skating dataset
arXiv preprint arXiv:2303.14655, 2023. and multi-task approach for sport action,” Peng Cheng Laboratory
[245] C. De Vleeschouwer, F. Chen, D. Delannay, C. Parisot, C. Chaudy, Commumications, vol. 1, no. 3, p. 107, 2020.
E. Martrou, A. Cavallaro et al., “Distributed video acquisition and [265] S. Liu, X. Liu, G. Huang, L. Feng, L. Hu, D. Jiang, A. Zhang, Y. Liu,
annotation for sport-event summarization,” NEM summit, vol. 8, no. and H. Qiao, “Fsd-10: a dataset for competitive sports content analysis,”
10.1016, 2008. arXiv preprint arXiv:2002.03312, 2020.
19

[266] S. Liu, A. Zhang, Y. Li, J. Zhou, L. Xu, Z. Dong, and R. Zhang, Zhonghan Zhao received the BE degree from Com-
“Temporal segmentation of fine-gained semantic action: A motion- munication University of China. He is currently
centered figure skating dataset,” in Proceedings of the AAAI conference working toward the PhD degree with Zhejiang Uni-
on artificial intelligence, vol. 35, no. 3, 2021, pp. 2163–2171. versity - University of Illinois Urbana-Champaign
[267] Y. Li, Y. Li, and N. Vasconcelos, “Resound: Towards action recog- Institute, Zhejiang University. His research interests
nition without representation bias,” in Proceedings of the European include machine learning, reinforcement learning
Conference on Computer Vision (ECCV), 2018, pp. 513–528. and computer vision.
[268] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-
temporal maximum average correlation height filter for action recog-
nition,” in 2008 IEEE conference on computer vision and pattern
recognition. IEEE, 2008, pp. 1–8.
[269] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d
points,” in 2010 IEEE computer society conference on computer vision
and pattern recognition-workshops. IEEE, 2010, pp. 9–14.
[270] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal structure
of decomposable motion segments for activity classification,” in Com-
puter Vision–ECCV 2010: 11th European Conference on Computer Wenhao Chai received the BE degree from Zhe-
Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, jiang University, China. He is currently working
Part II 11. Springer, 2010, pp. 392–405. toward the Master degree with University of Wash-
[271] J. Pers, “Cvbase 06 dataset: a dataset for development and testing of ington. His research interests include 3D human pose
computer vision based methods in sport environments,” SN, Ljubljana, estimation, generative models, and multi-modality
2005. learning.
[272] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa, “Robocup:
The robot world cup initiative,” in Proceedings of the first international
conference on Autonomous agents, 1997, pp. 340–347.
[273] JiDi, “Jidi olympics football,” https://github.com/jidiai/ai_lib/blob/
master/env/olympics_football.py, 2022.
[274] A. S. Azad, E. Kim, Q. Wu, K. Lee, I. Stoica, P. Abbeel,
A. Sangiovanni-Vincentelli, and S. A. Seshia, “Programmatic modeling
and generation of real-time strategic soccer environments for reinforce-
ment learning,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 36, no. 6, 2022, pp. 6028–6036.
[275] J. Wang, J. Ma, K. Hu, Z. Zhou, H. Zhang, X. Xie, and Y. Wu, “Tac- Shengyu Hao received the MS degree from Beijing
trainer: A visual analytics system for iot-based racket sports training,” University of Posts and Telecommunications, China.
IEEE Transactions on Visualization and Computer Graphics, vol. 29, He is currently working toward the PhD degree
no. 1, pp. 951–961, 2022. with Zhejiang University - University of Illinois
[276] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Urbana-Champaign Institute, Zhejiang University.
Z. Liu, Z. Wu, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge, His research interests include machine learning and
“Summary of chatgpt/gpt-4 research and perspective towards the future computer vision.
of large language models,” 2023.
[277] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,”
arXiv preprint arXiv:2304.02643, 2023.
[278] R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A.
Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson et al., “Segment
anything model (sam) for digital pathology: Assess zero-shot segmen-
tation on whole slide imaging,” arXiv preprint arXiv:2304.04155, 2023.
[279] S. Roy, T. Wald, G. Koehler, M. R. Rokuss, N. Disch, J. Holzschuh,
D. Zimmerer, and K. H. Maier-Hein, “Sam.md: Zero-shot medical
image segmentation capabilities of the segment anything model,” 2023. Wenhao Hu received the BS degree from Zhe-
[280] Y. Liu, J. Zhang, Z. She, A. Kheradmand, and M. Armand, “Samm jiang University, China. He is currently working
(segment any medical model): A 3d slicer integration to sam,” 2023. toward the PhD degree with Zhejiang University -
[281] J. Z. Wu, Y. Ge, X. Wang, W. Lei, Y. Gu, W. Hsu, Y. Shan, X. Qie, University of Illinois Urbana-Champaign Institute,
and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion Zhejiang University. His research interests include
models for text-to-video generation,” arXiv preprint arXiv:2212.11565, generative models and 3D reconstruction.
2022.
[282] J. Liu, N. Saquib, Z. Chen, R. H. Kazi, L.-Y. Wei, H. Fu, and C.-
L. Tai, “Posecoach: A customizable analysis and visualization system
for video-based running coaching,” IEEE Transactions on Visualization
and Computer Graphics, 2022.
[283] Z. Zhao, S. Lan, and S. Zhang, “Human pose estimation based speed
detection system for running on treadmill,” in 2020 International
Conference on Culture-oriented Science & Technology (ICCST). IEEE,
2020, pp. 524–528.
[284] T. Perrett, A. Masullo, D. Damen, T. Burghardt, I. Craddock,
M. Mirmehdi et al., “Personalized energy expenditure estimation: Vi- Guanhong Wang received the MS degree from
sual sensing approach with deep learning,” JMIR Formative Research, Huaqiao University, China. He is currently working
vol. 6, no. 9, p. e33606, 2022. toward the PhD degree with Zhejiang University -
[285] D. Radke and A. Orchard, “Presenting multiagent challenges in team University of Illinois Urbana-Champaign Institute,
sports analytics,” arXiv preprint arXiv:2303.13660, 2023. Zhejiang University. His research interests include
deep learning and computer vision.
20

Shidong Cao received the BE degree from Beijing


University of Posts and Telecommunications, China.
He is currently working toward the MS degree
with Zhejiang University - University of Illinois
Urbana-Champaign Institute, Zhejiang University.
His research interests include machine learning and
computer vision.

Dr. Mingli Song received the Ph.D. degree in com-


puter science from Zhejiang University, China, in
2006. He is currently a Professor with the Microsoft
Visual Perception Laboratory, Zhejiang University.
His research interests include face modeling and
facial expression analysis. He received the Microsoft
Research Fellowship in 2004.

Dr. Jenq-Neng Hwang received the BS and MS


degrees, both in electrical engineering from the
National Taiwan University, Taipei, Taiwan, in 1981
and 1983 separately. He then received his Ph.D.
degree from the University of Southern California. In
the summer of 1989, Dr. Hwang joined the Depart-
ment of Electrical and Computer Engineering (ECE)
of the University of Washington in Seattle, where he
has been promoted to Full Professor since 1999. He
is the Director of the Information Processing Lab.
(IPL), which has won several AI City Challenges
and BMTT Tracking awards in the past years. Dr. Hwang served as associate
editors for IEEE T-SP, T-NN and T-CSVT, T-IP and Signal Processing
Magazine (SPM). He was the General Co-Chair of 2021 IEEE World AI IoT
Congress, as well as the program Co-Chairs of IEEE ICME 2016, ICASSP
1998 and ISCAS 2009. Dr. Hwang is a fellow of IEEE since 2001.

Dr. Gaoang Wang joined the international campus


of Zhejiang University as an Assistant Professor in
September 2020. He is also an Adjunct Assistant
Professor at UIUC. Gaoang Wang received a B.S.
degree at Fudan University in 2013, a M.S. degree
at the University of Wisconsin-Madison in 2015,
and a Ph.D. degree from the Information Processing
Laboratory of the Electrical and Computer Engi-
neering department at the University of Washington
in 2019. After that, he joined Megvii US office in
July 2019 as a research scientist working on multi-
frame fusion. He then joined Wyze Labs in November 2019 working on deep
neural network design for edge-cloud collaboration. His research interests
are computer vision, machine learning, artificial intelligence, including multi-
object tracking, representation learning, and active learning. Gaoang Wang
published papers in many renowned journals and conferences, including IEEE
T-IP, IEEE T-MM, IEEE T-CSVT, IEEE T-VT, CVPR, ICCV, ECCV, ACM
MM, IJCAI.

You might also like