A Survey of Deep Learning in Sports Applications
A Survey of Deep Learning in Sports Applications
I. I NTRODUCTION
RTIFICIAL Intelligence (AI) has found wide-ranging action recognition
Comprehension
2) Ball Localization: Ball localization provides crucial 3D 2) Ball Tracking: Accurately recognizing and tracking
positional information about the ball, which offers comprehen- a high-speed, small ball from raw video poses significant
sive insights into its movement state [11]. This task involves challenges. Huang et al. [52] propose a heatmap-based deep
estimating the ball’s diameter in pixels within an image patch learning network [53], [54] to identify the ball image in a
centered on the ball, and it finds applications in various aspects single frame and learn its flight patterns across consecutive
of game analytics [34]. These applications include automated frames. Furthermore, precise ball tracking is essential for
offside detection in soccer [35], release point localization in assisting other tasks, such as recognizing spin actions in table
basketball [36], and event spotting in table tennis [37]. tennis [55] by combining ball tracking information.
Existing solutions often rely on multi-view points [38]–
[40] to triangulate the 2D positions of the ball detected in C. Player Re-identification
individual frames, providing robustness against occlusions that
Player re-identification (ReID) is a task of matching and
are prevalent in team sports such as basketball or American
recognizing individuals across time and different views. In
football.
technical terms, this involves comparing an image of a person,
However, in single-view ball 3D localization, occlusion be-
referred to as the query, against a collection of other images
comes a significant challenge. Most approaches resort to fitting
within a large database, known as the gallery, taken from
3D ballistic trajectories based on the 2D detections [40], [41],
various camera viewpoints. In sports, the ReID task aims to re-
limiting their effectiveness in detecting the ball during free fall
identify players, coaches, and referees across images captured
when it follows ballistic paths. Nonetheless, in many game
successively from moving cameras [36], [56]. Challenges such
situations, the ball may be partially visible or fully occluded
as similar appearances and occlusions and the low resolu-
during free fall. Van et al. [36], [42] address these limitations
tion of player details in broadcast videos make player re-
by deviating from assumptions of ballistic trajectory, time
identification a challenging task.
consistency, and clear visibility. They propose an image-based
Addressing these challenges, many approaches have focused
method that detects the ball’s center and estimates its size
on recognizing jersey numbers as a means of identifying
within the image space, bridging the gap between trajectory
players [22], [57], or have employed part-based classification
predictions offered by ballistic approaches. Additionally, there
techniques [58]. Recently, Teket et al. [59] proposed a real-
are also works on reconstructing 3D shuttle trajectories in
time capable pipeline for player detection and identification
badminton [43].
using a Siamese network with a triplet loss to distinguish
B. Player and Ball Tracking players from each other, without relying on fixed classes or
jersey numbers. An et al. [60] introduced a multi-granularity
Player and ball tracking is the process of consistently network with an attention mechanism for player ReID, while
following and identifying the location and motion of objects Habel et al. [61] utilized CLIP with InfoNCE loss as an
across consecutive frames. This tracking operation is integral objective, focusing on class-agnostic approaches.
to facilitating an automated understanding of sports activities. To address the issue of low-resolution player details in
1) Player Tracking: Tracking players in the temporal
multi-view soccer match broadcast videos, Comandur et
dimension is immensely valuable for gathering player-specific
al. [56] proposed a model that re-identifies players by ranking
statistics. Recent works [44], [45] utilize the SORT algorithm
replay frames based on their distance to a given action frame,
[46], which combines Kalman filtering with the Hungarian
incorporating a centroid loss, triplet loss, and cross-entropy
algorithm to associate overlapping bounding boxes. Addition-
loss to increase the margin between clusters.
ally, Hurault et al. [47] employ a self-supervised approach,
In addition, some researchers have explored semi-supervised
fine-tuning an object detection model trained on generic ob-
or weakly supervised methods. Maglo et al. [62] developed a
jects specifically for soccer player detection and tracking.
semi-interactive system using a transformer-based architecture
In player tracking, a common challenge arises from similar
for player ReID. Similarly, in hockey, Vats et al. [63] employed
appearances that make it difficult to associate detections and
a weakly-supervised training approach with cross-entropy loss
maintain identity consistency. Intuitively, integrating informa-
to predict jersey numbers as a form of classification.
tion from other tasks can assist in tracking. Some works [48]
explore patterns in jersey numbers, team classification, and
pose-guided partial features to handle player identity switches D. Player Instance Segmentation
and correlate player IDs using the K-shortest path algo- Player instance segmentation aims at assigning pixel-level
rithm. In dance scenarios, incorporating skeleton features labels to each player. In player instance segmentation, occlu-
from human pose estimation significantly improves tracking sion is the key problem, especially in crowded regions, like
performance in challenging scenes with uniform costumes and basketball [36]. Some works [64], [65] utilize online specific
diverse movements [49]. copy-paste method [66] to address the occlusion issue.
To address identity mismatches during occlusions, Naik et Moreover, instance segmentation features can be used to
al. [44] utilize the difference in jersey color between teams distinguish different players in team sports with different
and referees in soccer. They update color masks in the tracker actions [24], [67]. In hockey, Koshkina et al. [24] use Mask
module from frame to frame, assigning tracker IDs based R-CNN [68] to detect and segment each person on the playing
on jersey color. Additionally, other works [45], [50] tackle surface. Zhang et al. [67] utilize the segmentation task to
occlusion issues using DeepSort [51]. enhance throw action recognition [67] and event spotting [37].
4
TABLE I
D EEP LEARNING MODELS FOR S PORTS COMPREHENSION . “IAR”, “GAR”, “AQA” STAND FOR I NDIVIDUAL ACTION R ECOGNITION , G ROUP ACTION
R ECOGNITION , ACTION Q UALITY A SSESSMENT.
grained action recognition in sports that do not involve body- for feature extraction in group activity analysis [132] is an
centric actions. emerging approach with great potential.
Moreover, some other works focus more on specific action
recognition through temporal localization rather than classi-
B. Group Action Recognition fication. Several automated methods have been proposed to
Group activity recognition involves recognizing activities identify important actions in a game by analyzing camera shots
performed by multiple individuals or objects. It plays a sig- or semantic information. Studies [133]–[135] have explored
nificant role in automated human behavior analysis in various human activity localization in sports videos, salient game
fields, including sports, healthcare, and surveillance. Unlike action identification [136], [137], and automatic identification
multi-player activity recognition, group / team action recogni- and summarization of game highlights [138]–[140]. Recent
tion focuses on identifying a single group action that arises methods are more on soccer. For instance, Giancola et al. [141]
from the collective actions and interactions of each player introduce the concept of accurately identifying and localizing
within the group. This poses greater challenges compared to specific actions within uncut soccer broadcast videos. More
individual action recognition and requires the integration of recently, innovative methodologies have emerged in this field,
multiple computer vision techniques. aiming to automate the process. Cioppa et al. [142] propose the
application of a context-aware loss function to enhance model
Due to the involvement of multiple players, modeling player
performance. They later demonstrated how integrating camera
interaction relations becomes essential in group action anal-
calibration and player localization features can improve spot-
ysis. In general, actor interaction relations can be modeled
ting capabilities [32]. Hong et al. [143] propose an efficient
using graph convolutional networks (GCN) or Transformers
end-to-end training approach, while Darwish et al. [144]
in various methods. Transformer-based methods [124]–[129]
utilize spatiotemporal encoders. Alternative strategies, such
often explicitly represent spatiotemporal relations and employ
as graph-based techniques [145] and transformer-based meth-
attention-based techniques to model individual relations for
ods [146], offer fresh perspectives, particularly in handling
inferring group activity. GCN-based methods [102], [130]
relational data and addressing long-range dependencies. Lastly,
construct relational graphs of the actors and simultaneously
Soares et al. [147], [148] have highlighted the potential
explore spatial and temporal actor interactions using graph
of anchor-based methods in precise action localization and
convolution networks.
categorization.
Among them, Yan et al. [126] construct separate spatial and
temporal relation graphs to model actor relations. Gavrilyuk et
al. [124] encode temporal information using I3D [111] and C. Action Quality Assessment
establish spatial relations among actors using a vanilla trans- Action quality assessment (AQA) is a method used to
former. Li et al. [129] introduces a cluster attention mech- evaluate and quantify the overall performance or proficiency of
anism. Dual-AI [131] proposes a dual-path role interaction human actions based on the analysis of video or motion data.
framework for group behavior recognition, incorporating tem- AQA takes into account criteria such as technique, speed, and
poral encoding of the actor into the transformer architecture. control to assess the movement and assign a score, which can
Moreover, the use of simple multi-layer perceptrons (MLP) be used to guide training and rehabilitation programs. AQA
6
motion. This enables the generation of multiple motions from The SoccerNet series [33], [34], [141], [212] is the largest
the same audio, even with a deterministic model. Li et al. [202] one including annotations for a variety of spatial annotations
propose a novel cross-modal transformer-based model that and cross-view correspondences. It covers multiple vision-
better preserves the correlation between music and 3D motion. based tasks including player understanding like player track-
This approach results in more realistic and globally translated ing, re-identification, broadcast video understanding like action
long human motion. spotting, video captioning, and field understanding like camera
calibration.
E. Sport Video Synthesizing In recent years, the combination of large-scale datasets and
The goal of artificially synthesizing sports videos is to deep learning models has become increasingly popular in the
generate realistic and immersive content, such as player move- field of soccer tasks, raising the popularity of the SoccerNet
ments or game scenarios. Early works in this field train models series datasets [34], [141], [212]. Meanwhile, SoccerDB [211],
using annotated videos where each time step is labeled with SSET [210], and ComprehensiveSoccer [243] are more suit-
the corresponding action. However, these approaches use a able for tasks that require player detection. However, there
discrete representation of actions, which make it challenging are few datasets like SoccerKick [213] for soccer player pose
to define prior knowledge for real-world environments. Addi- estimation. It is hoped that more attention can be paid to the
tionally, devising a suitable continuous action representation recognition and understanding of player skeletal movements
for an environment is also complex. To address the complexity in the future.
of action representation in tennis, Menapace et al. [205]
propose a discrete action representation. Building upon this B. Basketball
idea, Huang et al. [206] model actions as a learned set of Basketball datasets have been developed for various tasks
geometric transformations. Davtyan et al. [207] take a different such as player and ball detection, action recognition, and pose
approach by separating actions into a global shift component estimation. APIDIS [40], [245] is a challenging dataset with
and a local discrete action component. More recent works annotations for player and ball positions, and clock and non-
in tennis have utilized a NeRF-based renderer [208], which clock actions. Basket-1,2 [38] consists of two frame sequences
allows for the representation of complex 3D scenes. Among for action recognition and ball detection. NCAA [246] is
these works, Menapace et al. [209] employ a text-based action a large dataset with action categories and bounding boxes
representation that provides precise details about the specific for player detection. SPIROUDOME [215] focuses on player
ball-hitting action being performed and the destination of the detection and localization. BPAD [154] is a first-person
ball. perspective dataset with labeled basketball events. Space-
Jam [247] is for action recognition with estimated player
V. DATASETS AND B ENCHMARKS poses. FineBasketball [248] is a fine-grained dataset with 3
In the era of deep learning, having access to effective data is broad and 26 fine-grained categories. NBA [126] is a dataset
crucial for training and evaluating models. In order to facilitate for group activity recognition, where each clip belongs to one
this, we have compiled a list of commonly used public sports of the nine group activities, and no individual annotations are
datasets, along with their corresponding details, as shown in provided, such as separate action labels and bounding boxes.
Table II. Below, we provide a more detailed description of NPUBasketball [216] contains RGB frames, depth maps, and
each dataset. skeleton information for various types of action recognition
models. DeepSportradar-v1 [36] is a multi-label dataset for
A. Soccer 3D localization, calibration, and instance segmentation tasks.
In Captioning task, NSVA [217] is the largest open-source
In soccer, most video-based datasets benefit from active
dataset in the basketball domain. Compared to SVN [249]
tasks like player tracking and action recognition, while some
and SVCDV [162], NSVA is publicly accessible and has the
datasets focus on field localization and registration or player
most sentences among the three datasets, with five times more
depth maps and meshes.
videos than both SVN and SVCDV. Additionally, there are
Some datasets focus more on player detection and tracking.
some special datasets that focus on reconstructing the player.
Soccer-ISSIA [240] is an early work and a relatively small
NBA2K dataset [250] includes body meshes and texture data
dataset with player bounding box annotations. SVPP [241]
of several NBA players.
provides a multi-sensor dataset that includes body sensor data
and video data. Soccer Player [242] is specifically designed
for player detection and tracking, while SoccerTrack [214] is C. Volleyball
a novel dataset with multi-view and super high definition. Despite being a popular sport, there are only a few volley-
Other datasets like Football Action [137] and Soc- ball datasets available, most of which are on small scales.
cerDB [211] benefit action recognition, and Comprehen- Volleyball-1,2 [38] contains two sequences with manually
siveSoccer [243] and SSET [210] can be used for various annotated ball positions. HierVolleyball [251] and its extension
video analysis tasks, such as action classification, localization, HierVolleyball-v2 [252] are developed for team activity recog-
and player detection. SoccerKicks [212] provides player pose nition, with annotated player actions and positions. Sports
estimation. GOAL [244] supports knowledge-grounded video Video Captioning Dataset-Volleyball (SVCDV) [162] is a
captioning. dataset for captioning tasks, with 55 videos from YouTube,
9
TABLE II
A LIST OF VIDEO - BASED SPORTS - RELATED DATASETS USED IN THE PUBLISHED PAPERS . N OTE THAT SOME OF THEM ARE NOT PUBLICLY AVAILABLE
AND “ MULTIPLE ” MEANS THAT THE DATASET CONTAINS VARIOUS SPORTS INSTEAD OF ONLY ONE SPECIFIC TYPE OF SPORTS . “ DET.”, “ CLS .”, “ TRA .”,
“ASS .”, “ SEG .”, “ LOC .”,“ CAL .”, “ CAP.” STAND FOR PLAYER / BALL DETECTION , ACTION CLASSIFICATION , PLAYER / BALL TRACKING , ACTION QUALITY
ASSESSMENT, OBJECT SEGMENTATION , TEMPORAL ACTION LOCALIZATION , CAMERA CALIBRATION , AND CAPTIONING RESPECTIVELY.
each containing an average of 9.2 sentences. However, this with labeled player positions and time boundaries of actions.
dataset is not available for download. THETIS [256] includes 1,980 self-recorded videos of 12 tennis
actions with RGB, depth, 2D skeleton, and 3D skeleton videos,
D. Hockey which can be used for multiple types of action recognition
models. TenniSet [257] contains five Olympic tennis match
The Hockey Fight dataset [253] contains 1,000 video
videos with six labeled event categories and textural descrip-
clips from National Hockey League (NHL) games for bi-
tions, making it suitable for both recognition, localization, and
nary classification of fight and non-fight. The Player Tracklet
action retrieval tasks.
dataset [254] consists of 84 video clips from NHL games with
annotated bounding boxes and identity labels for players and
It should be noted that some recent works focus more on
referees and is suitable for player tracking and identification.
generative tasks, like PVG [258], which obtained a tennis
dataset through YouTube videos. PE-Tennis [218] built upon
E. Tennis PVG and introduces camera calibration resulting from recon-
Various datasets have been constructed for tennis video struction, making it possible to edit the viewpoint. LGEs-
analysis. ACASVA [255] is designed for tennis action recog- Tennis [209] enables generation from text editing on player
nition and consists of six broadcast videos of tennis games movement, shot type, and location.
10
G. Gymnastics
K. Dance
The FineGym [114] is a recent work developed for gymnas-
tic action recognition and localization. It contains 303 videos The field of deep learning has several research tasks for
with around 708-hour length and is annotated hierarchically, dance, including music-oriented choreography, dance mo-
making it suitable for fine-grained action recognition and tion synthesis, and multiple object tracking. Researchers
localization. On the other hand, AFG-Olympics [262] provides propose several datasets to promote research in this field.
challenging scenarios with extensive background, viewpoint, GrooveNet [194] consists of approximately 23 minutes of
and scale variations over an extended sample duration of up to motion capture data recorded at 60 frames per second and four
2 minutes. Additionally, a discriminative attention module is performances by a dancer. Dance with Melody [195] includes
proposed to embed long-range spatial and temporal correlation 40 complete dance choreographies for four types of dance,
semantics. totaling 907,200 frames collected with optical motion capture
equipment. EA-MUD [200] includes 104 video sequences
of 12 dancing genres, while AIST++ [202] is a large-scale
H. Badminton 3D human dance motion dataset with frame-level annotations
The Badminton Olympic [263] provides annotations for including 9 views of camera intrinsic and extrinsic parameters,
player detection, point localization, action recognition, and 17 COCO-format human joint locations in both 2D and 3D,
localization tasks. It comprises 10 YouTube videos of singles and 24 SMPL pose parameters. These datasets can be used for
badminton matches, each approximately an hour long. The tasks such as dance motion recognition, tracking, and quality
dataset includes annotations for player positions, temporal assessment.
locations of point wins, and time boundaries and labels of
strokes. Meanwhile, Stroke Forecasting [177] contains 43,191 L. Sport Related Datasets for General Purpose
trimmed video clips of badminton strokes categorized into 10
types, which can be used for both action recognition and stroke There are several datasets for sports action recognition
forecasting. and assessment tasks, including UCF sports [268], MSR
Action3D [269], Olympic [270], Sports 1M [225], SVW [227],
MultiSports [233], OlympicSports [226], OlympicScor-
I. Figure skating ing [228], and AQA [231]. These datasets cover different
There are 5 datasets proposed for figure skating action sports, including team sports and individual sports, and provide
recognition in recent years. FineSkating [264] is a hierarchical- various annotations, such as action labels, quality scores, and
labeled dataset of 46 videos of figure skating competitions bounding boxes.
for action recognition and action quality assessment. FSD- Additionally, Win-Fail [237] is a dataset specifically de-
10 [265] comprises ten categories of figure skating actions and signed for recognizing the outcome of actions, while Sport-
provides scores for action quality assessment. FisV-5 [107] is sPose [238] is the largest markerless dataset for 3D human
a dataset of 500 figure skating competition videos labeled with pose estimation in sports, containing 5 short sports-related
scores by 9 professional judges. FR-FS [109] is designed to activities recorded from 7 cameras, totaling 1.5 million frames.
recognize figure skating falls, with 417 videos containing the SportsMOT [239] is a large-scale and high-quality multi-
movements of take-off, rotation, and landing. MCFS [266] has object tracking dataset comprising detailed annotations for
three-level annotations of figure skating actions and their time each player present on the field in diverse sports scenarios.
boundaries, allowing for action recognition and localization. These datasets provide valuable resources for researchers to
develop and evaluate algorithms for various sports-related
J. Diving tasks.
There are three diving datasets available for action recogni-
tion and action quality assessment. Diving48 [267] contains M. Others
18,404 video segments covering 48 fine-grained categories CVBASE Handball [271] and CVBASE Squash [271]
of diving actions, making it a relatively low-bias dataset are developed for handball and squash action recognition,
11
respectively, with annotated trajectories of players and ac- the performance and reliability of deep learning models in
tion categories. GolfDB [223] facilitates the analysis of golf sports applications.
swings, providing 1,400 high-quality golf swing video seg- b) Datasets Standardization: Standardizing datasets for
ments, action labels, and bounding boxes of players. Lastly, various sports is a daunting task, as each sport has unique
FenceNet [119] consists of 652 videos of expert-level fencers technical aspects and rules that make it difficult to create
performing six categories of actions, with RGB frames, 3D a unified benchmark for specific tasks. For example, taking
skeleton data, and depth data provided. Rugby sevens [62] action recognition tasks as an example, in diving [222], only
is a public sports tracking dataset with tracking ground truth the movement of the athlete needs to be focused on, and
and the generated tracks. MLB-YouTube [224] is introduced attention should be paid to the details of role actions. However,
for fine-grained action recognition in baseball videos. in team sports such as volleyball [251], more attention is
needed to distinguish and identify targets and cluster the
VI. V IRTUAL E NVIRONMENTS same actions after identification. Given the varying emphases
of tasks, there are substantial differences in the dataset re-
Researchers can utilize virtual environments for simulation.
quirements. To go further, action recognition of the same
In a virtual environment that provides agents with simulated
sport type, involves nuanced differences in label classification,
motion tasks, multiple data information can be continuously
making it challenging to develop a one-size-fits-all solution or
generated and retained in the simulation. For example, Fever
benchmark. The creation of standardized, user-friendly, open-
Basketball [183] is an asynchronous environment, which sup-
source, high-quality, and large-scale datasets is crucial for
ports multiple characters, multiple positions, and both the
advancing research and enabling fair comparisons between dif-
single-agent and multi-agent player control modes.
ferent models and approaches in sports performance analysis.
There are many virtual soccer games, such as rSoccer [178],
RoboCup Soccer Simulator [272], the DeepMind MuJoCo c) Data Utilization: The sports domain generates vast
Multi-Agent Soccer Environment [179], [180] and JiDi amounts of fine-grained data through sensors and IoT devices.
Olympics Football [273]. rSoccer [178] and JiDi Olympics However, current data processing methods primarily focus on
Football [273] are two toy football games in which plays are computer vision and do not fully exploit the potential of end-
just rigid bodies and can just move and push the ball. However, to-end deep learning approaches. To fully harness the power of
players in GFootball [181] have more complex actions, such these rich data sources, researchers must develop methods that
as dribbling, sliding, and sprinting. Besides, environments like combine fine-grained sensor data with visual information. This
RoboCup Soccer Simulator [272] and DeepMind MuJoCo fusion of diverse data streams can enable more comprehensive
Multi-Agent Soccer Environment [179], [180] focus more on and insightful analysis, leading to significant advancements
low-level control of a physics simulation of robots, while in the field of sports performance. Some studies have shown
GFootball focuses more on developing high-level tactics. To that introducing multi-modal data can benefit the analysis of
improve the flexibility and control over environment dynamics, athletic performance. For example, in table tennis, visual and
SCENIC [274] is proposed to model and generate diverse sce- IOT signals can be simultaneously used to analyze athlete
narios in a real-time strategy environment programmatically. performance [275]. In dance, visual and audio signals are both
important [202]. More attention is needed on how to utilize
diverse data, so as to achieve better fusion. Meanwhile, multi-
VII. C HALLENGES modal algorithms and datasets [202] are both necessary.
In recent years, deep learning has emerged as a powerful
tool in the analysis and enhancement of sports performance.
VIII. F UTURE TREND
The application of these advanced techniques has revolution-
ized the way athletes, coaches, and teams approach training, The integration of deep learning methodologies into sports
strategy, and decision-making. By leveraging the vast amounts analytics can empower athletes, coaches, and teams with
of data generated in sports, deep learning models have the unprecedented insights into performance, decision-making,
potential to uncover hidden patterns, optimize performance, and injury prevention. This future work aims to explore the
and provide valuable insights that can inform decision-making transformative impact of deep learning techniques in sports
processes. However, despite its promising potential, the im- performance, focusing on data generation methods, multi-
plementation of deep learning in sports performance faces modality and multi-task models, foundation models, applica-
several challenges that need to be addressed to fully realize tions, and practicability.
its benefits. a) Multi-modality and Multi-task: By harnessing the
a) Task Challenge: The complex and dynamic nature power of multi-modal data and multi-task learning, robust
of sports activities presents unique challenges for computer and versatile models capable of handling diverse and complex
vision tasks in tracking and recognizing athletes and their sports-related challenges can be fulfilled. Furthermore, we will
movements. Issues such as identity mismatch due to similar investigate the potential of large-scale models in enhancing
appearances [48], [49], blurring [52] caused by rapid motion, predictive and analytical capabilities. It consists of practical
and occlusion [44], [45] from other players or objects in the applications and real-world implementations that can improve
scene can lead to inaccuracies and inconsistencies in tracking athlete performance and overall team dynamics. Ultimately,
and analysis. Developing robust and adaptable algorithms that this work seeks to contribute to the growing body of research
can effectively handle these challenges is essential to improve on deep learning in sports performance, paving the way for
12
novel strategies and technologies that can revolutionize the sports for everyone. There are already some works [282]–
world of sports analytics. [284] focusing on sports performance analysis, data recording
b) Foundation Model: The popularity of ChatGPT has visualization, energy expenditure estimation, and many other
demonstrated the power of large language models [276], while aspects. At the same time, in professional sports, there are also
the recent segment-anything project showcases the impressive some works [16], [275] that focus on combining various data
performance of large models in visual tasks [277]. The prompt- and methods to help improve athletic performance. Broadly
based paradigm is highly capable and flexible in natural lan- speaking, in both daily life and professional fields, there is
guage processing and even image segmentation, offering un- a need for more applications relating to health and fitness
precedented rich functionality. For example, some recent work assessments.
has leveraged segment-anything in medical image [278]–[280], e) Practicability: In more challenging, high-level tasks
achieving promising results by providing point or bounding with real-world applications, practicality becomes increasingly
box prompts for preliminary zero-shot capability assessment, important. Many practical challenges remain unexplored or
demonstrating that segment anything model (SAM) has good under-explored in applying deep learning to sports perfor-
generalization performance in medical imaging. Therefore, the mance. In decision-making, for example, current solutions
development of large models in the sports domain should often rely on simulation-based approaches. However, multi-
consider how to combine existing large models to explore agent decision-making techniques hold great potential for
applications, and how to create large models specifically for enhancing real-world sports decision-making. Tasks such as
the sports domain. ad-hoc teamwork [285] in multi-agent systems and zero-shot
Combining large models requires considering the adaptabil- human-machine interaction are crucial for enabling effec-
ity of the task. Compared to the medical field, sports involve a tive and practical real-world applications. Further research is
high level of human participation, inherently accommodating needed to bridge the gap between theoretical advancements
different levels and modalities of methods and data. We believe and their practical implications in sports performance analy-
that both large language models in natural language processing sis and decision-making. For example, RoboCup [272] aims
and large image segmentation models in computer vision to defeat human players in the World Cup by 2050. This
should have strong compatibility in sports. In short, we believe complex task requires robots to perceive their environment,
there is potential for exploring downstream tasks, such as using gather information, understand it, and execute specific actions.
ChatGPT for performance evaluation and feedback: employ Such agents must exhibit sufficient generalization, engage in
ChatGPT to generate natural language summaries of player or extensive human-machine interaction, and quickly respond to
team performance, as well as provide personalized feedback performance and environmental changes in real-time.
and recommendations for improvement.
Foundation models directly related to the sports domain IX. C ONCLUSION
require a vast amount of data corresponding to the specific In this paper, we present a comprehensive survey of deep
tasks. For visual tasks, for example, it is essential to ensure learning in sports, focusing on four main aspects: algorithms,
good scalability, adopt a prompt-based paradigm, and maintain datasets, challenges, and future works. We innovatively sum-
powerful capabilities while being flexible and offering richer marize the taxonomy and divide methods into perception,
functionality. It is important to note that large models do not comprehension, and decision from low-level to high-level
necessarily imply a large number of parameters, but rather a tasks. In the challenges and future works, we provide cutting-
strong ability to solve tasks. Recent work on segment-anything edge methods and give insights into the future trends and
has proven that even relatively simple models can achieve challenges of deep learning in sports.
excellent performance when the data volume is sufficiently
large. Therefore, creating large-scale, high-quality datasets in ACKNOWLEDGMENTS
the sports domain remains a crucial task. This work is supported by National Key R&D Program of
c) Data Generation: High-quality generated data can China under Grant No.2022ZD0162000, and National Natural
significantly reduce manual labor costs while demonstrating Science Foundation of China No.62106219.
the diversity that generative models can bring. Many stud-
ies [202], [281] have focused on generating sports videos, of- R EFERENCES
fering easily editable, high-quality generation methods, which
[1] N. Chmait and H. Westerbeek, “Artificial intelligence and machine
are elaborated upon in the relevant Section IV-D and IV-E. learning in sport research: An introduction for non-data scientists,”
Meanwhile, by combining large models, additional annotation Frontiers in Sports and Active Living, p. 363, 2021.
work can be performed at this stage, and if possible, new [2] “Smt,” https://www.smt.com/.
[3] “vizrt,” https://www.vizrt.com/.
usable data can be generated. [4] A. Duarte, C. Micael, S. Ludovic, S. Hugo, and D. Keith, Artificial
d) Applications: Though there are many excellent auto- Intelligence in Sport Performance Analysis, 2021.
matic algorithms for different tasks in the field of sports, they [5] E. E. Cust, A. J. Sweeting, K. Ball, and S. Robertson, “Machine and
deep learning for sport-specific movement recognition: a systematic
are still insufficient when it comes to deployment for specific review of model development and performance,” Journal of sports
tasks. In the daily exercise of ordinary people, who generally sciences, vol. 37, no. 5, pp. 568–600, 2019.
lack professional guidance, there should be more applications [6] R. P. Bonidia, L. A. Rodrigues, A. P. Avila-Santos, D. S. Sanches, J. D.
Brancher et al., “Computational intelligence in sports: A systematic
that make good use of these deep learning algorithms, and literature review,” Advances in Human-Computer Interaction, vol.
use more user-friendly and intelligent methods to promote 2018, 2018.
13
[7] R. Beal, T. J. Norman, and S. D. Ramchurn, “Artificial intelligence for [29] A. Cioppa, A. Deliege, M. Istasse, C. De Vleeschouwer, and
team sports: a survey,” The Knowledge Engineering Review, vol. 34, M. Van Droogenbroeck, “Arthus: Adaptive real-time human segmen-
p. e28, 2019. tation in sports through online distillation,” in Proceedings of the
[8] D. Tan, H. Ting, and S. Lau, “A review on badminton motion IEEE/CVF Conference on Computer Vision and Pattern Recognition
analysis,” in 2016 International Conference on Robotics, Automation Workshops, 2019, pp. 0–0.
and Sciences (ICORAS). IEEE, 2016, pp. 1–4. [30] R. Vandeghen, A. Cioppa, and M. Van Droogenbroeck, “Semi-
[9] F. Wu, Q. Wang, J. Bian, N. Ding, F. Lu, J. Cheng, D. Dou, and supervised training to improve player and ball detection in soccer,”
H. Xiong, “A survey on video action recognition in sports: Datasets, in Proceedings of the IEEE/CVF Conference on Computer Vision and
methods and applications,” IEEE Transactions on Multimedia, 2022. Pattern Recognition, 2022, pp. 3481–3490.
[10] S. Wang, D. Yang, P. Zhai, Q. Yu, T. Suo, Z. Sun, K. Li, and [31] R. Sanford, S. Gorji, L. G. Hafemann, B. Pourbabaee, and M. Javan,
L. Zhang, “A survey of video-based action quality assessment,” in 2021 “Group activity detection from trajectory and video data in soccer,”
International Conference on Networking Systems of AI (INSAI). IEEE, in Proceedings of the IEEE/CVF Conference on Computer Vision and
2021, pp. 1–9. Pattern Recognition Workshops, 2020, pp. 898–899.
[11] P. R. Kamble, A. G. Keskar, and K. M. Bhurchandi, “Ball tracking [32] A. Cioppa, A. Deliege, F. Magera, S. Giancola, O. Barnich, B. Ghanem,
in sports: a survey,” Artificial Intelligence Review, vol. 52, no. 3, pp. and M. Van Droogenbroeck, “Camera calibration and player localiza-
1655–1705, 2019. tion in soccernet-v2 and investigation of their representations for action
[12] Y. Adesida, E. Papi, and A. H. McGregor, “Exploring the role of spotting,” in Proceedings of the IEEE/CVF Conference on CVPR, 2021,
wearable technology in sport kinematics and kinetics: A systematic pp. 4537–4546.
review,” Sensors, vol. 19, no. 7, p. 1597, 2019. [33] A. Cioppa, S. Giancola, A. Deliege, L. Kang, X. Zhou, Z. Cheng,
[13] M. Rana and V. Mittal, “Wearable sensors for real-time kinematics B. Ghanem, and M. Van Droogenbroeck, “Soccernet-tracking: Multiple
analysis in sports: a review,” IEEE Sensors Journal, vol. 21, no. 2, pp. object tracking dataset and benchmark in soccer videos,” in Proceed-
1187–1207, 2020. ings of the IEEE/CVF Conference on Computer Vision and Pattern
[14] E. Van der Kruk and M. M. Reijne, “Accuracy of human motion cap- Recognition, 2022, pp. 3491–3502.
ture systems for sport applications; state-of-the-art review,” European [34] A. Cioppa, A. Deliège, S. Giancola, B. Ghanem, and M. Van Droogen-
journal of sport science, vol. 18, no. 6, pp. 806–819, 2018. broeck, “Scaling up soccernet with multi-view spatial localization and
[15] A. M. Turing, Computing machinery and intelligence. Springer, 2009. re-identification,” Scientific Data, vol. 9, no. 1, p. 355, 2022.
[16] J. Wang, K. Qiu, H. Peng, J. Fu, and J. Zhu, “Ai coach: Deep [35] I. Uchida, A. Scott, H. Shishido, and Y. Kameda, “Automated offside
human pose estimation and analysis for personalized athletic training detection by spatio-temporal analysis of football videos,” in Proceed-
assistance,” in Proceedings of the 27th ACM international conference ings of the 4th International Workshop on Multimedia Content Analysis
on multimedia, 2019, pp. 374–382. in Sports, 2021, pp. 17–24.
[36] G. Van Zandycke, V. Somers, M. Istasse, C. D. Don, and D. Zambrano,
[17] U. Rao and U. C. Pati, “A novel algorithm for detection of soccer ball
“Deepsportradar-v1: Computer vision dataset for sports understanding
and player,” in 2015 International Conference on Communications and
with high quality annotations,” in Proceedings of the 5th International
Signal Processing (ICCSP). IEEE, 2015, pp. 0344–0348.
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp.
[18] Y. Yang, M. Xu, W. Wu, R. Zhang, and Y. Peng, “3d multiview
1–8.
basketball players detection and localization based on probabilistic
[37] R. Voeikov, N. Falaleev, and R. Baikulov, “Ttnet: Real-time temporal
occupancy,” in 2018 Digital Image Computing: Techniques and Ap-
and spatial video analysis of table tennis,” in Proceedings of the
plications (DICTA). IEEE, 2018, pp. 1–8.
IEEE/CVF Conference on Computer Vision and Pattern Recognition
[19] M. Şah and C. Direkoğlu, “Evaluation of image representations for Workshops, 2020, pp. 884–885.
player detection in field sports using convolutional neural networks,” [38] A. Maksai, X. Wang, and P. Fua, “What players do with the ball:
in 13th International Conference on Theory and Application of Fuzzy A physically constrained interaction modeling,” in Proceedings of the
Systems and Soft Computing—ICAFS-2018 13. Springer, 2019, pp. IEEE conference on computer vision and pattern recognition, 2016,
107–115. pp. 972–981.
[20] S. Gerke, A. Linnemann, and K. Müller, “Soccer player recognition [39] X. Cheng, N. Ikoma, M. Honda, and T. Ikenaga, “Simultaneous phys-
using spatial constellation features and jersey number recognition,” ical and conceptual ball state estimation in volleyball game analysis,”
Computer Vision and Image Understanding, vol. 159, pp. 105–115, in 2017 IEEE Visual Communications and Image Processing (VCIP).
2017. IEEE, 2017, pp. 1–4.
[21] G. Li, S. Xu, X. Liu, L. Li, and C. Wang, “Jersey number recognition [40] P. Parisot and C. De Vleeschouwer, “Consensus-based trajectory es-
with semi-supervised spatial transformer network,” in Proceedings timation for ball detection in calibrated cameras systems,” Journal of
of the IEEE conference on computer vision and pattern recognition Real-Time Image Processing, vol. 16, no. 5, pp. 1335–1350, 2019.
workshops, 2018, pp. 1783–1790. [41] J. Sköld, “Estimating 3d-trajectories from monocular video sequences,”
[22] H. Liu and B. Bhanu, “Pose-guided r-cnn for jersey number recognition 2015.
in sports,” in Proceedings of the IEEE/CVF Conference on Computer [42] G. Van Zandycke and C. De Vleeschouwer, “3d ball localization from a
Vision and Pattern Recognition Workshops, 2019, pp. 0–0. single calibrated image,” in Proceedings of the IEEE/CVF Conference
[23] M. Istasse, J. Moreau, and C. De Vleeschouwer, “Associative em- on CVPR, 2022, pp. 3472–3480.
bedding for team discrimination,” in Proceedings of the IEEE/CVF [43] P. Liu and J.-H. Wang, “Monotrack: Shuttle trajectory reconstruction
Conference on Computer Vision and Pattern Recognition Workshops, from monocular badminton video,” in Proceedings of the IEEE/CVF
2019, pp. 0–0. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[24] M. Koshkina, H. Pidaparthy, and J. H. Elder, “Contrastive learning for 3513–3522.
sports video: Unsupervised player classification,” in Proceedings of the [44] B. T. Naik, M. F. Hashmi, Z. W. Geem, and N. D. Bokde, “Deepplayer-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, track: player and referee tracking with jersey color recognition in
2021, pp. 4528–4536. soccer,” IEEE Access, vol. 10, pp. 32 494–32 509, 2022.
[25] M. Manafifard, H. Ebadi, and H. A. Moghaddam, “A survey on player [45] B. T. Naik and M. F. Hashmi, “Yolov3-sort: detection and tracking
tracking in soccer videos,” Computer Vision and Image Understanding, player/ball in soccer sport,” Journal of Electronic Imaging, vol. 32,
vol. 159, pp. 19–46, 2017. no. 1, pp. 011 003–011 003, 2023.
[26] R. Theagarajan, F. Pala, X. Zhang, and B. Bhanu, “Soccer: Who has the [46] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
ball? generating visual analytics and player statistics,” in Proceedings and realtime tracking,” in 2016 IEEE international conference on image
of the IEEE Conference on Computer Vision and Pattern Recognition processing (ICIP). IEEE, 2016, pp. 3464–3468.
Workshops, 2018, pp. 1749–1757. [47] S. Hurault, C. Ballester, and G. Haro, “Self-supervised small soccer
[27] A. Arbues-Sanguesa, A. Martín, J. Fernández, C. Ballester, and player detection and tracking,” in Proceedings of the 3rd international
G. Haro, “Using player’s body-orientation to model pass feasibility workshop on multimedia content analysis in sports, 2020, pp. 9–18.
in soccer,” in Proceedings of the IEEE/CVF Conference on Computer [48] R. Zhang, L. Wu, Y. Yang, W. Wu, Y. Chen, and M. Xu, “Multi-camera
Vision and Pattern Recognition Workshops, 2020, pp. 886–887. multi-player tracking with deep player identification in sports video,”
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- Pattern Recognition, vol. 102, p. 107260, 2020.
time object detection with region proposal networks,” in Advances in [49] P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo,
Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. “Dancetrack: Multi-object tracking in uniform appearance and diverse
91–99. motion,” arXiv preprint arXiv:2111.14690, 2021.
14
[50] M. Buric, M. Ivasic-Kos, and M. Pobar, “Player tracking in sports one-shot learning technique,” in Proceedings of the 2019 2nd artificial
videos,” in 2019 IEEE International Conference on Cloud Computing intelligence and cloud computing conference, 2019, pp. 117–124.
Technology and Science (CloudCom), 2019, pp. 334–340. [72] S. Suda, Y. Makino, and H. Shinoda, “Prediction of volleyball trajectory
[51] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime using skeletal motions of setter player,” in Proceedings of the 10th
tracking with a deep association metric,” in 2017 IEEE international Augmented Human International Conference 2019, 2019, pp. 1–8.
conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649. [73] T. Shimizu, R. Hachiuma, H. Saito, T. Yoshikawa, and C. Lee,
[52] Y.-C. Huang, I.-N. Liao, C.-H. Chen, T.-U. İk, and W.-C. Peng, “Track- “Prediction of future shot direction using pose and position of tennis
net: A deep learning network for tracking high-speed and tiny objects player,” in Proceedings Proceedings of the 2nd International Workshop
in sports applications,” in 2019 16th IEEE International Conference on on Multimedia Content Analysis in Sports, 2019, pp. 59–66.
Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1–8. [74] E. Wu and H. Koike, “Futurepong: Real-time table tennis trajectory
[53] V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” forecasting using pose prediction network,” in Extended Abstracts of
in 2017 12th IEEE International Conference on Automatic Face & the 2020 CHI Conference on Human Factors in Computing Systems,
Gesture Recognition (FG 2017). IEEE, 2017, pp. 468–475. 2020, pp. 1–8.
[54] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human [75] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory re-
pose estimation in videos,” in Proceedings of the IEEE international current neural network architectures for large scale acoustic modeling,”
conference on computer vision, 2015, pp. 1913–1921. 2014.
[55] S. Schwarcz, P. Xu, D. D’Ambrosio, J. Kangaspunta, A. Angelova, [76] M. Einfalt, C. Dampeyrou, D. Zecha, and R. Lienhart, “Frame-level
H. Phan, and N. Jaitly, “Spin: A high speed, high resolution vision event detection in athletics videos with pose-based convolutional se-
dataset for tracking and action recognition in ping pong,” arXiv preprint quence networks,” in Proceedings Proceedings of the 2nd International
arXiv:1912.06640, 2019. Workshop on Multimedia Content Analysis in Sports, 2019, pp. 42–50.
[56] B. Comandur, “Sports re-id: Improving re-identification of players in [77] H. Thilakarathne, A. Nibali, Z. He, and S. Morgan, “Pose is all
broadcast videos of team sports,” arXiv preprint arXiv:2206.02373, you need: The pose only group activity recognition system (pogars),”
2022. Machine Vision and Applications, vol. 33, no. 6, p. 95, 2022.
[57] A. Nady and E. E. Hemayed, “Player identification in different sports,” [78] A. Tharatipyakul, K. T. Choo, and S. T. Perrault, “Pose estimation for
in VISIGRAPP, 2021. facilitating movement learning from online videos,” in Proceedings of
[58] A. Senocak, T.-H. Oh, J. Kim, and I. S. Kweon, “Part-based player the International Conference on Advanced Visual Interfaces, 2020, pp.
identification using deep convolutional representation and multi-scale 1–5.
pooling,” in 2018 IEEE/CVF Conference on Computer Vision and [79] E. W. Trejo and P. Yuan, “Recognition of yoga poses through an
Pattern Recognition Workshops (CVPRW), 2018, pp. 1813–18 137. interactive system with kinect based on confidence value,” in 2018
[59] O. M. Teket and I. S. Yetik, “A fast deep learning based approach for 3rd international conference on advanced robotics and mechatronics
basketball video analysis,” in Proceedings of the 2020 4th International (ICARM). IEEE, 2018, pp. 606–611.
Conference on Vision, Image and Signal Processing, ser. ICVISP [80] D. Farin, S. Krabbe, W. Effelsberg et al., “Robust camera calibration
2020. New York, NY, USA: Association for Computing Machinery, for sport videos using court models,” in Storage and Retrieval Methods
2020. [Online]. Available: https://doi.org/10.1145/3448823.3448882 and Applications for Multimedia 2004, vol. 5307. SPIE, 2003, pp.
80–91.
[60] Q. An, K. Cui, R. Liu, C. Wang, M. Qi, and H. Ma, “Attention-
[81] Q. Yao, A. Kubota, K. Kawakita, K. Nonaka, H. Sankoh, and S. Naito,
aware multiple granularities network for player re-identification,” in
“Fast camera self-calibration for synthesizing free viewpoint soccer
Proceedings of the 5th International ACM Workshop on Multimedia
video,” in 2017 IEEE International Conference on Acoustics, Speech
Content Analysis in Sports, 2022, pp. 137–144.
and Signal Processing (ICASSP). IEEE, 2017, pp. 1612–1616.
[61] K. Habel, F. Deuser, and N. Oswald, “Clip-reident: Contrastive training
[82] N. Homayounfar, S. Fidler, and R. Urtasun, “Sports field localization
for player re-identification,” in Proceedings of the 5th International
via deep structured models,” in Proceedings of the IEEE Conference
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp.
on CVPR, 2017, pp. 5212–5220.
129–135.
[83] J. Chen and J. J. Little, “Sports camera calibration via synthetic data,”
[62] A. Maglo, A. Orcesi, and Q.-C. Pham, “Efficient tracking of team sport in Proceedings of the IEEE/CVF conference on CVPR workshops,
players with few game-specific annotations,” in Proceedings of the 2019, pp. 0–0.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [84] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End-
2022, pp. 3461–3471. to-end camera calibration for broadcast videos,” in Proceedings of the
[63] K. Vats, W. McNally, P. Walters, D. A. Clausi, and J. S. Zelek, IEEE/CVF conference on CVPR, 2020, pp. 13 627–13 636.
“Ice hockey player identification via transformers,” arXiv preprint [85] X. Nie, S. Chen, and R. Hamid, “A robust and efficient framework
arXiv:2111.11535, 2021. for sports-field registration,” in Winter Conference on Applications
[64] B. Yan, Y. Li, X. Zhao, and H. Wang, “Dual data augmentation of Computer Vision, WACV. IEEE, 2021, pp. 1935–1943. [Online].
method for data-deficient and occluded instance segmentation,” in Available: https://doi.org/10.1109/WACV48630.2021.00198
Proceedings of the 5th International ACM Workshop on Multimedia [86] F. Shi, P. Marchwica, J. C. G. Higuera, M. Jamieson, M. Javan,
Content Analysis in Sports, 2022, pp. 117–120. and P. Siva, “Self-supervised shape alignment for sports field
[65] B. Yan, F. Qi, Z. Li, Y. Li, and H. Wang, “Strong instance segmentation registration,” in Winter Conference on Applications of Computer
pipeline for mmsports challenge,” arXiv preprint arXiv:2209.13899, Vision, WACV. IEEE, 2022, pp. 3768–3777. [Online]. Available:
2022. https://doi.org/10.1109/WACV51458.2022.00382
[66] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. [87] Y.-J. Chu, J.-W. Su, K.-W. Hsiao, C.-Y. Lien, S.-H. Fan, M.-C.
Le, and B. Zoph, “Simple copy-paste is a strong data augmentation Hu, R.-R. Lee, C.-Y. Yao, and H.-K. Chu, “Sports field registration
method for instance segmentation,” in Proceedings of the IEEE/CVF via keypoints-aware label condition,” in Conference on Computer
conference on computer vision and pattern recognition, 2021, pp. Vision and Pattern Recognition Workshops, CVPRW. IEEE/CVF,
2918–2928. 2022, pp. 3523–3530. [Online]. Available: https://doi.org/10.1109/
[67] C. Zhang, M. Wang, and L. Zhou, “Recognition method of basketball CVPRW56347.2022.00396
players’ throwing action based on image segmentation,” International [88] N. Zhang and E. Izquierdo, “A high accuracy camera calibration
Journal of Biometrics, vol. 15, no. 2, pp. 121–133, 2023. method for sport videos,” in International Conference on Visual
[68] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 2017 Communications and Image Processing, VCIP. IEEE, 2021, pp. 1–5.
IEEE International Conference on Computer Vision (ICCV), 2017, pp. [Online]. Available: https://doi.org/10.1109/VCIP53242.2021.9675379
2980–2988. [89] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient
[69] W. Chai, Z. Jiang, J.-N. Hwang, and G. Wang, “Global adaptation meets video understanding,” in Proceedings of the IEEE/CVF International
local generalization: Unsupervised domain adaptation for 3d human Conference on Computer Vision, 2019, pp. 7083–7093.
pose estimation,” arXiv preprint arXiv:2303.16456, 2023. [90] D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification
[70] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: re- with channel-separated convolutional networks,” in Proceedings of the
altime multi-person 2d pose estimation using part affinity fields,” IEEE IEEE/CVF International Conference on Computer Vision, 2019, pp.
transactions on pattern analysis and machine intelligence, vol. 43, 5552–5561.
no. 1, pp. 172–186, 2021. [91] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks
[71] N. Promrit and S. Waijanya, “Model for practice badminton basic skills for video recognition,” in Proceedings of the IEEE/CVF international
by using motion posture detection from video posture embedding and conference on computer vision, 2019, pp. 6202–6211.
15
[92] W. Wang, D. Tran, and M. Feiszli, “What makes training multi- [116] J. Choi, C. Gao, J. C. Messou, and J.-B. Huang, “Why can’t i dance
modal classification networks hard?” in Proceedings of the IEEE/CVF in the mall? learning to mitigate scene bias in action recognition,”
Conference on Computer Vision and Pattern Recognition, 2020, pp. Advances in Neural Information Processing Systems, vol. 32, 2019.
12 695–12 705. [117] P. Weinzaepfel and G. Rogez, “Mimetics: Towards understanding
[93] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action human actions out of context,” International Journal of Computer
recognition with multi-stream adaptive graph convolutional networks,” Vision, vol. 129, no. 5, pp. 1675–1690, 2021.
IEEE Transactions on Image Processing, vol. 29, pp. 9532–9545, 2020. [118] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling
[94] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more and unifying graph convolutions for skeleton-based action recognition,”
explainable: A graph convolutional baseline for skeleton-based action in Proceedings of the IEEE/CVF conference on computer vision and
recognition,” in proceedings of the 28th ACM international conference pattern recognition, 2020, pp. 143–152.
on multimedia, 2020, pp. 1625–1633. [119] K. Zhu, A. Wong, and J. McPhee, “Fencenet: Fine-grained footwork
[95] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, recognition in fencing,” in Proceedings of the IEEE/CVF Conference
and B. Gong, “Movinets: Mobile video networks for efficient video on Computer Vision and Pattern Recognition, 2022, pp. 3589–3598.
recognition,” in Proceedings of the IEEE/CVF Conference on Computer [120] J. Hong, M. Fisher, M. Gharbi, and K. Fatahalian, “Video pose
Vision and Pattern Recognition, 2021, pp. 16 020–16 030. distillation for few-shot, fine-grained sports action recognition,” in
[96] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all Proceedings of the IEEE/CVF International Conference on Computer
you need for video understanding,” arXiv preprint arXiv:2102.05095, Vision, 2021, pp. 9254–9263.
vol. 2, no. 3, p. 4, 2021. [121] Y. Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo,
[97] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video H. Li, and S. Gould, “The ikea asm dataset: Understanding people
swin transformer,” arXiv preprint arXiv:2106.13230, 2021. assembling furniture through actions, objects and pose,” in Proceedings
[98] R. Herzig, E. Ben-Avraham, K. Mangalam, A. Bar, G. Chechik, of the IEEE/CVF Winter Conference on Applications of Computer
A. Rohrbach, T. Darrell, and A. Globerson, “Object-region video Vision, 2021, pp. 847–859.
transformers,” arXiv preprint arXiv:2110.06915, 2021. [122] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza-
[99] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang, kos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling
L. Zhou, and L. Yuan, “Bevt: Bert pretraining of video transformers,” egocentric vision: The epic-kitchens dataset,” in Proceedings of the
arXiv preprint arXiv:2112.01529, 2021. European Conference on Computer Vision (ECCV), 2018, pp. 720–
[100] H. Tan, J. Lei, T. Wolf, and M. Bansal, “Vimpac: Video pre-training 736.
via masked token prediction and contrastive learning,” arXiv preprint [123] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. West-
arXiv:2106.11250, 2021. phal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag
[101] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel- et al., “The" something something" video database for learning and
wise topology refinement graph convolution for skeleton-based action evaluating visual common sense,” in Proceedings of the IEEE interna-
recognition,” in Proceedings of the IEEE/CVF International Conference tional conference on computer vision, 2017, pp. 5842–5850.
on Computer Vision, 2021, pp. 13 359–13 368. [124] K. Gavrilyuk, R. Sanford, M. Javan, and C. G. Snoek, “Actor-
[102] H. Yuan, D. Ni, and M. Wang, “Spatio-temporal dynamic inference net- transformers for group activity recognition,” in CVPR, 2020, pp. 839–
work for group activity recognition,” in Proceedings of the IEEE/CVF 848.
International Conference on Computer Vision, 2021, pp. 7476–7485. [125] G. Hu, B. Cui, Y. He, and S. Yu, “Progressive relation learning for
[103] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai, “Revisiting group activity recognition,” in CVPR, 2020, pp. 980–989.
skeleton-based action recognition,” arXiv preprint arXiv:2104.13586, [126] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “Social adaptive module
2021. for weakly-supervised group activity recognition,” in Computer Vision–
[104] X. Xiang, Y. Tian, A. Reiter, G. D. Hager, and T. D. Tran, “S3d: ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
Stacking segmental p3d for action quality assessment,” in ICIP, 2018, 2020, Proceedings, Part VIII 16. Springer, 2020, pp. 208–224.
pp. 928–932. [127] M. Ehsanpour, A. Abedin, F. Saleh, J. Shi, I. Reid, and H. Rezatofighi,
[105] P. Parmar and B. T. Morris, “Action quality assessment across multiple “Joint learning of social groups, individuals action and sub-group
actions,” in WACV, 2018. activities in videos,” in ECCV. Springer, 2020, pp. 177–195.
[106] ——, “What and how well you performed? a multitask learning [128] R. R. A. Pramono, Y. T. Chen, and W. H. Fang, “Empowering relational
approach to action quality assessment,” in CVPR, 2019. network by self-attention augmented conditional random fields for
[107] C. Xu, Y. Fu, B. Zhang, Z. Chen, and X. Xue, “Learning to score figure group activity recognition,” in ECCV. Springer, 2020, pp. 71–90.
skating sport videos,” IEEE Transactions on Circuits and Systems for [129] S. Li, Q. Cao, L. Liu, K. Yang, S. Liu, J. Hou, and S. Yi, “Groupformer:
Video Technology (TCSVT), vol. PP, no. 99, pp. 1–1, 2019. Group activity recognition with clustered spatial-temporal transformer,”
[108] Y. Tang, Z. Ni, J. Zhou, D. Zhang, and J. Zhou, “Uncertainty-aware ICCV, 2021.
score distribution learning for action quality assessment,” in CVPR, [130] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, “Learning actor relation
2020. graphs for group activity recognition,” in CVPR, 2019, pp. 9964–9974.
[109] S. Wang, Y. D., Z. P., C. C., and Z. L., “Tsa-net: Tube self-attention [131] M. Han, D. J. Zhang, Y. Wang, R. Yan, L. Yao, X. Chang, and
network for action quality assessment,” in ACM MM, 2021. Y. Qiao, “Dual-ai: dual-path actor interaction learning for group activity
[110] Z. Qi, R. Zhu, Z. Fu, W. Chai, and V. Kindratenko, “Weakly supervised recognition,” in Proceedings of the IEEE/CVF conference on computer
two-stage training scheme for deep video fight detection model,” arXiv vision and pattern recognition, 2022, pp. 2990–2999.
preprint arXiv:2209.11477, 2022. [132] G. Xu and J. Yin, “Mlp-air: An efficient mlp-based method for
[111] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new actor interaction relation learning in group activity recognition,” arXiv
model and the kinetics dataset,” in proceedings of the IEEE Conference preprint arXiv:2304.08803, 2023.
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [133] V. Bettadapura, C. Pantofaru, and I. Essa, “Leveraging contextual cues
[112] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, for generating basketball highlights,” in Proceedings of the 24th ACM
T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick et al., “Moments international conference on Multimedia, 2016, pp. 908–917.
in time dataset: one million videos for event understanding,” IEEE [134] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem, “Scc:
transactions on pattern analysis and machine intelligence, vol. 42, Semantic context cascade for efficient action detection,” in 2017 IEEE
no. 2, pp. 502–508, 2019. Conference on Computer Vision and Pattern Recognition (CVPR).
[113] G. Wang, K. Lu, Y. Zhou, Z. He, and G. Wang, “Human-centered prior- IEEE, 2017, pp. 3175–3184.
guided and task-dependent multi-task representation learning for action [135] P. Felsen, P. Agrawal, and J. Malik, “What will happen next? fore-
recognition pre-training,” in 2022 IEEE International Conference on casting player moves in sports videos,” in Proceedings of the IEEE
Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6. international conference on computer vision, 2017, pp. 3342–3351.
[114] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video [136] A. Cioppa, A. Deliege, and M. Van Droogenbroeck, “A bottom-up
dataset for fine-grained action understanding,” in Proceedings of the approach based on semantics for the interpretation of the main camera
IEEE/CVF conference on computer vision and pattern recognition, stream in soccer games,” in Proceedings of the IEEE Conference on
2020, pp. 2616–2625. Computer Vision and Pattern Recognition Workshops, 2018, pp. 1765–
[115] S. Sun, F. Wang, Q. Liang, and L. He, “Taichi: A fine-grained action 1774.
recognition dataset,” in Proceedings of the 2017 ACM on International [137] T. Tsunoda, Y. Komori, M. Matsugu, and T. Harada, “Football action
Conference on Multimedia Retrieval, 2017, pp. 429–433. recognition using hierarchical lstm,” in Proceedings of the IEEE
16
conference on computer vision and pattern recognition workshops, [159] W. Li, G. Pan, C. Wang, Z. Xing, and Z. Han, “From coarse to
2017, pp. 99–107. fine: Hierarchical structure-aware video summarization,” ACM Trans.
[138] Z. Cai, H. Neher, K. Vats, D. A. Clausi, and J. Zelek, “Temporal Multimedia Comput. Commun. Appl., vol. 18, no. 1s, jan 2022.
hockey action recognition via pose and optical flows,” in Proceedings [Online]. Available: https://doi.org/10.1145/3485472
of the IEEE Conference on Computer Vision and Pattern Recognition [160] W. Chai and G. Wang, “Deep vision multimodal learning: Methodol-
Workshops, 2019, pp. 0–0. ogy, benchmark, and trend,” Applied Sciences, vol. 12, no. 13, p. 6588,
[139] M. Sanabria, F. Precioso, and T. Menguy, “A deep architecture for mul- 2022.
timodal summarization of soccer games,” in Proceedings Proceedings [161] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang, “Fine-
of the 2nd International Workshop on Multimedia Content Analysis in grained video captioning for sports narrative,” in Proceedings of the
Sports, 2019, pp. 16–24. IEEE Conference on Computer Vision and Pattern Recognition, 2018,
[140] F. Turchini, L. Seidenari, L. Galteri, A. Ferracani, G. Becchi, and pp. 6006–6015.
A. Del Bimbo, “Flexible automatic football filming and summariza- [162] M. Qi, Y. Wang, A. Li, and J. Luo, “Sports video captioning via
tion,” in Proceedings Proceedings of the 2nd International Workshop attentive motion representation and group relationship modeling,” IEEE
on Multimedia Content Analysis in Sports, 2019, pp. 108–114. Transactions on Circuits and Systems for Video Technology, vol. 30,
[141] S. Giancola, M. Amine, T. Dghaily, and B. Ghanem, “Soccernet: A no. 8, pp. 2617–2633, 2019.
scalable dataset for action spotting in soccer videos,” in Proceedings [163] ——, “Sports video captioning via attentive motion representation
of the IEEE conference on computer vision and pattern recognition and group relationship modeling,” IEEE Transactions on Circuits and
workshops, 2018, pp. 1711–1721. Systems for Video Technology, vol. 30, no. 8, pp. 2617–2633, 2020.
[142] A. Cioppa, A. Deliege, S. Giancola, B. Ghanem, M. V. Droogenbroeck, [164] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang, “Fine-grained
R. Gade, and T. B. Moeslund, “A context-aware loss function for action video captioning for sports narrative,” in 2018 IEEE/CVF Conference
spotting in soccer videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6006–6015.
on Computer Vision and Pattern Recognition, 2020, pp. 13 126–13 136. [165] J. Wang, I. Fox, J. Skaza, N. Linck, S. Singh, and J. Wiens, “The
[143] J. Hong, H. Zhang, M. Gharbi, M. Fisher, and K. Fatahalian, “Spotting advantage of doubling: a deep reinforcement learning approach to
temporally precise, fine-grained events in video,” in Computer Vision– studying the double team in the nba,” arXiv preprint arXiv:1803.02940,
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23– 2018.
27, 2022, Proceedings, Part XXXV. Springer, 2022, pp. 33–51. [166] Y. Luo, “Inverse reinforcement learning for team sports: Valuing
[144] A. Darwish and T. El-Shabrway, “Ste: Spatio-temporal encoder for ac- actions and players,” 2020.
tion spotting in soccer videos,” in Proceedings of the 5th International [167] G. Liu and O. Schulte, “Deep reinforcement learning in ice hockey
ACM Workshop on Multimedia Content Analysis in Sports, 2022, pp. for context-aware player evaluation,” arXiv preprint arXiv:1805.11088,
87–92. 2018.
[145] A. Cartas, C. Ballester, and G. Haro, “A graph-based method for [168] C. Yanai, A. Solomon, G. Katz, B. Shapira, and L. Rokach, “Q-
soccer action spotting using unsupervised player classification,” in ball: Modeling basketball games using deep reinforcement learning,” in
Proceedings of the 5th International ACM Workshop on Multimedia Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36,
Content Analysis in Sports, 2022, pp. 93–102. no. 8, 2022, pp. 8806–8813.
[146] H. Zhu, J. Liang, C. Lin, J. Zhang, and J. Hu, “A transformer-based [169] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
system for action spotting in soccer videos,” in Proceedings of the D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
5th International ACM Workshop on Multimedia Content Analysis in ment learning,” arXiv preprint arXiv:1509.02971, 2015.
Sports, 2022, pp. 103–109. [170] “statsperform-optical-tracking,” https://www.statsperform.com/
[147] J. V. Soares and A. Shah, “Action spotting using dense detection team-performance/football/optical-tracking.
anchors revisited: Submission to the soccernet challenge 2022,” arXiv [171] “secondspectrum,” https://www.secondspectrum.com.
preprint arXiv:2206.07846, 2022. [172] X. Wei, P. Lucey, S. Morgan, and S. Sridharan, “Forecasting the next
[148] J. V. Soares, A. Shah, and T. Biswas, “Temporally precise action shot location in tennis using fine-grained spatiotemporal tracking data,”
spotting in soccer videos using dense detection anchors,” in 2022 IEEE IEEE Transactions on Knowledge and Data Engineering, vol. 28,
International Conference on Image Processing (ICIP). IEEE, 2022, no. 11, pp. 2988–2997, 2016.
pp. 2796–2800. [173] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Memory aug-
[149] J. H. Pan, J. Gao, and W. S. Zheng, “Action assessment by joint relation mented deep generative models for forecasting the next shot location
graphs,” in ICCV, 2019. in tennis,” IEEE Transactions on Knowledge and Data Engineering,
[150] G. I. Parisi, S. Magg, and S. Wermter, “Human motion assessment vol. 32, no. 9, pp. 1785–1797, 2019.
in real time using recurrent self-organization,” in IEEE International [174] X. Wei, P. Lucey, S. Morgan, M. Reid, and S. Sridharan, “The thin edge
Symposium on Robot and Human Interactive Communication (RO- of the wedge: Accurately predicting shot outcomes in tennis using style
MAN), 2016. and context priors,” in Proceedings of the 10th Annu MIT Sloan Sport
[151] S. T. Kim and M. R. Yong, “Evaluationnet: Can human skill be Anal Conf, Boston, MA, USA, 2016, pp. 1–11.
evaluated by deep networks?” arXiv:1705.11077, 2017. [175] H. M. Le, P. Carr, Y. Yue, and P. Lucey, “Data-driven ghosting using
[152] X. Yu, Y. Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware con- deep imitation learning,” 2017.
trastive regression for action quality assessment,” in Proceedings of [176] P. Power, H. Ruiz, X. Wei, and P. Lucey, “Not all passes are created
the IEEE/CVF International Conference on Computer Vision, 2021, equal: Objectively measuring the risk and reward of passes in soccer
pp. 7919–7928. from tracking data,” in Proceedings of the 23rd ACM SIGKDD inter-
[153] Y. Li, X. Chai, and X. Chen, “End-to-end learning for action quality national conference on knowledge discovery and data mining, 2017,
assessment,” in Advances in Multimedia Information Processing – pp. 1605–1613.
PCM, 2018. [177] W.-Y. Wang, H.-H. Shuai, K.-S. Chang, and W.-C. Peng, “Shuttlenet:
[154] G. Bertasius, H. S. Park, S. X. Yu, and J. Shi, “Am i a baller? basketball Position-aware fusion of rally progress and player styles for stroke
performance assessment from first-person videos,” in ICCV, 2019. forecasting in badminton,” in Proceedings of the AAAI Conference on
[155] R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summa- Artificial Intelligence, 2022.
rization using deep learning,” in 2019 IEEE Conference on Multimedia [178] F. B. Martins, M. G. Machado, H. F. Bassani, P. H. M. Braga, and E. S.
Information Processing and Retrieval (MIPR), 2019, pp. 270–273. Barros, “rsoccer: A framework for studying reinforcement learning in
[156] M. Rafiq, G. Rafiq, R. Agyeman, G. S. Choi, and S.-I. Jin, “Scene small and very small size robot soccer,” 2021.
classification for sports video summarization using transfer learning,” [179] S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Grae-
Sensors, vol. 20, no. 6, p. 1702, Mar 2020. [Online]. Available: pel, “Emergent coordination through competition,” arXiv preprint
http://dx.doi.org/10.3390/s20061702 arXiv:1902.07151, 2019.
[157] A. A. Khan, J. Shao, W. Ali, and S. Tumrani, “Content-aware summa- [180] S. Liu, G. Lever, Z. Wang, J. Merel, S. Eslami, D. Hennes, W. M.
rization of broadcast sports videos: An audio–visual feature extraction Czarnecki, Y. Tassa, S. Omidshafiei, A. Abdolmaleki et al., “From
approach,” Neural Processing Letters, pp. 1–24, 2020. motor control to team play in simulated humanoid football,” arXiv
[158] H. Shingrakhia and H. Patel, “Sgrnn-am and hrf-dbn: A hybrid preprint arXiv:2105.12196, 2021.
machine learning model for cricket video summarization,” Vis. [181] K. Kurach, A. Raichuk, P. Stańczyk, M. Zajac,˛ O. Bachem, L. Espeholt,
Comput., vol. 38, no. 7, p. 2285–2301, jul 2022. [Online]. Available: C. Riquelme, D. Vincent, M. Michalski, O. Bousquet et al., “Google
https://doi.org/10.1007/s00371-021-02111-8 research football: A novel reinforcement learning environment,” in
17
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 061–
no. 04, 2020, pp. 4501–4510. 10 070.
[182] Y. Zhao, I. Borovikov, J. Rupert, C. Somers, and A. Beirami, “On multi- [206] J. Huang, Y. Jin, K. M. Yi, and L. Sigal, “Layered controllable video
agent learning in team sports games,” arXiv preprint arXiv:1906.10124, generation,” in "Proceedings of the European Conference of Computer
2019. Vision (ECCV)", S. Avidan, G. Brostow, M. Cissé, G. M. Farinella,
[183] H. Jia, Y. Hu, Y. Chen, C. Ren, T. Lv, C. Fan, and C. Zhang, and T. Hassner, Eds., 2022.
“Fever basketball: A complex, flexible, and asynchronized sports game [207] A. Davtyan and P. Favaro, “Controllable video generation through
environment for multi-agent reinforcement learning,” arXiv preprint global and local motion dynamics,” in Proceedings of the European
arXiv:2012.03204, 2020. Conference of Computer Vision (ECCV), 2022.
[184] F. Z. Ziyang Li, Kaiwen Zhu, “Wekick,” https://www.kaggle.com/c/ [208] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor-
google-football/discussion/202232, 2020. thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields
[185] S. Huang, W. Chen, L. Zhang, Z. Li, F. Zhu, D. Ye, T. Chen, and for view synthesis,” in Proceedings of the European Conference of
J. Zhu, “Tikick: Towards playing multi-agent football full games from Computer Vision (ECCV), 2020.
single-agent demonstrations,” arXiv preprint arXiv:2110.04507, 2021. [209] W. Menapace, A. Siarohin, S. Lathuilière, P. Achlioptas, V. Golyanik,
[186] F. Lin, S. Huang, T. Pearce, W. Chen, and W.-W. Tu, “Tizero: Mastering E. Ricci, and S. Tulyakov, “Plotting behind the scenes: Towards
multi-agent football with curriculum learning and self-play,” arXiv learnable game engines,” arXiv preprint arXiv:2303.13472, 2023.
preprint arXiv:2302.07515, 2023. [210] N. Feng, Z. Song, J. Yu, Y.-P. P. Chen, Y. Zhao, Y. He, and T. Guan,
[187] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu, “The “Sset: a dataset for shot segmentation, event detection, player tracking
surprising effectiveness of mappo in cooperative, multi-agent games,” in soccer videos,” Multimedia Tools and Applications, vol. 79, pp.
arXiv preprint arXiv:2103.01955, 2021. 28 971–28 992, 2020.
[188] M. Wen, J. G. Kuba, R. Lin, W. Zhang, Y. Wen, J. Wang, and Y. Yang, [211] Y. Jiang, K. Cui, L. Chen, C. Wang, and C. Xu, “Soccerdb: A large-
“Multi-agent reinforcement learning is a sequence modeling problem,” scale database for comprehensive video understanding,” in Proceedings
arXiv preprint arXiv:2205.14953, 2022. of the 3rd International Workshop on Multimedia Content Analysis in
[189] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “First Sports, 2020, pp. 1–8.
return, then explore,” Nature, vol. 590, no. 7847, pp. 580–586, 2021. [212] A. Deliege, A. Cioppa, S. Giancola, M. J. Seikavandi, J. V. Dueholm,
[190] P. Tendulkar, A. Das, A. Kembhavi, and D. Parikh, “Feel the music: K. Nasrollahi, B. Ghanem, T. B. Moeslund, and M. Van Droogen-
Automatically generating a dance for an input song,” arXiv preprint broeck, “Soccernet-v2: A dataset and benchmarks for holistic under-
arXiv:2006.11905, 2020. standing of broadcast soccer videos,” in Proceedings of the IEEE/CVF
[191] X. Ren, H. Li, Z. Huang, and Q. Chen, “Self-supervised dance video Conference on Computer Vision and Pattern Recognition, 2021, pp.
synthesis conditioned on music,” in Proceedings of the 28th ACM 4508–4519.
International Conference on Multimedia, 2020, pp. 46–54. [213] N. M. Lessa, E. L. Colombini, and A. D. S. Simões, “Soccerkicks:
a dataset of 3d dead ball kicks reference movements for humanoid
[192] J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo,
robots,” in 2021 IEEE International Conference on Systems, Man, and
R. Martins, and E. R. Nascimento, “Learning to dance: A graph
Cybernetics (SMC). IEEE, 2021, pp. 3472–3478.
convolutional adversarial network to generate realistic dance motions
[214] A. Scott, I. Uchida, M. Onishi, Y. Kameda, K. Fukui, and K. Fujii,
from audio,” Computers & Graphics, vol. 94, pp. 11–21, 2021.
“Soccertrack: A dataset and tracking algorithm for soccer with fish-
[193] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
eye and drone videos,” in Proceedings of the IEEE/CVF Conference
2d pose estimation using part affinity fields,” in Proceedings of the
on Computer Vision and Pattern Recognition, 2022, pp. 3569–3579.
IEEE conference on computer vision and pattern recognition, 2017,
[215] P. Parisot and C. De Vleeschouwer, “Scene-specific classifier for
pp. 7291–7299.
effective and efficient team sport players detection from a single
[194] O. Alemi, J. Françoise, and P. Pasquier, “Groovenet: Real-time music-
calibrated camera,” Computer Vision and Image Understanding, vol.
driven dance movement generation using artificial neural networks,”
159, pp. 74–88, 2017.
networks, vol. 8, no. 17, p. 26, 2017.
[216] C. Ma, J. Fan, J. Yao, and T. Zhang, “Npu rgb+ d dataset and a
[195] T. Tang, J. Jia, and H. Mao, “Dance with melody: An lstm-autoencoder feature-enhanced lstm-dgcn method for action recognition of basketball
approach to music-oriented dance synthesis,” in Proceedings of the 26th players,” Applied Sciences, vol. 11, no. 10, p. 4426, 2021.
ACM international conference on Multimedia, 2018, pp. 1598–1606. [217] D. Wu, H. Zhao, X. Bao, and R. P. Wildes, “Sports video analysis on
[196] N. Yalta, S. Watanabe, K. Nakadai, and T. Ogata, “Weakly-supervised large-scale data,” in ECCV, Oct. 2022.
deep recurrent neural networks for basic dance step generation,” in [218] W. Menapace, S. Lathuiliere, A. Siarohin, C. Theobalt, S. Tulyakov,
2019 International Joint Conference on Neural Networks (IJCNN). V. Golyanik, and E. Ricci, “Playable environments: Video manipulation
IEEE, 2019, pp. 1–8. in space and time,” in Proceedings of the IEEE/CVF Conference on
[197] W. Zhuang, Y. Wang, J. Robinson, C. Wang, M. Shao, Y. Fu, and S. Xia, Computer Vision and Pattern Recognition, 2022, pp. 3584–3593.
“Towards 3d dance motion synthesis and control,” arXiv preprint [219] C. Xu, Y. Fu, B. Zhang, Z. Chen, Y.-G. Jiang, and X. Xue, “Learning
arXiv:2006.05743, 2020. to score figure skating sport videos,” IEEE transactions on circuits and
[198] H.-K. Kao and L. Su, “Temporally guided music-to-body-movement systems for video technology, vol. 30, no. 12, pp. 4578–4590, 2019.
generation,” in Proceedings of the 28th ACM International Conference [220] S. Wang, D. Yang, P. Zhai, C. Chen, and L. Zhang, “Tsa-net: Tube
on Multimedia, 2020, pp. 147–155. self-attention network for action quality assessment,” in Proceedings
[199] H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, of the 29th ACM International Conference on Multimedia, 2021, pp.
and J. Kautz, “Dancing to music,” Advances in neural information 4902–4910.
processing systems, vol. 32, 2019. [221] P. Parmar and B. T. Morris, “What and how well you performed? a
[200] G. Sun, Y. Wong, Z. Cheng, M. S. Kankanhalli, W. Geng, and X. Li, multitask learning approach to action quality assessment,” in Proceed-
“Deepdance: music-to-dance motion choreography with adversarial ings of the IEEE/CVF Conference on Computer Vision and Pattern
learning,” IEEE Transactions on Multimedia, vol. 23, pp. 497–509, Recognition, 2019, pp. 304–313.
2020. [222] J. Xu, Y. Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A
[201] R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance fine-grained dataset for procedure-aware action quality assessment,” in
revolution: Long-term dance generation with music via curriculum Proceedings of the IEEE/CVF Conference on Computer Vision and
learning,” arXiv preprint arXiv:2006.06119, 2020. Pattern Recognition, 2022, pp. 2949–2958.
[202] R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: [223] W. McNally, K. Vats, T. Pinto, C. Dulhanty, J. McPhee, and A. Wong,
Music conditioned 3d dance generation with aist++,” 2021. “Golfdb: A video database for golf swing sequencing,” in Proceedings
[203] H. Ahn, J. Kim, K. Kim, and S. Oh, “Generative autoregressive of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
networks for 3d dancing move synthesis from music,” IEEE Robotics nition Workshops, 2019, pp. 0–0.
and Automation Letters, vol. 5, no. 2, pp. 3501–3508, 2020. [224] A. Piergiovanni and M. S. Ryoo, “Fine-grained activity recognition in
[204] Z. Ye, H. Wu, J. Jia, Y. Bu, W. Chen, F. Meng, and Y. Wang, baseball videos,” in Proceedings of the ieee conference on computer
“Choreonet: Towards music to dance synthesis with choreographic vision and pattern recognition workshops, 2018, pp. 1740–1748.
action unit,” in Proceedings of the 28th ACM International Conference [225] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
on Multimedia, 2020, pp. 744–752. L. Fei-Fei, “Large-scale video classification with convolutional neural
[205] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, and E. Ricci, networks,” in Proceedings of the IEEE conference on Computer Vision
“Playable video generation,” in Proceedings of the IEEE Conference on and Pattern Recognition, 2014, pp. 1725–1732.
18
[226] H. Pirsiavash, C. Vondrick, and A. Torralba, “Assessing the quality of [246] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and
actions,” in Computer Vision–ECCV 2014: 13th European Conference, L. Fei-Fei, “Detecting events and key actors in multi-person videos,”
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. in Proceedings of the IEEE conference on computer vision and pattern
Springer, 2014, pp. 556–571. recognition, 2016, pp. 3043–3053.
[227] S. M. Safdarnejad, X. Liu, L. Udpa, B. Andrus, J. Wood, and [247] S. Francia, S. Calderara, and D. F. Lanzi, “Classificazione di azioni
D. Craven, “Sports videos in the wild (svw): A video dataset for sports cestistiche mediante tecniche di deep learning,” URL: https://www.
analysis,” in 2015 11th IEEE International Conference and Workshops researchgate. net/publication/330534530_Classificazione_di_Azioni_
on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, Cestistiche_mediante_Tecniche_di_Deep_Learning, 2018.
pp. 1–7. [248] X. Gu, X. Xue, and F. Wang, “Fine-grained action recognition on a
[228] P. Parmar and B. Tran Morris, “Learning to score olympic events,” in novel basketball dataset,” in ICASSP 2020-2020 IEEE International
Proceedings of the IEEE conference on computer vision and pattern Conference on Acoustics, Speech and Signal Processing (ICASSP).
recognition workshops, 2017, pp. 20–28. IEEE, 2020, pp. 2563–2567.
[229] W. Zhang, Z. Liu, L. Zhou, H. Leung, and A. B. Chan, “Martial arts, [249] Y. Yan, N. Zhuang, B. Ni, J. Zhang, M. Xu, Q. Zhang, Z. Zhang,
dancing and sports dataset: A challenging stereo and multi-view dataset S. Cheng, Q. Tian, Y. Xu et al., “Fine-grained video captioning via
for 3d human pose estimation,” Image and Vision Computing, vol. 61, graph-based multi-granularity interaction learning,” IEEE transactions
pp. 22–39, 2017. on pattern analysis and machine intelligence, vol. 44, no. 2, pp. 666–
[230] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei- 683, 2019.
Fei, “Every moment counts: Dense detailed labeling of actions in [250] L. Zhu, K. Rematas, B. Curless, S. Seitz, and I. Kemelmacher-
complex videos,” International Journal of Computer Vision, vol. 126, Shlizerman, “Reconstructing nba players,” in Proceedings of the Euro-
pp. 375–389, 2018. pean Conference on Computer Vision (ECCV), August 2020.
[231] P. Parmar and B. Morris, “Action quality assessment across multiple [251] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori,
actions,” in 2019 IEEE winter conference on applications of computer “A hierarchical deep temporal model for group activity recognition,”
vision (WACV). IEEE, 2019, pp. 1468–1476. in Proceedings of the IEEE conference on computer vision and pattern
[232] C. Zalluhoglu and N. Ikizler-Cinbis, “Collective sports: A multi- recognition, 2016, pp. 1971–1980.
task dataset for collective activity recognition,” Image and Vision [252] Ibrahim, Mostafa S and Muralidharan, Srikanth and Deng, Zhiwei and
Computing, vol. 94, p. 103870, 2020. Vahdat, Arash and Mori, Greg, “Hierarchical deep temporal models
[233] Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: for group activity recognition,” CoRR, vol. abs/1607.02643, 2016.
A multi-person video dataset of spatio-temporally localized sports [Online]. Available: http://arxiv.org/abs/1607.02643
actions,” in Proceedings of the IEEE/CVF International Conference [253] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, and R. Suk-
on Computer Vision, 2021, pp. 13 536–13 545. thankar, “Violence detection in video using computer vision tech-
[234] A. Nibali, J. Millward, Z. He, and S. Morgan, “Aspset: An outdoor niques,” in Computer Analysis of Images and Patterns: 14th Inter-
sports pose video dataset with 3d keypoint annotations,” Image and national Conference, CAIP 2011, Seville, Spain, August 29-31, 2011,
Vision Computing, vol. 111, p. 104196, 2021. Proceedings, Part II 14. Springer, 2011, pp. 332–339.
[254] K. Vats, P. Walters, M. Fani, D. A. Clausi, and J. Zelek, “Player tracking
[235] J. Chung, C.-h. Wuu, H.-r. Yang, Y.-W. Tai, and C.-K. Tang, “Haa500:
and identification in ice hockey,” arXiv preprint arXiv:2110.03090,
Human-centric atomic action dataset with curated videos,” in Proceed-
2021.
ings of the IEEE/CVF International Conference on Computer Vision,
[255] T. De Campos, M. Barnard, K. Mikolajczyk, J. Kittler, F. Yan,
2021, pp. 13 465–13 474.
W. Christmas, and D. Windridge, “An evaluation of bags-of-words and
[236] X. Chen, A. Pang, W. Yang, Y. Ma, L. Xu, and J. Yu,
spatio-temporal shapes for action recognition,” in 2011 IEEE Workshop
“Sportscap: Monocular 3d human motion capture and fine-grained
on Applications of Computer Vision (WACV). IEEE, 2011, pp. 344–
understanding in challenging sports videos,” International Journal of
351.
Computer Vision, Aug 2021. [Online]. Available: https://doi.org/10.
[256] S. Gourgari, G. Goudelis, K. Karpouzis, and S. Kollias, “Thetis: Three
1007/s11263-021-01486-4
dimensional tennis shots a human action dataset,” in Proceedings of
[237] P. Parmar and B. Morris, “Win-fail action recognition,” in Proceedings the IEEE Conference on Computer Vision and Pattern Recognition
of the IEEE/CVF Winter Conference on Applications of Computer Workshops, 2013, pp. 676–681.
Vision, 2022, pp. 161–171. [257] H. Faulkner and A. Dick, “Tenniset: a dataset for dense fine-grained
[238] C. K. Ingwersen, C. Mikkelstrup, J. N. Jensen, M. R. Hannemose, event recognition, localisation and description,” in 2017 International
and A. B. Dahl, “Sportspose: A dynamic 3d sports pose dataset,” in Conference on Digital Image Computing: Techniques and Applications
Proceedings of the IEEE/CVF International Workshop on Computer (DICTA). IEEE, 2017, pp. 1–8.
Vision in Sports, 2023. [258] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, and E. Ricci,
[239] Y. Cui, C. Zeng, X. Zhao, Y. Yang, G. Wu, and L. Wang, “Sportsmot: “Playable video generation,” in Proceedings of the IEEE/CVF Confer-
A large multi-object tracking dataset in multiple sports scenes,” arXiv ence on Computer Vision and Pattern Recognition, 2021, pp. 10 061–
preprint arXiv:2304.05170, 2023. 10 070.
[240] T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, and P. L. Mazzeo, “A [259] P.-E. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier, “Sport action
semi-automatic system for ground truth generation of soccer video recognition with siamese spatio-temporal cnns: Application to table
sequences,” in 2009 Sixth IEEE International Conference on Advanced tennis,” in 2018 International Conference on Content-Based Multime-
Video and Signal Based Surveillance. IEEE, 2009, pp. 559–564. dia Indexing (CBMI). IEEE, 2018, pp. 1–6.
[241] S. A. Pettersen, D. Johansen, H. Johansen, V. Berg-Johansen, V. R. [260] K. M. Kulkarni and S. Shenoy, “Table tennis stroke recognition
Gaddam, A. Mortensen, R. Langseth, C. Griwodz, H. K. Stensland, using two-dimensional human pose estimation,” in Proceedings of the
and P. Halvorsen, “Soccer video and player position dataset,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Proceedings of the 5th ACM Multimedia Systems Conference, 2014, 2021, pp. 4576–4584.
pp. 18–23. [261] J. Bian, Q. Wang, H. Xiong, J. Huang, C. Liu, X. Li, J. Cheng, J. Zhao,
[242] K. Lu, J. Chen, J. J. Little, and H. He, “Light cascaded convolu- F. Lu, and D. Dou, “P2a: A dataset and benchmark for dense action
tional neural networks for accurate player detection,” arXiv preprint detection from table tennis match broadcasting videos,” arXiv preprint
arXiv:1709.10230, 2017. arXiv:2207.12730, 2022.
[243] J. Yu, A. Lei, Z. Song, T. Wang, H. Cai, and N. Feng, “Comprehensive [262] S. Zahan, G. M. Hassan, and A. Mian, “Learning sparse temporal
dataset of broadcast soccer videos,” in 2018 IEEE Conference on video mapping for action quality assessment in floor gymnastics,” arXiv
Multimedia Information Processing and Retrieval (MIPR). IEEE, preprint arXiv:2301.06103, 2023.
2018, pp. 418–423. [263] A. Ghosh, S. Singh, and C. Jawahar, “Towards structured analysis
[244] J. Qi, J. Yu, T. Tu, K. Gao, Y. Xu, X. Guan, X. Wang, Y. Dong, of broadcast badminton videos,” in 2018 IEEE Winter Conference on
B. Xu, L. Hou et al., “Goal: A challenging knowledge-grounded video Applications of Computer Vision (WACV). IEEE, 2018, pp. 296–304.
captioning benchmark for real-time soccer commentary generation,” [264] Z. T. L. Shan, “Fineskating: A high-quality figure skating dataset
arXiv preprint arXiv:2303.14655, 2023. and multi-task approach for sport action,” Peng Cheng Laboratory
[245] C. De Vleeschouwer, F. Chen, D. Delannay, C. Parisot, C. Chaudy, Commumications, vol. 1, no. 3, p. 107, 2020.
E. Martrou, A. Cavallaro et al., “Distributed video acquisition and [265] S. Liu, X. Liu, G. Huang, L. Feng, L. Hu, D. Jiang, A. Zhang, Y. Liu,
annotation for sport-event summarization,” NEM summit, vol. 8, no. and H. Qiao, “Fsd-10: a dataset for competitive sports content analysis,”
10.1016, 2008. arXiv preprint arXiv:2002.03312, 2020.
19
[266] S. Liu, A. Zhang, Y. Li, J. Zhou, L. Xu, Z. Dong, and R. Zhang, Zhonghan Zhao received the BE degree from Com-
“Temporal segmentation of fine-gained semantic action: A motion- munication University of China. He is currently
centered figure skating dataset,” in Proceedings of the AAAI conference working toward the PhD degree with Zhejiang Uni-
on artificial intelligence, vol. 35, no. 3, 2021, pp. 2163–2171. versity - University of Illinois Urbana-Champaign
[267] Y. Li, Y. Li, and N. Vasconcelos, “Resound: Towards action recog- Institute, Zhejiang University. His research interests
nition without representation bias,” in Proceedings of the European include machine learning, reinforcement learning
Conference on Computer Vision (ECCV), 2018, pp. 513–528. and computer vision.
[268] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-
temporal maximum average correlation height filter for action recog-
nition,” in 2008 IEEE conference on computer vision and pattern
recognition. IEEE, 2008, pp. 1–8.
[269] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d
points,” in 2010 IEEE computer society conference on computer vision
and pattern recognition-workshops. IEEE, 2010, pp. 9–14.
[270] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal structure
of decomposable motion segments for activity classification,” in Com-
puter Vision–ECCV 2010: 11th European Conference on Computer Wenhao Chai received the BE degree from Zhe-
Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, jiang University, China. He is currently working
Part II 11. Springer, 2010, pp. 392–405. toward the Master degree with University of Wash-
[271] J. Pers, “Cvbase 06 dataset: a dataset for development and testing of ington. His research interests include 3D human pose
computer vision based methods in sport environments,” SN, Ljubljana, estimation, generative models, and multi-modality
2005. learning.
[272] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa, “Robocup:
The robot world cup initiative,” in Proceedings of the first international
conference on Autonomous agents, 1997, pp. 340–347.
[273] JiDi, “Jidi olympics football,” https://github.com/jidiai/ai_lib/blob/
master/env/olympics_football.py, 2022.
[274] A. S. Azad, E. Kim, Q. Wu, K. Lee, I. Stoica, P. Abbeel,
A. Sangiovanni-Vincentelli, and S. A. Seshia, “Programmatic modeling
and generation of real-time strategic soccer environments for reinforce-
ment learning,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 36, no. 6, 2022, pp. 6028–6036.
[275] J. Wang, J. Ma, K. Hu, Z. Zhou, H. Zhang, X. Xie, and Y. Wu, “Tac- Shengyu Hao received the MS degree from Beijing
trainer: A visual analytics system for iot-based racket sports training,” University of Posts and Telecommunications, China.
IEEE Transactions on Visualization and Computer Graphics, vol. 29, He is currently working toward the PhD degree
no. 1, pp. 951–961, 2022. with Zhejiang University - University of Illinois
[276] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Urbana-Champaign Institute, Zhejiang University.
Z. Liu, Z. Wu, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge, His research interests include machine learning and
“Summary of chatgpt/gpt-4 research and perspective towards the future computer vision.
of large language models,” 2023.
[277] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,”
arXiv preprint arXiv:2304.02643, 2023.
[278] R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A.
Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson et al., “Segment
anything model (sam) for digital pathology: Assess zero-shot segmen-
tation on whole slide imaging,” arXiv preprint arXiv:2304.04155, 2023.
[279] S. Roy, T. Wald, G. Koehler, M. R. Rokuss, N. Disch, J. Holzschuh,
D. Zimmerer, and K. H. Maier-Hein, “Sam.md: Zero-shot medical
image segmentation capabilities of the segment anything model,” 2023. Wenhao Hu received the BS degree from Zhe-
[280] Y. Liu, J. Zhang, Z. She, A. Kheradmand, and M. Armand, “Samm jiang University, China. He is currently working
(segment any medical model): A 3d slicer integration to sam,” 2023. toward the PhD degree with Zhejiang University -
[281] J. Z. Wu, Y. Ge, X. Wang, W. Lei, Y. Gu, W. Hsu, Y. Shan, X. Qie, University of Illinois Urbana-Champaign Institute,
and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion Zhejiang University. His research interests include
models for text-to-video generation,” arXiv preprint arXiv:2212.11565, generative models and 3D reconstruction.
2022.
[282] J. Liu, N. Saquib, Z. Chen, R. H. Kazi, L.-Y. Wei, H. Fu, and C.-
L. Tai, “Posecoach: A customizable analysis and visualization system
for video-based running coaching,” IEEE Transactions on Visualization
and Computer Graphics, 2022.
[283] Z. Zhao, S. Lan, and S. Zhang, “Human pose estimation based speed
detection system for running on treadmill,” in 2020 International
Conference on Culture-oriented Science & Technology (ICCST). IEEE,
2020, pp. 524–528.
[284] T. Perrett, A. Masullo, D. Damen, T. Burghardt, I. Craddock,
M. Mirmehdi et al., “Personalized energy expenditure estimation: Vi- Guanhong Wang received the MS degree from
sual sensing approach with deep learning,” JMIR Formative Research, Huaqiao University, China. He is currently working
vol. 6, no. 9, p. e33606, 2022. toward the PhD degree with Zhejiang University -
[285] D. Radke and A. Orchard, “Presenting multiagent challenges in team University of Illinois Urbana-Champaign Institute,
sports analytics,” arXiv preprint arXiv:2303.13660, 2023. Zhejiang University. His research interests include
deep learning and computer vision.
20