
2 Ren and Chang, et al.
is a fascinating development. However, the classic AL algorithm also nds it dicult to handle
high-dimensional data [
160
]. Therefore, the combination of DL and AL, referred to as DAL, is
expected to achieve superior results. DAL has been widely utilized in various elds, including image
recognition [
35
,
47
,
53
,
68
], text classication [
145
,
180
,
185
], visual question answering [
98
] and
object detection [
3
,
39
,
121
], etc. Although a rich variety of related work has been published, DAL
still lacks a unied classication framework. In order to ll this gap, in this article, we will provide
a comprehensive overview of the existing DAL related work, along with a formal classication
method. We will next briey review the development status of DL and AL in their respective
elds. Subsequently, in Section 2, the necessity and challenges of combining DL and AL are further
explicated.
1.1 Deep Learning
DL attempts to build appropriate models by simulating the structure of the human brain. The
McCulloch-Pitts (MCP) model proposed in 1943 by [
40
] is regarded as the beginning of modern DL.
Subsequently, in 1986, [
129
] introduced backpropagation into the optimization of neural networks,
which laid the foundation for the subsequent rapid development of DL. In the same year, Recurrent
Neural Networks (RNNs) [
75
] were rst proposed. In 1998, the LeNet [
92
] network made its rst
appearance, representing one of the earliest uses of deep neural networks (DNN). However, these
pioneering early works were limited by the computing resources available at the time and did
not receive as much attention and investigation as they should have [
90
]. In 2006, Deep Belief
Networks (DBNs) [
62
] were proposed and used to explore a deeper range of networks, which
prompted the name of neural networks as DL. In 2012, in the ImageNet competition, the DL model
AlexNet [
87
] won the championship in one fell swoop. AlexNet uses the ReLU activation function
to eectively suppress the gradient disappearance problem, while the use of multiple GPUs greatly
improves the training speed of the model. Subsequently, DL began to win championships in various
competitions and constantly beat records in various tasks. From the perspective of automation, the
emergence of DL has transformed the manual design of features [
30
,
102
] in machine learning to
facilitate automatic extraction [
58
,
149
]. It is precisely because of this powerful automatic feature
extraction ability that DL has demonstrated such unprecedented advantages in many elds. After
decades of development, the eld of DL-related research work quite rich. In Fig.1a, we present
a standard deep learning model example: convolutional neural network (CNN) [
91
,
130
]. Based
on this approach, similar CNNs are applied to various image processing tasks. In addition, RNNs
and Generative Adversarial Networks (GANs) [
132
] are also widely utilized. Beginning in 2017,
DL gradually shifted from the initial feature extraction automation to the automation of model
architecture design [11, 124, 189]; however, this still has a long way to go.
Thanks to the publication of a large number of existing annotation datasets [
16
,
87
], in recent
years, DL has made breakthroughs in various elds including machine translation [
4
,
13
,
159
,
168
],
speech recognition [
110
,
116
,
120
,
136
], and image classication [
60
,
106
,
115
,
174
]. However, this
comes at the cost of a large number of manually labeled datasets, and DL has a strong greedy
attribute to the data. While, in the real world, obtaining a large number of unlabeled datasets is
relatively simple, the manual labeling of datasets comes at a high cost; this is particularly true for
those elds where labeling requires a high degree of professional knowledge [
64
,
153
]. For example,
the labeling and description of lung lesion images of COVID-19 patients requires experienced
clinicians to complete, and it is clearly impractical to demand that such professionals complete
a large amount of medical image labeling. Similar elds also include speech recognition [
1
,
188
],
medical imaging [
64
,
93
,
109
,
176
], recommender systems [
2
,
26
], information extraction [
17
],
satellite remote sensing [
99
] and robotics [
7
,
22
,
158
,
186
], etc. Therefore, a way of maximizing the
performance gain of the model when annotating a small number of samples is urgently required.