Technologies 12 00015
Technologies 12 00015
Review
A Review of Machine Learning and Deep Learning for Object
Detection, Semantic Segmentation, and Human Action
Recognition in Machine and Robotic Vision
Nikoleta Manakitsa 1 , George S. Maraslidis 1 , Lazaros Moysis 2,3 and George F. Fragulis 1, *
1 Department of Electrical and Computer Engineering, University of Western Macedonia, 50100 Kozani, Greece;
[email protected] (N.M.); [email protected] (G.S.M.)
2 Laboratory of Nonlinear Systems-Circuits and Complexity, Physics Department, Aristotle University
of Thessaloniki, 54624 Thessaloniki, Greece; [email protected]
3 Department of Mechanical Engineering, University of Western Macedonia, ZEP Campus,
50100 Kozani, Greece
* Correspondence: [email protected]
Abstract: Machine vision, an interdisciplinary field that aims to replicate human visual perception in
computers, has experienced rapid progress and significant contributions. This paper traces the origins
of machine vision, from early image processing algorithms to its convergence with computer science,
mathematics, and robotics, resulting in a distinct branch of artificial intelligence. The integration
of machine learning techniques, particularly deep learning, has driven its growth and adoption
in everyday devices. This study focuses on the objectives of computer vision systems: replicating
human visual capabilities including recognition, comprehension, and interpretation. Notably, image
classification, object detection, and image segmentation are crucial tasks requiring robust mathemati-
cal foundations. Despite the advancements, challenges persist, such as clarifying terminology related
to artificial intelligence, machine learning, and deep learning. Precise definitions and interpretations
are vital for establishing a solid research foundation. The evolution of machine vision reflects an
ambitious journey to emulate human visual perception. Interdisciplinary collaboration and the inte-
Citation: Manakitsa, N.; Maraslidis,
gration of deep learning techniques have propelled remarkable advancements in emulating human
G.S.; Moysis, L.; Fragulis, G.F. A
behavior and perception. Through this research, the field of machine vision continues to shape the
Review of Machine Learning and
Deep Learning for Object Detection,
future of computer systems and artificial intelligence applications.
Semantic Segmentation, and Human
Action Recognition in Machine and Keywords: machine vision; computer vision; image processing; object classification; object detection;
Robotic Vision. Technologies 2024, 12, object segmentation; pattern recognition; artificial intelligence; machine learning; deep learning;
15. https://doi.org/10.3390/ robotics; mechatronics
technologies12020015
The field of computer vision has been greatly influenced by earlier research efforts. In
the 1980s, significant advancements were made in digital image processing and the analysis
of algorithms related to image understanding. Prior to these breakthroughs, researchers
worked on mathematical models to replicate human vision and explored the possibilities
of integrating vision into autonomous robots. Initially, the term “machine vision” was
primarily associated with electrical engineering and industrial robotics. However, over
time, it merged with computer vision, giving rise to a unified scientific discipline. This
convergence of machine vision and computer vision has led to remarkable growth, with
machine learning techniques playing a pivotal role in accelerating progress. Today, real-time
vision algorithms have become ubiquitous, seamlessly integrated into everyday devices
like mobile phones equipped with cameras. This integration has transformed how we
perceive and interact with technology [4].
Machine vision has revolutionized computer systems, empowering them with ad-
vanced artificial intelligence techniques that surpass human capabilities in various specific
tasks. Through computer vision systems, computers have gained the ability to perceive
and comprehend the visual world [3].
The overarching goals of computer vision are to enable computers to see, recognize,
and comprehend the visual world in a manner analogous to human vision. Researchers
in machine vision have dedicated their efforts to developing algorithms that facilitate
these visual perception functions. These functions include image classification, which
determines the presence of specific objects in image data; object detection, which identifies
instances of semantic objects within predefined categories; and image segmentation, which
breaks down images into distinct segments for analysis. The complexity of each computer
vision task, coupled with the diverse mathematical foundations involved, poses significant
challenges to their study. However, understanding and addressing these challenges holds
great theoretical and practical importance in the field of computer vision.
The contribution of this work is a presentation of the literature that showcases the
current state of research of machine learning and deep learning methods for object detection,
semantic segmentation, and human action recognition in machine and robotic vision. In
this paper, we present a comprehensive overview of the key elements that constitute
machine vision and the technologies that enhance its performance. We discuss innovative
scientific methods extensively utilized in the broad field of machine and deep learning
in recent years, along with their advantages and limitations. This review not only adds
new insights into machine learning and deep learning methods in machine/robotic vision
but also features real-world applications of object detection, semantic segmentation, and
human action recognition. Additionally, it includes a critical discussion aimed at advancing
the field.
This paper’s organizational structure is as follows. Section 2 offers an overview of
machine learning/deep learning algorithms and methods. Section 3 comprehensively
covers object detection, image, and semantic segmentation algorithms and methods, with a
specific focus on human action recognition methods. Section 4 introduces detailed notions
regarding robotic vision. Section 5 presents Hubel and Wiesel’s electrophysiological in-
sights, Van Essen’s map of the brain, and their impact on machine/robotic vision. Section 6
presents a discussion regarding the aforementioned topics. Lastly, Section 7 addresses the
current challenges and future trends in the field.
assess technologies based on metrics like accuracy and scalability, with the optimal choice
dependent on the specific problem and resources. Initially met with skepticism, public
perception of AI’s benefits has shifted positively over time. Artificial intelligence aims to
replicate human intelligence, with vision being a crucial aspect. Exploring the link between
computer vision and AI, the latter comprises machine learning and deep learning subsets,
essential for understanding machine vision’s progress (see Figure 1).
Figure 1. Relationship between artificial intelligence, machine learning, and deep learning.
The terms artificial intelligence, machine learning, and deep learning are often mis-
takenly used interchangeably. To grasp their relationship, it is helpful to envision them
as concentric circles. The outermost circle represents artificial intelligence, which was
the initial concept. Machine learning, which emerged later, forms a smaller circle that is
encompassed by artificial intelligence. Deep learning, the driving force behind the ongoing
evolution of artificial intelligence, is represented by the smallest circle nested within the
other two.
nature of deep learning, driven by data and minimal user involvement, whereas traditional
machine learning relies on human-crafted features and a more manual, stepwise process.
Figure 2. Comparison between a shallow neural network (first image above) and deep learning
(second image below).
analyze images [10,11]. Originating from image processing, these algorithms have driven
progress in pattern recognition, object detection, and image classification, ushering in a
paradigm shift. Machine vision leverages intricate techniques and mathematical models,
bridging the gap between human visual systems and machine intelligence. By extracting
meaning from visual stimuli, computer vision has transformed our understanding of
artificial intelligence’s visual realms. Images convey diverse information, including colors,
shapes, and recognizable objects, analogous to how the human brain interprets emotions
and states. In machine vision, algorithms analyze digital images to extract information
based on user-defined criteria. Object detection, face detection, and color recognition are
some examples, illustrating the system’s dependence on specific patterns for information
extraction [12]. The process involves detecting patterns representing objects, with the
detailed steps outlined in Figure 4.
Figure 5. An RGB image, its red , green and blue component [15].
The HSI and HSV models aim to approximate human perception by considering
characteristics such as hue (H), saturation (S), intensity (I), brightness (B), and value (V).
In the HSI model, the hue component ranges from 0◦ to 360◦ , determining the color’s
hue, whereas saturation (S) expresses the mixing degree of a primary color with white
(Figure 7). The intensity (I) component denotes light intensity without conveying color
information [16]. The HSI model, depicted as a double cone, exhibits upper and lower
peaks corresponding to white (I = 1) and black (I = 0), with maximum purity (S = 1) at
I = 0.5 (Figure 8).
Figure 9 showcases the HSI model in a real photo, depicting HSV channels as grayscale
images, revealing color saturation and modified color intensity for a clearer representa-
tion. The HSV model calculates the brightness component differently from HSI, primarily
managing hue and chromatic clarity components for digital tasks like histogram balancing.
The HSV color space positions black at the inverted cone’s top (V = 0) and white
at the base (V = 1). The hue component (hue) for red is 0◦ , differing by 180◦ from the
Technologies 2024, 12, 15 6 of 40
complementary colors. Saturation (S) is determined by the distance from the cone’s base,
simplifying color representation and extraction in object detection compared to the RGB
color space [14,17].
Figure 7. Color components of the HSI model on a face: hue, saturation, and intensity.
Image segmentation, crucial for analyzing objects through mathematical models, re-
sults in a binary image based on features like texture and color. The process can utilize
color or color intensities, and the histogram-based method, which constructs a histogram
from all pixels, aids in identifying common pixels in the image. In medical applications,
such as chest X-rays, histogram-based segmentation is prevalent [37]. When segmenting
based on color, one-dimensional histograms are obtained for monochrome images, whereas
color images require three histograms for each channel. Peaks or valleys in the histogram
assist in identifying objects and backgrounds. Multicolor images involve processing in-
dividual RGB histograms, combining results to select a robust segmentation hypothesis.
Segmentation based on pixel intensity is less complex, as evidenced in black-and-white
images where relatively large-sized objects create pixel distributions around their average
intensity values [38].
Figure 10. Original image (left); representation of the boundaries of the regions on the original image
(top-right image); segmentation result, where each uniform region is described by an integer and all
pixels in the region have this value (bottom-right image).
Technologies 2024, 12, 15 9 of 40
aspects. This technique enables the counting, tracking, and precise labeling of objects. The
process, termed ‘object detection’ or ‘recognition’, employs mathematical techniques like
convolution, spatial transformations, and machine learning algorithms. Specific instances,
such as ‘face detection’ or ‘car detection’, focus on extracting information related to faces
or cars. Mathematical concepts essential for object detection include convolutions, spa-
tial transformations, and machine learning algorithms like support vector machines and
decision trees (see Table 4) [39,50] (Figure 11).
leveraging the success of deep learning in various contexts. The authors of [75] proposed
a framework for action recognition, utilizing multiple models to capture both global and
local motion features. A 3D CNN captures overall body motion, whereas a 2D CNN focuses
on individual body parts, enhancing recognition by incorporating both global and local
motion information. Furthermore, [76] drew inspiration from deep learning achievements,
proposing a CNN-Bi-LSTM model for human activity recognition. Through end-to-end
training, the model refines pre-trained CNN features, demonstrating exceptional accuracy
in recognizing single- and multiple-person activities on RGB-D datasets. In [77], a novel
hybrid architecture for human action recognition was introduced, combining four pre-
trained network models through an optimized metaheuristic algorithm. The architecture
involves dataset creation, a deep neural network (DNN) design, training optimization, and
performance evaluation. The results demonstrate its superiority over existing architectures
in accurately predicting human actions. The authors of [78] presented a key contribution
with temporal-spatial mapping, capturing video frame evolution. The proposed temporal
attention model within a convolutional neural network achieved remarkable performance,
surpassing a competing baseline method by 4.2% in accuracy on the challenging HMDB51
dataset. In [79] authors tackled still image-based human action recognition challenges
using transfer learning and data augmentation and by fine-tuning CNN architectures. The
proposed model outperformed prior benchmarks on the Stanford 40 and PPMI datasets,
showcasing its robustness. Finally, [80] introduced the cooperative genetic algorithm (CGA)
for feature selection, employing a cooperative approach that enhances accuracy, reduces
overfitting, and improves resilience to noise and outliers. CGA offers superior feature
selection outcomes across various domains.
The main human action recognition methods are presented in Table 5, while their
characteristics, including their advantages, disadvantages/limitations, and complexities,
are given in Appendix A.
Method References
Deep learning (CNNs and RNNs) addresses the critical task of human action
[9,61–72]
recognition in computer vision, enhancing accuracy and optimizing performance.
Attention-based LSTM for feature distinctions, incorporating a spatiotemporal
[73]
saliency-based multi-stream network.
A hybrid deep learning model for human action recognition. [74]
Utilizes multiple models to capture global and local motion features for
[75]
action recognition.
Uses RGB frames, Bi-LSTM, and a CNN for action recognition. [76]
A novel hybrid architecture combining four pre-trained network models,
[77]
predicting human actions.
Uses a temporal-spatial mapping operation for action recognition. [78]
Use of image-based HAR through transfer learning. [79]
A cooperative approach for feature selection. [80]
cations like autonomous driving and medical imaging. Convolutional neural networks,
especially in deep learning, have significantly advanced semantic segmentation, providing
high-resolution mapping for various applications, including YouTube stories and scene
understanding [86–92]. This technique finds applications in diverse areas, such as docu-
ment analysis, virtual makeup, self-driving cars, and background manipulation in images,
showcasing its versatility and importance. Semantic segmentation architectures typically
involve an encoder network, which utilizes pre-trained networks like VGG or ResNet, and
a decoder network, which projects learned features onto the pixel space, enabling dense
pixel-level classification [86–92].
The three main approaches are :
1. Region-Based Semantic Segmentation
Typically, region-based approaches use the “segmentation using recognition” pipeline.
In this method, free-form regions are extracted from an image and described before being
subjected to region-based classification. The region-based predictions are transformed into
pixel predictions during testing by giving each pixel a label based on the region with the
highest score to which it belongs [86,87,93–96].
2. Fully Convolutional Network-Based Semantic Segmentation
The original Fully Convolutional Network (FCN) does not require region proposals
because it learns a mapping from pixels to pixels. By enabling it to handle images of any
size, the FCN expands the capabilities of a conventional CNN. FCNs only use convolutional
and pooling layers, as opposed to traditional CNNs, which use fixed fully connected layers,
allowing predictions on inputs of any size [92,97–100].
3. Weakly Supervised Semantic Segmentation
Many semantic segmentation methods depend on pixel-wise segmentation masks, which
are laborious and costly to annotate. To address this challenge, weakly supervised methods
have emerged. These approaches leverage annotated bounding boxes to achieve semantic
segmentation, providing a more efficient and cost-effective solution [50,63,90–92,101–107].
Some other approaches are discussed below.
In [108], the authors discussed the application of deep learning for the semantic
segmentation of medical images. They outlined crucial steps for constructing an effective
model and addressing challenges in medical image analysis. Deep convolutional neural
networks (DCNNs) in semantic segmentation were explored in [109], where models like
UNet, DeepUNet, ResUNet, DenseNet, and RefineNet were reviewed. DCNNs proved
effective in semantic segmentation, following a three-phase procedure: preprocessing,
processing, and output generation. Ref. [110] introduced CGBNet, a segmentation network
that enhanced performance through context encoding and multi-path decoding. The
network intelligently selects relevant score maps and introduces a boundary delineation
module for competitive scene segmentation results.
The main semantic segmentation methods are presented in Table 6, while their char-
acteristics, including their advantages, disadvantages/limitations, and complexities, are
given in Appendix B.
Summary References
Identify fundamental computer vision problems: image
[12,81–85]
classification, object detection, and segmentation.
Semantic segmentation assigns labels to every pixel, significantly
[86–92]
enhanced by deep learning, particularly CNNs.
Describe components of a semantic segmentation architecture and
three main approaches: region-based, FCN-based, and [50,63,86,87,90–92,97–107]
weakly supervised.
Semantic segmentation, focusing on medical image analysis
[108–110]
and DCNNs.
Technologies 2024, 12, 15 14 of 40
to automatically classify images into predefined categories. For decades, researchers have
developed advanced techniques to improve the quality of classification. Traditionally,
classification models can only perform well on small datasets, such as CIFAR-10 [116] and
MNIST [117]. The biggest leap forward in the development of image classification occurred
when the large-scale image dataset “ImageNet” was created by Feifei Li in 2009 [106].
An equally important and challenging task in computer vision is object detection,
which involves identifying and localizing objects from either a large number of predefined
categories in natural images or for a specific object. Object detection and image classification
face a similar technological challenge: both need to handle a wide variety of objects.
However, object detection is more challenging compared to image classification because it
requires identifying the exact target object being searched for [19]. Most research efforts
have focused on detecting a single class of object data, such as pedestrians or faces, by
designing a set of suitable features. In these studies, objects are detected using a set
of predefined patterns, where the features correspond to a location in the image or a
feature pyramid.
Object classification identifies the objects present in the visual scene, whereas object
detection reveals their locations. Object segmentation is defined as the pixel-level catego-
rization of pixels, aiming to divide an image into significant regions by classifying each
pixel into a specific layer. In classical object segmentation, the method of uncontrolled
merging and region segmentation has been extensively investigated based on clustering,
general feature optimization, or user intervention. It is divided into two primary branches
based on object partitioning. In the first branch, semantic segmentation is employed, where
each pixel corresponds to a semantic object classification. In the second branch, instance
segmentation is utilized, providing different labels for different object instances as a further
improvement of semantic segmentation [19].
In [72], the authors presented a comprehensive survey of the literature on human ac-
tion recognition, with a specific focus on the fusion of vision and inertial sensing modalities.
The surveyed papers were categorized based on fusion approaches, features, classifiers,
and multimodality datasets. The authors also addressed challenges in real-world deploy-
ment and proposed future research directions. The work contributed a thorough overview,
categorization, and insightful discussions of the fusion-based approach for human ac-
tion recognition.
The authors of [118] evaluated some Kinect-based algorithms for human action recog-
nition using multiple benchmark datasets. Their findings revealed that most methods
excelled in cross-subject action recognition compared to cross-view action recognition.
Additionally, skeleton-based features exhibited greater resilience in cross-view recognition,
while deep learning features were well-suited for large datasets.
The authors of [119] offered a comprehensive review of recent advancements in
human action recognition systems. They introduced hand-crafted representation-based
methods, as well as deep learning-based approaches, for this task. A thorough analysis
and a comparison of these methods and datasets used in human action recognition were
presented. Furthermore, the authors suggested potential future research directions in
the field.
In [120], a comprehensive review of recent progress made in semantic segmentation
was presented. The authors specifically examined and compared three categories of meth-
ods: those relying on hand-engineered features, those leveraging learned features, and
those utilizing weakly supervised learning. The authors presented the descriptions, as
well as a comparison, of prominent datasets used in semantic segmentation. Furthermore,
they conducted a series of comparisons between various semantic segmentation models to
showcase their respective strengths and limitations.
In [121], a comprehensive examination of semantic segmentation techniques employ-
ing deep neural networks was presented. The authors thoroughly analyzed the leading
approaches in this field, highlighting their strengths, weaknesses, and key challenges.
They concluded that deep convolutional neural networks have demonstrated remarkable
Technologies 2024, 12, 15 16 of 40
( I ∗ K )( x, y) = ∑ ∑ I ( x + i, y + j) · K (i, j)
i j
where I is the input image, K is the convolutional kernel, ( x, y) represents the spatial
position in the output feature map, and (i, j) iterates over the kernel dimensions.
2. Activation Functions: Activation functions, such as the Rectified Linear Unit (ReLU),
introduce non-linearity into the network. The ReLU function is defined as:
f ( x ) = max(0, x )
and is applied to the output of the convolutional and fully connected layers.
3. Pooling: Pooling layers reduce the spatial dimensions of feature maps. Max pooling,
for example, retains the maximum value in a specified window. Mathematically, it
can be represented as:
P( x, y) = max( I ( x, y))
where P is the pooled output and I ( x, y) is the input.
4. Fully Connected Layers: In the final layers of a CNN, fully connected layers perform
traditional neural network operations. A fully connected layer computes the weighted
sum of all inputs and passes it through an activation function, often a softmax, for
classification tasks.
5. Backpropagation: The training of CNNs relies on backpropagation, a mathematical
process for adjusting network weights and biases to minimize a loss function. This
process involves the chain rule to compute gradients and update model parameters.
ht = f (Wh · ht−1 + Wx · xt + bh )
where ht is the hidden state at time step t; f is the activation function, typically the
hyperbolic tangent (tanh) or sigmoid; Wh and Wx are the weight matrices; and bh is the
bias term.
2. Output Calculation: The output at each time step can be computed based on the
current hidden state. For regression tasks, the output yt is often calculated as:
yt = Wy · ht + by
where yt is the output at time step t, Wy is the weight matrix for the output, and by is the
bias term.
3. Backpropagation Through Time (BPTT): RNNs are trained using the backpropa-
gation through time (BPTT) algorithm, which is an extension of backpropagation. BPTT
calculates gradients for each time step and updates the network’s weights and biases
accordingly.
Technologies 2024, 12, 15 19 of 40
RNNs are well-suited for sequence data, time-series analysis, and natural language
processing tasks. They can capture dependencies and contexts in sequential information,
making them a valuable tool in machine learning and deep learning.
min max V ( D, G ) = Ex∼ pdata ( x) [log( D ( x ))] + Ez∼ p(z) [log(1 − D ( G (z)))]
G D
where pdata ( x ) is the real data distribution, p(z) is the prior distribution of noise, and E
represents the expectation.
Technologies 2024, 12, 15 20 of 40
4. Optimal Generator: At optimality, the generator produces samples that are indistin-
guishable from real data, meaning D ( G (z)) = 0.5. This occurs when the objective function
V ( D, G ) reaches its global minimum.
5. Training: GANs are trained using techniques like stochastic gradient descent.
The generator updates its parameters to minimize the objective function, whereas the
discriminator updates its parameters to maximize it.
6. Generated Data: The generator produces synthetic data samples G (z) that closely
resemble real data.
GANs are widely used in various applications, including image generation, style
transfer, and data augmentation.
f t = σ (W f · [ h t − 1 , x t ] + b f )
it = σ (Wi · [ht−1 , xt ] + bi )
C̃t = tanh(WC · [ht−1 , xt ] + bC )
Ct = f t ⊙ Ct−1 + it ⊙ C̃t
where σ is the sigmoid activation function and ⊙ represents the element-wise multiplica-
tion.
3. Hidden State (ht ): The hidden state is derived from the cell state and is updated
using the output gate:
ot = σ (Wo · [ht−1 , xt ] + bo )
ht = ot ⊙ tanh(Ct )
4. Training: LSTMs are trained using the backpropagation through time (BPTT) and
gradient descent algorithms. The gradients are computed with respect to the cell state,
hidden state, and parameters.
LSTMs are known for their ability to capture long-term dependencies and are widely
used in natural language processing, speech recognition, and various sequential data tasks.
In Table 8, we can see a detailed comparison of deep learning algorithms and methods
and their integration in robotic vision.
Table 8. Cont.
halls, enhancing the visitor experience. In the context of category-level image classification,
the use of spatial pyramids based on 3D scene geometry has been proposed to improve
classification accuracy [144]. Data fusion techniques with redundant sensors have been used
to boost robotic navigation. Big data and AI have been used to optimize communication
and navigation within robotic swarms in complex environments. They have also been
applied in robotic platforms for navigation and object tracking using redundant sensors
and Bayesian fusion approaches [145]. Additionally, the combination of big data analysis
and robotic vision has been used to develop intelligent calculation methods and devices for
human health assessment and monitoring [146].
diverse users within a decentralized system. Additionally, they introduced the Contextual
Optimization (CoOp) method for fine-tuning pre-trained vision-language models.
5. Hubel and Wiesel’s Electrophysiological Insights, Van Essen’s Map of the Brain, and
Their Impact on Robotic Vision
5.1. Hubel and Wiesel’s Contribution
Deep learning’s impact on robotic vision connects insights from neuroscience and com-
puter science. Hubel and Wiesel’s electrophysiological research revealed the fundamental
mechanisms of human visual perception, laying the foundation for understanding how
neural networks process visual information in deep learning. Similarly, Van Essen’s brain
map serves as a critical reference for comprehending neural pathways and functions, eluci-
dating connections within the visual cortex for developing deep learning algorithms. The
synergy between neuroscientific revelations and computer science has redefined robotic
vision. Deep learning algorithms, inspired by the neural architectures discovered by Hubel
and Wiesel and refined through insights from Van Essen’s map, have empowered robots
to decipher visual data with precision. This fusion of understanding and innovation has
accelerated the development of autonomous robots capable of perceiving, interpreting, and
reacting to their surroundings. By embracing the neural foundations of visual perception,
deep learning has surpassed human abilities, allowing robots to navigate, interact, and
make knowledgeable decisions.
Hubel and Wiesel’s groundbreaking contributions in their electrophysiological studies
mixed neuroscience, artificial neural networks (ANNs), and computer vision, shaping the
very foundation of modern AI. Their exploration of the cat and monkey visual systems
unearthed fundamental insights into sensory processing, establishing vital connections be-
tween biological mechanisms and computational paradigms. Understanding the receptive
fields of cells in the cat’s striate cortex shed light on brain visual processing. The authors
of [156] enriched the comprehension of visual pathways from the retina to the cortex, influ-
encing perception. Notably, moving stimuli trigger robust responses, suggesting motion’s
key role in cortical activation. This insight has led to advances in fields like computer
vision and robotics, refining motion detection. Specific shapes, sizes, and orientations that
activate cortical cells have impacted experimental design. Moreover, intricate properties
within the striate cortex units hint at deeper complexities necessitating exploration. Such
insights contribute to a holistic understanding of the brain’s visual processing mechanisms.
Studying a cat’s visual cortex unveils complex receptive fields, surpassing lower visual lev-
els. This involves receptive fields and binocular interaction and overcomes the limitations
of slow-wave recording. A new approach studies individual cells using micro-electrodes,
correlating responses with cell location. This method has enhanced the understanding of
functional anatomy in smaller cortex areas [157–159].
Hubel and Wiesel’s pioneering revelation of “feature detectors” is another cornerstone
that resonates within ANNs and computer vision. These specialized neurons, responsive to
distinct visual attributes, resemble the artificial neurons that define the core architecture of
ANNs. Just as Hubel and Wiesel studied layers of neurons processing features like edges,
ANNs harness a similar hierarchy to progressively grasp more complex patterns, enriching
our understanding of both brain and machine vision. Moreover, Hubel and Wiesel’s dis-
covery of “ocular dominance columns” and “orientation columns” mirrors the hierarchical
arrangement of ANNs, creating structured systems for pattern recognition. The layer-wise
organization they elucidated forms the multi-layer architecture of ANNs, maximizing
their capacity to decipher complex data patterns. Hubel and Wiesel’s legacy also extends
to computer vision, infusing it with a deeper understanding of visual processing. Their
identification of critical periods in visual development aligns with the iterative “training”
stages of ANNs. By synthesizing their discoveries, ANNs can autonomously learn and
recognize complex patterns from images, revolutionizing fields like image classification,
object detection, and facial recognition.
Technologies 2024, 12, 15 25 of 40
Many research papers have built upon the contributions of Hubel and Wiesel. Here,
we examine a few of these papers. The VLSI binocular vision system simulates the primary
visual cortex disparity computation in robotics and computer vision [160]. It employs
silicon retinas, orientation chips, and an FPGA, enabling real-time disparity calculation
with minimal hardware. Complex cell responses and a disparity map assist in depth
perception and 3D reconstruction. This blend of analog and digital circuits ensures efficient
computation. However, the authors solely addressed the primary visual cortex disparity
emulation, overlooking other visual aspects. In [161], the authors introduced a practical
vergence eye control system for binocular robot vision. The system is rooted in the primary
visual cortex (V1) disparity computation and comprises silicon retinas, simple cell chips,
and an FPGA. Silicon retinas mimic vertebrate retinal fields, while simple cell chips emulate
orientation-selective fields like Hubel and Wiesel’s model. The system generates real-
time complex cell outputs for five disparities, enabling reliable vergence movement, even
in complex scenarios. This development has paved the way for accurate eye control
in binocular robot vision, with potential applications in robotics, computer vision, and
AI. In [162], the authors introduced a hierarchical machine vision system based on the
primate visual model, thereby enhancing pattern recognition in machines. It involves
invariance transforms and an adaptive resonance theory network, focusing on luminance,
not color, motion, or depth. The system mirrors network-level biological processes, without
biochemical simulation. This system can enhance machine vision algorithms, aiding tasks
like object recognition and image classification.
The authors of [163] studied visual mechanisms like automatic gain control and
nonuniform processing. They suggested that these biological processes, if applied to
machine vision, could reduce data and enhance computational efficiency, particularly
in wide-view systems. Implementing these mechanisms could boost machine vision’s
processing power and effectiveness. In [164], the growth of cognitive neuroscience and the
merging of psychology and neurobiology were explored. In addition, the authors examined
memory, perception, action, language, and awareness, bridging behavior and brain circuits.
Cognitive psychologists emphasized information flow and internal representations. The
authors also touched on the molecular aspects of memory, delving into storage and neural
processes, and underscored the progress in memory research within cognitive neuroscience
and the value of comprehending both behavioral and molecular memory facets. The
authors of [165] explored how the human visual cortex processes complex visual stimuli.
They discussed the event-related potentials (ERPs) generated when viewing faces, objects,
and letters. Specific ERPs revealed different stages of face processing. The study revealed
distinct regions used for the recognition of objects and letters, along with bilateral and right
hemisphere-specific face activity. These findings have enhanced our understanding of the
neural mechanisms involved in face perception and object recognition in the human brain.
Individuals with autism exhibit challenges in recognizing faces, often due to reduced
attention to eyes and unusual processing methods [166]. Impairments start early, at around
3 years old, affecting both structural encoding and recognition memory stages. Electro-
physiological studies have highlighted disruptions in the face-processing neural systems
from an early age that persist into adulthood. Slower face processing has been linked to
more severe social issues. Autism also impacts the brain’s specialization for face process-
ing. These insights have deepened our comprehension of social cognition impairments in
autism, aiding early identification and interventions. Other research papers on the use of
machine learning methods for classifying autism include [167–179].
Table 10 shows the main articles discussing the above methods, while their character-
istics, including their advantages, disadvantages/limitations, and complexities, are given
in Appendix C.
Technologies 2024, 12, 15 26 of 40
Summary References
Discuss Hubel and Wiesel’s electrophysiological studies connecting neuroscience, ANNs, and computer vision. [156–159]
Build upon Hubel and Wiesel’s work, exploring VLSI binocular vision systems, practical vergence eye control
[160–163]
systems, hierarchical machine vision systems, and visual mechanisms
Explores cognitive neuroscience by merging psychology and neurobiology, with a focus on memory, perception,
[164]
action, language, and awareness.
Explore how the human visual cortex processes complex stimuli, revealing distinct regions for object and letter
recognition and face processing. Individuals with autism face challenges in recognizing faces, with disruptions in [165,166]
neural systems linked to social issues.
Machine learning methods for classifying autism [167–179]
Table 11. Methods related to Van Essen’s functional mapping of the brain.
Summary Reference
Outlines cortical areas tied to vision and other senses and presents a database of connectivity patterns. [180]
Explores surface-based visualization for mapping the cerebral cortex’s functional specialization. [181]
Reveals the brain’s activation–deactivation balance during tasks, showcasing ongoing brain organization and
[182]
supporting an understanding of neural fluctuations’ impact on function.
Presents a comprehensive map of the human cerebral cortex’s divisions, identifies new areas, and develops a machine
[183]
learning classifier for automated identification.
6. Discussion
In machine vision, there exist numerous contemporary technologies pertaining to
pattern recognition, each harboring its own merits and demerits. Presented below are
several recent technologies alongside their respective advantages and disadvantages.
Deep learning leverages neural networks comprising multiple layers to extract intricate
and high-level features from data. Remarkable achievements have been witnessed in
diverse pattern recognition tasks through deep learning, such as image classification, object
detection, face recognition, and semantic segmentation, among others. Deep learning
possesses certain advantages: it can autonomously learn from extensive datasets without
substantial human intervention or feature engineering; it can adeptly capture non-linear
and hierarchical relationships within the data; and it can reap the benefits of hardware and
software advancements like GPUs and frameworks. However, deep learning also entails
certain drawbacks: it demands substantial computational resources and time for training
and deployment; it may be susceptible to issues of overfitting or underfitting, hinging
upon network architecture selection, hyperparameter tuning, regularization techniques,
and more; it may lack interpretability and explainability concerning learned features and
decisions; and it may prove vulnerable to adversarial attacks or data poisoning.
lored to distinct pattern recognition tasks and datasets. A universal or optimal solution to
this quandary remains elusive, often necessitating trial-and-error or heuristic approaches.
Another challenge lies in ensuring the robustness and dependability of learned models,
particularly when deployed in real-world scenarios. Numerous factors can influence model
performance and behavior, such as data quality, distribution shifts, and adversarial ex-
amples. Lastly, deep learning confronts the task of enhancing the interpretability and
explainability of learned features and decisions, particularly when faced with intricate and
high-dimensional data. Striking a balance between model accuracy and interpretability
poses a challenge, as comprehending the reasoning behind model predictions or classifica-
tions is no easy feat.
Other challenges in using deep learning methods for robotic vision include the com-
plexity and entanglement of optical parameters in wide-angle systems, which require
data-driven prediction models to overcome. Another challenge is the need for robust 3D
object detection, which is crucial for decision making in autonomous intelligent systems.
Although deep learning has shown potential in this area, a lack of critical review and
comparison among various methods makes it challenging to select the most suitable ap-
proach for specific applications. Achieving non-adversarial robustness in deep learning
models is also challenging, as it is difficult to predict the types of distribution shifts that
may occur. Researchers have proposed various approaches to address this challenge, but
there is a need for further improvement and evaluation of model performance under data
distribution shifts. Additionally, applying visual algorithms developed from computer
vision datasets to robotic vision poses unique challenges due to the assumption of fixed
categories and time-invariant task distributions.
Although big data holds immense potential for pattern recognition in machine vision,
it possesses certain limitations that merit consideration. One limitation is the potential
unavailability or inaccessibility of big data for analysis. Legal or ethical regulations may
restrict access to certain data sources, such as personal or medical data. Additionally, data
obtained from user-generated or crowd-sourced platforms can be unreliable or incomplete.
Another limitation arises from the incompatibility or inconsistency of data derived from
diverse modalities or domains. Another facet of big data’s limitations lies in its varying
usefulness and informativeness for pattern recognition in machine vision. Redundant or ir-
relevant data, such as noisy or corrupted samples, may hinder effective analysis. Moreover,
biased or unrepresentative data, including imbalanced or skewed datasets, can undermine
the accuracy of pattern recognition models. Furthermore, misleading or deceptive data,
such as manipulated or fabricated information, can introduce additional challenges.
Federated learning presents several challenges that warrant attention. One challenge
revolves around effectively coordinating and synchronizing model updates from diverse
clients in a distributed and dynamic environment. Communication and computation
efficiency within federated learning are influenced by various factors such as network
latency, bandwidth, connectivity, and heterogeneity. Another challenge lies in striking
a balance between model privacy and accuracy. Different privacy protection levels and
methods, including differential privacy, secure aggregation, and encryption, exist in fed-
erated learning. However, these methods may introduce noise or distortion into model
updates, potentially degrading accuracy or convergence. A third challenge pertains to
addressing non-iidness and data imbalance among clients. Variations in data distributions
or characteristics stemming from client preferences, behaviors, or contexts may arise. This
imbalance can result in certain clients exerting greater influence or weight on the model,
leading to suboptimal generalization or fairness.
7. Conclusions
Machine vision is arguably the most crucial pillar for supporting and creating func-
tional artificial intelligence. Vision, as one of the five senses, plays a key role in proper
sensory perception among humans and has significantly influenced our social and tech-
nological evolution. Considering this, the academic and research community is investing
Technologies 2024, 12, 15 29 of 40
tremendous efforts to pave new paths in machine vision development and optimize existing
algorithms and methods.
In this paper, we presented a comprehensive overview of the key elements that
constitute machine vision and the technologies that enhance its performance. We discussed
innovative scientific methods extensively utilized in the broad field of AI in recent years,
along with their advantages and limitations.
Specific attention and research focus must be directed toward understanding the as-
pects of how the human brain recognizes and categorizes objects. This knowledge can then
be transferred to robotic models. Robotic vision, coupled with robotic touch, presents a
significant challenge in robotics. Achieving a robotic hand that adapts to tasks in a manner
similar to the human hand’s behavior will greatly contribute to the evolution of the mecha-
tronics scientific field and bring us closer to achieving AI with human-like characteristics.
Such AI systems would be capable of successfully performing arduous, repetitive, and
hazardous tasks that pose challenges for humans. Moreover, they would have the ability to
tackle complex problems across various domains, ranging from astronomy to biomedicine.
It is crucial to note that careful attention should be given to the subsequent steps of
technological development. Establishing an appropriate regulatory framework is necessary
to ensure responsible management of these new findings and experiments by countries
worldwide, thereby mitigating any potential adverse effects on humans. We are currently
experiencing a period of significant change, often referred to as a new Technological and
Industrial Revolution, which may rival, if not surpass, the transformative impact of the
Internet. Therefore, the forthcoming steps are pivotal for human evolution, as they will
shape the trajectory of our species.
In conclusion, machine vision has made remarkable progress in replicating human
visual perception in computers. This survey provided a comprehensive overview of robot
vision with a detailed review of papers published in the past 3–5 years Because of its
interdisciplinary nature and integration with computer science, mathematics, and robotics,
machine vision has become widely used in daily gadgets. The subject has advanced
significantly thanks, in large part, to deep learning.
Future research directions for using deep learning methods in robotic vision include
addressing challenges such as insufficient and inaccurate annotations, recognizing pathol-
ogy images with different data distributions, and training AI models based on decentralized
data sources. Another important area of research is the development of self-supervised
learning methods and domain-adaptation techniques for medical image analysis, which
can help overcome the limitations of labeled data. Additionally, there is a need for compre-
hensive analysis and validation of 3D object detection methods using benchmark datasets
and validation matrices. Furthermore, exploring the applications of deep learning algo-
rithms and deep nets in various areas of robot vision, such as image segmentation and
drug detection, is an important research direction. Overall, the field of robotic vision is
constantly evolving, and future research should focus on improving the performance and
automation capabilities of deep learning-based systems
Future pathways for computer vision could include:
• Improved object detection: Overcoming challenges with small or occluded objects.
• Real-time 3D reconstruction: Creating 3D models of environments in real time.
• Automated image labeling: Automatically tagging images with descriptive and accu-
rate labels.
• Visual reasoning and understanding: Developing algorithms that can reason and
make decisions based on visual input.
• Robustness to adversarial attacks: Creating computer vision models that are robust
to adversarial attacks, suitable for security applications, and capable of preventing
image manipulation.
• Integration with other technologies: Finding new ways to integrate computer vision
with other technologies, such as robotics, virtual reality, and augmented reality.
Technologies 2024, 12, 15 30 of 40
• Improved facial recognition: Developing more accurate and reliable methods for facial
recognition that can be used in security and identification applications.
• More efficient deep learning models: Developing deep learning models that require
less computation and can run faster on mobile devices.
• Enhanced video analysis: Improving the ability of computer vision to analyze video
data, including object tracking and activity recognition.
• Expanding applications: Finding new and innovative ways to apply computer vision
technology in fields such as healthcare, agriculture, and transportation.
• Detection of hidden/camouflaged objects, with applications in surveillance and biology.
• A challenging task will be the detection of objects that are intentionally designed to
blend into their environments, like camouflaged ones [184]. This will be especially
interesting in monitoring natural environments but also has potential in military
applications.
• Depth perception and 3D object detection are also very interesting, as they have
applications in depth perception, navigation, action recognition, and more [185,186].
This topic was also identified as a future challenge in [187].
• Finally, emergency rescue missions would be a highly impactful application to con-
sider [188].
Author Contributions: Conceptualization, N.M. and G.F.F.; methodology, N.M.; validation, L.M. and
G.S.M.; writing—original draft preparation, N.M.; writing—review and editing, L.M., G.S.M. and
G.F.F.; supervision, G.F.F. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: No data available.
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A
Table A1. Characteristics of methods presented in Table 5.
Disadvantages
Method Performance and Advantages and Limitations Complexities
Appendix B
Appendix C
Appendix D
References
1. Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends,
applications, and datasets. Vis. Comput. 2021, 38 , 2939–2970. [CrossRef] [PubMed]
2. Robinson, N.; Tidd, B.; Campbell, D.; Kulić, D.; Corke, P. Robotic Vision for Human-Robot Interaction and Collaboration: A
Survey and Systematic Review. ACM Trans. Hum.-Robot. Interact. 2023, 12, 1–66. [CrossRef]
3. Anthony, E.J.; Kusnadi, R.A. Computer Vision for Supporting Visually Impaired People: A Systematic Review. Eng. Math.
Comput. Sci. (Emacs) J. 2021, 3, 65–71. [CrossRef]
4. Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput.
Intell. Neurosci. 2018, 2018, 7068349. [CrossRef] [PubMed]
5. Gupta, A.; Anpalagan, A.; Guan, L.; Khwaja, A.S. Deep Learning for Object Detection and Scene Perception in Self-Driving Cars:
Survey, Challenges, and Open Issues. Array 2021, 10, 100057. [CrossRef]
6. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Sánchez, C.I. A Survey on Deep Learning in Medical
Image Analysis. Med. Image Anal. 2017, 42, 60–88. [CrossRef]
7. Huang, H.; Yu, P.S.; Wang, C. An Introduction to Image Synthesis with Generative Adversarial Nets. arXiv 2018, arXiv:1803.04469.
8. Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to Train Your Robot with Deep Reinforcement Learning:
Lessons We Have Learned. Int. J. Robot. Res. 2021, 40, 698–721. [CrossRef]
9. Ganesh, D.; Teja, R.R.; Reddy, C.D.; Swathi, D. Human Action Recognition based on Depth maps, Skeleton and Sensor Images
using Deep Learning. In Proceedings of the 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), Bangalore,
India, 7–9 October 2022. [CrossRef]
10. Wu, D.; Sharma, N.; Blumenstein, M. Recent advances in video-based human action recognition using deep learning: A review.
In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017;
pp. 2865–2872.
11. Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition: Overview,
challenges, and opportunities. Acm Comput. Surv. (Csur) 2021, 54, 1–40. [CrossRef]
12. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870.
13. Jin, L.; Zhu, Z.; Song, E.; Xu, X. An Effective Vector Filter for Impulse Noise Reduction Based on Adaptive Quaternion Color
Distance Mechanism. Signal Process. 2019, 155, 334–345. [CrossRef]
14. Chernov, V.; Alander, J.; Bochko, V. Integer-Based Accurate Conversion between RGB and HSV Color Spaces. Comput. Electr. Eng.
2015, 46, 328–337. [CrossRef]
Technologies 2024, 12, 15 34 of 40
15. Tsapatsoulis, N. Digital Image Processing Lecture Notes. 2023. Available online: https://www.studocu.com/in/document/
jawaharlal-nehru-technological-university-hyderabad/ece/digital-image-processing-lecture-notes-2022-2023/56139343 (ac-
cessed on 10 March 2023).
16. Arunpandian, M.; Arunprasath, T.; Vishnuvarthanan, G.; Rajasekaran, M.P. Thresholding Based Soil Feature Extraction from
Digital Image Samples–A Vision Towards Smarter Agrology. In Information and Communication Technology for Intelligent Systems
(ICTIS 2017)-Volume 1; Springer: Berlin/Heidelberg, Germany, 2018; pp. 458–465.
17. Kolkur, S.; Kalbande, D.; Shimpi, P.; Bapat, C.; Jatakia, J. Human Skin Detection Using RGB, HSV and YCbCr Color Models. arXiv
2017, arXiv:1708.02694.
18. BlackIce. HSI Color Conversion—Imaging Toolkit Feature. 2023. Available online: https://www.blackice.com/colorspaceHSI.
htm (accessed on 10 March 2023).
19. Feng, X.; Jiang, Y.; Yang, X.; Du, M.; Li, X. Computer vision algorithms and hardware implementations: A survey. Integration
2019, 69, 309–320. [CrossRef]
20. Boykov, Y.; Funka-Lea, G. Graph Cuts and Efficient ND Image Segmentation. Int. J. Comput. Vis. 2006, 70, 109–131. [CrossRef]
21. Pizurica, A.; Philips, W.; Lemahieu, I.; Acheroy, M. A Joint Inter-and Intrascale Statistical Model for Bayesian Wavelet Based
Image Denoising. IEEE Trans. Image Process. 2002, 11, 545–557. [CrossRef] [PubMed]
22. Shi, X.; Li, Y.; Zhao, Q. Flexible Hierarchical Gaussian Mixture Model for High-Resolution Remote Sensing Image Segmentation.
Remote Sens. 2020, 12, 1219. [CrossRef]
23. Wang, X.F.; Huang, D.S.; Xu, H. An Efficient Local Chan–Vese Model for Image Segmentation. Pattern Recognit. 2010, 43, 603–618.
[CrossRef]
24. Bresson, X.; Esedoḡlu, S.; Vandergheynst, P.; Thiran, J.P.; Osher, S. Fast Global Minimization of the Active Contour/Snake Model.
J. Math. Imaging Vis. 2007, 28, 151–167. [CrossRef]
25. Aytaç, E. Unsupervised Learning Approach in Defining the Similarity of Catchments: Hydrological Response Unit Based k-Means
Clustering, a Demonstration on Western Black Sea Region of Turkey. Int. Soil Water Conserv. Res. 2020, 8, 321–331. [CrossRef]
26. Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.K.; Ren, F. Learning in the Frequency Domain. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1740–1749.
27. Dubes, R.C.; Jain, A.K.; Nadabar, S.G.; Chen, C.C. MRF Model-Based Algorithms for Image Segmentation. In Proceedings of the
10th International Conference on Pattern Recognition, Atlantic City, NJ, USA, 16–21 June 1990; Volume 1, pp. 808–814.
28. Bleau, A.; Leon, L.J. Watershed-Based Segmentation and Region Merging. Comput. Vis. Image Underst. 2000, 77, 317–370.
[CrossRef]
29. Wu, Z.; Gao, Y.; Li, L.; Xue, J.; Li, Y. Semantic Segmentation of High-Resolution Remote Sensing Images Using Fully Convolutional
Network with Adaptive Threshold. Connect. Sci. 2019, 31, 169–184. [CrossRef]
30. Gout, C.; Le Guyader, C.; Vese, L. Segmentation under Geometrical Conditions Using Geodesic Active Contours and Interpolation
Using Level Set Methods. Numer. Algorithms 2005, 39, 155–173. [CrossRef]
31. Das, P.; Das, A. A Fast and Automated Segmentation Method for Detection of Masses Using Folded Kernel Based Fuzzy C-Means
Clustering Algorithm. Appl. Soft Comput. 2019, 85, 105775. [CrossRef]
32. Ziou, D.; Tabbone, S. Edge Detection Techniques-an Overview. Pattern Recognit. Image Anal. C/C Raspoznavaniye Obraz. Anal.
Izobr. 1998, 8, 537–559.
33. Kurak, C.W., Jr.; McHugh, J. A Cautionary Note on Image Downgrading. In Proceedings of the Annual Computer Security
Applications Conference, San Antonio, TX, USA, 30 November–4 December 1992; pp. 153–159.
34. Hussin, R.; Juhari, M.R.; Kang, N.W.; Ismail, R.C.; Kamarudin, A. Digital Image Processing Techniques for Object Detection from
Complex Background Image. Procedia Eng. 2012, 41, 340–344. [CrossRef]
35. Cruz, D.J.; Amaral, R.L.; Santos, A.D.; Tavares, J.M.R. Application of Digital Image Processing Techniques to Detect Through-
Thickness Crack in Hole Expansion Test. Metals 2023, 13, 1197. [CrossRef]
36. Wang, R.; Lei, T.; Cui, R.; Zhang, B.; Meng, H.; Nandi, A.K. Medical Image Segmentation Using Deep Learning: A Survey. Iet
Image Process. 2022, 16, 1243–1267. [CrossRef]
37. Yadav, S.S.; Jadhav, S.M. Deep Convolutional Neural Network Based Medical Image Classification for Disease Diagnosis. J. Big
Data 2019, 6, 113. [CrossRef]
38. Giuliani, D. Metaheuristic Algorithms Applied to Color Image Segmentation on HSV Space. J. Imaging 2022, 8, 6. [CrossRef]
39. Mallick, S. Image Recognition and Object Detection: Part 1; Learn OpenCV. 2016. Available online: https://learnopencv.com/
image-recognition-and-object-detection-part1/ (accessed on 9 March 2023).
40. Xylourgos, N. Segmentation of Ultrasound Images for Finding Anatomical References. Bachelor’s Thesis, Technological
Educational Institute of Crete, Heraklion, Greece, 2009.
41. Nixon, M.; Aguado, A. Feature Extraction and Image Processing for Computer Vision; Academic Press: Cambridge, MA, USA, 2019.
42. Wang, P.; Fan, E.; Wang, P. Comparative Analysis of Image Classification Algorithms Based on Traditional Machine Learning and
Deep Learning. Pattern Recognit. Lett. 2021, 141, 61–67. [CrossRef]
43. Uddin, M.P.; Mamun, M.A.; Hossain, M.A. PCA-based Feature Reduction for Hyperspectral Remote Sensing Image Classification.
Iete Tech. Rev. 2021, 38, 377–396. [CrossRef]
44. Wan, H.; Wang, H.; Scotney, B.; Liu, J. A Novel Gaussian Mixture Model for Classification. In Proceedings of the 2019 IEEE
International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3298–3303.
Technologies 2024, 12, 15 35 of 40
45. Haji, S.H.; Abdulazeez, A.M. Comparison of Optimization Techniques Based on Gradient Descent Algorithm: A Review. Palarch’s
J. Archaeol. Egypt/Egyptol. 2021, 18, 2715–2743.
46. Chandra, M.A.; Bedi, S.S. Survey on SVM and Their Application in Image Classification. Int. J. Inf. Technol. 2021, 13, 1–11.
[CrossRef]
47. Frenkel, C.; Lefebvre, M.; Bol, D. Learning without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of
Deep Neural Networks. Front. Neurosci. 2021, 15, 629892. [CrossRef] [PubMed]
48. Zhao, X.; Huang, P.; Shu, X. Wavelet-Attention CNN for Image Classification. Multimed. Syst. 2022, 28, 915–924. [CrossRef]
49. Venkataramanan, A.; Benbihi, A.; Laviale, M.; Pradalier, C. Gaussian Latent Representations for Uncertainty Estimation Using
Mahalanobis Distance in Deep Classifiers. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
Waikoloa, HI, USA, 2–7 January 2023; pp. 4488–4497.
50. Tong, K.; Wu, Y.; Zhou, F. Recent Advances in Small Object Detection Based on Deep Learning: A Review. Image Vis. Comput.
2020, 97, 103910. [CrossRef]
51. Barazida, N. YOLOv6: Next Generation Object Detection—Review and Comparison. 2022. Available online: https:
//www.linkedin.com/posts/dagshub_yolov6-next-generation-object-detection-activity-6947577684583456768-06KJ?trk=
public_profile_like_view (accessed on 9 March 2023).
52. Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning Data Augmentation Strategies for Object Detection.
In Computer Vision–ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham,
Switzerland, 2020; Volume 12372, pp. 566–583. [CrossRef]
53. Masita, K.L.; Hasan, A.N.; Shongwe, T. Deep Learning in Object Detection: A Review. In Proceedings of the 2020 International
Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa,
6–7 August 2020; pp. 1–11.
54. Shepley, A.J.; Falzon, G.; Kwan, P.; Brankovic, L. Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in
Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11561–11574 . [CrossRef]
55. Zhang, L.; Zhang, Y.; Zhang, Z.; Shen, J.; Wang, H. Real-Time Water Surface Object Detection Based on Improved Faster R-CNN.
Sensors 2019, 19, 3523. [CrossRef]
56. Zhong, F.; Quan, C. Stereo-Rectification and Homography-Transform-Based Stereo Matching Methods for Stereo Digital Image
Correlation. Measurement 2021, 173, 108635. [CrossRef]
57. Jin, S.; Liu, W.; Xie, E.; Wang, W.; Qian, C.; Ouyang, W.; Luo, P. Differentiable Hierarchical Graph Grouping for Multi-person Pose
Estimation. In Computer Vision–ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing:
Cham, Switzerland, 2020; Volume 12352, pp. 718–734. [CrossRef]
58. Min, K.; Lee, G.H.; Lee, S.W. Attentional Feature Pyramid Network for Small Object Detection. Neural Netw. 2022, 155, 439–450.
[CrossRef] [PubMed]
59. Ciaparrone, G.; Sánchez, F.L.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep Learning in Video Multi-Object Tracking: A
Survey. Neurocomputing 2020, 381, 61–88. [CrossRef]
60. Mehul. Object Tracking in Videos: Introduction and Common Techniques. 2020. Available online: https://aidetic.in/blog/2020
/10/05/object-tracking-in-videos-introduction-and-common-techniques/ (accessed on 11 March 2023).
61. Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21
June 2012. [CrossRef]
62. Fathi, A.; Mori, G. Action recognition by learning mid-level motion features. In Proceedings of the 2008 IEEE Conference on
Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [CrossRef]
63. Huang, C.P.; Hsieh, C.H.; Lai, K.T.; Huang, W.Y. Human Action Recognition Using Histogram of Oriented Gradient of Motion
History Image. In Proceedings of the 2011 First International Conference on Instrumentation, Measurement, Computer,
Communication and Control, Beijing, China, 21–23 October 2011. [CrossRef]
64. Chun, S.; Lee, C. Human action recognition using histogram of motion intensity and direction from multiple views. Iet Comput.
Vis. 2016, 10, 250–257. [CrossRef]
65. Hassan, M.; Ahmad, T.; Farooq, A.; Ali, S.A.; hassan, S.R.; Liaqat, N. A Review on Human Actions Recognition Using Vision
Based Techniques. J. Image Graph. 2014, 2, 28–32. [CrossRef]
66. Al-Ali, S.; Milanova, M.; Al-Rizzo, H.; Fox, V.L. Human Action Recognition: Contour-Based and Silhouette-Based Approaches. In
Computer Vision in Control Systems-2; Springer International Publishing: Cham, Switzerland, 2014; pp. 11–47. [CrossRef]
67. Chang, M.J.; Hsieh, J.T.; Fang, C.Y.; Chen, S.W. A Vision-based Human Action Recognition System for Moving Cameras Through
Deep Learning. In Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning, New York,
NY, USA, 27–29 November 2019. [CrossRef]
68. Chiang, M.L.; Feng, J.K.; Zeng, W.L.; Fang, C.Y.; Chen, S.W. A Vision-Based Human Action Recognition System for Companion
Robots and Human Interaction. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications
(ICCC), Chengdu, China, 7–10 December 2018. [CrossRef]
69. Hoshino, S.; Niimura, K. Robot Vision System for Real-Time Human Detection and Action Recognition. In Intelligent Autonomous
Systems 15; Springer International Publishing: Cham, Switzerland, 2018; pp. 507–519. [CrossRef]
Technologies 2024, 12, 15 36 of 40
70. Hoshino, S.; Niimura, K. Robot Vision System for Human Detection and Action Recognition. J. Adv. Comput. Intell. Intell. Inform.
2020, 24, 346–356. [CrossRef]
71. Chen, Q.; Tang, H.; Cai, J. Human Action Recognition Based on Vision Transformer and L2 Regularization. In Proceedings of
the 2022 11th International Conference on Computing and Pattern Recognition, New York, NY, USA, 17–19 November 2022.
[CrossRef]
72. Majumder, S.; Kehtarnavaz, N. Vision and Inertial Sensing Fusion for Human Action Recognition: A Review. IEEE Sens. J. 2021,
21, 2454–2467. [CrossRef]
73. Dai, C.; Liu, X.; Lai, J. Human Action Recognition Using Two-Stream Attention Based LSTM Networks. Appl. Soft Comput. 2020,
86, 105820. [CrossRef]
74. Jaouedi, N.; Boujnah, N.; Bouhlel, M.S. A New Hybrid Deep Learning Model for Human Action Recognition. J. King Saud-Univ.-
Comput. Inf. Sci. 2020, 32, 447–453. [CrossRef]
75. Gu, Y.; Ye, X.; Sheng, W.; Ou, Y.; Li, Y. Multiple Stream Deep Learning Model for Human Action Recognition. Image Vis. Comput.
2020, 93, 103818. [CrossRef]
76. Singh, T.; Vishwakarma, D.K. A Deeply Coupled ConvNet for Human Activity Recognition Using Dynamic and RGB Images.
Neural Comput. Appl. 2021, 33, 469–485. [CrossRef]
77. Yilmaz, A.A.; Guzel, M.S.; Bostanci, E.; Askerzade, I. A Novel Action Recognition Framework Based on Deep-Learning and
Genetic Algorithms. IEEE Access 2020, 8, 100631–100644. [CrossRef]
78. Song, X.; Lan, C.; Zeng, W.; Xing, J.; Sun, X.; Yang, J. Temporal–Spatial Mapping for Action Recognition. IEEE Trans. Circuits Syst.
Video Technol. 2019, 30, 748–759. [CrossRef]
79. Chakraborty, S.; Mondal, R.; Singh, P.K.; Sarkar, R.; Bhattacharjee, D. Transfer Learning with Fine Tuning for Human Action
Recognition from Still Images. Multimed. Tools Appl. 2021, 80, 20547–20578. [CrossRef]
80. Guha, R.; Khan, A.H.; Singh, P.K.; Sarkar, R.; Bhattacharjee, D. CGA: A New Feature Selection Model for Visual Human Action
Recognition. Neural Comput. Appl. 2021, 33, 5267–5286. [CrossRef]
81. Forch, V.; Hamker, F.H. Recurrent Spatial Attention for Facial Emotion Recognition. In Proceedings of the Workshop Localize IT,
Chemnitz Linux-Tage, Chemnitz, Germany, 16–17 March 2019.
82. Schröder, E.; Braun, S.; Mählisch, M.; Vitay, J.; Hamker, F. Feature Map Transformation for Multi-Sensor Fusion in Object Detection
Networks for Autonomous Driving. In Proceedings of the Advances in Computer Vision: Proceedings of the 2019 Computer
Vision Conference (CVC), Las Vegas, NV, USA, 25–26 April 2019; pp. 118–131.
83. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef] [PubMed]
84. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of
the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich,
Germany, 5–9 October 2015; pp. 234–241.
85. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.
In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034.
86. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
[CrossRef]
87. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef] [PubMed]
88. Lundgren, A.V.A.; dos Santos, M.A.O.; Bezerra, B.L.D.; Bastos-Filho, C.J.A. Systematic Review of Computer Vision Semantic
Analysis in Socially Assistive Robotics. AI 2022, 3, 229–249. [CrossRef]
89. Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the State-of-the-Art Technologies of Semantic Segmentation Based on Deep
Learning. Neurocomputing 2022, 493, 626–646. [CrossRef]
90. Wei, Y.; Xiao, H.; Shi, H.; Jie, Z.; Feng, J.; Huang, T.S. Revisiting Dilated Convolution: A Simple Approach for Weakly-and
Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 18–23 June 2018; pp. 7268–7277.
91. Zhang, M.; Zhou, Y.; Zhao, J.; Man, Y.; Liu, B.; Yao, R. A Survey of Semi-and Weakly Supervised Semantic Segmentation of
Images. Artif. Intell. Rev. 2020, 53, 4259–4288. [CrossRef]
92. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets
and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062.
93. Arbelaez, P.; Hariharan, B.; Gu, C.; Gupta, S.; Bourdev, L.; Malik, J. Semantic Segmentation Using Regions and Parts. In
Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012.
[CrossRef]
94. Tighe, J.; Lazebnik, S. Finding Things: Image Parsing with Regions and Per-Exemplar Detectors. In Proceedings of the 2013 IEEE
Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [CrossRef]
95. He, Y.; Chiu, W.C.; Keuper, M.; Fritz, M. STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017. [CrossRef]
Technologies 2024, 12, 15 37 of 40
96. Adamek, T.; O’Connor, N.E.; Murphy, N. Region-Based Segmentation of Images Using Syntactic Visual Features. In Proceedings
of the WIAMIS 2005—6th International Workshop on Image Analysis for Multimedia Interactive Services, Montreux, Switzerland,
13–15 April 2005 .
97. Ji, J.; Lu, X.; Luo, M.; Yin, M.; Miao, Q.; Liu, X. Parallel Fully Convolutional Network for Semantic Segmentation. IEEE Access
Pract. Innov. Open Solut. 2021, 9, 673–682. [CrossRef]
98. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [CrossRef]
99. Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE
International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [CrossRef]
100. Dai, J.; He, K.; Sun, J. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation.
In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015.
[CrossRef]
101. Chen, L.; Wu, W.; Fu, C.; Han, X.; Zhang, Y. Weakly Supervised Semantic Segmentation with Boundary Exploration. In
Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 347–362.
102. Kwak, S.; Hong, S.; Han, B. Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network. In Proceedings of
the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
103. Ouassit, Y.; Ardchir, S.; Yassine El Ghoumari, M.; Azouazi, M. A Brief Survey on Weakly Supervised Semantic Segmentation. Int.
J. Online Biomed. Eng. 2022, 18, 83–113. [CrossRef]
104. Schmitt, M.; Prexl, J.; Ebel, P.; Liebel, L.; Zhu, X.X. Weakly Supervised Semantic Segmentation of Satellite Images for Land Cover
Mapping–Challenges and Opportunities. arXiv 2020, arXiv:2002.08254.
105. Gama, P.H.T.; Oliveira, H.; dos Santos, J.A.; Cesar, R.M., Jr. An overview on Meta-learning approaches for Few-shot Weakly-
supervised Segmentation. Comput. Graph. 2023, 113, 77–88. [CrossRef]
106. Wang, J.; Ma, Y.; Zhang, L.; Gao, R.X.; Wu, D. Deep learning for smart manufacturing: Methods and applications. J. Manuf. Syst.
2018, 48, 144–156. [CrossRef]
107. Zhang, D.; Han, J.; Cheng, G.; Yang, M.H. Weakly Supervised Object Localization and Detection: A Survey. IEEE Trans. Pattern
Anal. Mach. Intell. 2021, 44, 5866–5885. [CrossRef] [PubMed]
108. Azzi, Y.; Moussaoui, A.; Kechadi, M.T. Semantic Segmentation of Medical Images with Deep Learning: Overview. Med. Technol. J.
2020, 4, 568–575. [CrossRef]
109. Singh, R.; Rani, R. Semantic Segmentation using Deep Convolutional Neural Network: A Review. Ssrn Electron. J. 2020, 1, 1–8.
[CrossRef]
110. Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Semantic Segmentation With Context Encoding and Multi-Path Decoding. IEEE
Trans. Image Process. 2020, 29, 3520–3533. [CrossRef]
111. Miled, M.; Messaoud, M.A.B.; Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimed. Tools Appl.
2023, 82, 551–571. [CrossRef]
112. Gianey, H.K.; Khandelwal, P.; Goel, P.; Maheshwari, R.; Galhotra, B.; Singh, D.P. Lip Reading Framework using Deep Learning
and Machine Learning. In Advances in Data Science and Analytics: Concepts and Paradigms; Scrivener Publishing LLC: Beverly, MA,
USA, 2023; pp. 67–87.
113. Wu, Y.; Wang, D.H.; Lu, X.T.; Yang, F.; Yao, M.; Dong, W.S.; Shi, J.B.; Li, G.Q. Efficient Visual Recognition: A Survey on Recent
Advances and Brain-Inspired Methodologies. Mach. Intell. Res. 2022, 19, 366–411. [CrossRef]
114. Santosh, K.; Hegadi, R. Recent Trends in Image Processing and Pattern Recognition. In Proceedings of the Second International
Conference, RTIP2R 2018, Solapur, India, 21–22 December 2018. Revised Selected Papers, Part I; Communications in Computer
and Information Science ; Springer-Nature: Singapore, 2019.
115. Liu, H.; Yin, J.; Luo, X.; Zhang, S. Foreword to the Special Issue on Recent Advances on Pattern Recognition and Artificial
Intelligence. Neural Comput. Appl. 2018, 29, 1–2. [CrossRef]
116. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto,
ON, Canada, 2009.
117. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998,
86, 2278–2324. [CrossRef]
118. Wang, L.; Huynh, D.Q.; Koniusz, P. A Comparative Review of Recent Kinect-Based Action Recognition Algorithms. IEEE Trans.
Image Process. 2019, 29, 15–28. [CrossRef] [PubMed]
119. Al-Faris, M.; Chiverton, J.; Ndzi, D.; Ahmed, A.I. A Review on Computer Vision-Based Methods for Human Action Recognition.
J. Imaging 2020, 6, 46. [CrossRef]
120. Yu, H.; Yang, Z.; Tan, L.; Wang, Y.; Sun, W.; Sun, M.; Tang, Y. Methods and datasets on semantic segmentation: A review.
Neurocomputing 2018, 304, 82–103. [CrossRef]
121. Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf.
Retr. 2017, 7, 87–93. [CrossRef]
122. Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321.
[CrossRef]
Technologies 2024, 12, 15 38 of 40
123. Sanjaya, Y.C.; Gunawan, A.A.S.; Irwansyah, E. Semantic Segmentation for Aerial Images: A Literature Review. Eng. Math.
Comput. Sci. (Emacs) J. 2020, 2, 133–139. [CrossRef]
124. Chen, Y.L.; Cai, Y.R.; Cheng, M.Y. Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach. Machines
2023, 11, 275. [CrossRef]
125. Yadav, S.P.; Nagar, R.; Shah, S.V. Learning Vision-based Robotic Manipulation Tasks Sequentially in Offline Reinforcement
Learning Settings. arXiv 2023, arXiv:2301.13450.
126. Vuletić, J.; Polić, M.; Orsag, M. Robotic Strawberry Flower Treatment Based on Deep-Learning Vision. In Human-Friendly
Robotics 2022; Borja, P., Della Santina, C., Peternel, L., Torta, E., Eds.; Springer International Publishing: Cham, Switzerland, 2023;
Volume 26, pp. 189–204. [CrossRef]
127. Brogan, D.P.; DiFilippo, N.M.; Jouaneh, M.K. Deep Learning Computer Vision for Robotic Disassembly and Servicing Applications.
Array 2021, 12, 100094. [CrossRef]
128. Keerthikeshwar, M.; Anto, S. Deep Learning for Robot Vision. In Intelligent Manufacturing and Energy Sustainability; Reddy, A.,
Marla, D., Favorskaya, M.N., Satapathy, S.C., Eds.; Springer: Singapore, 2021; Volume 213, pp. 357–365. [CrossRef]
129. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic Image Segmentation with Deep Convolu-
tional Nets, Atrous Convolution, and Fully Connected Crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [CrossRef]
[PubMed]
130. Sun, W.; Wang, R. Fully Convolutional Networks for Semantic Segmentation of Very High Resolution Remotely Sensed Images
Combined With DSM. IEEE Geosci. Remote Sens. Lett. 2018, 15, 474–478. [CrossRef]
131. Browne, M.; Ghidary, S.S. Convolutional Neural Networks for Image Processing: An Application in Robot Vision. In AI
2003: Advances in Artificial Intelligence; Goos, G., Hartmanis, J., Van Leeuwen, J., Gedeon, T.D., Fung, L.C.C., Eds.; Springer:
Berlin/Heidelberg, Germany, 2003; Volume 2903, pp. 641–652. [CrossRef]
132. Ruiz-del-Solar, J.; Loncomilla, P.; Soto, N. A Survey on Deep Learning Methods for Robot Vision. arXiv 2018, arXiv:1803.10862.
133. Bernstein, A.V.; Burnaev, E.V.; Kachan, O.N. Reinforcement Learning for Computer Vision and Robot Navigation. In Machine
Learning and Data Mining in Pattern Recognition; Perner, P., Ed.; Springer International Publishing: Cham, Switzerland, 2018;
Volume 10935, pp. 258–272. [CrossRef]
134. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [CrossRef]
135. Laraba, S.; Tilmanne, J.; Dutoit, T. Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition. In Computer
Vision Systems; Tzovaras, D., Giakoumis, D., Vincze, M., Argyros, A., Eds.; Springer International Publishing: Cham, Switzerland,
2019; Volume 11754, pp. 612–626. [CrossRef]
136. Zhang, J. Multi-Source Remote Sensing Data Fusion: Status and Trends. Int. J. Image Data Fusion 2010, 1, 5–24. [CrossRef]
137. Rajagopalan, S.S.; Morency, L.P.; Baltrus̆aitis, T.; Goecke, R. Extending Long Short-Term Memory for Multi-View Structured
Learning. In Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing:
Cham, Switzerland, 2016; Volume 9911, pp. 338–353. [CrossRef]
138. Li, T.; Hua, M.; Wu, X.U. A Hybrid CNN-LSTM Model for Forecasting Particulate Matter (PM2. 5). IEEE Access 2020,
8, 26933–26940. [CrossRef]
139. Kollias, D.; Zafeiriou, S. A Multi-component CNN-RNN Approach for Dimensional Emotion Recognition in-the-Wild. arXiv 2019,
arXiv:1805.01452.
140. Rožanec, J.M.; Zajec, P.; Theodoropoulos, S.; Koehorst, E.; Fortuna, B.; Mladenić, D. Synthetic Data Augmentation Using GAN for
Improved Automated Visual Inspection. Ifac-Papersonline 2023, 56, 11094–11099. [CrossRef]
141. Tasdelen, A.; Sen, B. A Hybrid CNN-LSTM Model for Pre-miRNA Classification. Sci. Rep. 2021, 11, 14125. [CrossRef] [PubMed]
142. Zieba, M.; Wang, L. Training Triplet Networks with GAN. arXiv 2017, arXiv:1704.02227.
143. Sergiyenko, O.Y.; Tyrsa, V.V. 3D Optical Machine Vision Sensors with Intelligent Data Management for Robotic Swarm Navigation
Improvement. IEEE Sens. J. 2020, 21, 11262–11274. [CrossRef]
144. Jiang, H.; Peng, L.; Wang, X. Machine Vision and Big Data-Driven Sports Athletes Action Training Intervention Model. Sci.
Program. 2021, 2021, 9956710. [CrossRef]
145. Elfiky, N. Application of Analytics in Machine Vision Using Big Data. Asian J. Appl. Sci. 2019, 7, 376–385. [CrossRef]
146. Popov, S.B. The Big Data Methodology in Computer Vision Systems. In Proceedings of the International Conference Information
Technology and Nanotechnology (ITNT-2015), Samara, Russia, 29 June–1 July 2015; Volume 1490, pp. 420–425.
147. Tuor, T.; Wang, S.; Ko, B.J.; Liu, C.; Leung, K.K. Overcoming Noisy and Irrelevant Data in Federated Learning. In Proceedings of
the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5020–5027.
148. Zhang, H.; Bosch, J.; Olsson, H.H. Real-Time End-to-End Federated Learning: An Automotive Case Study. In Proceedings of
the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021;
pp. 459–468.
149. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings,
R. Advances and Open Problems in Federated Learning. In Foundations and Trends® in Machine Learning; Now Publishers Inc.:
Boston, MA, USA, 2021; Volume 14, pp. 1–210.
150. Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečný, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. LEAF: A Benchmark for Federated
Settings. arXiv 2019, arXiv:1812.01097.
Technologies 2024, 12, 15 39 of 40
151. Tyagi, S.; Rajput, I.S.; Pandey, R. Federated Learning: Applications, Security Hazards and Defense Measures. In Proceedings of
the 2023 International Conference on Device Intelligence, Computing and Communication Technologies, (DICCT), Dehradun,
India, 17–18 March 2023; pp. 477–482.
152. Federated Learning: Collaborative Machine Learning without Centralized Training Data. 2017. Available online: https://blog.
research.google/2017/04/federated-learning-collaborative.html (accessed on 9 March 2023).
153. Kant, S.; da Silva, J.M.B.; Fodor, G.; Göransson, B.; Bengtsson, M.; Fischione, C. Federated Learning Using Three-Operator
ADMM. IEEE J. Sel. Top. Signal Process. 2022, 17, 205–221. [CrossRef]
154. Tao, J.; Gao, Z.; Guo, Z. Training Vision Transformers in Federated Learning with Limited Edge-Device Resources. Electronics
2022, 11, 2638. [CrossRef]
155. Guo, T.; Guo, S.; Wang, J. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In
Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1364–1374. [CrossRef]
156. Hubel, D.H.; Wiesel, T.N. Receptive Fields of Single Neurones in the Cat’s Striate Cortex. J. Physiol. 1959, 148, 574. [CrossRef]
157. Hubel, D.H.; Wiesel, T.N. Receptive Fields, Binocular Interaction and Functional Architecture in the Cat’s Visual Cortex. J. Physiol.
1962, 160, 106. [CrossRef]
158. Kandel, E.R. An Introduction to the Work of David Hubel and Torsten Wiesel. J. Physiol. 2009, 587, 2733. [CrossRef] [PubMed]
159. Wurtz, R.H. Recounting the Impact of Hubel and Wiesel. J. Physiol. 2009, 587, 2817–2823. [CrossRef] [PubMed]
160. Shimonomura, K.; Kushima, T.; Yagi, T. Binocular Robot Vision Emulating Disparity Computation in the Primary Visual Cortex.
Neural Netw. 2008, 21, 331–340. [CrossRef] [PubMed]
161. Shimonomura, K.; Yagi, T. Neuromorphic Vergence Eye Movement Control of Binocular Robot Vision. In Proceedings of the 2010
IEEE International Conference on Robotics and Biomimetics, Tianjin, China, 14–18 December 2010; pp. 1774–1779.
162. Gochin, P.M.; Lubin, J.M. A Hierarchical Machine Vision System Based on a Model of the Primate Visual System. In Proceedings
of the 5th IEEE International Symposium on Intelligent Control 1990, Philadelphia, PA, USA, 5–7 September 1990; pp. 61–65.
163. Zeevi, Y.Y. Adaptive Machine Vision: What Can Be Learned from Biological Systems. In Intelligent Robots and Computer Vision
VIII: Algorithms and Techniques; SPIE: Philadelphia, PA, USA, 1990; Volume 1192, pp. 560–568.
164. Milner, B.; Squire, L.R.; Kandel, E.R. Cognitive Neuroscience and the Study of Memory. Neuron 1998, 20, 445–468. [CrossRef]
[PubMed]
165. Allison, T. Electrophysiological Studies of Human Face Perception. I: Potentials Generated in Occipitotemporal Cortex by Face
and Non-face Stimuli. Cereb. Cortex 1999, 9, 415–430. [CrossRef] [PubMed]
166. Dawson, G.; Webb, S.J.; McPartland, J. Understanding the Nature of Face Processing Impairment in Autism: Insights From
Behavioral and Electrophysiological Studies. Dev. Neuropsychol. 2005, 27, 403–424. [CrossRef]
167. Di Nuovo, A.; Conti, D.; Trubia, G.; Buono, S.; Di Nuovo, S. Deep Learning Systems for Estimating Visual Attention in
Robot-Assisted Therapy of Children with Autism and Intellectual Disability. Robotics 2018, 7, 25. [CrossRef]
168. El Arbaoui, F.E.Z.; El Hari, K.; Saidi, R. A Survey on the Application of the Internet of Things in the Diagnosis of Autism Spectrum
Disorder. In Advanced Technologies for Humanity; Saidi, R., El Bhiri, B., Maleh, Y., Mosallam, A., Essaaidi, M., Eds.; Lecture Notes
on Data Engineering and Communications Technologies; Springer: Cham, Switzerland, 2022; pp. 29–41. [CrossRef]
169. Javed, H.; Park, C.H. Behavior-Based Risk Detection of Autism Spectrum Disorder Through Child-Robot Interaction. In
Proceedings of the Hri’20: Companion of the 2020 Acm/Ieee International Conference on Human-Robot Interaction, New York,
NY, USA, 23–26 March 2020; pp. 275–277. [CrossRef]
170. Kollias, K.F.; Syriopoulou-Delli, C.K.; Sarigiannidis, P.; Fragulis, G.F. The Contribution of Machine Learning and Eye-tracking
Technology in Autism Spectrum Disorder Research: A Review Study. In Proceedings of the 2021 10th International Conference
on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 5–7 July 2021; pp. 1–4.
171. Kollias, K.F.; Syriopoulou-Delli, C.K.; Sarigiannidis, P.; Fragulis, G.F. The Contribution of Machine Learning and Eye-Tracking
Technology in Autism Spectrum Disorder Research: A Systematic Review. Electronics 2021, 10, 2982. [CrossRef]
172. Kollias, K.F.; Syriopoulou-Delli, C.K.; Sarigiannidis, P.; Fragulis, G.F. Autism Detection in High-Functioning Adults with the
Application of Eye-Tracking Technology and Machine Learning. In Proceedings of the 2022 11th International Conference on
Modern Circuits and Systems Technologies (MOCAST), Bremen, Germany, 8–10 June 2022; pp. 1–4.
173. Kollias, K.F.; Maia Marques Torres E Silva, L.M.; Sarigiannidis, P.; Syriopoulou-Delli, C.K.; Fragulis, G.F. Implementation of
Robots in Autism Spectrum Disorder Research: Diagnosis and Emotion Recognition and Expression. In Proceedings of the
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST), Athens, Greece, 28–30 June 2023;
pp. 1–4. [CrossRef]
174. Ramirez-Duque, A.A.; Frizera-Neto, A.; Bastos, T.F. Robot-Assisted Diagnosis for Children with Autism Spectrum Disorder
Based on Automated Analysis of Nonverbal Cues. In Proceedings of the 2018 7th IEEE International Conference on Biomedical
Robotics and Biomechatronics (Biorob), Enschede, The Netherlands, 26–29 August 2018; pp. 456–461. [CrossRef]
175. Ramírez-Duque, A.A.; Frizera-Neto, A.; Bastos, T.F. Robot-Assisted Autism Spectrum Disorder Diagnostic Based on Artificial
Reasoning. J. Intell. Robot. Syst. 2019, 96, 267–281. [CrossRef]
176. Riva, G.; Riva, E. CARERAID: Controlled Autonomous Robot for Early Detection and Rehabilitation of Autism and Intellectual
Disability. Cyberpsycho. Behav. Soc. Netw. 2019, 22, 747–748. [CrossRef]
177. Romero-García, R.; Martínez-Tomás, R.; Pozo, P.; de la Paz, F.; Sarriá, E. Q-CHAT-NAO: A Robotic Approach to Autism Screening
in Toddlers. J. Biomed. Inform. 2021, 118, 103797. [CrossRef]
Technologies 2024, 12, 15 40 of 40
178. Shelke, N.A.; Rao, S.; Verma, A.K.; Kasana, S.S. Autism Spectrum Disorder Detection Using AI and IoT. In Proceedings of the
2022 Fourteenth International Conference on Contemporary Computing, Noida, India, 4–6 August 2022; pp. 213–219.
179. Shushma, G.; Jacob, I.J. Autism Spectrum Disorder Detection Using AI Algorithm. In Proceedings of the 2022 Second International
Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, 23–25 February 2022; pp. 1–5.
180. Felleman, D.J.; Van Essen, D.C. Distributed Hierarchical Processing in the Primate Cerebral Cortex. Cereb. Cortex 1991, 1, 1.
[CrossRef] [PubMed]
181. Van Essen, D.C.; Drury, H.A.; Joshi, S.; Miller, M.I. Functional and Structural Mapping of Human Cerebral Cortex: Solutions Are
in the Surfaces. Proc. Natl. Acad. Sci. USA 1998, 95, 788–795. [CrossRef] [PubMed]
182. Fox, M.D.; Snyder, A.Z.; Vincent, J.L.; Corbetta, M.; Van Essen, D.C.; Raichle, M.E. The Human Brain Is Intrinsically Organized
into Dynamic, Anticorrelated Functional Networks. Proc. Natl. Acad. Sci. USA 2005, 102, 9673–9678. [CrossRef]
183. Glasser, M.F.; Coalson, T.S.; Robinson, E.C.; Hacker, C.D.; Harwell, J.; Yacoub, E.; Ugurbil, K.; Andersson, J.; Beckmann, C.F.;
Jenkinson, M.; et al. A Multi-Modal Parcellation of Human Cerebral Cortex. Nature 2016, 536, 171–178. [CrossRef] [PubMed]
184. Tang, L.; Xiao, H.; Li, B. Can sam segment anything? when sam meets camouflaged object detection. arXiv 2023, arXiv:2304.04709.
185. Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C.; et al. Multi-modal 3D Object Detection in
Autonomous Driving: A Survey and Taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [CrossRef]
186. Rajasegaran, J.; Pavlakos, G.; Kanazawa, A.; Feichtenhofer, C.; Malik, J. On the Benefits of 3D Pose and Tracking for Human
Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver,
Canada, 18–22 June 2023; pp. 640–649.
187. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [CrossRef]
188. Zhang, Y.; Guo, Q.; Du, Z.; Wu, A. Human Action Recognition for Dynamic Scenes of Emergency Rescue Based on Spatial-
Temporal Fusion Network. Electronics 2023, 12, 538. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.