sensors-23-00798
sensors-23-00798
Article
Deep Learning with Attention Mechanisms for Road
Weather Detection
Madiha Samo, Jimiama Mosima Mafeni Mase and Grazziela Figueredo *
Abstract: There is great interest in automatically detecting road weather and understanding its
impacts on the overall safety of the transport network. This can, for example, support road condition-
based maintenance or even serve as detection systems that assist safe driving during adverse climate
conditions. In computer vision, previous work has demonstrated the effectiveness of deep learning
in predicting weather conditions from outdoor images. However, training deep learning models
to accurately predict weather conditions using real-world road-facing images is difficult due to:
(1) the simultaneous occurrence of multiple weather conditions; (2) imbalanced occurrence of weather
conditions throughout the year; and (3) road idiosyncrasies, such as road layouts, illumination, and
road objects, etc. In this paper, we explore the use of a focal loss function to force the learning process
to focus on weather instances that are hard to learn with the objective of helping address data imbal-
ances. In addition, we explore the attention mechanism for pixel-based dynamic weight adjustment
to handle road idiosyncrasies using state-of-the-art vision transformer models. Experiments with a
novel multi-label road weather dataset show that focal loss significantly increases the accuracy of
computer vision approaches for imbalanced weather conditions. Furthermore, vision transformers
outperform current state-of-the-art convolutional neural networks in predicting weather conditions
with a validation accuracy of 92% and an F1-score of 81.22%, which is impressive considering the
imbalanced nature of the dataset.
Keywords: computer vision; deep learning; image classification; loss functions; vision transformers;
Citation: Samo, M.; Mafeni Mase,
weather detection; autonomous vehicles
J.M.; Figueredo, G. Deep Learning
with Attention Mechanisms for Road
Weather Detection. Sensors 2023, 23,
798. https://doi.org/10.3390/
1. Introduction
s23020798
Different types of weather severely affect traffic flow, driving performance, and ve-
Academic Editors: Joaquin
hicle and road safety [1]. Statistics from the Federal Highway Administration show that
Garcia-Alfaro, Pantaleone Nespoli
increased amount of accidents and congestion are usually directly associated with hos-
and Kuo-Liang Chung
tile weather [2]. As a result, there is the need for advanced intelligent systems that can
Received: 9 December 2022 accurately and automatically detect weather conditions to support safe driving, traffic
Revised: 1 January 2023 safety risk assessment, autonomous vehicles, and effective management of the transport
Accepted: 6 January 2023 network. Deep learning has emerged as one of the main approaches used for automatic
Published: 10 January 2023 weather recognition as evidenced by the related work in Section 2. The state-of-the-art
literature mostly employs convolutional neural networks (CNN), which are trained on
outdoor weather images and subsequently label new images with a single weather class.
This type of classification for roads, however, produces less accurate results, as multiple
Copyright: © 2023 by the authors.
weather types are likely to occur simultaneously. For example, Figure 1 shows multiple
Licensee MDPI, Basel, Switzerland.
weather conditions (i.e., sunny and wet) present in a single scenario. Another limitation
This article is an open access article
found in the current related work is that deep learning models are mostly trained on
distributed under the terms and
balanced and high variance weather datasets. This oversimplifies road weather conditions,
conditions of the Creative Commons
which are characterised by highly imbalanced and more complex scenarios, such as road
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
layouts, interacting elements, vehicles, people, and different illumination conditions. The
4.0/).
representation learning, therefore, becomes compromised as road elements that could
potentially allow for a more specific type of learning for the road problem are not included.
There is also currently no research study investigating intelligent strategies for multi-label,
highly imbalanced and complex road scenarios, such as dynamic pixel-based weighting.
This drives the motivation of this study to propose a publicly available realistic multi-label
road weather dataset and employ vision transformers based on focal loss to address class
imbalance and road idiosyncrasies.
Figure 1. Multiple weather conditions (sunny and wet) existing in a single image.
2. Background
2.1. Related Work
The rapid evolution and widespread application of sensors (e.g., onboard cameras)
has led to large volumes of data streams constantly being generated in transportation. Deep
learning approaches have emerged as suitable approaches to address big data problems
as they reduce the dependency on human experts and learn high-level features from data
in an incremental manner. Specifically, for weather recognition tasks, convolution neural
networks have been vastly explored by many researchers.
Kang et al. [3] introduced a weather classification framework based on GoogleNet
to recognise four weather conditions—hazy, snowy, rainy and others. Their framework
was trained using the general MWI weather dataset [4] and achieved 92% accuracy. The
model outperformed multiple kernel learning-based approaches [4] and AlexNet CNN [5].
Similarly, An et al. [6] explored ResNet and Alexnet coupled with support vector machines
for weather classification. The authors evaluated the models using several multi-class
Sensors 2023, 23, 798 3 of 13
The classification performance achieved in the above studies for weather recognition is
acceptable. However, the majority of the studies focused on multi-class classification, which
could be unrepresentative of real-world weather conditions where more than one weather
condition can occur simultaneously (as shown in the sample image in Figure 1). The few
studies that employ multi-label classification [10] are either implemented on a general
weather dataset or fail to make their datasets available for comparison and advancement.
In addition, the studies use carefully selected outdoor images, which create well-balanced
weather datasets. This oversimplifies the road weather detection problem, which is usually
imbalanced in nature, e.g., icy and snowy weather conditions rarely occur in the United
Kingdom (UK). The outdoor datasets also fail to include different lighting conditions and
road characteristics, making them ungeneralizable to road weather images.
We address the above limitations by proposing a multi-label weather dataset for roads
to address the problem of multiple weather existing in a single frame. In addition, as the
weather data are inherently unbalanced, an attention mechanism needs to be provided to
address those categories that are harder to learn, as those are more likely to be extreme
(rare) conditions, and their misclassification by the intelligent systems should be minimised.
Hence, the systematic approach followed in this study allows the model to focus more on
less represented classes instead of data-dominated labels to prevent training a bias network.
We also focus on feeding the model information about hard instances to avoid the gradient
being outclassed by the accumulation of the losses of easy instances. Lastly, we focus on
dynamically assigning weights to the pixels allowing the model to focus more on relevant
features during classification, which can potentially increase the model’s efficiency for
highly complex data. Specifically, the study involves identifying the potential of adapting
weighted loss and focal loss function to deal with class imbalance problems and hard-to-
learn instances in the dataset. The study also involves exploring vision transformer models
allowing the model to focus more on relevant pixels only. To the best of our knowledge,
this study is the first attempt to recognise the potential of weighted loss, focal loss and
pixel-based attention mechanism for multi-label road weather classification.
2.2. Loss Functions Explored in This Study to Deal with Data Predicaments
• Class Weighted Loss Function: The traditional cross entropy loss does not take into
account the imbalanced nature of the dataset. The inherent assumption that the
data are balanced often lead to fallacious results. Since the learning becomes biased
towards majority classes, the model fails to learn meaningful features to identify the
minority classes. Therefore, to overcome these issues, loss function can be optimised
by assigning weights such that more attention is given to minority classes during
training. Weights are assigned to each class such that the smaller the number of
instances in a class, the greater the weight assigned to that class. For each class,
Weight assigned to the class = Total images in dataset/ Total images in that class. The
weighted cross-entropy loss function is given by:
N C
L= ∑ ∑ ωc [−(yc ∗ ( pc ) + (1 − yc ) ∗ (1 − pc ))] (1)
i =1 c =1
where L is the total loss, c represents the class, i represents the training instance, while
C and N represent total number of classes and instances, respectively. The yc indicates
the ground truth label for the class c, and pc is the predicted probability that the given
image belongs to class c, while ω c represents the weight of the class c.
• Focal Loss Function: A focal loss function is a dynamically scaled cross-entropy loss
function. Focal loss forces the model to focus on the hard misclassified examples
during the training process [17]. For any given instance, the scaling factor of the focal
loss function decays to zero as the loss decreases, thus allowing the model to rapidly
focus on hard examples instead of assigning similar weights to all the instances. Focal
loss function is given by
Sensors 2023, 23, 798 5 of 13
where α and γ are hyperparameters such that setting γ greater than zero reduces
relative loss for examples that are easily classified. The hyperparameter γ >= 0 and
its value controls the loss for easy and hard instances, while α lies between [0,1] and
addresses the class imbalance problem.
3. Vision Transformers
Transformers were initially introduced for Natural Language Processing (NLP) tasks [23],
while image processing tasks usually relied on convolution neural networks. Recently,
transformers have been adopted for computer vision tasks [15] and they are called vision
transformers. Vision transformers are similar to NLP transformers, where patches of
images are used instead of sentences. Images are broken down into a series of patches and
transformed into embeddings, which can be easily fed into NLP transformers, similar to
embeddings of words.
Conventional CNNs typically assign similar attention (weights) to all the pixels of an
image during classification. As already proven in the field of NLP, introducing attention
mechanisms such that higher weights are assigned to pixels of relevant information could
lead to potentially better results and efficient models. Therefore, Vision Transformers (ViT)
capture relationships between different parts of an image allowing the model to focus more
on relevant pixels in classification problems. ViT computes relationships among pixels in
small sections of the image (also known as patches) to reduce computation time instead
of computing the relationship between each individual pixel. Each image is considered
a sequence of patches of pixels. However, to retain the positional information, positional
embeddings are added to the patch embeddings, as shown in Figure 2. These positional
embeddings are important to represent the position of features in a flattened sequence;
otherwise, the transformer will lose information about the sequential relationships between
the patches. A positional embedding (PE) matrix is used to define the relative distance of
all possible pairs in the given sequence of patch embeddings and is given by the formula:
where pos is the position of the feature in the input sequence, i is used to map column
indices such that 0 <= i <= d/2, and d is the dimension of the embedding space.
Sensors 2023, 23, 798 6 of 13
The results with the position embeddings are then fed to a transformer encoder for
classification, as shown in Figure 3. The transformer encoder module consists of a Multi-
Head Self Attention (MSA) layer and a Multi-Layer Perceptron (MLP) layer. The MSA
layer splits the given input into multiple heads such that each head learns different levels
of self-attention. The outputs are then further concatenated and passed through the MLP
layer. The concatenated outputs from the MSA layer are normalised in the Norm layer and
sent to the MLP layer for classification. The MLP layer consists of Gaussian Error Linear
Unit (GELU) activation functions.
Figure 2 shows an overview of ViT. This section concludes by explaining in more detail
the attention mechanism adopted by the MSA layer.
A typical attention mechanism is based on trainable vector pairs consisting of keys
and values. A set of k key vectors is packed in a matrix K (K ∈ Rkxd ) such that the query
vector (q ∈ Rd ) is matched against this set of k key vectors. The matching is based on inner
dot products, which are then scaled and normalised. A softmax function is then applied
to obtain k weights. The weighted sum of k value vectors then serve as an output of the
attention. For self-attention, the vectors (Query, Key and Value) are calculated from a given
set of N input vectors (i.e., patches of images) such that:
Query = XWq , Key = XWk , Value = XWv , where Wq , Wk , and Wv are the linear
transformations with the constraint k = N, indicating that the attention is computed
between the given N input vectors.
The MSA layer refers to the “h” number of self-attention functions applied to the input,
as follows: Multihead( Q, K, V ) = [head1, . . . , headh]W0 , where W refers to the learnable
Sensors 2023, 23, 798 7 of 13
parameter matrices. MSA computation is made such that query, key and value vectors are
split into N vectors before applying self-attention. The self-attention process is then applied
to each split vector individually. The independent attention modules are concatenated and
linearly transformed.
We conclude this section by summarising the image classification process of ViT using
the self-attention mechanism and encoder layer described above. Input images are split
into patches of fixed sizes and multiplied with embedding matrices. Each patch is assigned
a trainable positional embedding vector to remember the order of the input sequence before
feeding the input to the transformer. The transformer uses constant vector size in all the
layers so all the patches are flattened to map these dimensions using a trainable linear
projection. Each encoder comprises two sub-layers. The first sub-layer allows the input to
pass through the self-attention module while the outputs of the self-attention operation are
then passed to a feed-forward neural network in the second sub-layer with output neurons
for classifying the images. Skip connections and layer normalisation are also incorporated
into the architecture for each sublayer of the encoder.
4. Experiments
4.1. Proposed Dataset Description
Due to a lack of a publicly available multi-label road weather dataset, we have created
an open-source dataset consisting of road images depicting eight classes of weather and road
surface conditions, i.e., sunny, cloudy, foggy, rainy, wet, clear, snowy and icy. The images are
extracted from available online videos on YouTube captured and uploaded by ‘Alan Z1000sx’
(the YouTube account that owns the road-facing videos) using a video camera mounted on
the dashboard of a heavy goods vehicle completing journeys across the UK (a sample video
is available at [24]). The video clips captured different roads in the UK (i.e., motorways,
urban roads, rural roads, and undivided highways), different weather conditions (i.e., sunny,
cloudy, foggy, rainy, wet, clear, snowy and icy) and different lighting conditions (i.e., sunset,
sunrise, morning, afternoon, night, and evening). We downloaded 25 videos uploaded by
‘Alan Z1000sx’ with an average duration of 8 min. We developed a python script to extract
images from the videos every 10 s. A total of 2,498 images were extracted.
To annotate the images, we utilised an online annotation platform called Zooni-
verse [25]. In Zooniverse, volunteers assist researchers in data annotation and pattern
recognition tasks. We created a project in Zooniverse for annotating the images, uploaded
the images, specified the labels, and added volunteers to our project. Zooniverse provides
an easy-to-use interface for annotating the images, as shown in Figure 4. As shown in the
figure, each image could be assigned to more than one weather condition. The annotations
were carried out by two volunteers. After annotating the images, Zooniverse offers an
option to export the annotations to a comma-separated values file. Table 2 shows the
distribution of the images in different weather conditions. The dataset is imbalanced with
the majority of the images classified as clear and sunny, while icy is the least classified as
UK roads are rarely icy. Six sample images from the dataset are shown in Figure 5, and the
complete dataset is available online at [26].
Figure 5. Six samples of weather images from our multi-label road weather dataset.
dataset [28,29]. The images are first resized into the required image size for the CNN archi-
tectures, e.g., 224 × 224 for most of the models except EfficientNet-B7 and Inception-v3,
which require an input size of 600 × 600 and 299 × 299, respectively. Later, the models are
pre-trained by setting the ‘pre-trained’ parameter in the models to True (in Pytorch).
In stage 2, the pre-trained models are re-trained on our proposed road weather dataset
by replacing the number of outputs in the final fully connected layer of the CNN models
with the number of weather classes. It is important to note that ‘icy’ road weather images
were eliminated from our experiments as they were only three instances. Therefore, seven
weather classes were utilised in our experiments to re-train the models. Only the last layers
of the CNN architectures are optimised during the training process using cross entropy loss.
In the third stage, we update the cross entropy loss to incorporate the number of
images in each class (i.e., class weighted loss function). This is important to reduce the bias
of the majority classes of imbalanced datasets by providing higher weights to images from
minority classes and lower weights to images from majority classes.
In the fourth stage, focal loss function is implemented to pay more attention to classes
that are harder to learn, e.g., extreme (rare) weather conditions.
Convolution neural networks assign similar weight to all the pixels during classifica-
tion, which might lead to inefficient results, especially in a complex road image with a lot of
background noise. To tackle this, in the fifth stage, an attention mechanism is implemented
using Vision Transformers (ViT), which are pre-trained on the ImageNet dataset. In the last
stage, ViT models, namely ViT-B/16 and ViT-B/32, are re-trained on the proposed road
dataset for multi-label weather detection.
in the training accuracy when introducing weighted loss function. However, the models
VGG19 and EfficientNet-B7 are still the second and third-best models. The validation
accuracy for all the models has also increased in Table 4, with GoogleNet attaining the
most significant difference of 2.11%, while ResNet-152 is still the best-performing model
with a validation accuracy of 88.48% for the weighted loss function.
Table 3. Multi-label classification results for road weather detection using simple binary cross entropy
loss function (best performance in bold).
Table 4. Multi-label classification results for road weather detection using class weighted loss function
to force models to handle rare weather conditions (best performance in bold).
When we focus on images that are difficult to classify, a focal loss function is used
to optimise the models. Table 5 shows that by using the focal loss function, performance
improves further. ResNet-152 still outperforms the other models with a 74.4% F1-score.
However, the best overall improvement can be seen for the model GoogleNet with a
17.74% from binary cross entropy loss to focal loss function and a 4.45% increase from
class weighted loss function to focal loss function. GoogleNet and Inception-v3 are now
the second and third-best-performing models instead of VGG19 and EfficientNet-B7. The
other models, ResNet-152, Inception-v3, EfficientNet-B7, have improved their validation
accuracy for the focal loss function by 0.44%, 0.13%, 0.39% and 0.48% as compared to the
weighted loss function. The focal loss function produces an overall better performing
and generalised model since the average F score is significantly improved for all the
models when introducing the focal loss function. The F score for focal loss has a significant
improvement of 2.07%, 4.45%, 3.4%, 3.77% and 2.77% for VGG19, GoogleNet, ResNet-152,
Inception-v3 and EfficientNet-B7, respectively, as compared to the class weighted loss
function. The best-performing CNN model for all the loss functions is ResNet-152 with an
F score of 64.22%, 71%, 74.4% for binary cross entropy, weight loss and focal loss function.
However, the focal loss function seems to outperform all other loss functions by focusing
more on hard instances. The class weighted loss function seems to be the second-best
Sensors 2023, 23, 798 11 of 13
loss function as it assigns more weight to classes with fewer instances and significantly
improves the F score for all the CNN models, respectively.
Table 5. Multi-label classification results for road weather detection using focal loss function to force
models to handle difficult-to-classify weather images (best performance in bold).
6. Conclusions
Intelligent weather detection is important to support safe driving and effective man-
agement of the transport network. Previous computer vision studies perform multi-class
weather classification, which is not always appropriate and reliable for road safety, as
multiple weather conditions are likely to occur simultaneously. In addition, the majority of
them use balanced randomly selected outdoor images, which are unrepresentative of the
real-world frequency of weather types and the unbalanced nature of road weather data. In
this paper, we have introduced multi-label deep learning architectures for road weather
classification, i.e., VGG19, GoogleNet, ResNet-152, Inception-v3, and EfficientNet-B7. To
adequately evaluate their performance, we have created a multi-label road weather dataset
using naturalistic road clips captured by onboard cameras. The dataset consists of road
images captured at different road types, different lighting conditions and different weather
and road surface conditions. Due to the imbalanced nature of the dataset, we improved
model performance using class weighted and focal loss functions to handle rare weather
conditions and hard-to-classify images. Results show significant classification improvement
Sensors 2023, 23, 798 12 of 13
when higher weights are assigned to rare weather conditions (class weighted loss function),
e.g., snowy and icy weather, thereby reducing overfitting on frequently occurring weather
conditions, such as sunny and cloudy. Additionally, further improvement is observed when
the models are forced to focus more on hard-to-classify weather images (focal loss function).
Furthermore, we explore attention mechanisms for pixel-based dynamic weight adjustment
and segmentation to improve the models’ performance. This is essential in separating the
road layouts from the background and providing higher weights to pixels depending on
the weather conditions. For example, cloudy weather can be easily recognised by analysing
the background (clouds) and wet weather by analysing the road. This was achieved using
transformer vision models, ViT-B/16 and ViT-B/32, which outperformed all other CNN
architectures. For future work, Grad-CAM interpretation can be implemented to observe
an in-depth visual explanation to understand the learning process of ViT models under
these scenarios.
References
1. Mase, J.M.; Pekaslan, D.; Agrawal, U.; Mesgarpour, M.; Chapman, P.; Torres, M.T.; Figueredo, G.P. Contextual Intelligent
Decisions: Expert Moderation of Machine Outputs for Fair Assessment of Commercial Driving. arXiv 2022, arXiv:2202.09816.
2. Perrels, A.; Votsis, A.; Nurmi, V.; Pilli-Sihvola, K. Weather conditions, weather information and car crashes. ISPRS Int. J. Geo-Inf.
2015, 4, 2681–2703. [CrossRef]
3. Kang, L.W.; Chou, K.L.; Fu, R.H. Deep Learning-based weather image recognition. In Proceedings of the 2018 International
Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, 6–8 December 2018; pp. 384–387.
4. Zhang, Z.; Ma, H. Multi-class weather classification on single images. In Proceedings of the 2015 IEEE International Conference
on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4396–4400.
5. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
6. An, J.; Chen, Y.; Shin, H. Weather classification using convolutional neural networks. In Proceedings of the 2018 International
SoC Design Conference (ISOCC), Daegu, Korea, 12–15 November 2018; pp. 245–246.
7. Khan, M.N.; Ahmed, M.M. Weather and surface condition detection based on road-side webcams: Application of pre-trained
convolutional neural network. Int. J. Transp. Sci. Technol. 2021, 11, 468–483. [CrossRef]
8. Guerra, J.C.V.; Khanam, Z.; Ehsan, S.; Stolkin, R.; McDonald-Maier, K. Weather Classification: A new multi-class dataset,
data augmentation approach and comprehensive evaluations of Convolutional Neural Networks. In Proceedings of the 2018
NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Edinburgh, UK, 6–9 August 2018; pp. 305–310.
9. Jabeen, S.; Malkana, A.; Farooq, A.; Khan, U.G. Weather Classification on Roads for Drivers Assistance using Deep Transferred
Features. In Proceedings of the 2019 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan,
16–18 December 2019; pp. 221–2215.
10. Zhao, B.; Li, X.; Lu, X.; Wang, Z. A CNN-RNN architecture for multi-label weather recognition. Neurocomputing 2018, 322, 47–57.
[CrossRef]
11. Xia, J.; Xuan, D.; Tan, L.; Xing, L. ResNet15: Weather Recognition on Traffic Road with Deep Convolutional Neural Network. Adv.
Meteorol. 2020, 2020, 6972826. [CrossRef]
Sensors 2023, 23, 798 13 of 13
12. Toğaçar, M.; Ergen, B.; Cömert, Z. Detection of weather images by using spiking neural networks of deep learning models. Neural
Comput. Appl. 2021, 33, 6147–6159. [CrossRef]
13. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer.
IEEE Trans. Pattern Anal. Mach. Intell. 2022. [CrossRef] [PubMed]
14. Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the
International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1691–1703.
15. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
16. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-
standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019;
pp. 4171–4186. [CrossRef]
17. Rengasamy, D.; Jafari, M.; Rothwell, B.; Chen, X.; Figueredo, G.P. Deep learning with dynamically weighted loss function for
sensor-based prognostics and health management. Sensors 2020, 20, 723. [CrossRef] [PubMed]
18. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
19. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015; pp. 1–9. [CrossRef]
20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778.
21. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826.
22. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International
Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114.
23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9
December 2017.
24. A Sample of HGV Dashcam Clips. 2021. Available online: https://youtu.be/-PfIjkiDozo (accessed on 1 October 2021).
25. Zooniverse Website. 2022. Available online: https://www.zooniverse.org/ (accessed on 28 March 2022).
26. Road Weather Dataset. 2022. Available online: https://drive.google.com/file/d/1e7NRaIVX6GNqHGC_aAqaib_DU_0eMVRz
(accessed on 9 January 2023).
27. Deng, J. A large-scale hierarchical image database. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Miami,
FL, USA, 20–25 June 2009.
28. Mafeni Mase, J.; Chapman, P.; Figueredo, G.P.; Torres Torres, M. Benchmarking deep learning models for driver distraction
detection. In Proceedings of the International Conference on Machine Learning, Optimization, and Data Science, Siena, Italy,
19–23 July 2020; pp. 103–117.
29. Pytorch. Models and Pre-Trained Weights. 2022. Available online: https://pytorch.org/vision/stable/models.html (accessed on
15 March 2022).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.