0% found this document useful (0 votes)
13 views

20 - Efficient - Pneumonia - Detection - Using - Vision - Transfo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

20 - Efficient - Pneumonia - Detection - Using - Vision - Transfo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

www.nature.

com/scientificreports

OPEN Efficient pneumonia detection


using Vision Transformers on chest
X‑rays
Sukhendra Singh 1, Manoj Kumar 1, Abhay Kumar 2, Birendra Kumar Verma 1,
Kumar Abhishek 2 & Shitharth Selvarajan 3*

Pneumonia is a widespread and acute respiratory infection that impacts people of all ages. Early
detection and treatment of pneumonia are essential for avoiding complications and enhancing
clinical results. We can reduce mortality, improve healthcare efficiency, and contribute to the global
battle against a disease that has plagued humanity for centuries by devising and deploying effective
detection methods. Detecting pneumonia is not only a medical necessity but also a humanitarian
imperative and a technological frontier. Chest X-rays are a frequently used imaging modality for
diagnosing pneumonia. This paper examines in detail a cutting-edge method for detecting pneumonia
implemented on the Vision Transformer (ViT) architecture on a public dataset of chest X-rays
available on Kaggle. To acquire global context and spatial relationships from chest X-ray images,
the proposed framework deploys the ViT model, which integrates self-attention mechanisms and
transformer architecture. According to our experimentation with the proposed Vision Transformer-
based framework, it achieves a higher accuracy of 97.61%, sensitivity of 95%, and specificity of
98% in detecting pneumonia from chest X-rays. The ViT model is preferable for capturing global
context, comprehending spatial relationships, and processing images that have different resolutions.
The framework establishes its efficacy as a robust pneumonia detection solution by surpassing
convolutional neural network (CNN) based architectures.

Pneumonia is a common respiratory infection caused by multiple types of bacteria, viruses, and fungi. It is the
leading cause of morbidity and mortality worldwide, particularly among infants under the age of five and the
elderly. According to W
­ HO1, 1.4 million pneumonia-related fatalities among children under five in 2018. Chest
X-ray imaging is commonly used to diagnose pneumonia, as it can reveal important symptoms, such as increased
lung opacity and consolidation. However, it can be difficult to interpret a chest X-ray (CXR) because pneumonia
symptoms can be subtle and overlap with other lung diseases. Rapid and accurate diagnosis of pneumonia is
essential for expediting treatment and improving patient outcomes. Radiological images, such as chest X-rays
or CT scans, require specialized training and can be time-consuming to diagnose pneumonia.In recent years,
there has been significant interest to develop model using machine learning techniques that assist physicians
in diagnosing pneumonia using chest X-ray images. These techniques have shown promising results and may
improve the efficacy and accuracy of pneumonia diagnosis.
By training a CNN on a dataset of chest X-ray images, Deep Learning (DL)2–5 has been utilized to detect
­pneumonia6–10. As shown in Fig. 1, the CNN can learn to recognize patterns and associated features with pneu-
monia, such as clouded lung areas to detect pneumonia.The model can then be used to classify new X-ray
images as normal or pneumonia. Multiple ­studies11–14 have demonstrated the efficacy of this method in detecting
pneumonia with a high degree of accuracy. Attention mechanism isn DL ­refers15–21 to a technique used in neural
networks to selectively focus on certain portions of an input as opposed to processing the entire input equally. In
image detection and classification, attention mechanisms can be utilized to concentrate the network’s attention
on specific regions of an image that are most important for making a classification decision. This can help the
network to improve its accuracy and decrease its computation needs. ViT models are a variant of the Transformer
­architecture22–26, which was originally designed for NLP applications. These models have been adapted for
image classification tasks by handling an image as a sequence of image segments that are then processed by the

1
JSS Academy of Technical Education, Noida, India. 2National Institute of Technology Patna, Patna, India. 3School
of Built Environment, Engineering and Computing, Leeds Beckett University, LS1 3HE, Leeds, UK. *email:
[email protected]

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 1


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

Figure 1.  A sample CXR (normal and pneumonia) image.

transformer’s attention mechanism. In addition, the ViT model outperformed state-of-the-art (SOTA) techniques
on a broad variety of image classification tasks, making it an excellent candidate for the pneumonia diagnosis task.

Motivation
Vision Transformer architecture for pneumonia detection from CXR is motivated by the need for time to detect
this severe respiratory disease. Globally, pneumonia is one of the big causes of mortality. Early diagnosis and
treatment are crucial for improved patient outcomes. Traditional methods of evaluating CXR to diagnose pneu-
monia are time-consuming and require specialized medical knowledge, which can lead to diagnostic errors and
treatment delays. In response to these challenges, DL techniques such as CNNs and RNNs have been developed
to automate the detection of pneumonia from CXR. However, these methods are inadequate to analyze complex
medical images. ViT architecture has demonstrated exceptional efficacy in a variety of vision tasks, including
image classification and object detection. It is a viable candidate for pneumonia detection from CXR because
it can extract global and local image features. Utilizing the power of self-attention mechanisms, ViT is able to
effectively capture complex patterns and relationships in X-ray images, resulting in improved pneumonia detec-
tion accuracy and reliability. Therefore, the goal of utilizing ViT architecture for pneumonia detection from CXR
is to surmount the limitations of conventional methods and improve the precision and efficacy of DL models
for medical imaging analysis. Vision Transformer architectures are totally different from CNN architectures.
Transformer-based architectures were initially designed for sequence-to-sequence tasks in natural language
processing. CNN is primarily used for tasks like machine translation, text summarization, language modeling,
and sentiment analysis. These architectures have been customized into Vision Transformer architecture so that
they can be suitable for Image classification and analysis.
The contribution of work is summarized as follows.

• In this investigation, we propose a ViT-based architecture for pneumonia detection in CXR. This architecture
will be designed to effectively manage the large and complex medical images that are typical in CXR and will
be capable of detecting pneumonia with precision.
• We will evaluate the accuracy of the proposed ViT architecture to that of existing DL techniques. This will
provide a thorough analysis of the benefits and drawbacks of our proposed approach compared to existing
methodologies.
• We will evaluate the efficacy of the proposed ViT architecture using a CXR dataset that is publicly available.
This will entail training and testing the model using a set of performance metrics, including accuracy, recall,
precision, and F1 score, to measure its performance.

We will present the proposed ViT architecture’s performance evaluation findings and analysis. This will
include a discussion of any limitations of the proposed model and recommendations for improving its efficacy
through future work.

Organization of the paper


The rest paper is structured as Sect. 2 discusses the background and working principle of the proposed architec-
ture and other variants of Vision Transformer architecture. Section 3 presents recent applications and a review
of related studies. Section 4 describes the dataset characteristics and proposed architecture. Section 5 discusses
experiment specifications, results, and prospects of Vision Transformer architecture, followed by Sect. 6, which
represents the conclusion.

Background and methodology


In this section, the paper builds the foundation for the proposed architecture.

Transformer architecture
­ etwork27 designed for natural languages, such as language transla-
The transformer architecture is a neural n
tion, language modeling, and text summarization. The main concept of the transformer architecture is the self-
attention mechanism, which assess the relative relevance of various words or sub-phrases in a given input. This is

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 2


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

achieved by computing a "query," "key," and "value" for each word or sub-phrase, followed by adding a weighted
sum on the similarity between the query and the keys. Additionally, the transformer architecture utilizes a multi-
­ echanism28–30 to attend to various input positions. In addition to the self-attention m
head attention m ­ echanism31,
the feed-forward neural network process the output of the self-attention layer to produce the better result in
Transformer model. The architecture also uses positional encoding to convey the position of the input image.

Vision transformer derived from generic transformer architecture


The Vision Transformer replaces the original transformer’s self-attention mechanism with a spatial attention
­mechanism32 which is designed to govern images’ two-dimensional grid structure. This enables the model to
analyze and comprehend the spatial relationships between different image regions. Itis an effective architecture for
image classification and computer vision tasks. Images are processed through the Transformer model, consisting
of spatial attention and a feed-forward neural network. The spatial attention mechanism applies the attention
to the image pixels, followed by the feed-forward neural network to the output of the attention mechanism. In
addition, this modeluses a patch-based strategy where an image is divided into smaller segments and learns
to focus separately on each patch. This allows the model to extract granular features and improve its accuracy.

Working principle of Vision Transformer


The fundamental concept of a ViT is the self-attention mechanism, which exploits both global and local fea-
tures by focusing on distinct portions of the image. The self-attention mechanism is implemented by adding
self-attention layers with multiple heads that are known as transformer blocks. Each patch is converted into
corresponding 1-D vector and transmitted to the transformer. The transformer then uses self-attention to learn
the relationships between the various regions, and the resulting representation is input into a feed-forward
neural network to make a prediction.As the spatial resolution of the input does not constrain the self-attention
mechanism, one of the main advantages of ViT is their ability to handle images of arbitrary sizes. This model
can be trained on large images, such as high-resolution medical images, without downsampling or cropping.
Additionally, this model has been improved in recent variants such as D ­ eiT33,34, Swin-T35,36, and R
­ eViT37 to
enhance their performance, reduce the number of parameters and computational costs, and make them more
efficient and scalable for practical applications.

Self‑attention mechanism in Vision Transformer for image detection and classification


A Vision ­Transformer38,39 is a neural network that processes visual information using self-attention mechanisms.
Similar to how the Transformer architecture is used in natural language processing (NLP), ViT employs attention
mechanisms to evaluate the specific parts of an image in order to make accurate predictions. These networks
excel at image classification and object detection.

Self attention techniques


Self-attention15 is a technique that enables a model to selectively concentrate its processing on particular regions
of an image. Self-attention is typically applied to extracted feature maps generated by a CNN in the context of
images. Self-attention allows the model to determine the relative importance of various image regions by com-
puting a set of attention weights for each region. These attention weights can then be applied to the feature maps
before their transmission to the remainder of the network.There are numerous methods to incorporate self-
attention into images. A common technique is using a multi-head self-attention mechanism, in which the model
computes multiple sets of attention weights for various regions of the image and then combines them. This allows
the model to consider the entire image when making a prediction rather than just a specific region’s features. A
further method for image processing is to use a transformer-based model in which the self-attention mecha-
nism focuses on various image regions when selecting a prediction. The transformer-based model is trained to
understand the relationships between multiple image regions and makes predictions based on this information.
Self-attention in DL for image processing can be categorized into two main modules: channel attention and
spatial attention.

Spatial attention networks. In contrast to conventional CNNs, which process entire images and extract features
from them, spatial attention ­networks32,40 process only particular regions of an image. This is accomplished by
incorporating an attention mechanism that learns to weigh various image regions based on their significance to
the current task. By selectively attending to the relevant areas of an image, spatial attention networks can achieve
greater accuracy and efficiency when performing tasks such as image captioning, object detection, and visual
question answering. In addition, the attention mechanism improves the interpretability of these networks by
highlighting the regions of the image that the network is concentrating on for a given task.

Channel attention. Channel ­attention41,42 pertains to a mechanism’s ability to focus on particular channels of
the feature maps selectively. Typically, this is carried out by computing a set of attention weights for each channel
of the feature maps. These attention weights can then be applied to channels before their transmission through
the remainder of the network. This allows the model to concentrate its prediction on the channels that are most
informative.The combination of channel and spatial attention empowers the model to predict using both spa-
tial information (the location of the specified portion within the image) and channel information (the features
extracted by the CNN). This results in more robust and generalizable modelsfor images that have not been seen.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 3


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

Variants of Vision Transformer


Several customizations in ViThave been experimented with to improve its performance or fit certain applications.
The main customization methods include.

Patch size. The ViT architecture linearly embeds fixed-size input image patches. Patch size affects model per-
formance. Larger patches capture global context but lose fine-grained details, while smaller patches may fail to
capture global context. To find better performance, optimal patch size has been used.

Positional encoding. ViT incorporates spatial information into the model via learnable positional encodings.
These encodings assist the model understand image patch placements. ViT performance can be improved with
sine/cosine, spatial, or learned positional encodings.

Architectural variations. To improve ViT, researchers have tried several architectural variations. A Pyramid
Vision Transformer (PVT) is a hierarchical modification that captures multi-scale information. The Convo-
luted Vision Transformer (ConvViT) combines self-attention and convolutional layers to use local and global
information.

Training methods. ViT performance and convergence have been improved using various training methods.
Data augmentation, regularization (dropout, weight decay), and advanced optimization algorithms (Adam,
RMSprop) are examples. Pretraining on ImageNet and transfer ­learning43,44 have also been used to initialize ViT
models.

Hybrid models. Hybrid designs integrate Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs) for tasks such as pneumonia detection in chest X-ray images, we first use a CNN as the feature extractor,
removing its fully connected layers while retaining its convolutional and pooling layers. The CNN-generated fea-
ture maps are then separated into non-overlapping patches, and each patch is converted into a high-dimensional
embedding vector. These embeddings, which depict local characteristics, are then fed into the ViT model in
order to capture global dependencies and contextual information across the entire image. For final predictions,
a classification head is appended to the ViT output. The entire hybrid model, comprised of the CNN feature
extractor and the ViT model, is trained from beginning to end using labeled data, with fine-tuning strategies
tailored to the specific dataset and computational resources available. This approach maximizes the extraction
of both local and global information, optimizing performance for complex image analysis tasks.Transformers
process CNN-extracted features. This hybrid strategy uses CNNs (local feature extraction) and transformers
(global context modeling) to improve performance. Pyramid Vision Transformer (PVT captures multi-scale
information hierarchically. Multiple steps process features at varying resolutions. The model effectively captures
local and global information. A convoluted Vision Transformer (ConvViT) is a Self-attention mechanism with
convolutional layers. Self-attention models global context, while convolutional layers catch local patterns. This
combination improves the model’s local and global information handling.

Attention mechanism. ViT’s architecture relies on attention techniques. Attention mechanism customization
may include Long-Range Arena (LRA) attention, Axial attention, and Shifted attention. LRA attention efficiently
handles input image long-range dependencies. It helps the model capture global context even when patches are
far apart.
Axial attention captures dependencies along image axes (rows and columns). Self-attention is modified to
catch shifted or offset patch dependencies. This helps the model manage data spatial transformations.
To have state-of-the-art performance and improved convergence,researchers have experimented with the
following pre-trained Vision Transformer architectures.

DeiT (data‑efficient image transformers). DeiT34 uses self-attention mechanisms and patch-based processing
to outperform CNNs in image tasks with less labeled training data. Self-attention computes attention weights on
smaller image patches to efficiently capture long-range relationships and grasp the global context. The models
are pre-trained on large, unlabeled datasets to learn general visual representations, then fine-tuned on smaller,
task-specific datasets. Visual characteristics and hierarchical representations help the model transfer pre-trained
knowledge to the target task. Dropout and data augmentation increase generalization. Data-efficient image
transformers use self-attention, patch-based processing, pre-training, fine-tuning, transfer learning, and regu-
larization to perform well in picture tasks without labeled data.

Swin‑T. Swin ­transformer36,45, a new image understanding architecture, blends Transformers with CNNs. It
converts the input image into non-overlapping patches using transformer layers. Swin Transformer’s hierarchi-
cal architecture organizes transformer layers into stages, making it unique. Lower stages process patch-level
information, whereas later stages capture broader contextual information. The hierarchical model efficiently
captures image local and global dependencies. Shift procedures help Swin Transformer model repair spatial
links. Swin Transformer uses Transformers’ self-attention mechanism and CNNs’ efficient processing to achieve
state-of-the-art results on image classification, object detection, and semantic segmentation with fewer compu-
tational resources than other transformer-based models.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 4


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

ReViT. The Vision Transformer (ViT) architecture can accommodate inputs of different resolutions with
Resizable-ViT37. Traditional ViT models require fixed-size inputs, which can limit their adaptability in real-
world applications with varied image sizes. Resizable-ViT solves this problem with "token shifting" and "layer
dropping." Token shifting requires scaling the input image and adapting position and token embeddings to the
new resolution. For lower inputs, layer-dropping skips model architectural layers based on input resolution,
reducing computing complexity. Resizable-ViT efficiently processes images of varied resolutions while doing
well on image recognition tasks by dynamically adapting to input sizes.
All of these variants have been shown to enhance the performance and efficiency of Vision Transformers
and have been applied to a variety of tasks, including image recognition, object detection, and medical imaging,
with SOTA results.

Recent applications of Vision Transformer architecture


Vision Transformer (ViT) has attracted great interest in computer vision duties due to its capacity to process
images with high precision and efficiency. Recent developments and applications have been made to the ViT
architecture. The DeiT model, which enhances the training of ViT models using data augmentation and dis-
tillation techniques, is one of the most significant innovations. The Swin Transformer model, which employs
hierarchical representations to enhance the performance of ViT models on large-scale image datasets, is another
innovation.Recent Vision Transformer architectures research has centered on a variety of applications, including.

Object detection and instance segmentation


ViT architecture is promising for object detection and instance segmentation because it possesses several essen-
tial characteristics that make it suitable for these tasks. First, the self-attention mechanisms in ViT enable the
model to learn global relationships between various image components, which can be used to identify and
localize objects. ViT can be trained on large datasets with many labeled examples, which is essential for these
tasks because they require a large amount of data to learn the involved complex patterns. Finally, ViT can be
fine-tuned for specific object detection or instance segmentation t­ asks46, allowing it to achieve high accuracy by
adapting to the requirements of these tasks.

Dense predictions
Dense prediction is the task of predicting a pixel-wise output for an input image, such as semantic segmenta-
tion, where each pixel is designated as a specific object or background. The input image is divided into a series
of non-overlapping segments for dense prediction, which is then flattened and fed into the ViT architecture.
Self-attention allows ViT to record spatial information across these regions, and the output is shaped into a grid
corresponding to the original image. One of the benefits of employing ViT for dense prediction is that it can
learn to distinguish between objects of varying sizes and shapes without explicit object proposals or region-
based attributes. ViT attends to all regions in the input image and learns to weigh their contributions based on
the significance of their contributions to the output. In addition, ViT can be trained end-to-end with large-scale
datasets like ImageNet to acquire general features that can be applied to subsequent tasks like a dense prediction.
In situations with limited labeled data, this makes ViT an attractive design for dense prediction.

Self‑supervised learning
Even without human annotations, ViT can be used for self-supervised l­earning47,48. Self-supervised learning
teaches input data meaningful representations for classification, detection, and segmentation. Training the model
on a pretext task is one method to use ViT for self-supervised l­ earning49. Pretext tasks allow the model to learn
key characteristics from input data. Data augmentation to generate multiple perspectives of the same image
and training the ViT model to predict which views match is a common pretext challenge. Contrastive learning
teaches the ViT model to distinguish between similar and distinct images. Two arbitrary images are supplied to
the ViT model. The model is then trained to predict whether or not two images are identical.In both cases, the
ViT model discovers features that are independent of viewpoint, illumination, and other factors that affect the
appearance of input data. These learned characteristics can be used to establish supervised model weights or to
refine subsequent tasks.

Multi‑modal learning
Recent ­research50 has examined the use of transformer-based architectures for multimodal unsupervised learn-
ing from raw video, audio, and text. Using self-supervised learning techniques, the plan is to implement a
transformer-based architecture capable of handling multiple modalities and capable of predicting the next frame,
audio, or text given the current one.

Efficient ViT architectures


Recent efforts have been made to make Vision Transformer models more effective in terms of computation time
and memory consumption. Multiple architectures, such as Separable Vision Transformer (SepViT)51 and Revers-
ible Vision Transformer (RViT)37,52, have been proposed by researchers that are capable of achieving comparable
or superior performance than conventional ViT models while being more energy-efficient. SepViT blocks employ
separable convolutions rather than conventional convolutions. This update minimizes the self-attention mecha-
nism, the most computationally expensive component of ViT. Separable convolutions separate conventional
convolutions into depthwise and pointwise convolutions, requiring fewer parameters and computations. RViT
augments ViT design with reversible residual blocks. These blocks recreate input features from output features,

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 5


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

which increases the efficiency of gradient calculation during backpropagation. Reversible blocks enable models
with limited memory to be larger.

Explainable AI
ViT can be utilized in Explainable ­AI33 to provide insight into how an image classification decision is made. By
using attention maps generated by ViT, it is possible to visualize which aspects of an image are most crucial to
the classification decision. This information can be used to clarify the model’s decision when communicating
with humans.
In Table 1, the article summarizes recent contributions made for a range of tasks using Vision Transformer
architecture.

Material and methods


Dataset characteristics
In the investigation, we used a publicly available chest X-ray (CXR) dataset from ­Kaggle57,58. The same dataset
has also been utilized in numerous other investigations. The dataset consists of three sections: train, test, and
validation. Each section contains subfolders for Pneumonia and Normal CXRs. There are 5863 X-ray images
in total as shown in Table 2. The X-ray images used in the dataset were acquired at the Women and Children’s
Medical Center in Guangzhou from children aged one to five.These images were taken as part of the children’s
routine medical examinations.To assure the quality of the X-ray images used in the analysis, they were screened
by specialists for low-resolution or unreadable images. The remaining images were then evaluated by two physi-
cian specialists, with any discrepancies resolved by a third specialist. This procedure was performed to teach an
AI system to make precise diagnoses.80% of the dataset has been allocated to the training set, 10% to the test
set, and 10% to the validation set, as shown in Table 3.

Proposed architecture
The proposed Architecture uses patch embeddings, positional encodings, several Transformer encoder layers,
self-attention, feed-forward neural networks, and a classification head to classify and analyze imageswhich are
shown in Fig. 2.

Input embedding
It requires reshaping the input image into patches as shown in Fig. 3 and applying a linear transformation in
order to obtain the embeddings. Let’s denote the input image as X ∈ R(H×W×C) re H, W, and C, respectively,
represent the height, breadth, and number of channels. Each patch has a dimension of P × P, and there are N
patches in total. The input embedding can be represented the as E ∈ R(N×D) where D is the number of dimen-
sions of the embeddings.

Positional encoding
The input embeddings include positional information to capture the relative and absolute positions of the patches.
The positional encoding matrix P ∈ R(N×D) added to the input embeddings E element by element.

Transformer encoder
Each layer of the Transformer Encoder is constituted of a multi-head self-attention mechanism and a position-
wise feed-forward network as shown in Fig. 4.

(a) Multi-head self-attention: The attention weights between the input embeddings are computed by the multi-
head self-attention mechanism. It entails three linear transformations: Query (Q), Key (K), and Value (V),
with Q, K, and V ∈ R(N×D) Using the attention weights, the output of the self-attention mechanism is the
weighted sum of the values. The attention weights are calculated by Eq. (1).
  √ 
Attention(Q, K, V ) = softMax QK T / (Dh) V (1)

where Dh represents the dimension of each attention head.


(b) Position-wise feed-forward network: The position-wise feed-forward network employs two linear trans-
formations separated by a nonlinear activation function (such as ReLU). Let’s designate the attention
mechanism’s output as A ∈ R(N×D) The representation of the position-wise feed-forward network is as
according to Eq. (2).
FFN(A) = max(0, A × W1 + b1) × W2 + b2 (2)
where W1 ∈ R(D×dFFN) , b1 ∈ R(1×dFFN) , W2 ∈ R(dFFN×D) , b2 ∈ R(1×D).

These two sub-layers are applied parallel to the input sequence and then combined to generate the encoder
layer’s output. The process is repeated multiple times to form a stack of encoder layers, where each encoder layer
builds upon the representation learned by the preceding encoder layer, enabling the model to learn increasingly
complex and generalized representations of the input sequence.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 6


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

Gap identified and future direction for


Article references Approach Major findings enhancement
Transformers demand more processing
Each segmented patch is linearly projected
power and memory than convolutional
into a high-dimensional embedding space
neural networks (CNNs), and the article
that result is then input into the Trans-
results can be improved by adjusting the does not elucidate how to address this.
former encoder.They replaced the tradi-
number of layers, the dimensionality of the Transformers are less interpretable than
tional CNN backbone with a Transformer
“An Image is Worth 16 × 16 Words: Trans- embeddings, or the design of the attention CNNs, and interpretability strategies are
encoder-decoder framework, thereby
formers for Image Recognition at Scale”53 mechanism and by fine-tuning the archi- not discussed in the article. Patch size,
enabling a more unified framework across
tecture to strike a balance between model computational efficiency, and performance
modalities obtained cutting-edge perfor-
capacity and computational efficiency compromises are not considered. Resolving
mance on benchmark datasets with fewer
these issues could facilitate the scalability
computational resources than traditional
of image recognition methods based on the
CNN-based methods
Transformer
The approach lacks fine-grained attention
because it employs a mechanism for soft
The authors demonstrate the effectiveness of
attention that assigns weights to image
incorporating a visual attention mechanism superior caption quality in comparison to
regions rather than concentrating on par-
into the caption generation process. The previous methods. By focusing on pertinent
ticular objects or attributes. This hinders
“Show, attend and tell: Neural image caption attention mechanism allows the model to image regions, the model generates more
the capability of the model to generate cap-
generation with visual attention” 17
focus on various portions of the image while accurate and descriptive captions that
tions with precise details. The article does
generating each word in the caption, thereby capture the image’s most important objects,
not discuss strategies or techniques for
improving the alignment between the image actions, and relationships
fine-tuning the interpretation and control
content and the generated text
of the attention mechanism, thereby limit-
ing the adaptability and interpretability
Using a larger dataset of fully-sampled
Deep generative network GVTrans trans- MRI acquisitions for training GVTrans,
lates noisy variables and latent onto high- incorporating additional information, such
quality MR images. Multi-layer architecture as patient demographics or clinical history,
better image quality than CNN-based
improves image resolution. Cross-attention into the training process, and developing
reconstructions with and without self-
transformer modules receive up-sampled a more efficient training algorithm for
“Deep MRI Reconstruction with Generative attention processes and can adjust to indi-
feature maps in each layer. MR images are GVTrans can improve the performance of
Vision Transformers”54 vidual test subjects. GVTrans may improve
masked using the same sampling pattern as the proposed architecture GVTrans.Train-
deep MRI reconstruction applicability and
the under-sampled acquisition for test data ing in the proposed GVTrans architecture
generalizability
inference. Optimized network parameters is computationally intensive.GVTrans
ensure that reconstructed and original may be unable to reconstruct images with
k-space samples match high levels of noise or anomalies, as well as
images with very low sampling rates
On some tasks, such as dense predic-
tion, UViT may not attain the same level
Universal Vision Transformer (UViT), an UViT is a simple yet efficient model that
“A Simple Single-Scale Vision Transformer of performance as more complex Vision
intuitive and efficient Vision Transformer achieves competitive performance on the
for Object Detection and Instance Segmen- Transformer architectures. UViT may not
architecture, was proposed for object detec- COCO benchmarks for object detection
tation”46 be as effective as models for object detec-
tion and instance segmentation and instance segmentation
tion and instance segmentation that are
more specialized
The distillation token can be computation-
A large, pre-trained convolutional neural ally expensive to compute, which is a limi-
network (CNN) is used as a teacher to train tation. Another limitation is that the distil-
a smaller, more efficient transformer-based lation token can result in a reduction in
student model in this method. The student the attention weights’ diversity.It would be
DeiT-B model obtains 85.2% top-1 accuracy
“Training data-efficient image transformers model gains knowledge from the teacher possible to enhance the distillation token
on ImageNet with 86 M-parameterwhen
& distillation through attention”34 by observing the instructor’s output, which by employing a more efficient method for
trained with 100 epochs and 16 GPUs
is represented by a distillation token. The computing it. The distillation token could
distillation token is added to the input of the be modified to promote attention weights
student model and is utilized to direct the with greater diversity. The method could be
attention mechanism applied to additional tasks, including object
detection and segmentation
utilizing a standard Vision Transformer A model’s performance on a medical image Domain adaption and other transfer
architecture and training it on a large col- classification task can be considerably learning methods may improve Vision
“Analyzing Transfer Learning of Vision
lection of natural images. Using a limited enhanced by transfer learning from a previ- Transformers’ medical image classification
Transformers for Interpreting Chest Radi-
number of labeled examples, they then ously trained Vision Transformer. There is performance in future research. The mod-
ography”55
refined this model using the CheXpert or no significant effect on the efficacy of the el’s performance can further be improved
Pediatric Pneumonia dataset model by fine-tuning using larger fine-tuning datasets
a novel design called Convolutional Vision
Transformer (CvT) that increases Vision
Transformers (ViTs) performance and
CvTs are harder to train and slower at
efficiency by adding convolutions.A con-
CvT outperforms ViTs on a variety of image inference hen compared with ViT’s. Using
volutional token embedding layer replaces
classification tasks while requiring fewer deeper and broader CvT models to further
the token embedding layer. This enables
parameters and FLOPs. For instance, the improve performance, adding residual con-
the CVT to discover spatial relationships
“Introducing Convolutions to Vision CvT achieves a top-1 accuracy of 89.4% on nections between CvT layers to improve
between tokens, thereby enhancing the
Transformers”56 the ImageNet-1 k dataset, which is compa- training stability, and employing dilated
model’s capacity to represent complex visual
rable to the state-of-the-art performance of convolutions and group convolutions to
patterns. Convolutional attention operation
ResNet-50 despite employing only 1/10th of improve the model’s ability to represent
replaces the attention operation. This ena-
the parameters and 1/100th of the FLOPs long-range dependencies can further
bles the CvT to efficiently compute attention
improve the proposed model
weights across vast spatial regions, thereby
enhancing the model’s capacity to capture
global context

Table 1.  Insight into related recent research.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 7


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

Class No of images
Pneumonia (P) 4273
Normal (N) 1583

Table 2.  Class distribution of the dataset.

# of images # of images from P class # of images from N class


Training data 4684 3205 1479
Validation data 586 360 226
Test data 586 330 256

Table 3.  Partitioning of training, testing, and validation datasets.

Figure 2.  The proposed system design architecture.

Classification layer. This layer utilizes the encoder layers’ output to predict pneumonia’s presence or absence.
This prediction may be made using a fully connected or convolutional layer.

Loss function. This component evaluates the model’s efficacy based on the predicted and actual labels. In this
endeavor, binary cross-entropy loss is a common loss function.

Ethical standards
No human participants were involved in the study. Dataset is available on Internet.

Result and discussions


Performance indicators
Various evaluation metrics are used to measure the effectiveness of machine learning models, and each has its
benefits and drawbacks. The most prevalent metrics include.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 8


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

Figure 3.  Dataset input image in the form of smaller patches.

Figure 4.  Internal design of a transformer encoder.

Accuracy
This is the most important metric for evaluating a model and is defined as the proportion of correct predictions
to the total number of predictions made by the model. It is evaluated using Eq. (3).

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 9


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

(True Positives + True Negatives)


Accuracy = (3)
(True Positives + True Negatives + False Positives + False Negatives)

Precision
Higher-precision classifiers produce fewer false positives. High accuracy reduces the likelihood of misclassifying
negative instances as positive in numerous applications where false positives have severe consequences. Preci-
sion is calculated by Eq. (4).
True Positives
Precision = (4)
(True Positives + False Positive)

Recall (sensitivity or true positive rate)


Classifiers with higher recall have fewer false negatives. The classifier captures positive cases and reduces false
negatives. A classifier with lower recall has more false negatives. The recall is determined by Eq. (5)
True Positives
Recall = (5)
(True Positives + False Negatives)

F1 Score
The F1 Score is the harmonic mean of precision and recall, indicating patterns between them and calculated
using Eq. (6).
2 ∗ (Precision ∗ Recall)
FScore = (6)
(Precision_Recall)

ROC curve
ROC curves evaluate binary classification models. The model separates positive and negative events across clas-
sification thresholds. ROC curve form and position indicate model discrimination. The ROC curve shows the
trade-off between positive and negative identification when the classification threshold changes. AUC increases
discrimination and model performance.

Confusion matrix
The confusion matrix tabulates classification model performance. It compares predicted labels to real labels and
shows different classification outcomes. The confusion matrix reveals model performance. True positives (TP)
and true negatives (TN) are situations that were accurately predicted. False positives (FP) and false negatives (FN)
are cases of misclassification. These values allow us to generate model performance metrics including accuracy,
precision, recall, and F1 score.

Model’s training
To demonstrate our proposed architecture, we experimented with a benchmark dataset of CXR images, one of
the most frequently downloaded datasets for testing on Kaggle. Using these studies and datasets for binary clas-
sification. Python 3.7, Anaconda/3, and CUDA/10 are installed on a Windows server with an i5 CPU, 2 GB GPU,
and 8 GB RAM, as well as an Anaconda/3 distribution. In addition to the parameters listed above, the Python
libraries Pytorch, OpenCV, matplotlib, os, math, and NumPy are used. During training, the data is partitioned
into batches, and the model’s parameters are modified based on each cohort’s average loss. The group size dictates
the number of samples utilized during each update phase. A larger sample size can speed up the training rate but
may require additional memory. CrossEntropyLoss was chosen as the experiment’s loss function. During training,
the model minimizes this loss function. It computes the negative log-likelihood of expected class probability and
actual labels. The training algorithm modifies the parameters of the model. In an experiment, the Adam opti-
mizer was used to alter the learning rate for each parameter based on gradient estimates of the first and second
moments. Pytorch was used for the implementation, and training was conducted in a GPU environmentThe
learning rate establishes how much model parameters are updated with each optimizer iteration. The multiplica-
tive factor of the learning rate is used to modify the learning rate at each epoch or phase, enabling more granular
control of the learning rate during training. The learning rate’s multiplicative factor can help the model converge
on a superior solution. Table 4 demonstrates the experiment’s hyperparameter settings. The novelty of our work
lies in the application of the Vision Transformer (ViT), specifically utilizing the DEIT_Base_Patch16_224 pre-
trained weights, to the domain of medical imaging for pneumonia detection. While ViT has shown promise in
various fields, its adaptation to medical imaging, especially chest X-ray analysis, is relatively unexplored. Our
approach capitalizes on ViT’s ability to capture intricate spatial relationships in images, offering advantages over
traditional methods. We demonstrate improved performance and potential for enhanced pneumonia detection
accuracy, marking a significant contribution to the field of medical image analysis.
A model’s performance depends on these hyperparameters and others. To enhance model performance,
selecting hyperparameter values requires careful analysis and experimentation. For optimal performance, hyper-
parameters must be explored and fine-tuned based on task, dataset, and model architecture.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 10


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

Hyperparameter Value
Batch size 16
Criterion CrossEntropyLoss
Learning rate 1e − 05
Optimizer Adam
Device Cuda
Image resize 224 × 224
The multiplicative factor of the learning rate 0.995

Table 4.  Hyper-parameter setting used in the experiment.

Performance evaluation
The model’s train-validation accuracy against the epoch curve shows its learning and generalization. If training
accuracy increases but validation accuracy plateaus or falls, it indicates overfitting. Convergence and excellent
accuracy for both curves show learning and generalization efficacy. The train-validation loss versus epochs curve
shows model optimization. The model initially matches data better when training and validation loss decreases.
Overfitting occurs when training loss decreases with increasing validation loss. Convergence and low loss suggest
error minimization and good generalization for both curves.
Table 5 presents the performance delivered by the proposed approach and Figs. 5 and 6 show the relationship
between accuracy and epoch and loss and epoch, respectively. Figures 5 and 6 show that during training, valida-
tion accuracy gradually improves along with test accuracy and reaches 97.61 and other performance indicators
are also indicating outperforming results.

Confidence intervals test


This is statistical tool used to estimate the range within which a performance metric, such as accuracy, sensitivity,
or specificity, is likely to lie. They provide a range of values that likely contains the true value of the parameter,
along with a level of confidence.
Confidence interval (CI) is calculated using the formula described using Eq. (7)

Accuracy × (1 − Accuracy)
AccuracyCI = Accuracy ± Z × (7)
sample size

Z is the z-score corresponding to the desired confidence level. For example, for a 95% confidence level, the
Z-score is approximately 1.96.

Epoch Split ratio Loss (train) Accuracy (train) Loss (test) Accuracy (test) Sensitivity Specificity F score AUC​
30 0.20 0.057 98.04 0.069 97.61 0.949 0.981 0.952 0.966

Table 5.  Performance delivered by the proposed model.

Figure 5.  Accuracy variation vs epoch curve.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 11


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

Figure 6.  Loss vs epoch curve.

Interpretation
The accuracy reported as 97.61% with a 95% confidence level, the confidence interval is between 96.2 and 98.9%.
This means we can be 95% confident that the true accuracy of our proposed model lies within this range.

Matthews correlation coefficient (MCC)


The Matthews correlation coefficient (MCC) is a measure used in machine learning to evaluate the quality of
binary classification. The formula for MCC is described in Eq. (8).
TP × TN − FP × FN
MCC = √ (8)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
From the confusion matrix on the test data.
TP = 152, TN = 420, FP = 6, FN = 8
MCC ≈ 0.9396
The Matthews correlation coefficient (MCC) typically ranges from − 1 to + 1:

+ 1 indicates a perfect prediction,


0 suggests a random prediction,
− 1 indicates a total disagreement between prediction and observation.

In this case, an MCC of approximately 0.9396 indicates a very strong positive correlation between the pre-
dicted and actual classifications. This suggests an excellent classification performance for the model used.
The confusion matrix in Fig. 7 shows that out of 586 samples in the test data, our proposed model showed 152
cases of TP and 420 cases of TN and 6 cases of FP,and 8 cases of FN which indicates a test accuracy of 97.61%.
Variation of precision and recall is represented by Figs. 8 and 9, which indicates that recall converse after 15 epocs
while precision converse after 35 epocs. The ROC of the suggested architecture, depicted in Fig. 10, indicates
an AUC value of 0.96. It denotes the capability of our proposed model to identify the presence or absence of

Figure 7.  Confusion matrix based on test data for the proposed model.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 12


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

Figure 8.  Model precision with epocs.

Figure 9.  Model recall with epocs.

Figure 10.  ROC curve with AUC 0.96 of proposed work.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 13


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

pneumonia. A precision-recall value of 0.94, depicted in Fig. 11, suggests that the model demonstrates a notable
capacity to accurately predict positive instances while capturing a substantial proportion of the true positive
instances. The precise interpretation may differ depending on the domain of application and the particular
objectives of the classification endeavor.

Discussion
Table 6 presents the performance of pre-train CNN architectures keeping all hyper-parameters values the same
to make a comparison on the same datasets. It shows that Vision Transformer architecture offersa great improve-
ment over all other architectures, The proposed architecture offers an accuracy of 97.61% and an AUC of 0.96
but this more extraordinary performance is obtained by compromising on training time because the training
was a bit time taking when compared with different architectures.

Research prospects in Vision Transformer


Vision Transformer (ViT) architecture research prospects for image classification hold tremendous potential
for advancing the field. Future research can concentrate on enhancing the performance of ViT models by opti-
mizing their architecture, refining training strategies, and investigating novel techniques to improve precision,
robustness, and efficiency. In addition, efforts can be focused on developing interpretability methodologies for
ViT models, allowing for a better comprehension of their decision-making process. It is possible to investigate
efficient training and inference methods to reduce computational complexity and accelerate model deployment.

Figure 11.  Precision–recall curve of the proposed method.

# of non-trainable
Sr no. Architecture Refs. Accuracy F-score # of trainable parameters parameters
59
1 VGG16 92.14 0.9234 50,178 14,714,688
60
2 VGG19 90.22 0.8999 50,178 20,024,384
61
3 ResNet50 82.37 0.8281 200,706 23,587,712
62
4 ResNet101 75.96 0.7593 200,706 42,658,176
63
5 ResNet152 87.18 0.8734 200,706 58,370,944
64
6 ResNet50V2 89.26 0.8937 200,706 23,564,800
65
7 ResNet101V2 92.62 0.9250 200,706 42,626,560
66
8 ResNet152V2 92.94 0.9312 200,706 58,331,648
67
9 InceptionV3 89.42 0.8937 102,402 21,802,784
68
10 InceptionResNetV2 90.70 0.8989 200,706 58,331,648
69
11 DenseNet121 91.82 0.9171 100,354 7,037,504
70
12 DenseNet169 88.78 0.8874 163,074 12,642,880
71
13 DenseNet201 91.83 0.9171 188,162 18,321,984
72
14 NASNetLarge 88.14 0.8812 975,746 84,916,818
73
15 Quaternion Residual Network 93.75 0.9405 560,769 8,576
16 Vision Transformer Proposed in the paper 97.61 0.9500 85,800,194 0

Table 6.  Performance evaluation relative to other architectures utilizing the same dataset.

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 14


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

Adapting ViT to scenarios with limited data using semi-supervised and few-shot learning techniques will increase
its applicability. In addition, domain-specific extensions, hybrid architectures that combine ViT with other
models, and real-world deployments will contribute to the advancement and practical application of ViT in
image classification tasks.

Conclusion
The article conducts a thorough analysis of a Vision Transformer (ViT) framework for pneumonia detection in
chest X-rays. ViTs’ ability to analyze complex image relationships is showcased, demonstrating superior per-
formance over traditional CNNs and other advanced techniques. ViTs excel in capturing global context, spatial
relations, and handling variable image resolutions, leading to accurate pneumonia detection. The study aims to
assess this method’s effectiveness by comparing it to state-of-the-art models on a diverse CXR dataset. The results
reveal ViT’s superiority with an accuracy of 97.61%, sensitivity of 95%, and specificity of 98%. In conclusion, the
ViT-based approach holds promise for early pneumonia detection in CXRs, offering substantial development
potential in this field. However, limitations include data scarcity and the need for real-world validation. Future
directions encompass enhancing interpretability, addressing model robustness, and conducting clinical trials
for practical deployment.

Data availability
In this work, a public dataset of CXR (https://​data.​mende​ley.​com/​datas​ets/​rscbj​br9sj/2) has been used.

Received: 20 June 2023; Accepted: 22 January 2024

References
1. Pneumonia in children. WHO (2019). https://​www.​who.​int/​news-​room/​fact-​sheets/​detail/​pneum​onia
2. Khan, S. H. et al. COVID-19 detection and analysis from lung CT images using novel channel boosted CNNs. Expert Syst. Appl.
229, 120477 (2022).
3. Khan, S. H. et al. COVID-19 detection in chest X-ray images using deep boosted hybrid learning. Comput. Biol. Med. 137, 104816
(2021).
4. Khan, S. H., Sohail, A., Zafar, M. M. & Khan, A. Coronavirus disease analysis using chest X-ray images and a novel deep convo-
lutional neural network. Photodiagnosis Photodyn. Ther. 35, 102473 (2021).
5. Singh, S., Tripathi, B. K. & Rawat, S. S. Deep quaternion convolutional neural networks for breast Cancer classification. Multimed.
Tools Appl. 82, 31285–31308 (2023).
6. Liang, G. & Zheng, L. A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput. Methods
Programs Biomed. 187, 104964 (2020).
7. Nishio, M., Noguchi, S., Matsuo, H. & Murakami, T. Automatic classification between COVID-19 pneumonia, non-COVID-19
pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods. Sci. Rep. 10, 1–6 (2020).
8. Asif, S., Zhao, M., Tang, F. & Zhu, Y. A deep learning-based framework for detecting COVID-19 patients using chest X-rays.
Multimed. Syst. https://​doi.​org/​10.​1007/​s00530-​022-​00917-7 (2022).
9. Suryaa, V. S., Annie, A. X. & Aiswarya, M. S. Efficient DNN ensemble for pneumonia detection in chest X-ray images. Int. J. Adv.
Comput. Sci. Appl. 12, 759–767 (2021).
10. Singh, S., Kumar, M., Kumar, A., Verma, B. K. & Shitharth, S. Pneumonia detection with QCSA network on chest X-ray. Sci. Rep.
13, 9025 (2023).
11. Duong, L. T., Nguyen, P. T., Iovino, L. & Flammini, M. Automatic detection of COVID-19 from chest X-ray and lung computed
tomography images using deep neural networks and transfer learning. Appl. Soft Comput. 132, 109851 (2023).
12. Duong, L. T., Le, N. H., Tran, T. B., Ngo, V. M. & Nguyen, P. T. Detection of tuberculosis from chest X-ray images: Boosting the
performance with Vision Transformer and transfer learning. Expert Syst. Appl. 184, 115519 (2021).
13 Duong, L. T., Nguyen, P. T., Iovino, L. & Flammini, M. Deep learning for automated recognition of COVID-19 from chest X-ray
images. medRxiv. https://​doi.​org/​10.​1101/​2020.​08.​13.​20173​997 (2020).
14. Kazemzadeh, S. et al. Deep learning detection of active pulmonary tuberculosis at chest radiography matched the clinical perfor-
mance of radiologists. Radiology 306, 124–137 (2023).
15. Ramachandran, P. et al. Stand-alone self-attention in vision models. Adv. Neural Inform. Process. Syst. 32 (2019).
16. Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention. Network. 14, 1–12 (2022).
17. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual attention. in 32nd Int. Conf. Mach. Learn. ICML
2015 3, 2048–2057 (2015).
18. Wang, F. et al. Residual attention network for image classification. in Proc.—30th IEEE Conf. Comput. Vis. Pattern Recognition,
CVPR 2017 2017-Janua, 6450–6458 (2017).
19. Singh, S. et al. Deep attention network for pneumonia detection using chest X-ray images. Comput. Mater. Contin. 74, 1673–1690
(2023).
20. Kumar, M. & Biswas, M. Human activity detection using attention-based deep network. Springer Proc. Math. Stat. 417, 305–315
(2023).
21. Kumar, M., Patel, A. K., Biswas, M. & Shitharth, S. Attention-based bidirectional-long short-term memory for abnormal human
activity detection. Sci. Rep. 13, 14442 (2023).
22. Carion, N. et al. End-to-end object detection with transformers. in Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 12346 LNCS 213–229 (2020).
23. Potamias, R. A., Siolas, G. & Stafylopatis, A. G. A transformer-based approach to irony and sarcasm detection. Neural Comput.
Appl. 32, 17309–17320 (2020).
24. Wolf, T. et al. Transformers: State-of-the-art natural language processing. 38–45 (2020). doi:https://​doi.​org/​10.​18653/​v1/​2020.​
emnlp-​demos.6.
25. Singh, S. & Mahmood, A. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access 9,
68675–68702 (2021).
26. Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv Prepr. arXiv1910.03771 (2019).
27. Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017-Decem, 5999–6009 (2017).
28. Al-Deen, H. S. S., Zeng, Z., Al-Sabri, R. & Hekmat, A. An improved model for analyzing textual sentiment based on a deep neural
network using multi-head attention mechanism. Appl. Syst. Innov. 4.4, 85 (2021).

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 15


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
www.nature.com/scientificreports/

29. Feng, Y. & Cheng, Y. Short text sentiment analysis based on multi-channel CNN with multi-head attention mechanism. IEEE
Access 9, 19854–19863 (2021).
30. Park, S. et al. Multi-task Vision Transformer using low-level chest X-ray feature corpus for COVID-19 diagnosis and severity
quantification. Med. Image Anal. 75, 102299 (2022).
31. Zhu, J. et al. Efficient self-attention mechanism and structural distilling model for Alzheimer’s disease diagnosis. Comput. Biol.
Med. 147, 105737 (2022).
32. Chen, C., Gong, D., Wang, H., Li, Z. & Wong, K. Y. K. Learning spatial attention for face super-resolution. IEEE Trans. Image
Process. 30, 1219–1231 (2020).
33. Mondal, A. K., Bhattacharjee, A., Singla, P. & Prathosh, A. P. XViTCOS: Explainable Vision Transformer based COVID-19 screen-
ing using radiography. IEEE J. Transl. Eng. Heal. Med. 10, 1–10 (2021).
34. Touvron, H. et al. Training data-efficient image transformers & distillation through attention. in International Conference on
Machine Learning 10347–10357 (2021).
35. Islam, M. N. et al. Vision Transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor
from CT-radiography. Sci. Rep. 12, 1–14 (2022).
36. Liu, Z. et al. Swin transformer: Hierarchical Vision Transformer using shifted windows. in Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision 10012–10022 (2021).
37. Zhu, Y. et al. Make a long image short: Adaptive token length for Vision Transformers. arXiv Prepr. arXiv2112.01686 (2021).
38 Han, K. et al. A survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. https://​doi.​org/​10.​1109/​TPAMI.​2022.​31522​
47 (2022).
39. Jiang, Z. et al. Computer-aided diagnosis of retinopathy based on Vision Transformer. J. Innov. Opt. Health Sci. 15.02, 2250009
(2022).
40. Chen, J. et al. Channel and spatial attention based deep object co-segmentation. Knowledge-Based Syst. 211, 106550 (2021).
41. Zhang, Y., Fang, M. & Wang, N. Channel-spatial attention network for fewshot classification. PLoS One 14, 1–16 (2019).
42. Bastidas, A. A. & Tang, H. Channel attention networks. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. 2019-June,
881–888 (2019).
43. Singh, S. et al. Hybrid models for breast cancer detection via transfer learning technique. Comput. Mater. Contin. 74, 3063–3083
(2022).
44. Seemendra, A., Singh, R. & Singh, S. Breast cancer classification using transfer learning. Lect. Notes Electr. Eng. 694, 425–436
(2021).
45. Jiang, J. COVID-19 detection in chest X-ray images using swin-transformer and transformer in transformer.
46. Chen, W. et al. A simple single-scale Vision Transformer for object detection and instance segmentation. in Lect. Notes Comput.
Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 13670 LNCS, 711–727 (2022).
47. Goldberg, X. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. https://d
​ oi.o
​ rg/1​ 0.2​ 200/S​ 00196​ ED1V​
01Y20​0906A​IM006 (2009).
48. Liu, X. et al. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 35.1, 857–876 (2021).
49. Caron, M. et al. Emerging properties in self-supervised Vision Transformers. in Proc. IEEE Int. Conf. Comput. Vis. 9630–9640
(2021). https://​doi.​org/​10.​1109/​ICCV4​8922.​2021.​00951.
50. Akbari, H. et al. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf.
Process. Syst. 29, 24206–24221 (2021).
51. Li, W. et al. SepViT: Separable Vision Transformer. (2022).
52. Mangalam, K. et al. Reversible Vision Transformers. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2022-June,
10820–10830 (2022).
53. Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. (2020).
54. Korkmaz, Y., Yurt, M., Dar, S. U. H., Özbey, M. & Cukur, T. Deep MRI reconstruction with generative Vision Transformers. in
Machine Learning for Medical Image Reconstruction: 4th International Workshop, MLMIR 2021, Held in Conjunction with MICCAI
2021, Strasbourg, France, October 1, 2021, Proceedings 4 54–64 (2021).
55. Usman, M., Zia, T. & Tariq, A. Analyzing transfer learning of Vision Transformers for interpreting chest radiography. J. Digit.
Imaging. https://​doi.​org/​10.​1007/​s10278-​022-​00666-z (2022).
56. Wu, H. et al. CvT: Introducing convolutions to Vision Transformers. Proc. IEEE Int. Conf. Comput. Vis. https://​doi.​org/​10.​1109/​
ICCV4​8922.​2021.​00009 (2021).
57. Kermany, D., Zhang, K. & Goldbaum, M. Chest X-ray images (pneumonia). https://​data.​mende​ley.​com/​datas​ets/​rscbj​br9sj/2
58. Kermany, D. Large dataset of labeled optical coherence tomography (OCT) and chest X-ray images. Mendeley Data. 3.10.17632
(2018).
59. M. Hassan. VGG16—Convolutional network for classification and detection. Neurohive (2018). https://​neuro​hive.​io/​en/​popul​
arnet​works/​vgg16.
60. Dey, N., Zhang, Y. D., Rajinikanth, V., Pugalenthi, R. & Raja, N. S. M. Customized VGG19 architecture for pneumonia detection
in chest X-rays. Pattern Recognit. Lett. 143, 67–74 (2021).
61. Elpeltagy, M. & Sallam, H. Automatic prediction of COVID-19 from chest images using modified ResNet50. Multimed. Tools Appl.
80.17 26451–26463 (2021).
62. Zhang, Q. A novel ResNet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 4, 1–13 (2022).
63. Prabhakaran, A. K., Nair, J. J. & Sarath, S. Thermal facial expression recognition using modified ResNet152. in Lecture Notes in
Electrical Engineering vol. 736 LNEE (2021).
64. Rahimzadeh, M. & Attar, A. A new modified deep convolutional neural network for detecting COVID-19 from X-ray images.
arXiv 19, 100360 (2020).
65. Lee, H. C. & Aqil, A. F. Combination of transfer learning methods for kidney glomeruli image classification. Appl. Sci. 12.3, 1040
(2022).
66. Albahli, S., Rauf, H. T., Algosaibi, A. & Balas, V. E. AI-driven deep CNN approach for multilabel pathology classification using
chest X-rays. PeerJ Comput. Sci. 7, 1–17 (2021).
67. Jignesh Chowdary, G., Punn, N. S., Sonbhadra, S. K. & Agarwal, S. Face mask detection using transfer learning of inceptionV3. in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
vol. 12581 LNCS (2020).
68. Mondal, M. R. H., Bharati, S. & Podder, P. CO-IRv2: Optimized InceptionResNetV2 for COVID-19 detection from chest CT
images. PLoS One 16.10, e0259179 (2021).
69. Ezzat, D., Hassanien, A. ell & Ella, H. A. GSA-DenseNet121-COVID-19: A hybrid deep learning architecture for the diagnosis of
COVID-19 disease based on gravitational search optimization algorithm. Arxiv.Org (2020).
70. U.N. Oktaviana & Y. Azhar. Garbage Classification Using Ensemble DenseNet169. J. RESTI (Rekayasa Sist. dan Teknol. Informasi).
5.6, 1207–1215 (2021).
71. Adhinata, F. D., Rakhmadani, D. P., Wibowo, M. & Jayadi, A. A deep learning using DenseNet201 to detect masked or non-masked
face. JUITA J. Inform. 9.1, 115–121 (2021).
72. Yang, G., He, Y., Yang, Y. & Xu, B. Fine-grained image classification for crop disease based on attention mechanism. Front. Plant
Sci. 11, 1–15 (2020).

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 16


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
www.nature.com/scientificreports/

73. Singh, S. & Tripathi, B. K. Pneumonia classification using quaternion deep learning. Multimed. Tools Appl. 81, 1743–1764 (2022).

Author contributions
All authors contributed equally to this work. The manuscript was reviewed by all authors.

Competing interests
The authors declare no competing interests.

Additional information
Correspondence and requests for materials should be addressed to S.S.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/.

© The Author(s) 2024

Scientific Reports | (2024) 14:2487 | https://doi.org/10.1038/s41598-024-52703-2 17


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like