256_Camera Ready - Copy

This study evaluates the performance of self-supervised learning (SSL) models for multi-label classification of chest X-ray images from the NIH dataset, comparing them against a traditional supervised ResNet model. The findings indicate that SSL models can achieve comparable or superior performance to supervised models, highlighting their potential to reduce reliance on labeled data while maintaining diagnostic accuracy. The research emphasizes the effectiveness of SSL in addressing data annotation challenges in medical imaging.

Uploaded by

Sam Dubey

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

256_Camera Ready - Copy

Uploaded by

Sam Dubey

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Evaluating Self-Supervised Learning Models for Multi-Label Pathology Detection: A

Benchmark Against Supervised Learning

Nidhi Agarwal Pooja Nitish Kumar Anushka

Department of Computer Science School of Computer Science Department of Information Department of Computer Science
Engineering Engineering and Technology, Bennett Technology and Engineering
Galgotias University, Greater University , Greater Noida, India Bharati Vidyapeeth College of Indira Gandhi Delhi Technical
Noida, India [email protected]
Engineering, New Delhi University for Women, New
[email protected] [email protected] Delhi
[email protected]
n

Yashica Garg
Department of Computer
Science and Engineering
Indira Gandhi Delhi Technical
University for Women, New
Delhi
[email protected]
n

Abstract— The analysis and objective of the current work is to 50, and MoCo (Momentum Contrast)— and compared their
check the performance of SSL Models in multi-label classification performance against a traditional supervised ResNet model. The
on the available dataset of NIH Chest X-ray images which is a supervised model serves as a benchmark to evaluate the
large-scale medical imaging dataset. The performance of SSL effectiveness of SSL in scenarios where labeled data is available
models, Dino_v_b16, Dino with ResNet-50, and MoCo [3], [4], [5]. The primary objective of this research is to assess
(Momentum Contrast) has been evaluated against the supervised whether SSL models can achieve comparable or superior
model ResNet. These models are supposed to train to detect performance to supervised models in medical imaging tasks. By
multiple pathologies using chest X-ray images. We also note that analyzing the performance of these models on multi-label
self-supervised embedding models outperform traditional
classification tasks, this study aims to demonstrate the potential
supervised learning to a significant extent, achieving similar or
higher performance on a number of quantitative metrics when
of SSL to reduce the reliance on labeled data while maintaining
compared to the softmax supervised ResNet model. This also high accuracy in diagnosing multiple pathologies. The inclusion
draws attention to the possibilities of using SSL for biosignal- of a supervised learning model allows for a comprehensive
specific medical images. Furthermore, SSL approaches offer a comparison, highlighting the strengths and limitations of both
promising solution in scenarios where there is an overabundance approaches in medical imaging [6], [7], [8].
of annotated datasets available for training models such as deep
neural networks. The effectiveness of SSL approaches in II. RELATED WORK
addressing the problem of data annotation is also emphasized by Medical imaging tasks, such as disease categorization, are
this study. now much simpler to complete thanks to deep learning,
particularly when used with supervised techniques like CNNs.
Keywords—self-supervised learning, multi-label However, these methods rely on big, labelled datasets, which are
classification, medical imaging, NIH Chest X-ray dataset. typically expensive and difficult to get in clinical contexts. An
effective method is Self-Supervised Learning (SSL), which
I. INTRODUCTION enables models to be trained on unlabeled images using
The integration of deep learning into medical imaging has auxiliary activities like jigsaw puzzles or image rotation.
revolutionized the way medical data is analyzed, offering Algorithms like DINO and MoCo SSL models have
potential for more accurate and timely diagnoses [1], [2]. demonstrated remarkable performance on a variety of vision
However, one of the significant challenges in deploying deep tasks, including medical imaging applications. This is especially
learning models in this field is the scarcity of labeled data, as true when there is a lack of labeled data. SSL's use in multilabel
annotating medical datasets is both resource-intensive and time- classification—the process of identifying many pathologies
consuming. To address this issue, self-supervised learning (SSL) from a single image—is still in its infancy, even though it has
has emerged as a promising approach, enabling models to learn shown encouraging results in both anomaly detection and
from large amounts of unlabeled data. This study focuses on the classification on medical pictures.
application of SSL techniques to the multi-label classification of
chest X-ray images from the NIH Chest X-ray dataset, which The performance of SSL models on the NIH Chest X-ray
contains over 100,000 images with various pathology labels. We dataset is compared with that of a traditional supervised ResNet
employed three SSL models—Dino_v_b16, Dino with ResNet- model for multilabel classification tasks in this study. Numerous
studies have investigated machine learning techniques for
979-8-3315-3038-9/25/$31.00 ©2025 IEEE medical image classification and localization, particularly for
chest X-ray analysis. Notable early work, such as by Wang et al. while this framework performs well on general datasets, its
[9], introduced the ChestX-ray8 dataset, a hospital-scale chest application to complex, multi-label medical datasets.
X-ray database with benchmarks for weakly-supervised
classification and localization of thoracic diseases. Despite its III. METHODOLOGY
contribution to large-scale medical imaging datasets, the A. Dataset
reliance on weak supervision limits its accuracy in identifying
multiple pathologies, especially when complex co-occurrences The NIH Chest X-ray dataset has a total of over 100,000
are present. Building on this, Ouyang et al. [10] explored self- chest X-ray images, with most of them having been classified
supervised learning (SSL) for medical image analysis by into one or more of the 14 different diseases. This dataset is
leveraging temporal correlations in chest X-rays. Their approach particularly well-suited for multi-label classification tasks,
showcases SSL’s potential but is restricted to a narrow focus on where each image may be associated with multiple disease
temporal dependencies, which may not capture the full breadth labels. This was done in the first place by preprocessing the
of pathological variations present in chest X-ray images. Yao et images to make them of uniform size and then normalizing the
al. [11] proposed a cross-modal SSL framework for learning pixel values to ensure consistency across the dataset.
representations between chest X-rays and radiology reports, Additionally, we applied various data augmentation techniques,
addressing the challenge of multimodal representation. including random cropping and flipping, to artificially expand
However, this method's dependency on radiology report data the dataset and enhance the model's ability to generalize. These
makes it less suitable for scenarios where such annotations are augmentations introduce variations in the images, helping the
limited. model become more robust to different orientations, scales, and
positions of the lungs within the X- ray. It's important to note
Our approach addresses this by focusing solely on visual that instead of using all 100,000 images, we randomly sampled
data, making it applicable to broader, less annotated datasets. a subset of the data for training and validation, ensuring that the
Azizi et al. [12] demonstrated that SSL models can improve
performance in medical image classification, yet their study is
limited to single-label classifications, whereas our work
investigates the more complex task of multi-label classification,
which is essential for accurate diagnosis in real-world
applications where multiple conditions often coexist. Irvin et al.
[13] contributed CheXpert, a large dataset with uncertainty
labels, highlighting the importance of addressing ambiguous
cases in medical imaging. Although beneficial, CheXpert’s
reliance on annotated data makes it less adaptable for settings selected images still represent the diversity of the entire dataset.
with limited labeled samples. In contrast, our work utilizes SSL
to reduce reliance on extensive annotation by generating high- Fig 1. This table presents medical data with columns for diagnosis.
quality representations from unannotated data. Zhang and
Sabuncu [14] introduced a generalized cross-entropy loss for B. Exploratory Data Analysis (EDA)
training deep networks with noisy labels, addressing the issue of a) Dataset Analysis by Gender for Chest X-ray Dataset :
label noise. However, this approach still necessitates labeled This section analyzes the chest X-ray dataset with a focus on
data, and the noisy label handling does not directly improve the gender-based differences. It examines how gender influences
model's performance in multi-label classifications. Our SSL the dataset characteristics and provides insights into any
approach mitigates these limitations by learning from unlabeled disparities or patterns observed in the data.
data without relying on noisy annotations. Kolesnikov et al. [15]
revisited self-supervised representation learning but primarily
focused on general-purpose visual datasets rather than medical
images. This study does not address the specific challenges of
bio signal data, such as the need for capturing subtle
pathological differences in medical imaging. By focusing on a
medical dataset, our study enhances SSL applicability to
domain-specific data, emphasizing pathology-level
classification.
Ghassemi et al. [16] reviewed machine learning challenges
in healthcare, emphasizing the importance of interpretability and
generalizability in medical applications. While informative, this
review lacks an empirical exploration of SSL in healthcare. Our Fig 2. Bar chart comparing patient counts by gender: Male (~3000) vs.
study fills this gap by empirically validating SSL’s effectiveness Female (~2500).
for chest X-ray analysis, demonstrating both improved
performance and the potential for reduced data annotation needs. b) Frequency Count of Diseases
Chen et al. [17] introduced a simple framework for contrastive This section presents the frequency count of various diseases
learning, which serves as a foundation for modern SSL models. found in the dataset. It includes a breakdown of disease
occurrences and highlights any significant trends or imbalances • ResNet-50 Backbone:
in disease prevalence.
The model uses a ResNet-50 architecture to extract
lower-level features from the image. ResNet-50
consists of 50 layers and uses residual connections,
which help in training deep networks by alleviating
the vanishing gradient problem.
• Transformer Head:
Improving the efficiency of the ResNet-50 backbone
by applying the convolutional neural network
focused on extracting the features, the data is sent to
the Vision Transformer head applying the self-
attention mechanism for learning the global context.
This hybrid model is especially successful because it makes
use of the notable local features extraction competencies of
Fig 3. Bar chart comparing patient counts by various diseases. CNNs by Transformers' global attention mechanism, thus it
really excels in image recognition tasks.
C. Self-Supervised Models c) Momentum Contrast (MoCo)
a) Vision Transformer (ViT) B/16 MoCo is a self-supervised learning framework developed
The Vision Transformer (ViT) is a deep learning by Facebook AI for learning visual representations without the
model designed for image recognition tasks, introduced need for labeled data. The key idea behind MoCo is to learn a
by Google. Unlike traditional convolutional neural good representation of the data by contrasting positive and
networks (CNNs), ViT treats images as sequences of negative pairs of images. Here's how it works:
patches, much like how transformers in natural language
• Contrastive Learning:
processing (NLP) treat sentences as sequences of words.
Here's a breakdown of ViT B/16: In MoCo, an image is transformed into two different
views using data augmentation techniques. One view
• Model Architecture: is treated as a query, and the other as a key. The model's
Patch Embedding: The input image is divided into goal is to bring the representation of the query closer to
fixed-size patches, each of 16x16 pixels (hence the "16" in the key (positive pair) and push it away from
B/16). These patches are then flattened and representations of other images (negative pairs).
embedded into a lower-dimensional space.
• Memory Bank:
Transformer Encoder: The embedded patches are
passed through a series of transformer layers, where MoCo maintains a large set of negative samples using
self-attention mechanisms capture the relationships a memory bank, which stores representations of images
between different patches in the image. from previous batches. This allows the model to have
a rich set of negatives, which is crucial for effective
Classification Head: After processing through the contrastive learning.
transformer layers, the output is passed to a
classification head for making predictions. • Memory Update:
The key encoder in MoCo is updated with a time-
• B/16 Specification: dependent algorithm where the parameters of the key
B refers to the "base" model, which is a specific encoder are averaged out over time along with the
configuration with a moderate number of current query encoder. This approach not only ensures
transformer layers, typically 12. 16 denotes the patch stability but also enables the efficient learning of
size, meaning each patch is 16x16 pixels. representations.

ViT models are known for their ability to MoCo demonstrated high flexibility in learning visual
outperform traditional CNNs on large datasets by representations that can be later used for various tasks like image
leveraging the self-attention mechanism to capture classification or object detection, sometimes competing with
global dependencies within the image. supervised learning methods in quality.

b) ViT ResNet-50 Hybrid D. Supervised Model

Resolution network (ResNet), an architecture used for
ViT ResNet-50 is a hybrid model that combines the
deep learning to solve a problem called vanishing gradient
strengths of Vision Transformers with those of ResNet, a among deeper networks by creating residual connections It
popular CNN architecture. In this hybrid approach: consists of residual blocks with identity shortcuts, where the
identity shortcuts are mapping that skips one or more
convolutional layers, allowing the input to be added directly to leading to more balanced and accurate predictions across all
the output of the block. So networks with hundreds or thousands classes.
of layers can be trained effectively because it gets around all the
gradient flow problems. There are many ResNet architectures of G. Training and Validation Loss
different depths (ResNet-18, ResNet-34, ResNet-50, etc.), and to The following graphs (Fig. 4, Fig. 5, Fig. 6) show the training
make deeper versions more effective and less expensive, they and validation loss for each model over the epochs.
use bottleneck blocks. In the end, one usually finds global
average pooling and a final fully connected layer for
classification.
E. Training and Evaluation TABLE I. DINO_V_B16, TEST DATASET ACCURACY REPORT
Training epochs: 30 epochs, they all trained on NVIDIA Labels Acc
GeForce RTX 3500 GPU (which is competitive enough for me
to finish better computing and faster speed). Epoch-wise training 0 Cardiomegaly 79.867987
and validation losses were recorded to track progress and a batch 1 Emphysema 74.257426
size of 32 was used with an Adam optimizer (learning rate
2 Effusion 68.976898
0.001) to provide the relative stability and speed of convergence.
The model resulted in accuracy, precision, recall, and F1 scores 3 Hernia 88.118812
being calculated for each disease label, and also analyzed the 4 Nodule 63.366337
loss curves to evaluate the convergence status of the models. It
also described the performance of each model in terms of 5 Pneumothorax 70.957096
computation requirements and prediction. 6 Atelectasis 72.607261
F. Loss Function and Class Frequency Balancing 7 Pleural_Thickening 71.947195
To address class imbalance in the dataset, we implemented 8 Mass 70.627063
a custom loss function that balances the frequencies of positive 9 Edema 75.247525
and negative classes. The positive frequency for each class jjj is 10 Consolidation 66.336634
calculated as:
11 Infiltration 68.316832
wpos, j is the positive weight for class,
12 Fibrosis 73.267327
wneg, j is the negative weight for class,
13 Pneumonia 83.498350
ytrue, i, j is the true label,
ypred, i, j is the predicted probability, ϵ\epsilonϵ is a
small constant to avoid taking the logarithm of zero. TABLE II. DINO_V_B16, TEST DATASET VALIDATION
REPORT
Loss = Σj=1ΣC[Nlij = 1ΣN(wpos,j · ytrue,i,j ·
log(ypred,i,j + ε) + wneg,j · (1 - ytrue,i,j) · log((1 - Labels Acc
ypred,i,j) + ε)] 0 Cardiomegaly 81.188119
1 Emphysema 73.927393
2 Effusion 73.927393
These frequencies help weight the loss function, ensuring the 3 Hernia 92.079208
model is not biased towards more common disease labels,
4 Nodule 66.996700
5 Pneumothorax 67.986799
6 Atelectasis 75.577558
7 Pleural_Thickening 72.277228
8 Mass 72.277228
9 Edema 73.597360
10 Consolidation 67.326733
11 Infiltration 75.577558
12 Fibrosis 76.897690
13 Pneumonia 84.818482

TABLE III. DINO WITH RESNET50, TEST DATASET

ACCURACY REPORT
Fig 4. The graph shows training and validation loss for Dino with ResNet50
over 15 epochs, both decreasing over time. Labels Acc
0 Cardiomegaly 67.656766
1 Emphysema 82.838284
2 Effusion 69.636964 0 Cardiomegaly 79.207921
3 Hernia 71.947195 1 Emphysema 82.508251
4 Nodule 56.765677 2 Effusion 70.957096
5 Pneumothorax 83.498350 3 Hernia 90.099010
6 Atelectasis 59.405941 4 Nodule 63.366337
7 Pleural_Thickening 68.976898 5 Pneumothorax 75.907591
6 Atelectasis 66.336634
8 Mass 64.026403
7 Pleural_Thickening 74.587459
9 Edema 84.488449 8 Mass 73.597360
10 Consolidation 59.405941 9 Edema 83.168317
11 Infiltration 68.646865 10 Consolidation 73.267327
12 Fibrosis 58.745875 11 Infiltration 62.376238
13 Pneumonia 67.656766 12 Fibrosis 76.897690
13 Pneumonia 67.656766
TABLE IV. DINO WITH RESNET50, TEST VALIDATION REPORT
H. Confusion Matrix for Multi Labels
Labels Acc
0 Cardiomegaly 71.947195
1 Emphysema 76.237624
2 Effusion 74.587459
3 Hernia 76.237624
4 Nodule 61.386139
5 Pneumothorax 78.547855
6 Atelectasis 60.066007
7 Pleural_Thickening 65.676568
8 Mass 69.636964
9 Edema 86.798680
10 Consolidation 59.735974
11 Infiltration 65.676568
12 Fibrosis 58.745875
13 Pneumonia 66.996700

TABLE V. MOCO(MOMENTUM CONTRAST), TEST DATASET

ACCURACY REPORT

Labels Acc
0 Cardiomegaly 83.498350
1 Emphysema 83.498350
2 Effusion 66.006601
3 Hernia 87.458746
4 Nodule 60.396040
5 Pneumothorax 77.227723
6 Atelectasis 69.306931
7 Pleural_Thickening 70.627063
8 Mass 71.617162
9 Edema 84.818482 Fig.5. Confusion matrices for Dino_v_b16.
10 Consolidation 70.297030
11 Infiltration 66.336634
12 Fibrosis 74.917492
13 Pneumonia 67.986799

TABLE VI. MOCO(MOMENTUM CONTRAST), TEST

VALIDATION REPORT

Labels Acc
Precision-Recall Curve

Fig.8. Confusion matrices for Dino with ResNet50.

Fig.9. Confusion matrices for MoCo(Momentum Contrast).

Fig. 6. Precision-Recall Curve for Dino_v_b16. Fig. 7. Precision-Recall Curve for Dino with ResNet50.

Fig. 8. Precision-Recall Curve for MoCo(Momentum Contrast)

I. Test Results Test Dataset Accuracy: The model performed best in detecting
the Hernia class (92.08%), with moderate to good accuracy for
other classes. However, it struggled with Nodule classification
(66.99%).
Efficiency in Inference: The high accuracy in specific classes
allows for quick and confident diagnoses, reducing the need for
additional processing in clinical workflows.
Observations: Compared to studies using similar methods, the
Dino_v_b16 model excels in training speed. However, like other
SSL approaches, its performance on underrepresented classes
such as Nodule aligns with challenges of dataset imbalance.
Addressing this issue through techniques like data augmentation
could enhance overall accuracy.
2) Dino with ResNet-50 Model
Fig. 9. Image showing high prediction probability for "Pneumothorax." Performance: Validation loss improved steadily from epoch 1 to
13, but showed a slight increase afterward, indicating potential
overfitting. Early stopping could help prevent this.
Efficiency: Training took 4 to 6 minutes per epoch due to the
complexity of the ResNet-50 backbone, which is powerful for
feature extraction but computationally intensive.
Test Dataset Accuracy: The model achieved strong results for
critical pathologies, with accuracies of 83.50% and 84.49% in
some categories, demonstrating versatility.
Efficiency in Inference: Despite its longer training times, the
model's reliability for key conditions makes it suitable for
applications requiring accurate identification of critical
Fig.10. Image showing high prediction probability for "Emphysema", diseases.
“Effusion”, and “Pleural_Thickening”.
Observations: This model aligns with prior studies highlighting
the trade-off between higher computational cost and improved
feature extraction. Its strong performance in critical cases
justifies its use in high-stakes scenarios.
3) MoCo (Momentum Contrast) Model
Performance: The MoCo model performed at a minimum
validation loss of 0.791 at epoch 21; however, in subsequent
epochs it did vary, indicating overfitting or noise in validation.
Efficiency: Training was a bit fast, the epoch takes around 3.5
to 4 minutes. This is more effective than other deep learning
models but better than Dino_v_b16 as it’s more equivalent in its
computational effort.
Fig.11. Image showing high prediction probability for ”Edema”, Test Dataset Accuracy: The model showed consistent
"Infiltration." performance across all classes without extreme variations,
making it a dependable choice for broad diagnostic tasks.
IV. RESULTS Efficiency in Inference: Its balanced accuracy and moderate
computational demands make it practical for applications
1) Dino_v_b16 Model requiring reliable performance across a variety of conditions.
Performance: The model consistently reduced training and Observations: The model's consistency matches findings in
validation losses over 15 epochs, reaching a minimum prior research, emphasizing its utility as a general-purpose
validation loss of 0.689. This steady improvement indicates option. Its balance of efficiency and accuracy makes it well-
effective learning and stability. suited for tasks with diverse diagnostic needs. Comparison with
Efficiency: Each training epoch took about 2 minutes and 45 Supervised Learning.
seconds, making this model highly efficient for large datasets Compared to the supervised learning model, these models
and frequent iterations. show several efficiencies and advantages:
B. Reduced training Time and Resource Utilization [4] R. Dahiya, N. Agarwal, S. Singh, D. Verma, and S. Gupta, “Diabetic
Retinopathy Eye Disease Detection Using Machine Learning,” EAI
Self-supervised or semi-supervised methods must have Endorsed Transactions on Internet of Things, vol. 10, Mar. 2024, doi:
allowed these models to make use of tons of unlabeled data — 10.4108/EETIOT.5349.
instead of big labeled datasets that cost a fortune to produce. [5] R. Dahiya, V. K. Dahiya, Deepakshi, N. Agarwal, L. P. Maguluri, and E.
Muniyandy, “Predictive Modelling for Parkinson’s Disease Diagnosis
C. Improved Computational Efficiency using Biomedical Voice Measurements,” EAI Endorsed Trans Pervasive
Models like Dino_v_b16 are optimized for faster training Health Technol, vol. 10, Mar. 2024, doi: 10.4108/EETPHT.10.5519.
iterations, lowering the cost in terms of computation as well as [6] R. Article et al., “Predictive Modelling for Heart Disease Diagnosis: A
Comparative Study of Classifiers,” EAI Endorsed Trans Pervasive Health
time. This can be very useful in situations where there is a need Technol, vol. 10, Mar. 2024, doi: 10.4108/EETPHT.10.5518.
for quick prototyping along with frequent model updates. [7] C. B. Sivaparthipan et al., “IoT-based patient stretcher movement
simulation in smart hospital using type-2 fuzzy sets systems,”
D. Enhanced Inference Efficiency https://doi.org/10.1080/09537287.2023.2217419,2023, doi:
Due to better generalization from self-supervised 10.1080/09537287.2023.2217419.
pretraining, predictions on any mode of input with these models [8] P. Manivannan et al., “Doctor unpredicted prescription handwriting
can be performed gracefully and accurately—surpassing prediction using triboelectric smart recognition,”
specific supervised models, which may require heavier pre- https://doi.org/10.1080/09537287.2023.2202173, 2023, doi:
10.1080/09537287.2023.2202173.
processing or post-processing steps.
[9] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M.
(2017). ChestX-ray8: Hospital-scale chest X-ray database and
benchmarks on weakly-supervised classification and localization of
V. CONCLUSION common thorax diseases. 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 3462-3471.
The evaluation Dino_v_b16, Dino with ResNet-50, MoCo, [10] Ouyang, X., He, L., & Chen, J. (2021). Self-supervised Learning for
and supervised ResNet-50 for classification of medical images Medical Image Analysis: Learning from Temporal Image Correlations in
showed Dino_v_b16 to be the best model in the overall balance Chest X-rays. IEEE Journal of Biomedical and Health Informatics, 25(6),
of accuracy and efficiency. MoCo specific, and Dino with 1931-1941.
ResNet-50 optimum under specific conditions at a high [11] Yao, Q., Xiao, C., Liu, M., & Zhao, X. (2021). Cross-modal self-
supervised learning for chest X-ray and radiology report representation
computational cost. Well-suited performances emanate from learning. Medical Image Analysis, 72, 102128.
these two models in comparison with the supervised scenario, [12] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., & Ng,
basically when heavy annotated data is a special interest area. A. Y. (2019). CheXpert: Alarge chest radiograph dataset with uncertainty
Further studies could take their optimization to the next level labels and expert comparison. Proceedings of the AAAI Conference on
towards real-world applications in clinical practice to confirm Artificial Intelligence, 33(1), 590-597.
applicability and impact concerning medical diagnostics. [13] Zhang, Z., & Sabuncu, M. R. (2018). Generalized Cross Entropy Loss for
Training Deep Neural Networks with Noisy Labels. Advances in Neural
VI. FUTURE SCOPE Information Processing Systems (NeurIPS), 31, 8778-8788.
[14] Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., &
Future work should aim to optimize these models for Houlsby, N. (2019). Revisiting Self-Supervised Visual Representation
specific medical tasks, especially for detecting rare conditions Learning. Proceedings of the IEEE/CVF Conference on Computer Vision
with limited data. Testing these models in real clinical settings and Pattern Recognition (CVPR), 1920-1929.
and with diverse patient groups will help understand their [15] Ghassemi, M., Naumann, T., Schulam, P., Beam, A. L., Chen, I. Y., &
practical use. Combining self-supervised learning with Ranganath, R. (2020). A Review of Challenges and Opportunities in
traditional supervised methods could improve their Machine Learning for Health. Journal of the American Medical
Informatics Association, 27(10), 1744-1751.
performance. Additionally, expanding the models to work with
[16] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple
different types of medical data, like combining images with Framework for Contrastive Learning of Visual Representations.
patient records, could lead to better diagnoses and more Proceedings of the 37th International Conference on Machine Learning
personalized care. (ICML), 119, 1597-1607.
[17] Yao, L., Prosky, J., Poblenz, E., Covington, B., Lyman, K., & Abedin, M.
(2019). Weakly Supervised Medical Diagnosis and Localization from
Multiple Resolutions. Proceedings of the IEEE/CVF Conference on
VII. REFERENCES Computer Vision and Pattern Recognition (CVPR), 10861-10870.
[18] Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J. N., Wu, Z., & Ding, X.
[1] N. Agarwal and P. Deep, “Edge computing for smart healthcare (2020). Embracing Imperfect Datasets: A Review of Deep Learning
monitoring platform advancement,” Reconnoitering the Landscape of Solutions for Medical Image Segmentation. Medical Image Analysis, 63,
Edge Intelligence in Healthcare, pp. 47–66, Apr. 2024, doi: 101693.
10.1201/9781003401841-6/Edge-Computing-Smart-Healthcare- [19] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum
Monitoring-Platform-Advancement-Nidhi-Agarwal-Prakhar-Deep. Contrast for Unsupervised Visual Representation Learning. Proceedings
[2] N. Agarwal, V. Singh, and P. Singh, “Semi-Supervised Learning with of the IEEE/CVF Conference on Computer Vision and Pattern
GANs for Melanoma Detection,” Proceedings - 2022 6th International Recognition (CVPR), 9729-9738.
Conference on Intelligent Computing and Control Systems, ICICCS [20] Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019). Transfusion:
2022, pp. 141–147, 2022, doi: 10.1109/ICICCS53718.2022.9787990. Understanding Transfer Learning for Medical Imaging. Advances in
[3] R. Dahiya, B. Arunkumar, V. K. Dahiya, and N. Agarwal, “Facilitating Neural Information Processing Systems (NeurIPS), 32, 3342-3352.
Healthcare Sector through IoT: Issues, Challenges, and Its Solutions,” [21] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., &
EAI Endorsed Transactions on Internet of Things, vol. 9, no. 4, pp. e5– Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks
e5, Nov. 2023, doi: 10.4108/EETIOT.V9I4.4317. via Gradient-based Localization. Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 618-626.
[22] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., & Ng, A. [28] Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J. N., Wu, Z., & Ding, X.
Y. (2017). CheXNet: Radiologist-Level Pneumonia Detection on ChestX- (2020). Embracing Imperfect Datasets: A Review of Deep Learning
Rays with Deep Learning. arXiv preprint arXiv:1711.05225. Solutions for Medical Image Segmentation. Medical Image Analysis, 63,
[23] Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., & 101693.
Ramesh, A. (2021). Big Self-Supervised Models Advance Medical Image [29] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum
Classification. Proceedings of the IEEE/CVF. Contrast for Unsupervised Visual Representation Learning. Proceedings
[24] Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & of the IEEE/CVF Conference on Computer Vision and Pattern
Houlsby, N. (2019). Revisiting Self-Supervised Visual Representation Recognition (CVPR), 9729-9738.
Learning. Proceedings of the IEEE/CVF Conference on Computer Vision [30] Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019). Transfusion:
and Pattern Recognition (CVPR), 1920-1929. Understanding Transfer Learning for Medical Imaging. Advances in
[25] Ghassemi, M., Naumann, T., Schulam, P., Beam, A. L., Chen, I. Y., & Neural Information Processing Systems (NeurIPS), 32, 3342-3352.
Ranganath, R. (2020). A Review of Challenges and Opportunities in [31] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., &
Machine Learning for Health. Journal of the American Medical Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks
Informatics Association, 27(10), 1744-1751. via Gradient-based Localization. Proceedings of the IEEE International
[26] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Conference on Computer Vision (ICCV), 618-626.
Framework for Contrastive Learning of Visual Representations. [32] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., & Ng, A.
Proceedings of the 37th International Conference on Machine Learning Y. (2017). CheXNet: Radiologist-Level Pneumonia Detection on ChestX-
(ICML), 119, 1597-1607. Rays with Deep Learning. arXiv preprint arXiv:1711.05225.
[27] Yao, L., Prosky, J., Poblenz, E., Covington, B., Lyman, K., & Abedin, M. [33] Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., &
(2019). Weakly Supervised Medical Diagnosis and Localization from Ramesh, A. (2021). Big Self-Supervised Models Advance Medical Image
Multiple Resolutions. Proceedings of the IEEE/CVF Conference on Classification. Proceedings of the IEEE
Computer Vision and Pattern Recognition (CVPR), 10861-10870.