256_Camera Ready - Copy
256_Camera Ready - Copy
Yashica Garg
Department of Computer
Science and Engineering
Indira Gandhi Delhi Technical
University for Women, New
Delhi
[email protected]
n
Abstract— The analysis and objective of the current work is to 50, and MoCo (Momentum Contrast)— and compared their
check the performance of SSL Models in multi-label classification performance against a traditional supervised ResNet model. The
on the available dataset of NIH Chest X-ray images which is a supervised model serves as a benchmark to evaluate the
large-scale medical imaging dataset. The performance of SSL effectiveness of SSL in scenarios where labeled data is available
models, Dino_v_b16, Dino with ResNet-50, and MoCo [3], [4], [5]. The primary objective of this research is to assess
(Momentum Contrast) has been evaluated against the supervised whether SSL models can achieve comparable or superior
model ResNet. These models are supposed to train to detect performance to supervised models in medical imaging tasks. By
multiple pathologies using chest X-ray images. We also note that analyzing the performance of these models on multi-label
self-supervised embedding models outperform traditional
classification tasks, this study aims to demonstrate the potential
supervised learning to a significant extent, achieving similar or
higher performance on a number of quantitative metrics when
of SSL to reduce the reliance on labeled data while maintaining
compared to the softmax supervised ResNet model. This also high accuracy in diagnosing multiple pathologies. The inclusion
draws attention to the possibilities of using SSL for biosignal- of a supervised learning model allows for a comprehensive
specific medical images. Furthermore, SSL approaches offer a comparison, highlighting the strengths and limitations of both
promising solution in scenarios where there is an overabundance approaches in medical imaging [6], [7], [8].
of annotated datasets available for training models such as deep
neural networks. The effectiveness of SSL approaches in II. RELATED WORK
addressing the problem of data annotation is also emphasized by Medical imaging tasks, such as disease categorization, are
this study. now much simpler to complete thanks to deep learning,
particularly when used with supervised techniques like CNNs.
Keywords—self-supervised learning, multi-label However, these methods rely on big, labelled datasets, which are
classification, medical imaging, NIH Chest X-ray dataset. typically expensive and difficult to get in clinical contexts. An
effective method is Self-Supervised Learning (SSL), which
I. INTRODUCTION enables models to be trained on unlabeled images using
The integration of deep learning into medical imaging has auxiliary activities like jigsaw puzzles or image rotation.
revolutionized the way medical data is analyzed, offering Algorithms like DINO and MoCo SSL models have
potential for more accurate and timely diagnoses [1], [2]. demonstrated remarkable performance on a variety of vision
However, one of the significant challenges in deploying deep tasks, including medical imaging applications. This is especially
learning models in this field is the scarcity of labeled data, as true when there is a lack of labeled data. SSL's use in multilabel
annotating medical datasets is both resource-intensive and time- classification—the process of identifying many pathologies
consuming. To address this issue, self-supervised learning (SSL) from a single image—is still in its infancy, even though it has
has emerged as a promising approach, enabling models to learn shown encouraging results in both anomaly detection and
from large amounts of unlabeled data. This study focuses on the classification on medical pictures.
application of SSL techniques to the multi-label classification of
chest X-ray images from the NIH Chest X-ray dataset, which The performance of SSL models on the NIH Chest X-ray
contains over 100,000 images with various pathology labels. We dataset is compared with that of a traditional supervised ResNet
employed three SSL models—Dino_v_b16, Dino with ResNet- model for multilabel classification tasks in this study. Numerous
studies have investigated machine learning techniques for
979-8-3315-3038-9/25/$31.00 ©2025 IEEE medical image classification and localization, particularly for
chest X-ray analysis. Notable early work, such as by Wang et al. while this framework performs well on general datasets, its
[9], introduced the ChestX-ray8 dataset, a hospital-scale chest application to complex, multi-label medical datasets.
X-ray database with benchmarks for weakly-supervised
classification and localization of thoracic diseases. Despite its III. METHODOLOGY
contribution to large-scale medical imaging datasets, the A. Dataset
reliance on weak supervision limits its accuracy in identifying
multiple pathologies, especially when complex co-occurrences The NIH Chest X-ray dataset has a total of over 100,000
are present. Building on this, Ouyang et al. [10] explored self- chest X-ray images, with most of them having been classified
supervised learning (SSL) for medical image analysis by into one or more of the 14 different diseases. This dataset is
leveraging temporal correlations in chest X-rays. Their approach particularly well-suited for multi-label classification tasks,
showcases SSL’s potential but is restricted to a narrow focus on where each image may be associated with multiple disease
temporal dependencies, which may not capture the full breadth labels. This was done in the first place by preprocessing the
of pathological variations present in chest X-ray images. Yao et images to make them of uniform size and then normalizing the
al. [11] proposed a cross-modal SSL framework for learning pixel values to ensure consistency across the dataset.
representations between chest X-rays and radiology reports, Additionally, we applied various data augmentation techniques,
addressing the challenge of multimodal representation. including random cropping and flipping, to artificially expand
However, this method's dependency on radiology report data the dataset and enhance the model's ability to generalize. These
makes it less suitable for scenarios where such annotations are augmentations introduce variations in the images, helping the
limited. model become more robust to different orientations, scales, and
positions of the lungs within the X- ray. It's important to note
Our approach addresses this by focusing solely on visual that instead of using all 100,000 images, we randomly sampled
data, making it applicable to broader, less annotated datasets. a subset of the data for training and validation, ensuring that the
Azizi et al. [12] demonstrated that SSL models can improve
performance in medical image classification, yet their study is
limited to single-label classifications, whereas our work
investigates the more complex task of multi-label classification,
which is essential for accurate diagnosis in real-world
applications where multiple conditions often coexist. Irvin et al.
[13] contributed CheXpert, a large dataset with uncertainty
labels, highlighting the importance of addressing ambiguous
cases in medical imaging. Although beneficial, CheXpert’s
reliance on annotated data makes it less adaptable for settings selected images still represent the diversity of the entire dataset.
with limited labeled samples. In contrast, our work utilizes SSL
to reduce reliance on extensive annotation by generating high- Fig 1. This table presents medical data with columns for diagnosis.
quality representations from unannotated data. Zhang and
Sabuncu [14] introduced a generalized cross-entropy loss for B. Exploratory Data Analysis (EDA)
training deep networks with noisy labels, addressing the issue of a) Dataset Analysis by Gender for Chest X-ray Dataset :
label noise. However, this approach still necessitates labeled This section analyzes the chest X-ray dataset with a focus on
data, and the noisy label handling does not directly improve the gender-based differences. It examines how gender influences
model's performance in multi-label classifications. Our SSL the dataset characteristics and provides insights into any
approach mitigates these limitations by learning from unlabeled disparities or patterns observed in the data.
data without relying on noisy annotations. Kolesnikov et al. [15]
revisited self-supervised representation learning but primarily
focused on general-purpose visual datasets rather than medical
images. This study does not address the specific challenges of
bio signal data, such as the need for capturing subtle
pathological differences in medical imaging. By focusing on a
medical dataset, our study enhances SSL applicability to
domain-specific data, emphasizing pathology-level
classification.
Ghassemi et al. [16] reviewed machine learning challenges
in healthcare, emphasizing the importance of interpretability and
generalizability in medical applications. While informative, this
review lacks an empirical exploration of SSL in healthcare. Our Fig 2. Bar chart comparing patient counts by gender: Male (~3000) vs.
study fills this gap by empirically validating SSL’s effectiveness Female (~2500).
for chest X-ray analysis, demonstrating both improved
performance and the potential for reduced data annotation needs. b) Frequency Count of Diseases
Chen et al. [17] introduced a simple framework for contrastive This section presents the frequency count of various diseases
learning, which serves as a foundation for modern SSL models. found in the dataset. It includes a breakdown of disease
occurrences and highlights any significant trends or imbalances • ResNet-50 Backbone:
in disease prevalence.
The model uses a ResNet-50 architecture to extract
lower-level features from the image. ResNet-50
consists of 50 layers and uses residual connections,
which help in training deep networks by alleviating
the vanishing gradient problem.
• Transformer Head:
Improving the efficiency of the ResNet-50 backbone
by applying the convolutional neural network
focused on extracting the features, the data is sent to
the Vision Transformer head applying the self-
attention mechanism for learning the global context.
This hybrid model is especially successful because it makes
use of the notable local features extraction competencies of
Fig 3. Bar chart comparing patient counts by various diseases. CNNs by Transformers' global attention mechanism, thus it
really excels in image recognition tasks.
C. Self-Supervised Models c) Momentum Contrast (MoCo)
a) Vision Transformer (ViT) B/16 MoCo is a self-supervised learning framework developed
The Vision Transformer (ViT) is a deep learning by Facebook AI for learning visual representations without the
model designed for image recognition tasks, introduced need for labeled data. The key idea behind MoCo is to learn a
by Google. Unlike traditional convolutional neural good representation of the data by contrasting positive and
networks (CNNs), ViT treats images as sequences of negative pairs of images. Here's how it works:
patches, much like how transformers in natural language
• Contrastive Learning:
processing (NLP) treat sentences as sequences of words.
Here's a breakdown of ViT B/16: In MoCo, an image is transformed into two different
views using data augmentation techniques. One view
• Model Architecture: is treated as a query, and the other as a key. The model's
Patch Embedding: The input image is divided into goal is to bring the representation of the query closer to
fixed-size patches, each of 16x16 pixels (hence the "16" in the key (positive pair) and push it away from
B/16). These patches are then flattened and representations of other images (negative pairs).
embedded into a lower-dimensional space.
• Memory Bank:
Transformer Encoder: The embedded patches are
passed through a series of transformer layers, where MoCo maintains a large set of negative samples using
self-attention mechanisms capture the relationships a memory bank, which stores representations of images
between different patches in the image. from previous batches. This allows the model to have
a rich set of negatives, which is crucial for effective
Classification Head: After processing through the contrastive learning.
transformer layers, the output is passed to a
classification head for making predictions. • Memory Update:
The key encoder in MoCo is updated with a time-
• B/16 Specification: dependent algorithm where the parameters of the key
B refers to the "base" model, which is a specific encoder are averaged out over time along with the
configuration with a moderate number of current query encoder. This approach not only ensures
transformer layers, typically 12. 16 denotes the patch stability but also enables the efficient learning of
size, meaning each patch is 16x16 pixels. representations.
ViT models are known for their ability to MoCo demonstrated high flexibility in learning visual
outperform traditional CNNs on large datasets by representations that can be later used for various tasks like image
leveraging the self-attention mechanism to capture classification or object detection, sometimes competing with
global dependencies within the image. supervised learning methods in quality.
Labels Acc
0 Cardiomegaly 83.498350
1 Emphysema 83.498350
2 Effusion 66.006601
3 Hernia 87.458746
4 Nodule 60.396040
5 Pneumothorax 77.227723
6 Atelectasis 69.306931
7 Pleural_Thickening 70.627063
8 Mass 71.617162
9 Edema 84.818482 Fig.5. Confusion matrices for Dino_v_b16.
10 Consolidation 70.297030
11 Infiltration 66.336634
12 Fibrosis 74.917492
13 Pneumonia 67.986799
Labels Acc
Precision-Recall Curve
Fig. 6. Precision-Recall Curve for Dino_v_b16. Fig. 7. Precision-Recall Curve for Dino with ResNet50.