0% found this document useful (0 votes)
4 views

sensors-2926618-peer-review-v1

Articles

Uploaded by

k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

sensors-2926618-peer-review-v1

Articles

Uploaded by

k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Article 1

COVID-19 HIERARCHICAL CLASSIFICATION USING A 2

DEEP LEARNING MULTI-MODAL 3

Albatoul Althinyan 1,2, Shada A. AlSalamah 1,3,4, Sherin Aly 5, Thamer Nouh 6, Bassam Mahboub 7, Laila Salameh 4
8, Metab Alkubeyyer 9 and Abdulrahman Mirza 1 5

1 Information Systems Department, College of Computer and Information Sciences, King Saud University, 6
Riyadh, Saudi Arabia. 7
2 Information Systems Department, College of Computer and Information Sciences, Imam Mohammed Bin 8
Saud Islamic University, Riyadh, Saudi Arabia. 9
3 National Health Information Center, Saudi Health Council, Riyadh, Saudi Arabia. 10
4 Digital Health and Innovation Department, Science Division, World Health Organization, Geneva, Switzer- 11
land. 12
5 Institute of Graduate Studies and Research, Alexandria University, Egypt 13
6 Trauma and Acute Care Surgery Unit, College of Medicine, King Saud University, Riyadh, Saudi Arabia 14
7 Clinical Sciences Department, College of Medicine, University of Sharjah, Sharjah, United Arab Emirates 15
8 Sharjah Institute for Medical Research, University of Sharjah, Sharjah, United Arab Emirates 16
9 Department of Radiology and Medical Imaging, King Khalid University Hospital, King Saud University, 17
Riyadh, Saudi Arabia 18
* Correspondence: [email protected]; Tel.: 00966506061614 19

Abstract: The coronavirus disease 2019 (COVID-19), originating in China, has rapidly spread world- 20
wide. Early detection is crucial for effective treatment. Physicians must examine infected patients 21
and make timely decisions to isolate them. However, completing these processes is difficult due to 22
limited time, expert radiologists, as well as limitations related to the reverse-transcription polymer- 23
ase chain reaction (RT-PCR) method. Early detection is crucial for preventive measures. Using radi- 24
ological imaging modalities, deep learning is an advanced machine learning technique that is highly 25
effective in diagnosing diseases and image classification. Previous COVID-19 classification research 26
has limitations such as binary classification, flat structure, single feature modality, small public da- 27
taset, and reliance on CT diagnostic processes. This study aims to overcome these limitations by 28
identifying pneumonia caused by COVID-19, distinguishing it from other types of pneumonia and 29
healthy lungs using chest X-ray (CXR) images and related tabular medical data, and showing the 30
effectiveness of incorporating tabular medical data. Additionally, given that pneumonia naturally 31
falls into a hierarchical structure, we leverage this structure within our approach to achieve im- 32
proved classification outcomes. Pre-trained CNN models were employed to extract features, which 33
were then combined using early fusion for the classification of eight distinct classes. Due to the na- 34
ture of the imbalanced dataset in this field, a variety of versions of Generative Adversarial Networks 35
(GANs) are used to generate synthetic data. The proposed approach tested in our private datasets 36
of 4,523 patients achieved a macro-avg F1-score of 95.9% and an F1-score of 87.5% for COVID-19 37
identification using a Resnet-based structure. In conclusion, this study was able to create an accurate 38
deep learning multi-modal to diagnose COVID-19 and differentiate it from other kinds of pneumo- 39
nia and normal lungs, which will enhance the radiological diagnostic process. 40

Keywords: Artificial intelligence; COVID-19; CXR; Hierarchical; Deep learning; Multi-modal; Diag- 41
nosis; Image classification; Multi-classes; Pneumonia 42
43

1. Introduction 44
The entire world's healthcare system is seriously endangered by a type of viral pneu- 45
monia known as COVID-19. Since the virus that causes COVID-19 has a high 46

Sensors 2024, 24, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/sensors


Sensors 2024, 24, x FOR PEER REVIEW 2 of 27

pathogenicity similar to SARS-CoV, it is known as severe acute respiratory syndrome 47


coronavirus-2 (SARS-CoV-2)[1]. Respiratory insufficiency can arise from the virus's im- 48
pact on the patient's respiratory system, specifically the lungs. As of 11 February 2024, the 49
World Health Organization (WHO) has received reports of 774,631,444 confirmed cases 50
of COVID-19, as well as 7,031,216 deaths from respiratory failure and injury to other major 51
organs [2]. China reported its first instances of the virus to the WHO in December 2019. 52
Hospitalization rates have increased globally due to the disease's rapid spread [3]. How- 53
ever, while the peak of the initial crisis may be over, the world is still dealing with the 54
lasting effects of the COVID-19 pandemic. Continuous research in this field is important 55
to improving the diagnosis of future respiratory disease pandemics. 56
Reverse-transcription polymerase chain reaction (RT-PCR) testing is a standard, 57
highly sensitive method for diagnosing COVID-19[4]. The method's limitations, which in- 58
clude low sensitivity, insufficient supply of examination kits, and long waiting times for 59
results, could affect patients and increase COVID-19 transmission [5]. In emergency situ- 60
ations, radiologists can assist physicians in diagnosing lung diseases by using imaging 61
modalities such as chest computed tomography (CT scans) or CXR. RT-PCR limitations 62
can be overcome with accurate analysis for radiological imaging. These procedures are 63
challenging because of time constraints and the large number of patients. Furthermore, 64
erroneous diagnoses can result from radiologists with different levels of expertise [6]. 65
Clearly, a radiographic examination of the chest is a significant diagnostic tool[7]. 66
Particularly considering how well and accurately CT scans can identify and detect pneu- 67
monia, even though CXR is more appropriate for this situation. This is due to several rea- 68
sons: CXR is inexpensive, simple, portable, available, and the most widely used tool for 69
diagnosing the lungs [8]. In addition to the recommendation of the American College of 70
Radiology (ACR) to use CXR imaging as a certified modality to reduce the possibility of 71
disease transmission [9]. 72
It is important to know that all kinds of pneumonia that infect the lungs are classified 73
in hierarchical structure by the International Statistical Classification of Diseases (ICD). 74
COVID-19 scientifically called as (SARSr-COV-2) was added to the hierarchical viral fam- 75
ily in the ICD 10th revision (ICD-10)[10]. Figure 1 presents the types of pneumonia for 76
which datasets were obtained that are organized in a hierarchical structure. There are 77
eight classes in total (with normal class), six of which are leaf nodes. 78

Pneumonia

Viral Bacterial

respiratory
SARSr-CoV-2 Influenza syncytial virus Adenoviruses
(RSV)
79
Figure 1. Proposed hierarchal class structure of pneumonia. 80

The hierarchical structure of COVID-19, as previously illustrated, indicates that this 81


is a hierarchical classification problem. The classes in high-level nodes at hierarchical lev- 82
els are known as coarse-grained nodes because they have unique features that will be 83
transmitted to their child nodes along with all the features from their parent node. Fur- 84
thermore, the final level of nodes in the structure, the leaf node, is referred to as fine- 85
grained since it lacks descendants and inherits all its parent's features. 86
Applying high-accuracy Artificial Intelligence (AI) models to diagnose medical im- 87
aging problems is a current trend in healthcare. Convolutional Neural Networks (CNNs) 88
Sensors 2024, 24, x FOR PEER REVIEW 3 of 27

are able to detect and learn the significant details that radiologists find difficult to recog- 89
nize with their naked eyes [11]. It produces promising results for learning complex prob- 90
lems in radiology [12]. Many of the previously reviewed studies [13] have employed deep 91
learning models to diagnose and detect COVID-19 pneumonia utilizing medical imaging 92
in a theoretical manner that cannot be implemented clinically. 93
In several fields, especially the diagnosis of medical assistants, deep learning ap- 94
proaches have accomplished significant advances in multi-modal structures by learning 95
features from different sources of data[14]. This clearly explains the effectiveness of add- 96
ing various medical data in addition to CXR images for the diagnostic process. 97
One variant of the conventional flat classification problem is called hierarchical clas- 98
sification (HC). In a flat classification approach, cases are categorized into classes without 99
following any predefined structure. 100
Proving that the hierarchical classification is more effective than the flat classification 101
in this domain is not the purpose of this work, because there are those in the literature 102
who have proven (Pereira et al., 2020). In this work, we investigate how clinical data affect 103
COVID-19 classification utilizing CXR images with a hierarchical classification frame- 104
work. to detect different types of pneumonia caused by multiple pathogens and differen- 105
tiate them from the normal lungs. To achieve that, we collected a private, imbalanced da- 106
taset in which some types of pneumonia are much more common than others. For that, 107
we applied variants of GANs model to balance the class distribution. We applied a multi- 108
modal hierarchical classification utilizing a pure deep learning (end-to-end CNN) ap- 109
proach for some predefined models in a hierarchy structure using a hybrid approach to 110
the CXR images first, then added the medical tabular data using early fusion. 111
The paper is organized as follows: Section 2 covers the literature for related works. 112
The details and characteristics of the dataset used in this paper and its analysis, as well as 113
the techniques used to preprocess either the CXR images or the tabular dataset, are dis- 114
cussed in Section 3. After that, Section 4 details our proposed methodology and experi- 115
mental setup. The obtained results and a discussion are summarized in Section 5. Finally, 116
the conclusion of the current work and some possibilities for future works are described 117
in the last section. 118

2. Related Work 119


Due to the COVID-19 epidemic, COVID-19 medical image classification has recently 120
attracted a lot of scientific interest. Researchers from a variety of disciplines have devel- 121
oped deep learning detection and classification models to diagnose COVID-19 reliably 122
and quickly by analyzing radiological images. We have published the review paper called 123
"Detection and Classification of COVID-19 by Radiological Imaging Modalities Using 124
Deep Learning Techniques: A Literature Review"[13], which attempts to explore all re- 125
lated remarkable works in the literature and study and analyze them to explain how most 126
current key approaches to the COVID-19 classification challenge have gaps and untapped 127
potential. In additionally, we had provided in the aforementioned paper some recommen- 128
dations addressing various aspects that may help researchers in this field. 129

3. Proposed Methods and Materials 130


The proposed approach that is applied in this study consists of five phases, as demon- 131
strated in Figure 2. These phases are as follows: collecting the required dataset (CXR im- 132
ages and patient medical data in tabular format), preparing and preprocessing the col- 133
lected dataset, generating a synthetic dataset to balance the data, feeding the preprocessed 134
dataset into a hierarchal multi-modal network, and, at the end, evaluating the classifica- 135
tion output. The details of each phase are described in the following sections. 136
137
Sensors 2024, 24, x FOR PEER REVIEW 4 of 27

138
Figure 2. Proposed framework for multi-modal classification of CXR images. 139

3.1. Dataset Description 140


Our study used two anonymized private datasets: (I) the first dataset was collected 141
from the database at King Khalid University Hospital, Riyadh, KSA under Institutional 142
Review Board (IRB) approval is (E-21-5939) in collaboration with Dr. Thamer Nouh and 143
Dr. Metab Alkubeyyer. It contains 3326 patients of Normal, Bacterial, Sars-cov-2, Influnza, 144
Respiratory syncytial virus (RSV), and Adenovirus cases as shown in Table 1. (II) while 145
the second dataset was obtained from Rashid Hospital, Dubai, UAE, reviewed and ap- 146
proved by the Dubai Scientific Research Ethical Committee (DSREC), Dubai Health Au- 147
thority (DHA); the IRB approval is (DSREC-12/2021_01) in collaboration with Dr. Bassam 148
Mahboub and Dr. Laila Salameh, which contains 1218 Sars-cov-2 patients as shown in 149
Table 2. Attempting to obtain data from different sources to improve a model's generali- 150
zation capabilities. 151
To obtain the dataset from the hospital databases, the physician is fetching the re- 152
quired viral patients by searching the desired test name and range of years. While there is 153
no test for bacterial infection, the patients whose diagnosis contains the word “bacterial 154
pneumonia” were selected as cases for the bacterial class. It is important to note that the 155
data in the normal category is collected from the patient who is scheduled for surgery to 156
ensure the patient's lungs are healthy. The CXR images from the first dataset were pro- 157
duced by Optima XR240amx, GE Healthcare. From the datasets (I) and (II), all CXR images 158
are posterior-anterior (PA), and anterior-posterior (AP) views were included. While CXR 159
images of both lateral views, the date gap between the diagnosis and obtaining the CXR 160
images was more than 48 hours, and\or those that did not have tabular data were ex- 161
cluded. Some samples of the dataset for different classes are shown in Figure 3. 162

Table 1. Patients’ characteristics for dataset (I). 163

Category Value
Number of patients 1217
Gender
Male 1070
Female 147
Diagnosis
SARSr-COV-2 1217
Age
Range 19-87
164
Sensors 2024, 24, x FOR PEER REVIEW 5 of 27

(A) Normal (B) Bacterial (C) Adenovirus

(D) Sarsr-cov-2 (E) Influenza (F) RSV

Figure 3. CXR samples of the dataset. 165

Table 2. Patients’ characteristics for dataset (II). 166

Category Value
Number of patients 1217
Gender
Male 1070
Female 147
Diagnosis
SARSr-COV-2 1217
Age
Range 19-87
The dataset consists of CXR images with the corresponding medical tabular data for 167
each patient. It is including demographic, vital signs, clinical, and medications data, the 168
attributes in the data record are described in Table 3. The tabular dataset includes 644 169
features with categorical and numerical data types. Some of the features have been re- 170
moved since they were not significant in diagnosing pneumonia. In addition, there are a 171
total of 60 different nationalities represented in the patient sample for the 4523 patients 172
who make up the entire dataset. 173
The MEWS score feature (stands for modified early warning score) is a clinical tool 174
used in healthcare settings to assess the patient's vital signs. More hospitals are currently 175
using it to help track changes between each set of vitals[15]. The MEWS score typically 176
consists of several physiological parameters include blood pressure, body temperature, 177
pulse rate, respiratory rate, and the AVPU (A = Awake, V = Verbal, P = Pain, U = Unre- 178
sponsive) score that is used to determine a patient's level of consciousness. A score is given 179
to each parameter based on specified standards. The overall MEWS score is then deter- 180
mined by summing the scores for each parameter[16]. According on their MEWS score, 181
patients may be classified into risk categories using the MEWS scoring system, such as: 182
Normal: 0-1 score, Low Risk: 2-3 score, Moderate Risk: 4-6, High Risk: 7-8, Critical: >8. The 183
concern for clinical deterioration increases as the MEWS score rises. 184
With the assistance of a knowledgeable pharmacist, medication prescriptions were 185
also limited to the most fundamental categories without doses, which helped decrease the 186
enormous number of medications from 3500 to 614. 187
188
Sensors 2024, 24, x FOR PEER REVIEW 6 of 27

Table 3. Attributes description of the tabular data. 189

Feature type Feature Name Description Data Type


Demographic Age Patient’s age at the time of diagnosis. Numerical
Gender Patient’s gender. Categorical
Nationality Nation of origin of the patient. Categorical
Vital signs BMI Patient’s Body mass index. Numerical
MEWS Score It is a calculation done on a patient after checking their Categorical
vital signs and AVPU score.
Lab test WBC Test that measures the number of white blood cells. Numerical
tHb Test the total hemoglobin.
Plt Test that measures the blood platelets.
Lymph Auto # Test that measures the percentage of lymphocytes in your
Sodium Lvl blood.
Potassium Lvl Test of the sodium level in the blood.
BUN Test of the potassium level in the blood.
Creatinine Test the blood urea nitrogen.
Alk Phos Test that measures the level of creatinine in the blood.
AST Tests measure the level of alkaline phosphatase in the
Albumin Lvl blood.
ALT Test measures the levels of aspartate aminotransferase en-
Bili Total zyme in the blood.
INR Test that checks the amount of albumin in the blood.
LDH Test measures the amount of alanine transaminase in the
Procalcitonin blood.
CRP Test measures the levels of bilirubin in your blood.
Ferritin Lvl The International Normalized Ratio is a blood test that
Hgb A1c measures how long it takes for blood to clot.
D-Dimer Test measures the level of lactate dehydrogenase in the
BNP blood.
Total CK Test measures the level of procalcitonin in the blood.
Vitamin D 25 Test can check the C-reactive protein level in the blood.
OH Test that measures the amount of ferritin in the blood.
Test measures the percentage of hemoglobin proteins in
the blood that are coated with sugar.
Test that measures D-dimer, which is a protein fragment
that the body makes when a blood clot dissolves in the body.
A B-type natriuretic peptide test is measuring the levels
of a certain type of hormone in the blood.
Test measures the amount of creatine kinase in the blood.
Test measures the level of active vitamin D in the blood.
Medication 614 Medications All medications that were prescribed to the patient when Categorical
visiting the hospital.
Sensors 2024, 24, x FOR PEER REVIEW 7 of 27

3.2. Data Analysis 190


Statistical analysis was conducted with python libraries related to statistical testing. 191
The statistical tests T-Test, Kruskal-Wallis test, and chi-squared test, used from the 192
“scipy.stats” module, were applied for continuous, categorical, and binary categorical val- 193
ues, respectively. The reported significance levels were two-sided, and the statistical sig- 194
nificance level was set to 0.05. The categorical features were expressed as frequency (%), 195
mean (µ) and standard deviation (σ) for continues features. Table 4 represents a compari- 196
son of the features between the two patient groups for the raw dataset, the first with 197
COVID-19 and the second for all remaining classes. 198
As we can see, there is a significant difference in age, gender, and BMI between the 199
two groups (all p<0.001). The MEWS score shows a significant difference (p<0.001), and 200
an unnormal MEWS score was observed more in COVID-19 patients. Significant differ- 201
ences are not found in some lab tests between the two groups, including tHb (p= 0.729), 202
Potassium Lvl (p= 0.606), Alk Phos (p=0.187), Bili Total (p= 0.477), INR (p= 0.680), LDH 203
(p=0.553), Ferritin Lvl (p= 0.419), BNP (p= 0.569), and Vitamin D 25 OH (p=0.818). While 204
WBC, Plt, Lymph Auto #, Sodium Lvl, BUN, Creatinine, Albumin Lvl, ALT, and Total CK 205
are significantly different between the two groups (p<0.001). In addition, AST, Procalci- 206
tonin, CRP, Hgb A1c, and D-Dimer also show significant differences (p=0.021, p=0.036, 207
p=0.009, p=0.001, and p=0.030, respectively). Although medications were observed in 208
COVID-19 patients, they are not statistically different compared with those in the non- 209
COVID-19 group. 210

Table 4. Patient's Medical information Characteristics (*: Data with statistical significance. a: chi- 211
square test, b: Student’s t test. c: Kruskal-Wallis H test). 212

Characteristics COVID-19 Non-COVID-19 P


Overall(n=4524)
(n = 1847) (n = 2677) value
Age (years) 48.9 ± 16.3 37.4 ± 20.95 42.65 ± 19.81 <0.001*b

Gender

Male 1402, 75.9% 1252, 46.8% 2655, 58.7% <0.001*a

Female 445, 24.1% 1424, 53.2% 1869, 41.3%

Vital signs

BMI 28.53 ± 10.86 39.03 ± 20.3 33.89 ± 17.19 <0.001*b

MEWS score (Nor- <0.001*c


581, 31.5% 2057, 76.8% 2638, 58.3%
mal)

MEWS score (Low


603, 32.6% 365, 13.6% 968, 21.4%
risk)

MEWS score (Mod-


491, 26.6% 185, 6.9% 676, 14.9%
erate risk)

MEWS score (High


106, 5.7% 59, 2.2% 165, 3.6%
risk)

MEWS score (Criti-


27, 1.5% 11, 0.4% 38, 0.8%
cal)

Lab test

WBC 9.69 ± 5.83 8.8 ± 4.46 8.98 ± 4.78 <0.001*b

tHb 12.97 ± 2.18 13.03 ± 2.55 12.99 ± 2.32 0.729 b

Plt 232.81 ± 99.41 nan ± nan 232.81 ± 99.41 <0.001*b


Sensors 2024, 24, x FOR PEER REVIEW 8 of 27

Lymph Auto # 1.3 ± 2.25 2.13 ± 1.44 1.92 ± 1.72 <0.001*b

Sodium Lvl 135.77 ± 6.11 138.02 ± 3.65 136.99 ± 5.06 <0.001*b

Potassium Lvl 4.14 ± 0.59 4.15 ± 0.49 4.15 ± 0.54 0.606 b

BUN 21.81 ± 20.96 5.13 ± 4.32 12.77 ± 16.75 <0.001*b

Creatinine 36.81 ± 84.86 78.3 ± 104.39 59.27 ± 98.12 <0.001*b

Alk Phos 99.38 ± 68.16 103.83 ± 79.9 102.72 ± 77.15 0.187 b

AST 71.5 ± 388.91 42.73 ± 222.26 54.68 ± 303.08 0.021 b

Albumin Lvl <0.001


27.04 ± 8.74 34.54 ± 5.2 32.68 ± 7.05
*b

ALT <0.001
77.44 ± 114.05 44.16 ± 100.25 45.66 ± 101.99
*b

Bili Total 11.18 ± 18.35 10.57 ± 19.74 10.72 ± 19.42 0.477 b

INR 1.08 ± 0.28 1.08 ± 0.26 1.08 ± 0.27 0.680 b

LDH 376.81 ± 331.01 363.38 ± 298.75 375.34 ± 327.59 0.553 b

Procalcitonin 1.67 ± 12.4 3.54 ± 12.36 1.91 ± 12.4 0.036 b

CRP 82.72 ± 84.94 68.91 ± 84.35 80.59 ± 84.97 0.009 b

Ferritin Lvl 1007.16 ± 0.419 b


709.05 ± 6046.88 960.46 ± 3574.55
2892.55

Hgb A1c 8.19 ± 5.34 7.47 ± 2.16 7.96 ± 4.58 0.001 b

D-Dimer 1.61 ± 2.51 2.37 ± 3.58 1.66 ± 2.61 0.030 b

BNP 2872.61 ± 0.569 b


2505.09 ± 4094.67 2757.2 ± 12191.43
14459.7

Total CK 126.63 ± 610.22 282.06 ± 505.17 170.01 ± 586.74 <0.001*b

Vitamin D 25 OH 43.57 ± 33.94 42.62 ± 32.02 42.7 ± 32.16 0.818 b

Medication

614 Medications 1.000 a


1231221, 98.5% 1623034, 98.6% 3269812, 98.7%
(No)

614 Medications
18623, 1.5% 23306, 1.4% 41976, 1.3%
(Yes)
213
The distribution of the data was also shown using a number of visualizations. Figure 214
4 shows the distribution of some continuous and categorical features, because we have 215
many medical features. 216
217
218
Sensors 2024, 24, x FOR PEER REVIEW 9 of 27

Figure 4. Exploratory analysis of the numeric features. 219

There are almost 59.5% Saudi citizens among all classes. The majority of patients are 220
male (0), and the age feature values fall between 20 and 60. Most of the MEWS scores are 221
normal, while there are a few critical cases from all classes. C-reactive protein (CRP) and 222
Sensors 2024, 24, x FOR PEER REVIEW 10 of 27

the Vitamin D 25 OH both showed a right-skewed distribution, and the white blood cells 223
(WBC), blood platelets (Plt), and lymphocytes percentage (Lymph Auto #) are show al- 224
most normal distribution. The Pearson correlation coefficient was used to obtain the rela- 225
tionships between the continuous features. While Cramér's V was used to measure the 226
association between the categorical features. 227

228
Figure 5. Correlation of the continuous features in the dataset. 229

Figure 5 shows the correlation of the continuous features, the heat map shows the 230
correlation between twenty-four continuous features. A correlation value of 0.75 was rec- 231
orded between both creatinine and total CK. Furthermore, a correlation of 0.72 was ob- 232
served between total AST and ALT. Furthermore, Table 5 represents that there is no asso- 233
ciation between age and nationality, while there is a weak association between age, na- 234
tionality, and MEWS score. 235
236
Sensors 2024, 24, x FOR PEER REVIEW 11 of 27

Table 5. Correlation of the categorical features in the dataset. 237

Features Correlation
(AGE, NATIONALITY) 0.000000
(AGE, MEWS Score) 0.196708
(NATIONALITY, AGE) 0.000000
(NATIONALITY, MEWS Score) 0.187111
(MEWS Score, AGE) 0.196708
(MEWS Score, NATIONALITY) 0.187111

3.3. Data Pre-processing 238


Efficient pre-processing of data can have a major effect on the reliability and quality 239
of deep learning model results. It assists in guaranteeing that the data is accurate, in the 240
right format, free of errors, and in line with the objectives of the modeling tasks[17]. 241
To preserve data privacy, we anonymize the identity of the patients in the CXR im- 242
ages and the tabular data because it is not included in the analysis. In the following sub- 243
sections, we discuss each pre-processing step for each of the tabular data and the CXR 244
images in detail and illustrate its main methods. 245

3.3.1. Tabular Data 246


The raw dataset contains tabular data that was obtained in its original, unprocessed 247
form, and we cleaned and organized it in a unified tabular structure in line with Dubai 248
COVID-19 data format. 249
Before addressing the missing values in the tabular data, checking the data type is 250
done to make sure that the data in dataset are correctly formatted and has consistent data 251
types. Since the presence of string values in the data is minimal, the appropriate action 252
would be to convert these string values to null and then proceed as missing values. 253
As is well known, managing the missing values in the medical data is not straight- 254
forward[18]. This is due to the nature and sensitivity of this data, where replacing the 255
missing values with the value of 0 has different meanings in this field. For this reason, we 256
predict the values for the features that have less than 75% of the missing values for each 257
file separately. Due to the importance of all features and inability to remove any of the 258
existing ones, we combine all files and repeat the process for the features in which the 259
percentage of missing values is higher than 75%. 260
Extreme Gradient Boosting (XGBoost) and Random Forest are machine learning 261
models used to impute continuous and categorical missing values. For each class, split the 262
data into two parts: a training dataset (non-missing values) and a test dataset (missing 263
values to be imputed). The XGBoost Regressor and Random Forest Regressor are used for 264
continuous features, while categorical features are imputed using the XGBoost Classifier 265
and Random Forest Classifier. Thereafter, utilize box and bar plots to visualize the distri- 266
bution of continuous and categorical values, respectively, and spot the outliers. The inter- 267
quartile range (IQR) method is used to identify the outliers; values outside the range of 268
the lower_bound [q1 - 1.5 * iqr] and upper_bound [q3 + 1.5 * iqr] are considered outliers. 269
Considering the importance of each individual patient in the dataset, no patients were 270
removed. The outlier values are replaced with “NaN” and then imputed using the 271
XGBoost Regressor and Random Forest Regressor for continuous features. Subsequently, 272
return and impute the features where the percentage of missing data exceeds 75%; each 273
column in this list is considered a target for imputation. The same approach is used to 274
create an initial predictive model using the target column and common non-null features 275
from the trainable datasets. Following each training cycle, a decision tree structure visu- 276
alization of the model is created and saved. This visualization aids in understanding the 277
Sensors 2024, 24, x FOR PEER REVIEW 12 of 27

model's decision pathways. Once these predictions are made, the missing values in the 278
original datasets are updated. 279
Then, applying a log transformation to the continuous columns in the dataset makes 280
the data more normally distributed and standardizes the numerical values to scale them 281
in the range of a mean of 0 and a standard deviation of 1, ensuring the data is appropriate 282
for subsequent modeling steps. The principal component analysis (PCA) is used to extract 283
features from high-dimensional tabular data. The Maximum Likelihood Estimation (MLE) 284
method is used to reduce the number of principle components from 644 to 218 compo- 285
nents significantly decreases the complexity of the data while retaining the essential vari- 286
ance to focusing on the most informative aspects of the data for modeling. The most im- 287
portant demographic and health-related data in order are: age, BMI, MEWS score, nation- 288
ality, then gender. While the top ten lab tests in order are: INR, Total CK, D-Dimer, CRP, 289
LDH, Albumin Lvl, BUN, Vitamin D 25 OH, and WBC test. The following medications 290
were among the most important: Spironolactone, Granisetron, Colchicine, Sulfasalazine, 291
Nifedipine, Gliclazide, Sodium Bicarbonate, Ferrous Sulfate, Metformin, and Pioglita- 292
zone, respectively. 293

3.3.2. Images 294


All the CXR images were downloaded from picture archiving and communication 295
systems (PACS). The radiology consultant was provided with files containing patient file 296
numbers according to each class; the radiology consultant fetched all the CXR images of 297
the specific range of years (e.g., COVID-19 from 2020–2022) that related to the listed pa- 298
tients. Each image is selected and labeled by the represented class based on matching the 299
date when the patient visited the hospital and was diagnosed with the disease with the 300
date of getting the CXR image. 301
The CXR images are obtained in the DICOM (Digital Imaging and Communications 302
in Medicine) extension. The MicroDicom DICOM viewer version (DM_PLAT- 303
FORM_XRAY_GANAPATI_4.10.2_2020_FW41.1_158) was used to convert the images to 304
the appropriate format JPG file [19]. 305
To help the deep learning model focus on the chest area (especially the lungs), the 306
images are manually cropped to guarantee that no tiny part of the lungs is removed in 307
any way, cutting out a chest region and removing any other parts of the body that appear 308
in the image. Applying image enhancement also is important to improve the classification 309
result. Gamma correction-based technique's ability to detect COVID-19 from CXR images 310
outperforms other methods (Rahman et al., 2021). Trying different thresholds, the gamma 311
threshold value (0.9) was chosen with the contrast enhancement threshold value (1.5) to 312
enhance the contrast of the CXR images. Combining them enables a more thorough ad- 313
justment of the appearance and tonal range of the image. If P is the pixel value within the 314
range [0,255], then x is the pixel's grayscale value (x ∈ P). The output pixel vector of the 315
gamma correction function g(x) is calculated with Eq. (1). 316

𝑥 1⁄𝛾(𝑥)
𝑔(𝑥) = 255 ( ) (1)
255
In addition, we apply image denoising using the total variation filter (TVF) method 317
to remove the noise from the images. Based on literature, combining contrast enhance- 318
ment (gamma correction) and image denoising (TVF) approaches produces outstanding 319
results on COVID-19 images (Sharma & Mishra, 2022). Moreover, transformations are 320
used to preprocess the images before the training phase. Due to dealing with CXR im- 321
ages from different resources and sizes, standardization was applied by resizing the im- 322
ages. The images were resized to (128×128) pixels, which gave a better result through the 323
experiments. Converting to grayscale and normalizing the image pixel values to a range 324
by dividing by 255 (the maximum pixel value for 8-bit images), and then converting to a 325
tensor for integration with the TensorFlow framework. 326
Sensors 2024, 24, x FOR PEER REVIEW 13 of 27

3.3.3. Eliminating rib shadows in CXR images 327


A significant challenge in the study of chest radiographs is the invisibility of anoma- 328
lies due to the superimposition of normal anatomical components, like ribs, over the pri- 329
mary tissue under examination [20].Therefore, it would be helpful to eliminate the ribs 330
without losing any information about the original tissue when trying to increase the nod- 331
ule visibility and identify nodules on a chest radiograph. For that reason, we try to apply 332
a method [21] to removing the rib shadows from the infected lungs to improve nodule 333
detection and enhance the diagnostic process. By using a hybrid self-template approach, 334
the algorithm tries to first identify the ribs. An unsupervised regression model is then 335
used to suppress the identified ribs. We attempt to adapt the paper's approach to our pri- 336
vate dataset in order to achieve the same good results for rib elimination from their CXR 337
images. 338
After preprocessing the images, the lung area is defined using a Gaussian filter to 339
extract the mask from each image. Rib detection is done by a bilateral filter, then the result 340
is converted to grayscale using Extreme Level Eliminating Histogram Equalization 341
(ELEHE) to improve the visibility of features in images. After that, Sobel edge detection is 342
applied to the equalized image to define the edge thickness. Then, dilation is applied to 343
merge nearby bright regions and increase their size; an opening operation is also per- 344
formed on the dilated images. Erosion is applied to the opened images to shrink the bright 345
regions and refine the image by reducing the size of the remaining bright regions after the 346
opening operation. To prepare the images to parabola fitting the connected component 347
analysis is performing on the images to define the connected components. The first (back- 348
ground) and last (foreground) components are popped out as its not useful, then fine tun- 349
ing to the best number of connected components. Thereafter, the parabola fitting is calcu- 350
lated by the following Eq. (2): 351

f(x) = ax2 + bx + c (2)


the most fitting connected components are being considered, and all the curves are 352
being plotted on the images by using polyline’s function. At the end, the ribs region is ac- 353
quired, shadow estimation is clearly defined, and suppression is done by removing the 354
shadows from an image by adjusting the pixel values in the shadow regions based on the 355
average BGR color values. 356

(A) Original image (B) Lung mask (C) Soble edge detection (D) Dilation

(E) Calculate the connected (F) Shadow estimation and (G) Ribs suppression (H) Shadow subtraction
components suppression
Sensors 2024, 24, x FOR PEER REVIEW 14 of 27

Figure 6. An example of a rib shadow elimination result. 357

Applying the previous approach to our private CXR images, as shown in Figure 6, 358
delivers unacceptable results on most images. It removes a lot of nodules from the lung 359
area, which affects the prediction results. It's important to note that the effectiveness of 360
shadow suppression may depend on the characteristics of the images you're working 361
with. which makes us attribute that to a lack of clarity of vision and sometimes the bound- 362
ary of the lungs for the most images, despite preprocessing them. Testing on a variety of 363
images is often necessary for a robust shadow removal model. For that reason, we decided 364
to unapplied the ribs elimination on this experiment. 365

3.4. Generating Synthetic Dataset 366


To avoiding bias in model training we need to balance dataset to ensuring that the 367
number of instances for each class is roughly the same. Balanced datasets often lead to 368
better model performance[22]. While the number of cases for classes in each level of the 369
hierarchy structure are not balanced, we need to generate synthetic data to balance the 370
dataset. For that, we choose GANs as the base model. The model have been developed in 371
a variety of versions, each with a particular purpose [23]. 372
A Conditional Tabular Generative Adversarial Network (CTGAN) was used for gen- 373
erating synthetic tabular data. A CTGAN’s synthesizer is the generator that initializes 374
with the best value of epoch for each data file that yields the optimal loss values, whether 375
for a generator or a discriminator. Fits the synthesizer to the data and generates synthetic 376
data using the fitted synthesizer. The desired samples of records that we need to generate 377
are specified to get the balance among each level of the dataset. Evaluating the quality of 378
the synthetic data compared to the real data yields superior results. 379
To generate fake CXR images corresponding to each fake tabular data, we used a 380
combination of tabular data and an image as inputs to a customized Conditional Genera- 381
tive Adversarial Network (CGAN), the generator is designed to produce 500×500 images, 382
and the discriminator processes both tabular and image data. It's important to note that 383
numerical-to-image synthesis with CGAN is a challenging task. The generator model has 384
10 layers, and the discriminator model has 8 layers, the count includes various types of 385
layers, such as Dense, Conv2D, Conv2DTranspose, Batch Normalization, Dropout, ReLU, 386
LeakyReLU, and Flatten layers. Both the generator and discriminator use Adam optimizer 387
with learning rate of 0.0001, abeta_1 of 0.5, and dropout rate o.2 used in discriminator. 388
The model is trained for a number of epochs for each class to generate synthetic data. 389
The model generates acceptable results; some samples are shown in Figure 7. In the next 390
section, we try to investigate how synthetic data improves classification accuracy, espe- 391
cially when a small amount of data is available. 392
393

(A) Bacterial (B) Adenovirus (C) Influenzas (D)RSV

Figure 7. Samples of Synthetic CXR images. 394

4. Hierarchal Model Architecture 395


Sensors 2024, 24, x FOR PEER REVIEW 15 of 27

To apply deep learning models for developing a hierarchical classification, we 396


adapted four pre-trained models to tackle the hierarchy classification process. It has been 397
observed that Visual Geometry Group (VGG)-based and Residual Network (Resnet)- 398
based models are widely utilized in this field and provide outstanding results (Andrade- 399
Girón et al., 2023) (Saini & Devi, 2023). We adopt VGG11 and Resnet18 as the basic models 400
for this challenge because both are more suitable to the moderate size of our dataset. The 401
dataset consists of eight hierarchical paths that are shown in Table 7. The details of all 402
models are explained in the following subsections. 403

Table 7. Dataset distribution for the hierarchical classification. 404

Label Path #Samples

Level#1

Normal 1273

Pneumonia 3270

Level#2

Pneumonia\Bacterial 248

Pneumonia\Viral 3165

Level#3

Pneumonia\Viral\ SARSr-CoV-2 1848

Pneumonia\Viral\Influenza 1281

Pneumonia\Viral \ RSV 21

Pneumonia\Viral \Adenoviruses 15

4.1. Hierarchical Convolutional Neural Network Based on the VGG Architecture 405
VGG-based was mainly used as a deep learning multi-modal for the proposed 406
method, with two architectures. The first architecture called the VGG-Like multi-modal, 407
adapts the architectural principles of the VGG neural network architecture, which utilizes 408
repetitive blocks of convolutional layers followed by max-pooling layers to effectively ex- 409
tract features from CXR images. Our VGG-Like multi-modal simplifies and tailors the 410
original design for hierarchical decision-making in pneumonia classification from CXR 411
images as shown in Figure 8. The adaptations made were designed to process single-chan- 412
nel (grayscale) CXR images by modifying the first convolutional layer to accept a single 413
input channel. The depth of the network is adjusted. The model includes fewer convolu- 414
tional layers than some VGG models (e.g., VGG16, VGG19), making it effective for the 415
targeted dataset. The model introduces branching points to make hierarchical decisions 416
at different levels of pneumonia classification: normal vs. pneumonia, bacterial vs. viral 417
pneumonia, and further subclassification of viral pneumonia. This hierarchical approach 418
is novel and not present in the standard VGG architecture. After the initial shared convo- 419
lutional layers, the network branches out to make specific decisions, with each branch 420
having its own set of convolutional and fully connected layers tailored to its classification 421
task. Considering the dataset size, the fully connected (ANN) layers in the branches are 422
simplified compared to the dense layers in the original VGG models, reducing the model's 423
complexity and the risk of overfitting on medical imaging datasets, which are typically 424
smaller than ImageNet. 425
Sensors 2024, 24, x FOR PEER REVIEW 16 of 27

426
Figure 8. VGG-Like multi-modal Architecture. 427

The input image size was changed to 128×128 pixels. The initial CNN layer for a first- 428
level decision consists of 2D convolution with a kernel size of 3×3 using the ReLU activa- 429
tion. And the Max Pooling with a kernel size of 2×2. The input for the initial layer is 1 430
channel, and the output is 32 channels. The Flattened output from the initial CNN layer 431
pass to hidden Layer of 128 neurons with ReLU activation, and the output (decision#1) 432
representing probabilities of normal/pneumonia classes. The pneumonia CNN layer for 433
the second-level decision consists of two convolutional layers with the same kernel size, 434
activation, and Max pooling layer. The output (decision#2) representing probabilities of 435
Viral/Bacterial classes. The Viral CNN layer for the third-level decision has the same de- 436
scription of the previous level, and the output (decision#3) representing probabilities of 437
SARSr-COV-2/Influenza/RSV/Adenovirus classes. The second architecture is VGG-Back- 438
bone Multi-modal, which adapts the original VGG architecture by utilizing their pre- 439
trained network as feature extractors, followed by three branches of fully connected layers 440
(ANNs) for the hierarchical decision-making task. The first convolutional layer was mod- 441
ified to accept a single input channel. The fully connected layers of the original architec- 442
ture are replaced with custom layers designed to make hierarchical decisions specific to 443
pneumonia classification, as shown in Figure 9. This adaptation allows the models to focus 444
on the most relevant features for each decision level. 445

446

Figure 9. VGG Backbone multimodal Architecture. 447


Sensors 2024, 24, x FOR PEER REVIEW 17 of 27

4.2. Hierarchical Convolutional Neural Network Based on the ResNet Architecture 448
In addition to VGG-based multi-modal, ResNet-based was also used as a deep learn- 449
ing multi-modal, with two architectures. The first architecture was called ResNet-Like 450
Multi-modal, which was inspired by the ResNet architecture. ResNet-Like multi-modal 451
adapts the ResNet architecture for the same hierarchical decision-making task in pneu- 452
monia classification as shown in Figure 10. The key adaptations are modified in the first 453
convolutional layer to accept grayscale images, reflecting the single-channel nature of 454
CXR images. The model employs customized residual blocks that match the task's com- 455
plexity and data characteristics. Each block consists of convolutional layers with batch 456
normalization and ReLU activation, similar to ResNet's design, but the number and con- 457
figuration of blocks are tailored to the pneumonia classification task. Similar to the VGG- 458
like model, the ResNet-like model incorporates branching points for hierarchical classifi- 459
cation decisions. This structure leverages the deep feature representation capability of 460
ResNet while providing specialized decision paths for different classification levels. The 461
introduction of skip connections in each block ensures effective training and feature prop- 462
agation, even with the model's depth. The model concludes with simplified, fully con- 463
nected layers in each branch for decision-specific classification. 464

465

Figure 10. ResNet-Like multimodal Architecture. 466

ResNet-Backbone Multi-modal is the second architecture, which is an adaptation of 467


the original ResNet architecture. As shown in Figure 11, after utilizing their pre-trained 468
network as feature extractors, it employs three branches of fully connected layers (ANNs) 469
for the hierarchical decision-making process. 470
Sensors 2024, 24, x FOR PEER REVIEW 18 of 27

471

Figure 11. ResNet-Backbone multi-modal Architecture. 472

By adapting ResNet's residual learning principle, the model efficiently learns features 473
from CXR images, which is crucial for medical imaging tasks where interpretability and 474
accuracy are paramount. 475

4.3. Training the Hierarchical Multi-modal Methodology 476


Applying the four architectures trained in hierarchical multi-modal, the model first 477
determines if the image shows signs of pneumonia. If pneumonia is detected, it then clas- 478
sifies the pneumonia as either viral or bacterial. If viral pneumonia is detected, the model 479
further classifies the type of viral pneumonia. The hierarchical inference function returns 480
a tuple of decisions, each corresponding to a level in the decision hierarchy. Algorithm1 481
details the proposed method in the form of pseudocode. 482
Algorithm 1. Pneumonia Hierarchical Classification
1: Input: CXR images, Tabular data, trained model
2: Output: Normal OR Pneumonia/Bacterial OR Pneumonia/Viral/(SARSr-COV-2/Influenza/RSV/Adenovirus)
3: Function Hierarchical_Inference (image, tabular data, model)
4: decision_1_probs = model.forward_pass (image, tabular data, decision_point='decision_1’)
5: decision_1 = ArgMax(decision_1_probs) # Normal vs Pneumonia classification
6: If decision_1 is 'Pneumonia' Then
7: decision_2_probs = model.forward_pass (image, tabular data, decision_point='decision_2’)
8: decision_2 = ArgMax(decision_2_probs) # Viral vs Bacterial classification
9: If decision_2 is 'Viral' Then
10: decision_3_probs = model.forward_pass (image, tabular data, decision_point='decision_3')
11: decision_3 = ArgMax(decision_3_probs) # Subtypes of Viral Pneumonia classification
12: Return ('Pneumonia', 'Viral', decision_3) # SARSr-COV-2/Influenza/RSV/Adenovirus
13: Else
14: Return ('Pneumonia', 'Bacterial', 'N/A’) # No further subclassification
15: End If
16: Else
17: Return ('Normal', 'N/A', 'N/A’) # No pneumonia detected, no further classification
Sensors 2024, 24, x FOR PEER REVIEW 19 of 27

18: End If
19: End Function
The training methodology adopted for VGG-Like and ResNet-Like involves a se- 483
quential and focused approach, targeting one decision point at a time within the hierar- 484
chical structure of the problem. This approach ensures that the models learn to accurately 485
classify at each level of decision-making, from distinguishing between normal and pneu- 486
monia cases to identifying specific types of pneumonia, and so on. The training process 487
begins with the first decision point, which distinguishes between normal and pneumonia 488
cases. During this phase, the loss weight for the first decision is set to 1, while the loss 489
weights for subsequent decisions are set to 0. This concentrates the model's learning on 490
accurately classifying the initial coarse categories without being influenced by the more 491
detailed classifications that follow. After the model achieves satisfactory performance on 492
the first decision, the training proceeds to the next decision point (bacterial vs. viral). For 493
this phase, the model's weights from the previous training step are retained, ensuring 494
continuity and leveraging learned features. The loss weight for the current decision is now 495
set to a higher value (e.g., 0.9 for decision 2), while the loss weight for the first decision 496
might be reduced (e.g., 0.1) to maintain its knowledge, and the loss weight for the third 497
decision is set to 0. This process is repeated for each subsequent decision point, gradually 498
shifting the model's focus down the hierarchy. 499
Although the VGG-Backbone and ResNet-Backbone models are not hierarchical in 500
architecture, the training process incorporates hierarchical principles to align with the 501
structured decision-making process of the problem. For these models, the training meth- 502
odology mimics the sequential focus used for the hierarchical models, adapting the learn- 503
ing process to emphasize one level of classification at a time. This structured approach 504
ensures that the backbone models, which are powerful feature extractors due to their pre- 505
trained weights, are finely tuned to the specific requirements of each decision point in the 506
classification task. 507
While cases in hierarchical classification have to follow a predefined hierarchy struc- 508
ture, Local (top-down) and global (big-bang) are the two main approaches that can be 509
used for addressing hierarchical classification (Silla & Freitas, 2011). Global and local ap- 510
proaches are also implemented in the hybrid approach, as we were doing in this work. 511
Whereas, from the local approach, the function proceeds in a top-down manner, starting 512
with a broad classification (decision_1) and refining the classification based on subsequent 513
decisions (decision_2 and decision_3). From a global perspective, the function considers 514
the entire hierarchy of classifications, as it defines the possible outcomes at each level and 515
makes decisions based on the entire set of possibilities. Therefore, both local and global 516
approaches are implemented in this hybrid approach, as explained. 517

4.4. Evaluation Metrics 518


To analyze the general classification performance, we have chosen the macro-avg 519
evaluation to calculate the mean evaluation metrics between the classes. The proposed 520
models were assessed using common evaluation metrics such as accuracy, precision, sen- 521
sitivity, and F1-score. The research targets hierarchal multiclass classification, using 6×6 522
size confusion matrix to output for values: True Positive (TP), False Positive (FP), False 523
Positive (FP), and False Negative (FN), which are used to calculate the following 524
measures[24]: 525
Macro-average Accuracy: It measures the average of correctly predicted from each 526
class to the total number of instances evaluated as shown in Equation (3). 527

𝑇𝑃 +𝑇𝑁
Macro-average Accuracy= 1⁄𝐶 × ∑ 𝑇𝑃 +𝑇𝑁𝑖 +𝐹𝑃 𝑖+𝐹𝑁 (3)
𝑖 𝑖 𝑖 𝑖

Macro-average Precision: It measures the average ratio of true positives among all 528
predicted positives across all classes as shown in Equation (4). 529
Sensors 2024, 24, x FOR PEER REVIEW 20 of 27

𝑇𝑃
Macro-average Precision= 1⁄𝐶 × ∑ 𝑇𝑃 +𝐹𝑃
𝑖
(4)
𝑖 𝑖

Macro-average sensitivity: It measures the average ability of the model to correctly 530
identify true positives across all classes, as shown in Equation (5). 531

𝑇𝑃
Macro-average sensitivity= 1⁄𝐶 × ∑ 𝑇𝑃 +𝐹𝑁
𝑖
(5)
𝑖 𝑖

Macro-average F1-score: It combines both precision and recall into a single metric, 532
providing an overall measure of the model's performance in terms of correctly identifying 533
true positives and minimizing false positives and negatives, as shown in Equation (6). 534

2×𝑇𝑃
Macro-average F1-score= 1⁄𝐶 × ∑ 2× 𝑇𝑃 +𝐹𝑃𝑖 +𝐹𝑁 (6)
𝑖 𝑖 𝑖

Where: 535
C is the number of classes in the classification task. 536
TPi is the number of True Positives for class i. 537
TNi is the number of True Negatives for class i. 538
FPi is the number of False Positives for class i. 539
FNi is the number of False Negatives for class i. 540

5. Results and Discussion 541


The experiments are implemented using the Python 3.10.12 language and the 542
PyTorch 2.1.0+cu121 for the deep learning models. The training of the model was per- 543
formed using a PC running the 64-bit Windows 11 Pro operating system. The machine 544
had an Intel® Core™ i7-10700 CPU @ 2.90GHz and 32GB of RAM. Due to the limited size 545
of some classes in the dataset (total 4543 patients) and to obtain a more robust model, the 546
data splitting approach was (70/30) for training and testing respectively, using k-fold 547
cross-validation with k set to 5. Iterative search is used for adjusting the hyperparameter 548
values of the models. Each model use Adam optimizer with learning rate of 0.001, learning 549
scheduler patience of 3, early stopping patience of 5 for the model without tabular data 550
and 15 for the model with tabular data, and used the cross-entropy loss as a loos function. 551
The models were trained in ranges of 20 to 40 epochs with batch sizes of 32. The hierar- 552
chical multi-modal were performing consistently well, given that the cross-validation 553
score was consistent across all cross-folds. 554
As mentioned in Section 4, four hierarchical deep learning multi-modal have been 555
created to diagnose COVID-19 using CXR images and clinical tabular data and have been 556
applied in many experiments. The first and second experiments were conducted to meas- 557
ure the performance of hierarchical deep learning models with and without second da- 558
taset (used only CXR images), which has been mentioned in Section 3, and before inte- 559
grating the synthetic CXR images. As shown in Figure 12, integrating the second da- 560
taset with the first one clearly improves the macro-average accuracy for all models. 561
Sensors 2024, 24, x FOR PEER REVIEW 21 of 27

Macro-Average Accuracy
Resnet backbone model (First+ Second) dataset 82.06
Resnet backbone model (First) dataset 75.85

VGG backbone model (First+ Second) dataset 82.2


VGG backbone model (First) dataset 78.53

Resnet like model (First+ Second) dataset 85.77


Resnet like model (First) dataset 83.48

VGG like model (First+ Second) dataset 84.25


VGG like model (First) dataset 81.45
70 72 74 76 78 80 82 84 86 88
562

Figure 12. Comparison of Macro-Average Accuracy for All Models with and Without a Second Da- 563
taset. 564

The third experiment was done for all models after integrating the synthetic CXR 565
images with the original CXR images from all datasets. Table 8 shows the results of all 566
decisions at each level for each hierarchical classification schema. We observed that the 567
best results for decision#1 and decision#2 (which is binary classification) were obtained 568
by the Resnet-Like model, while the best results for decision#3 (multi-classification) were 569
obtained by the Resnet-backbone model. In addition, the results of the comparison of 570
COVID-19 class for each hierarchical classification schema are shown in Table 9. The 571
COVID-19 classification using the Resnet-backbone model is higher than other models in 572
F1-score 85.82% and accuracy 92.88%. Table 10 represents the macro-avg results that were 573
achieved from all hierarchical classification models. Compared to the first and second ex- 574
periments, the results improved very clearly for all models after integrating the synthetic 575
CXR images to balance the dataset. The Resnet-Like model achieved the best results 576
among all models, with an F1-score of 92.65% and an accuracy of 92.61% for classifying 577
all the classes. 578

Table 8. Results of decisions at each level for each hierarchical classification schema using only CXR 579
images. 580

Models Decision# Accuracy Sensitivity Precision F1-score

Decision#1 94.92 94.92 95.12 94.98


VGG-Like Decision#2 90.21 90.21 90.80 90.37
Decision#3 88.95 88.95 89.06 88.98
Decision#1 95.05 95.05 95.26 95.12
Resnet-Like Decision#2 92.82 92.82 92.90 92.85
Decision#3 89.95 89.95 90.01 89.98
Decision#1 93.92 93.92 94.40 94.04
VGG-backbone Decision#2 88.41 88.41 89.51 88.67
Decision#3 89.36 89.36 89.43 89.39
Decision#1 93.90 93.90 94.52 94.04
Resnet-backbone Decision#2 89.02 89.02 90.23 89.28
Decision#3 90.32 90.32 90.49 90.33
581
Sensors 2024, 24, x FOR PEER REVIEW 22 of 27

Table 9. Results comparison of COVID-19 class for each hierarchical classification schema using 582
only CXR images. 583

Models Accuracy Sensitivity Precision F1-score

VGG-Like 90.82 80.80 83.57 82.17


Resnet-Like 91.55 84.99 83.11 84.04
VGG-backbone 91.42 84.29 83.13 83.71
Resnet-backbone 92.88 82.37 89.56 85.82

Table 10. macro-avg results comparison for each hierarchical classification schema using only CXR 584
images. 585

Models Loss of test set Accuracy Sensitivity Precision F1-score

VGG-Like 0.43 91.36 91.36 91.66 91.45


Resnet-Like 0.37 92.61 92.61 92.72 92.65
VGG-backbone 0.37 90.56 90.56 91.11 90.70
Resnet-backbone 0.30 91.08 91.07 91.75 91.22
In the last experiment, we applied the multi-modal approach by combining both the 586
CXR images and the tabular data. Based on the training accuracy and F1-score as shown 587
in Figure 13 (A) and (B), the Resnet-Backbone hierarchal multi-modal appears to perform 588
the best among the four in this experiment. The VGG-based hierarchal multi-modal also 589
shows good performance. As seen from Table 11, the results of different evaluation metrics 590
of all decisions at each level for each hierarchical classification schema. We observed that 591
there is clearly enhancement among all decisions. The best results for decision #1 were 592
obtained by the Resnet-Backbone multi-modal, while the best results for decision #2 were 593
obtained by the VGG-Backbone multi-modal. But regarding decision #3, the results from 594
Resnet-based multi-models were the best. The COVID-19 classification was also im- 595
proved, as shown in Table 12, using the Resnet-based multi-modal. It achieved an accu- 596
racy of 93.72% and a F1-score of 88.24 by Resnet-Like multi-modal. Table 13 summarizes 597
the macro-avg results that were achieved for this last experiment when integration be- 598
tween the CXR images and the tabular data produced superior diagnosis performance 599
compared to the previous experiment. Resnet-Backbone outperforms other multi-modal 600
with an accuracy of 95.97% and an F1-score of 95.98%. 601
602

(A)
Sensors 2024, 24, x FOR PEER REVIEW 23 of 27

603

604

605

606

607

608

609

610

611

612

613

614
(B)
615

616

Figure 13. Accuracy and F1-score results of training learning against the number of epochs across 617
all decisions for each multi-modal in the last experiments: (A) Accuracy. (B) F1-score. 618

Table 11. Results of decisions at each level for each hierarchical classification schema using CXR 619
images and tabular data. 620

Models Decision# Accuracy Sensitivity Precision F1-score

Decision#1 93.63 93.63 94.67 93.83


VGG-Like Decision#2 96.06 96.06 96.15 96.09
Decision#3 91.62 91.62 91.70 91.65
Decision#1 95.90 95.90 96.25 95.98
Resnet-Like Decision#2 93.29 93.29 93.89 93.41
Decision#3 93.38 93.38 93.44 93.41
Decision#1 97.66 97.66 97.68 97.67
VGG-backbone Decision#2 96.68 96.68 96.80 96.71
Decision#3 91.40 91.40 91.49 91.45
Decision#1 98.13 98.13 98.17 98.15
Resnet-backbone Decision#2 96.42 96.42 96.49 96.44
Decision#3 93.35 93.35 93.38 93.36

Table 12. Results comparison of COVID-19 class for each hierarchical classification schema using 621
CXR images and tabular data. 622

Models Accuracy Sensitivity Precision F1-score

VGG-Like 92.30 85.71 82.64 84.15


Resnet-Like 93.72 88.69 87.79 88.24
VGG-backbone 91.88 83.51 84.62 84.06
Resnet-backbone 93.89 87.96 86.98 87.47
Sensors 2024, 24, x FOR PEER REVIEW 24 of 27

Table 13. macro-avg results comparison for each hierarchical classification schema using CXR im- 623
ages and tabular data. 624

Models Loss of Accuracy Sensitivity Precision F1-score


test set
VGG-Like 0.23 93.77 93.77 94.18 93.86
Resnet-Like 0.20 94.19 94.19 94.53 94.26
VGG-backbone 0.24 95.25 95.25 95.32 95.27
Resnet-backbone 0.33 95.97 95.97 96.01 95.98
625

(A) (B)

(D)
(C)

Figure 14. Confusion matrix for: (A) Resnet-Backbone (B) Resnet-Like (C)VGG-Backbone (D) VGG- 626
Like. 627
Sensors 2024, 24, x FOR PEER REVIEW 25 of 27

628

Figure 14. Macro-average ROC curve across all decisions for each multi-modal in the last experi- 629
ments. 630

The confusion-matrix plots for the four multi-modal are depicted above in Figure 13. 631
The horizontal axes correspond to the predicted classes, and the vertical axes correspond 632
to the true classes, which represent the actual classifications. The diagonal cells in the con- 633
fusion matrix represent the correct predictions (TP and TN). The off-diagonal cells repre- 634
sent incorrect predictions (FP and FN). From observation of the number in the false pre- 635
diction cells, all the multi-modal seem to perform fairly well, as most of the higher values 636
are concentrated along the diagonal, which are correct predictions, and the misclassifica- 637
tion rate was very low, especially with Resnet_backbone multi-modal. The macro-average 638
ROC curve in Figure 14 demonstrates that the VGG-like multi-modal (AUC=0.95) has the 639
best overall performance across all classes, followed closely by the VGG-Backbone and 640
Resnet-Backbone multi-modal (AUC=0.93~0.92). Taking into consideration the other per- 641
formance metrics, Resnet-Backbone multi-modal achieved superior performance in the 642
classification process. 643

6. Conclusions 644
This paper proposes a novel approach for classifying COVID-19 and distinguishing 645
it from other types of pneumonia and normal lungs using CXR images and medical tabu- 646
lar data in four different hierarchal architectures based on Resnet and VGG pre-trained 647
models. The study used a private dataset that was obtained from King Khalid University 648
Hospital and Rashid Hospital, which contains a total of 4544 cases. The study aims to 649
enhance the process of diagnosing COVID-19 and prove that the combining of CXR im- 650
ages with clinical data will achieve significant improvements in the hierarchal classifica- 651
tion process. Overall, the performance metrics for all the hierarchal deep learning models 652
are enhanced after combining the medical data with CXR images. Resnet-Backbone 653
achieved the highest performance with an accuracy of 95.97%, a precision of 96.01%, and 654
an F-score of 95.98%. The proposed approach showed a promising result, especially hier- 655
archal deep learning multi-modal. Our findings could aid in the development of better 656
diagnostic tools for upcoming respiratory disease outbreaks. However, the study suffers 657
from a data imbalance due to the lack of patient medical data available for some classes. 658
In future work, we plan to explore more datasets from different resources, including dif- 659
ferent classes of pneumonia and lung diseases. 660

Acknowledgments: The authors would like to thank the Deanship of Scientific Research (DSR), 661
King Saud University, Riyadh, Saudi Arabia and Dubai Scientific Research Ethical Committee 662
(DSREC), Dubai Health Authority and Rashid hospital for their support in this study. In addition, 663
special thanks to the editor and reviewers for spending their valuable time reviewing and polishing 664
this article. . 665
Sensors 2024, 24, x FOR PEER REVIEW 26 of 27

Data Availability: The data are not available due to ethical reasons. 666

Conflicts of Interest: On behalf of all authors, the corresponding author declares no conflicts of 667
interest. 668

Informed Consent Statement: Informed consent was obtained from all subjects involved in the 669
study. 670

Institutional Review Board Statement: The study was conducted according to the declaration and 671
guidelines of Dubai Scientific Research Ethics Committee, DHA (DSREC-12/2021_01), and King 672
Saud University Institutional Review Board Committee (E-251-5939). . 673

References 674
1. M. Pal, G. Berhanu, C. Desalegn, and V. Kandi, “Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2): An Update,” 675
Cureus, Mar. 2020, doi: 10.7759/cureus.7423. 676
2. “COVID-19 cases | WHO COVID-19 dashboard,” datadot. Accessed: March. 5, 2024. [Online]. Available: 677
https://data.who.int/dashboards/covid19/cases 678
3. M. E. H. Chowdhury et al., “Can AI Help in Screening Viral and COVID-19 Pneumonia?,” IEEE Access, vol. 8, pp. 132665–132676, 679
2020, doi: 10.1109/ACCESS.2020.3010287. 680
4. N. Maharjan, N. Thapa, B. Pun Magar, M. Maharjan, and J. Tu, “COVID-19 Diagnosed by Real-Time Reverse Transcriptase- 681
Polymerase Chain Reaction in Nasopharyngeal Specimens of Suspected Cases in a Tertiary Care Center: A Descriptive Cross- 682
sectional Study,” J. Nepal Med. Assoc., vol. 59, no. 237, May 2021, doi: 10.31729/jnma.5383. 683
5. H. Swapnarekha, H. S. Behera, J. Nayak, and B. Naik, “Role of intelligent computing in COVID-19 prognosis: A state-of-the-art 684
review,” Chaos Solitons Fractals, vol. 138, p. 109947, Sep. 2020, doi: 10.1016/j.chaos.2020.109947. 685
6. T. Yang, Y.-C. Wang, C.-F. Shen, and C.-M. Cheng, “Point-of-Care RNA-Based Diagnostic Device for COVID-19,” Diagnostics, 686
vol. 10, no. 3, p. 165, Mar. 2020, doi: 10.3390/diagnostics10030165. 687
7. Y. A. Helmy, M. Fawzy, A. Elaswad, A. Sobieh, S. P. Kenney, and A. A. Shehata, “The COVID-19 Pandemic: A Comprehensive 688
Review of Taxonomy, Genetics, Epidemiology, Diagnosis, Treatment, and Control,” J. Clin. Med., vol. 9, no. 4, p. 1225, Apr. 2020, 689
doi: 10.3390/jcm9041225. 690
8. S. Candemir and S. Antani, “A review on lung boundary detection in chest X-rays,” Int. J. Comput. Assist. Radiol. Surg., vol. 14, 691
no. 4, pp. 563–576, Apr. 2019, doi: 10.1007/s11548-019-01917-1. 692
9. A. Jacobi, M. Chung, A. Bernheim, and C. Eber, “Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review,” 693
Clin. Imaging, vol. 64, pp. 35–42, Aug. 2020, doi: 10.1016/j.clinimag.2020.04.001. 694
10. “ICD-10 Version:2019.” Accessed: Jan. 25, 2024. [Online]. Available: https://icd.who.int/browse10/2019/en#/ 695
11. S. Kiryu et al., “Deep learning to differentiate parkinsonian disorders separately using single midsagittal MR imaging: a proof 696
of concept study,” Eur. Radiol., vol. 29, no. 12, pp. 6891–6899, Dec. 2019, doi: 10.1007/s00330-019-06327-0. 697
12. M. Roberts et al., “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID- 698
19 using chest radiographs and CT scans,” Nat. Mach. Intell., vol. 3, no. 3, pp. 199–217, Mar. 2021, doi: 10.1038/s42256-021-00307- 699
0. 700
13. A. S. Althenayan, S. A. AlSalamah, S. Aly, T. Nouh, and A. A. Mirza, “Detection and Classification of COVID-19 by Radiological 701
Imaging Modalities Using Deep Learning Techniques: A Literature Review,” Appl. Sci., vol. 12, no. 20, p. 10535, Oct. 2022, doi: 702
10.3390/app122010535. 703
14. J. Gao, P. Li, Z. Chen, and J. Zhang, “A Survey on Deep Learning for Multimodal Data Fusion,” Neural Comput., vol. 32, no. 5, 704
pp. 829–864, May 2020, doi: 10.1162/neco_a_01273. 705
15. W. R. Barnett, M. Radhakrishnan, J. Macko, B. T. Hinch, N. Altorok, and R. Assaly, “Initial MEWS score to predict ICU 706
admission or transfer of hospitalized patients with COVID-19: A retrospective study,” J. Infect., vol. 82, no. 2, pp. 282–327, Feb. 707
2021, doi: 10.1016/j.jinf.2020.08.047. 708
16. J. Gardner-Thorpe, N. Love, J. Wrightson, S. Walsh, and N. Keeling, “The Value of Modified Early Warning Score (MEWS) in 709
Surgical In-Patients: A Prospective Observational Study,” Ann. R. Coll. Surg. Engl., vol. 88, no. 6, pp. 571–575, Oct. 2006, doi: 710
10.1308/003588406X130615. 711
17. C. Menéndez, J. B. Ordieres, and F. Ortega, “Importance of information pre-processing in the improvement of neural network 712
results,” Expert Syst., vol. 13, no. 2, pp. 95–103, May 1996, doi: 10.1111/j.1468-0394.1996.tb00182.x. 713
18. M. Liu et al., “Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques,” 714
Artif. Intell. Med., vol. 142, p. 102587, Aug. 2023, doi: 10.1016/j.artmed.2023.102587. 715
19. D. R. Varma, “Managing DICOM images: Tips and tricks for the radiologist,” Indian J. Radiol. Imaging, vol. 22, no. 01, pp. 4–13, 716
Jan. 2012, doi: 10.4103/0971-3026.95396. 717
20. Yu. Gordienko et al., “Deep Learning with Lung Segmentation and Bone Shadow Exclusion Techniques for Chest X-Ray 718
Analysis of Lung Cancer,” in Advances in Computer Science for Engineering and Education, vol. 754, Z. Hu, S. Petoukhov, I. Dychka, 719
and M. He, Eds., in Advances in Intelligent Systems and Computing, vol. 754. , Cham: Springer International Publishing, 2019, 720
pp. 638–647. doi: 10.1007/978-3-319-91008-6_63. 721
Sensors 2024, 24, x FOR PEER REVIEW 27 of 27

21. H. Oğul, B. B. Oğul, A. M. Ağıldere, T. Bayrak, and E. Sümer, “Eliminating rib shadows in chest radiographic images providing 722
diagnostic assistance,” Comput. Methods Programs Biomed., vol. 127, pp. 174–184, Apr. 2016, doi: 10.1016/j.cmpb.2015.12.006. 723
22. K. E. Bennin, J. Keung, A. Monden, Y. Kamei, and N. Ubayashi, “Investigating the Effects of Balanced Training and Testing 724
Datasets on Effort-Aware Fault Prediction Models,” in 2016 IEEE 40th Annual Computer Software and Applications Conference 725
(COMPSAC), Atlanta, GA, USA: IEEE, Jun. 2016, pp. 154–163. doi: 10.1109/COMPSAC.2016.144. 726
23. M. Abedi, L. Hempel, S. Sadeghi, and T. Kirsten, “GAN-Based Approaches for Generating Structured Data in the Medical 727
Domain,” Appl. Sci., vol. 12, no. 14, p. 7075, Jul. 2022, doi: 10.3390/app12147075. 728
24. H. M and S. M.N, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, 729
vol. 5, no. 2, pp. 01–11, Mar. 2015, doi: 10.5121/ijdkp.2015.5201. 730
731

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual au- 732
thor(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to 733
people or property resulting from any ideas, methods, instructions or products referred to in the content. 734

You might also like