0% found this document useful (0 votes)
31 views

ResearchProposalFinalVer1 4 33

Uploaded by

Prajakta Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

ResearchProposalFinalVer1 4 33

Uploaded by

Prajakta Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1.

Introduction
Dengue fever is a mosquito-borne viral infection that poses a significant public health
challenge in India. The disease is caused by four closely related viruses transmitted by the
Aedes mosquito, primarily Aedes aegypti. Dengue fever is prevalent in tropical and
subtropical regions, and India, with its diverse climate and large population, is particularly
susceptible to outbreaks.
Dengue has been a growing concern in India, with cases rising steadily. According to statistics
from the Ministry of Health and Family Welfare, India experienced a surge in dengue cases
in recent years. In 2019, over 157,000 reported cases of dengue across the country [1]. The
numbers have fluctuated annually, but the trend indicates a significant burden on the
healthcare system.
1.1. Dengue
Dengue (“break-bone fever”) is a viral infection that spreads from mosquitoes primarily the Aedes
aegypti mosquito to people. In tropical and subtropical regions, it is more prevalent. Dengue
patients may have symptoms like rash, nausea, body aches, headaches, and high fevers [2].
Certain individuals may experience severe dengue and require hospital admission and care.
Dengue can be life-threatening in severe situations. Dengue virus, a member of the
“Flaviviridae Family”, comes in four serotypes: DEN-1, DEN-2, DEN-3, and DEN-4. There
is a range in the severity of the disease's manifestations, from moderate or non-existent
symptoms to serious and often deadly complications. Usually, symptoms start to show
themselves four to ten days after the mosquito bite like sudden onset of high fever, severe
headaches, joint and muscle pain, nausea, vomiting, rash, and mild bleeding (such as
nosebleeds or gum bleeding).
Sometimes, dengue fever can progress to a more severe form known as dengue hemorrhagic
fever (DHF) or dengue shock syndrome (DSS). DHF is characterized by severe bleeding,
organ impairment, and plasma leakage, while DSS involves circulatory failure, shock, and
potentially death if not promptly treated.
World Health Organization (WHO) has issued dengue case classification criteria for its
diagnosis and management two times, in 1997 and 2009. Fig 1 shows the classification of
dengue virus infection as per the 1997 classification and Fig 2 shows the classification of
dengue virus infection as per 2009 classification.

P a g e 4 | 33
Fig 1 The 1997 WHO classification of dengue virus infection [2]

Fig 2 The 2009 revised WHO dengue classification [2]


Avoiding mosquito bites is one way to reduce your risk of dengue, especially during the day.
Prevention of dengue fever primarily revolves around controlling mosquito populations and
reducing exposure to mosquito bites. This entails getting rid of mosquito breeding grounds
by draining standing water containers, applying insect repellents, donning protective gear,
and using mosquito nets, particularly during periods of high mosquito activity.
Overall, dengue fever continues to be a serious public health concern, especially in regions
where the Aedes mosquito is prevalent. To lessen the effects of outbreaks and lower the

P a g e 5 | 33
morbidity and mortality linked to this illness, comprehensive vector control measures,
surveillance, and prompt clinical management are crucial.
1.2. Machine Learning
Machine learning is a subset of “Artificial intelligence”. It involves training algorithms to
make predictions or decisions based on data. Machine learning is used in various fields such
as healthcare for diagnostics, finance for fraud detection, and retail for recommendation
systems. High-quality, relevant data is crucial for training effective models, as the model
learns patterns and insights from this data. Machine learning is used in various fields such as
healthcare for diagnostics, finance for fraud detection, and retail for recommendation
systems. It can be classified as “Supervised Learning” (with labeled data), “Unsupervised
Learning” (with unlabelled data), and “Reinforcement Learning” (learning through rewards
and penalties) as given in Figure 3

Figure 3: Types of Machine Learning [3]


1.3 Machine learning life cycle
When creating an ML-based application, seven steps need to be taken, as shown in Figure 4.
They're
• Collection of data
• Data Preparation
• Data Wrangling
• Data Analysis
• Model Training
• Model Testing
• Model Deployment

P a g e 6 | 33
Understanding the given problem is very important to obtain good results in ML.
• Collection of data:
The first and most important phase is gathering data. The model's success could be determined
by the precision and variety of the data gathered. Collect raw data from various sources such as
databases, sensors, web scraping, or public datasets. This step ensures the data is relevant,
sufficient, and represents the problem domain adequately. The quality and quantity of the data
directly impact the effectiveness of the model. It's important to understand the data's origin and
any potential biases it may contain. Finally, we get coherent data termed a “dataset”.
• Data Preparation:
We must prepare the data for the training phase after data collection. Cleaning and organizing
the collected data to remove inconsistencies, such as missing values or duplicates. This step
includes formatting the data into a structured format, such as tables, and ensuring that all data
points are accurate and complete. Proper data preparation ensures that the dataset is ready for
analysis and modeling. It's a crucial step to avoid garbage-in and garbage-out scenarios in model
training. After then, the data is forwarded for processing.
• Data Wrangling:
Cleaning up data and transforming it into a format that can be used is known as data wrangling.
Transform and map raw data into a more suitable format for analysis. This process includes
normalization, scaling, encoding categorical variables, and feature engineering. Data wrangling
aims to make the data more accessible and usable for model training. It often involves iterating
over multiple transformations to achieve the best dataset quality.
• Data Analysis:
Exploring the data to understand its underlying patterns, distributions, and relationships.
Techniques such as statistical analysis, visualization, and hypothesis testing are used. Data
analysis helps identify trends, anomalies, and correlations, providing insights that guide the
choice of modeling techniques. It also helps in understanding the context and scope of the
problem being addressed.
• Training the model:
Using the prepared dataset to train a machine learning algorithm. The model learns patterns and
relationships within the data during this process. It involves adjusting model parameters to
minimize prediction errors. Training is typically performed on a training dataset, with
hyperparameters tuned for optimal performance. This step may involve selecting different

P a g e 7 | 33
algorithms and comparing their performance.
• Test the model:
Evaluating the trained model's performance on a separate testing dataset to assess its accuracy,
generalization, and robustness. Here different metrics such as accuracy, precision, recall, and F1
score can be calculated. Model testing helps identify overfitting or underfitting issues and ensures
the model performs well on unseen data. It's essential for validating the model before
deployment.
• Deployment
Deployment involves setting up the necessary infrastructure, such as APIs, and monitoring the
model's performance in the real world. It's crucial to make a model which will adapt as new data
becomes available. Continuous monitoring and maintenance are necessary to ensure the model
remains accurate and relevant over time.

1.
Gatherin
g Data
7. 2. Data
Deployme Preparati
nt on

Machine
Learning 3. Data
6. Test
Model Life Cycle Wranglin
g

4.
5. Train
Analyse
Model
Data

Figure 4: Machine learning life cycle [4]


1.4 Supervised Learning Algorithms
“Supervised Learning” algorithms are a class of machine learning techniques where the model is
trained on labeled data, meaning each training example has an input-output pair. The algorithm
learns precise inputs-output mapping and minimizes the error between predictions and actual
values. Common examples include linear regression, decision trees, and support vector machines.
• Regression:
Regression is a “Supervised Learning” technique used to predict a continuous target variable

P a g e 8 | 33
based on one or more input features. By fitting a line or curve that minimizes the gap between
expected and actual values, it represents the relationship between the dependent variable
(output) and the independent variables (inputs). Polynomial regression, logistic regression, and
linear regression are common forms of regression (for binary classification). Finding patterns
and producing precise forecasts for new “Data sets” are the objectives. It is widely used in fields
such as finance, economics, and biology for tasks like forecasting and trend analysis.

Fig. 5 Regression analysis where the x-axis axis is an input and the y the output.[5]

The subcategories of regression are given below


• Linear Regression:
A basic approach for prediction analysis is called linear regression. It displays how independent
and dependent variables are linearly correlated. A model is referred to as “Simple Linear
Regression” if there is only one input, and “Multiple Linear Regression” if there are several
inputs. [5]
• Polynomial Regression:
Polynomial regression can be used to model the non-linear dataset. For both the independent
and dependent variables, a non-linear curve is established. After obtaining polynomial features
of a specific degree from the original features, linear regression is used to create the model. [6].
• Support Vector Regression:
Encouragement Regression and classification are two uses for vector machines. To fix the
maximum number of data points on the margin, a hyperplane with maximum marginal distance
is found. A maximum number of data points must be included in the hyperplane, and the
maximum number of data points must be included inside the boundary lines. [7]

P a g e 9 | 33
• Ridge Regression:
Ridge regression is a linear regression technique that includes a regularization term, specifically
the L2 norm of the coefficients, to prevent overfitting. It shrinks the coefficients towards zero,
but not exactly zero, by penalizing large coefficients. This method is especially useful when
dealing with multicollinearity in the data. The regularization parameter controls the extent of
shrinkage. It is also a “Regularization Approach” since it lessens the complexity of models [8].
• Classification:
The prediction of categories is done in classification. Based on training, different labels are
assigned to test data. Unlike regression, classification uses a category as the target variable.
Binary refers to a classification where there are only two labels. A multi-class classifier is a
classification problem when there are more than two possible outputs [9].
• Logistic Regression:
“Logistic regression” is used to predict a categorical dependent variable based on one or more
independent variables. It employs a sigmoid (logistic) function to model the probability of the
dependent variable belonging to a particular category, typically classifying data into 0 or 1
based on a threshold. This technique can handle binomial, multinomial, and ordinal
classification tasks, making it versatile for various categorical prediction problems [10].
• K-nearest neighbours:
K-nearest neighbors (KNN) is a simple, non-parametric algorithm used for classification and
regression tasks. It classifies data points based on the majority label of their closest neighbors,
determined by a distance metric like Euclidean distance. The number of neighbors, denoted as
'k,' influences the decision boundary's smoothness. KNN is intuitive and effective for small
datasets but can be computationally expensive with large datasets. It is sensitive to the choice
of 'k' and the distance metric used.
• Support Vector Machine:
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression. It works by finding the hyperplane that best separates the data
into different classes, maximizing the margin between the closest points of the classes, known
as support vectors. SVM can handle linear and non-linear data by using kernel functions to
transform the input space. It is effective in high-dimensional spaces and is robust to overfitting,
especially in high-dimensional datasets. The choice of kernel and regularization parameter are
critical to its performance.
• Decision Tree:

P a g e 10 | 33
Features indicate internal nodes of a decision tree, branches make decisions, and leaf nodes
forecast the result. To categorize the data, entropy or the Gini Index are employed as metrics.
• Random Forest:
When there is a lot of data to classify, decision trees may overfit. Random forests can address
these restrictions. Accuracy is increased by training many decision trees and merging them.
"Bagging" is the term for this process.
• Naïve Bayes Classifier:
Another technique for categorizing the dataset is the Bayes theorem. The term "naïve" refers to
the assumption that every feature that is classed is entirely independent of every other feature.
Data is divided into two parts: the response vector and the feature matrix. All of the data is kept
in rows in the feature matrix, and the response vector defines the outcome class.
1.5 Unsupervised Learning Algorithms
“Unsupervised learning” uses unlabeled data. The user is unaware of the data's pattern. The
algorithms examine the information structure using various methods. Clustering is the most often
used unsupervised learning algorithm.
• Artificial Neural Networks (ANN):
Artificial Neural Networks (ANNs) are computational models inspired by the human brain,
consisting of interconnected layers of nodes (neurons). They learn to recognize patterns by
adjusting the weights of connections through training with data. ANNs are capable of handling
complex, non-linear relationships and are widely used in tasks like image and speech recognition.
They are the foundation of deep learning techniques, which involve multiple hidden layers for
enhanced learning.
• Deep Neural Network (DNN)
Given that neural networks are typically boring, DNN is an adaptable model that also fosters
innovation. There are a lot of nodes needed in total to examine the data. Three layers make up a
DNN: input, output, and numerous intermediate hidden levels. DNN is utilized for two reasons:
progressive learning and the flexibility to change the outcomes.
• Convolutional Neural Network (CNN)-
Given its improved accuracy and capturing orientation, CNN is a superior method for identifying
images. They have numerous layers that go by various names, including as fully connected,
pooling, and convolutional layers.
1.6 Clustering techniques
• K-means clustering:

P a g e 11 | 33
K-means clustering is an unsupervised learning algorithm that partitions data into K distinct
clusters based on feature similarity. It iteratively assigns data points to the nearest cluster centroid
and recalculates centroids until convergence. The goal is to minimize the variance within each
cluster.
• K-medoids:
When the data has a large number of outliers, it is used. To find the median, data points are
selected at random. Among these medoids, the remaining data points are arranged based on the
minimum distance. We eventually obtain methods that correctly cluster the dataset after a
number of iterations.
• Fuzzy C-means:
These clusters have randomly allocated centroids and randomly initialized data. The specific
parameter αjk is used in this procedure. The fuzzy parameter is denoted by F. The associativity
of the data to a specific cluster is computed using αjk. F is always more than 1.
1.7 Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns to make decisions
by interacting with an environment. The agent receives rewards or penalties based on its actions
and aims to maximize cumulative rewards over time. It uses exploration and exploitation
strategies to discover the best actions. Reinforcement learning is particularly effective for
problems with a long-term goal, such as game playing and robotics. Key components include the
state, action, reward, and policy.
• Q-learning: Q-learning takes advantage of sequential decision-making. The optimal solution
is computed for each input, and the matching input is always dependent on the output that
came before it. Goal function Fπ can be reduced or increased depending on the circumstances.
The policy is represented by π, and a state is defined by P. Its interactions with the
environment determine its ideal course of action.
• Markov’s decision process: The core method of this algorithm is learning from interactions
to accomplish a particular objective. The reinforcement agent and the environment are in
constant communication; the agent chooses the action, and the environment responds. This
allows the environment to continuously provide the agent with new opportunities to explore.
1.8 Dengue Prediction using clinical marker
Clinical markers are measurable characteristics or indicators used by healthcare professionals
to assess a patient's health status, diagnose diseases, monitor disease progression, or evaluate
the effectiveness of treatment. These markers can include physical symptoms, laboratory test

P a g e 12 | 33
results, imaging findings, and other observable signs of disease or physiological changes in the
body. In the context of Dengue fever, clinical markers may refer to various symptoms and
laboratory findings that help healthcare providers diagnose and manage the illness. For
example, symptoms such as fever, headache, muscle and joint pain, rash, and nausea are
important clinical markers commonly associated with Dengue fever. Additionally, laboratory
markers such as platelet count, white blood cell count, liver enzyme levels, and the presence of
Dengue-specific antibodies (NS1 antigen, IgM, IgG) serve as clinical markers for diagnosing
and monitoring Dengue infection. The various blood parameters from which Dengue can be
detected are
Table 1: Blood Parameters
Sr. Blood Parameter Roll of the Blood Parameter in the Human Body
No.
1. Haemoglobin • Measure of oxygen-carrying red blood cells
Levels • May decrease due to haemorrhage associated with
severe cases.
• Normal range: 13.8 to 17.2 grams per decilitre (g/dL)
for men, 12.1 to 15.1 g/dL for women
2. Platelet Count • Evaluate blood clotting ability
• Typically decreases significantly during Dengue
infection. Normal range: 150,000 to 450,000 platelets
per microlitre (mcL) of blood
3. White Blood Cell • Reflects immune response and infection presence
Count • May show a decrease in the early stage but could
increase later due to secondary infections. Normal
range: 4,500 to 10,000 white blood cells per microliter
of blood.
4. Red Blood Cell • Counts red blood cells per milliliter of blood; it
Count typically stays within the normal range unless there is
significant bleeding.
• Males typically have 4.32 to 5.72 million cells per
microliter, whereas women typically have 3.90 to 5.03
million cells per microliter.

P a g e 13 | 33
5. Haematocrit • Shows the proportion of red blood cells in the blood
Levels volume.
• Hemoconcentration during plasma leakage may cause
it to rise. Normal range for men is 40.7% to 50.3%,
and for women it is 36.1% to 44.3%.
6. Presence of NS1 • Early indicator of Dengue virus infection. Detectable
Antigen within the first week of infection.
7. IgM and IgG • IgM antibodies appear during the acute phase, while
Antibodies IgG antibodies indicate past infection or immunity.
8. Decrease in • A significant decrease is a hallmark of Dengue
Platelet Count infection, leading to thrombocytopenia.
9. Presence of • Detects the presence of dengue virus genetic material
Dengue Virus in the blood, allowing for early diagnosis
RNA through
PCR Testing
1.9 Data Collection
Dengue data is referred to from publicly available datasets [23] [24]. The Dengue data will be
collected along with the details of healthy patients.
2. Literature Survey
Developing accurate Dengue prediction models is essential for timely intervention and resource
allocation. Literature on Dengue prediction encompasses various approaches, including
epidemiological studies, mathematical modeling, and machine learning techniques. Researchers
have explored various factors influencing Dengue transmission, such as climate variables,
socioeconomic factors, and population mobility patterns. Despite progress, challenges persist,
including data availability, model generalizability, and the dynamic nature of Dengue transmission
dynamics. Addressing these challenges is crucial for the development of robust prediction
frameworks.
Mohammad et al. [11], have presented a paper “Detection of Dengue Disease Empowered with
Fused Machine Learning” that aims to predict illness early on in order to prevent it. With 96.19%
accuracy, our fused model identified if the dengue diagnosis was positive or negative and whether
the dengue was DHF (Dengue Hemorrhagic Fever). To predict if dengue identification is
progressive or destructive, one uses the prediction fused Dengue Model (PFDM). Fuzzy logic is

P a g e 14 | 33
employed in SVM and ANN classifiers. The model will eventually be expanded to be physician-
friendly.

Hamzat et al. [12] presented a paper “Machine Learning Diagnosis of Dengue Fever: A Cost-
Effective Approach for Early Detection and Treatment” to identify dengue illness and evaluate
how cost-effective they are in comparison to traditional techniques. With 100% accuracy from
Random Forest, SVM, and Naïve Bayes, and 97.3% accuracy from K-Nearest Neighbor, the
classifiers obtained remarkable accuracy rates. Future research will examine the financial
advantages of treating diseases with machine learning techniques.

Nittaya et al. [13] presented a paper "A Multi-criteria Scheme to Build Model Ensemble for
Dengue Infection Case Estimation" and determined that to create a strong ensemble, model
selection should only consider the top two and bottom two models. Seven classifiers ANN, SVR,
MLR, etc are employed. Future work will focus on enhancing the ensemble approach to take into
account more factors that can ensure the best possible predicting results. There is no guarantee
of the same with the current combination of models.

Bilal et al. [14], has presented a paper "Early Diagnosis for Dengue Disease Prediction Using
Efficient Machine Learning Techniques Based on Clinical Data". Using K-fold and holdout
cross-validation techniques, the authors employed machine learning techniques such as KNN,
GBC, XGB, LightGBM, and Extra tree to accurately detect and validate the outcomes of the
suggested diagnostic model. With 99.12% and 99.03% accuracy in hold-out and 10-fold cross-
validation, respectively, the ETC model performed best. Plans call for testing it independently
using real-time data and expanding the algorithm's optimizations.

Lathesparan et al. [15], has presented a paper "Artificial Neural Networks Based Dengue
Diagnosis Prediction Model”. Principal Component Analysis (PCA) and wrapper feature
selection techniques were utilized to determine the important feature combinations. The wrapper
approaches employed Naïve Bayes, K-Nearest Neighbor (KNN), and J48 as their classifier
algorithms. Of the four feature selection techniques, PCA produces an ANN with a greater
accuracy. The two most important features selected by all wrapper feature selection approaches
are myalgia and retroocular pain. Additionally, a 59% variance reduction from the original 22-
dimensional system to an 8-dimensional system was achieved by PCA. The model's accuracy
will be expanded in the future by adding more test cases.

P a g e 15 | 33
Sheng-Wen et al. [16] have presented a paper “Assessing the risk of dengue severity using
demographic information and laboratory test results with machine learning" wherein authors
develop precise models for quick prognosis of dengue outbreaks using laboratory test findings
and demographic data. ANN, LR, random forests (RF), GBM, and SVM are the classifiers used.
The artificial neural network's discriminative capacity performed well when it came to the
prognosis of severe dengue. ANN demonstrated the highest balance accuracy (0.7523 ± 0.0273)
and average discrimination area under the receiver operating characteristic curve (0.8324 ±
0.0268). Future research, however, must validate the concept further with external populations.

Archana T et al. [17] presented a paper "Forecasting Machine Learning Based Feature Selection
for Dengue Prediction in the Early Stage" to investigate dengue symptoms using statistical and
analytical methods. Classification and Regression Tree (CART) with an accuracy of 97%,
precision of 98.3%, recall of 96.8%, and F-measure metrics of 97.5% is a better fit than Random
Forest and SVM for the suggested Dengue prediction model. Primary symptoms were determined
by ranking and grouping the common aspects. In the future, the test reports will be analysed to
classify Dengue as DF, DHF, or DSS depending on the severity of the symptoms.
Kiran Deep Singh [18] has presented a paper "Particle Swarm Optimization assisted Support
Vector Machine based Diagnostic System for Dengue prediction at the early stage" to identify
Dengue early in order to enhance both the diagnostic procedure and the likelihood of successful
treatment. In addition, to evaluate how well SVM and Particle Swarm Optimization (PSO) mine
the Dengue dataset and improve the accuracy of machine learning algorithms. Using traditional
Dengue classification data sets, the particle swarm optimization (PSO) strategy is validated and
shown to outperform other methods.

Md. Sanzidul et al. [19] have presented a paper "A Study on Dengue Fever in Bangladesh:
Predicting the Probability of Dengue Infection with External Behavior with Machine Learning"
to present a model that estimates the likelihood of contracting dengue fever prior to doing the
pathology test. This study has made an effort to forecast the likelihood of contracting dengue
fever based on exterior behaviors like as fever, pain, headaches, and sitophobia. This article
presents a model that can be used to estimate the likelihood of contracting dengue fever prior to
doing a pathology test. Hence, by providing their anatomical symptoms as input, the suspected
patient may receive a preliminary diagnosis, and this will also lessen the reliance on the
pathological test to obtain the primary treatment. Classification models such Naïve Bayes,

P a g e 16 | 33
AdaBoost, SVM, KNN, Decision Tree, Random Forest, and Logistic Regression were employed.
The accuracy of the NN model is higher. The model may be integrated in the future.

Amiq et al. [20] has presented a paper "Performance Evaluation of Classifiers for Predicting
Infection Cases of Dengue Virus Based on Clinical Diagnosis Criteria" to investigate, test, and
assess eight distinct categorization algorithms in order to determine which one is the most
effective and efficient. There are several classification models utilized, including NN, SVM,
KNN, Decision Tree, Random Forest, Naïve Bayes, AdaBoost, and Logistic Regression; the NN
model is more accurate. The DBD-DKK dataset's accuracy, precision, and recall may be
enhanced in the future by integrating it with other classification models or by including
characteristics from different clinical diagnosis criteria.
William et al. [21] have presented a paper "An autonomous cycle of data analysis tasks for the
clinical management of dengue" to create an independent cycle that incorporates data analytic
tasks to aid in clinical dengue management decision-making. This is the first study to provide a
prescriptive model for the clinical management of dengue using an autonomic approach to help
clinical management. SVM and ANN classifiers to determine the type of dengue. Genetic
Algorithm for determining the optimal course of action. In the future, the models' performance
could be enhanced by include co-morbidities like diabetes and arterial hypertension, since these
conditions have an impact on how severe dengue is.
Jun et al., [22] have presented a paper "A predictive analytics model using machine learning
algorithms to estimate the risk of shock development among dengue patients". Research
demonstrates examining the significance of a characteristic for predicting dengue shock in
dengue patients. compared the effectiveness of traditional machine learning algorithms with the
performance of ensemble learning techniques including bagging, AdaBoost, Gentle AdaBoost,
Adaptive logistic regression, and robust boosting. The bagging algorithm improves upon the
individual decision tree by 14.5%, outperforming other competing techniques. Day 2
haemoglobin (Hb) levels in the complete blood count (FBC) are proven to be a strong predictor
of the likelihood of severe dengue. In the future, our effort should concentrate on developing
integrated models that can automate prescribing decisions for diagnosis, epidemics, and
interventions, rather than solely concentrating on the dengue diagnostic model.

P a g e 17 | 33
Table 2: Summary of Literature Survey
Author Title Key Findings

Mohammad “Detection of Dengue Disease • The accuracy of this fused model was
Rustom Al Empowered with Fused 96.19% in determining whether the dengue
Nasar et al., [1] Machine Learning” diagnostic is positive or negative and
whether the dengue is DHF.
Hamzat Salami “Machine Learning Diagnosis • High accuracy rates: K-Nearest Neighbor
et al., [2] of Dengue Fever: A Cost- has a 97.3% accuracy rate, whereas Naïve
Effective Approach for Bayes, SVM, and Random Forest have
Early Detection and 100% accuracy.
Treatment”
• The most economical method is the one that
is employed.
Nittaya "A Multi-criteria Scheme to • Using the top 2 and bottom 2 models
Kerdprasop et Build Model Ensemble for together is an effective tactic.
al. [3] Dengue Infection Case • The ensemble model performs better than
Estimation" alternative ensemble tactics.

Bilal " Early Diagnosis for Dengue • Created a diagnostic paradigm for dengue
Abdualgalil et Disease Prediction Using early detection
al. [4] Efficient Machine Learning • For precise detection, machine learning
Techniques Based on Clinical methods such as KNN, GBC, XGB,
Data" LightGBM, and Extra Tree
• The suggested diagnostic model's outcomes
are validated by the application of K-fold
and holdout cross-validation techniques.
• the ETC model was the most accurate. with
99.12% and 99.03% accuracy in hold-out
and 10-fold cross-validation, respectively,

Lathesparan "Artificial Neural Networks • ANN with PCA, learning rate of 0.01,
Ramachandran Based Dengue Diagnosis demonstrate the highest accuracy of
et al. [5] Prediction Model" 73.41%
Sheng-Wen "Assessing the risk of dengue • Conceptualized the research approach and
Huang et al. severity using demographic methodology, curating and analyzing the
[6] information and laboratory test data
results with machine learning" • investigated the predictive models for
severe dengue outcomes
Archana T et "Forecasting Machine Learning • Classification and Regression Tree CART)
al., [7] Based Feature Selection for has an accuracy of 97%, precision of
Dengue Prediction in the Early 98.3%, recall of 96.8%, and F-measure
Stage" metrics of 97.5%, making it a better fit for
the suggested Dengue prediction model.

P a g e 18 | 33
Kiran Deep "Particle Swarm Optimization • Particle swarm optimization (PSO) and
Singh [8] assisted Support Vector support vector machines (SVM) combined
Machine based Diagnostic to detect Dengue at an early stage: this is a
System for Dengue prediction one-size-fits-all approach to early outbreak
at the early stage" identification.
Md. Sanzidul "A Study on Dengue Fever in • Made predictions about the likelihood of
Islam, et al., Bangladesh: Predicting the contracting dengue fever based on exterior
[9] Probability of Dengue behaviors such as fever, pain, headaches,
Infection with External and sitophobia.
Behavior with Machine
• This article presents a model that can be
Learning"
used to estimate the likelihood of
contracting dengue fever prior to doing a
pathology test.
Amiq Fahmi et "Performance Evaluation of • Three categories—DF, DHF, and DSS—
al., [10] Classifiers for Predicting were predicted using eight different
Infection Cases of Dengue classification models; the NN model
Virus Based on Clinical performed better across the board in all
Diagnosis Criteria" testing procedures.
William "An autonomous cycle of data • This is the 1st study to suggest a prescriptive
Hoyosa et al., analysis tasks for the clinical model for the clinical management of
[11] management of dengue" dengue and also to support the clinical
management of the disease using an
autonomic approach.
Jun et al., [12] "A predictive analytics model • Examined the significance of certain
using machine learning features in predicting dengue shock in
algorithms to estimate the risk dengue patients.
of shock development among • Examined how well ensemble learning
dengue patients"
techniques including robust boosting,
adaptive logistic regression, bagging,
AdaBoost, and gentle AdaBoost performed
in comparison to traditional machine
learning algorithms.

3. Research Gaps
Following are the major research gaps identified based on the literature studied,
• There is a scope for exploring the hematological parameters of Dengue subjects and its
importance in building machine learning models.
• There are very less research that has been carried out on the detection of Dengue and its
analysis with machine learning, Ensemble models, and Explainable AI tools approaches
using hematological parameters.

P a g e 19 | 33
• There are very limited studies have been reported for understanding the progression of the
disease using machine learning/ regression methods.
• Very little work carried out on a differential diagnosis with respect to dengue and dengue-
like viral infection diseases.
• Further, there are no multi-modal approaches reported for the detection of Dengue. And
its severity analysis.
4. Objectives of the Study
The key objectives of the research work are
• To identify the important clinical markers of dengue viral subjects with feature
engineering
• To implement machine learning algorithms for the prediction of dengue and similar
infections.
• To optimize and compare the developed machine learning models with explainability
5. Detailed Methodology
All objectives will be implemented using the package “Anaconda”. The language used for ML
implementation is Python. Two datasets will be used. One containing the complete blood count (CBC)
results of various patients,[24] and the other detailing information specific to Dengue disease
patients [23]. The CBC dataset includes vital parameters such as white blood cell count, red blood
cell count, hemoglobin levels, and platelet count, offering a comprehensive snapshot of each
patient's hematological health. The Dengue dataset, on the other hand, focuses on patients
diagnosed with Dengue, featuring data points like the day of diagnosis and symptoms. By
integrating these datasets, the module aims to detect potential Dengue cases based on daily test
samples, analyzing trends and improvements in the patient's blood parameters. This integration
not only aids in timely diagnosis but also monitors the progression and recovery of Dengue
patients, providing critical insights into their health status over time. The data would be loaded
into appropriate CSV files for data analysis. The data is wrangled too. The null values are replaced
by the mean value and outliers are removed to improve the accuracy of the model. Data
visualization is performed next to understand the various relationships. Histograms, frequency
polygons, leaf plots, stem plots, violin plots, and scatter plots would be used to establish various
relationships. Feature extraction and selection would be performed next. Attributes that are not
related to performance measures would be removed. Only external attributes would be used to
train the classification models. Then, the data is divided into training and testing in the ratio of 80-
20%. Then we use various classification models like logistic regression, decision trees, random

P a g e 20 | 33
forests, Naïve Bayes, and KNN model. Once the model is trained, we use these models for the
prediction of the test dataset. The test data is evaluated for accuracy, precision, recall, AUC curve,
ROC curve, and F1-score. Once the model satisfies the evaluation parameters the model can be
used to detect Dengue and also predict the severity score based on the hematological parameters.
The overall methodology is explained in Figure 6.

Figure 6: Overall flowchart of the proposed methodology

Algorithms used for Machine Learning


Supervised learning algorithms will be used for training and testing the data. Logistic Regression,
Decision Trees, Random Forest, K Nearest Neighbour, and Support Vector machines will be the

P a g e 21 | 33
algorithms used. The algorithms will be measured by accuracy, sensitivity, specificity AUROC
curve, and other classification parameters.
5.1 Objective 1:
To identify the important clinical markers of dengue viral subjects with feature engineering

In the first objective, we understand the relationship between various blood parameters and their
values which could be used to detect Dengue using statistical analysis. Data trends and hidden
trends can be spotted. Statistics is mainly divided into 2 parts.
1 Descriptive Statistics
2 Inferential Statistics
The properties or attributes of a dataset are described using descriptive statistics. The process of
concluding these data as a whole as well as individual quantitative observations also referred to as
"summary statistics"—can both be referred to as "descriptive statistics."
Measures of Descriptive Statistics are
• Distribution
• Central tendency
• Variability
Distribution: In a population or sample, distribution indicates the frequency of various outcomes
(or data points). We have two options for displaying it: graphically or as numbers in a list or table.
Central tendency: Measures that examine a dataset's typical center values are referred to as central
tendency. This goes beyond just referring to the median, or middle value within a dataset. Rather,
a range of central metrics are referred to by this broad phrase. It could, for example, contain center
metrics from various dataset quartiles. Typical metrics for central tendency consist of:
• The mean: The sum of all the data points' average values.
• The median: The dataset's middle or central value.
• The value that appears the most frequently in the dataset is the mode.
Variability: A dataset's variability refers to the way its values are dispersed or distributed.
Understanding a dataset's central tendency measures is necessary to identify variability.
Variability, like central tendency, is a multidimensional concept. This phrase refers to a variety of
metrics. Typical variability metrics consist of:

P a g e 22 | 33
• Standard deviation: This indicates how much variance or dispersion there is. It is implied
by a low standard deviation that most values are near the mean. A high standard deviation
indicates a wider distribution of the values.
• Minimum and maximum values: These are a dataset's or quartile's highest and lowest
values.
• Range: This quantifies how widely values are distributed. By deducting the smallest value
from the greatest, it is simple to find this.
• Kurtosis: This quantifies the presence of extreme values, sometimes referred to as outliers,
in the tails of a given distribution. A tail is said to have minimal kurtosis if there are no
outliers in it. A dataset is said to have high kurtosis if it has a large number of outliers.
• Skewness: This quantifies the symmetry of a dataset. Positive skewness would be shown
if the right-hand tail of a bell curve was longer and fatter. We refer to the longer and fatter
left-hand tail as having negative skewness.
Inferential Statistics:
Inferential statistics use prediction techniques, so the results are in the form of probability.
However, inferential statistics focus on sample data and not on the population. Random sampling
must be first applied to the data.
Hypothesis testing:
Here we check different samples to verify whether the proposed results are the same for all of
them. If a result has occurred by chance, it will be eliminated using this technique.
Confidence Interval:
Sample data can be used to estimate certain population measurement parameters. Rather than
giving a single mean value, the confidence interval gives a range of values.
Regression and Correlational Analysis:
To comprehend the relationships between variables, regression and correlational analysis
approaches can be applied. Regression looks at the potential effects of several input variables on a
single dependent variable.

P a g e 23 | 33
Correlation measures the degree of association among data points. Unlike regression, infer, cause,
and effect are not caused by correlation.

Figure 7: Types of Statistics


The steps followed in the first objective would be as follows.
Data Cleaning:
In the first step, data is cleaned and organized. Null values are replaced, and outliers and
redundant values are removed.
Quality Analysis:
Here descriptive statistics is used. Mean, median, mode, standard deviation, etc are calculated
for analysis of data trends. Histograms and frequency polygons are used too.
Analysis:
After descriptive analysis, we analyze the variable whether it is univariate, bivariate,
nominal, or ordinal. Scatter plots, leaf plots, and box plots are used. A hypothesis is generated
at the end of this step
Stability of results:
Various hypothesis tests like p-tests, z-tests, and chi-square tests are used to validate
the population. After validation of the hypothesis, we extract the attributes for ML analysis

P a g e 24 | 33
Figure 8: Flowchart of statistical analysis
5.2 Objective 2:
To implement machine learning algorithms for the prediction of dengue and similar
infections.
Machine learning tools include frameworks like TensorFlow, PyTorch, and sci-kit-learn, which
provide libraries and interfaces for building and training models. These tools support tasks such
as classification, regression, clustering, and neural network development.
The following steps will be followed for the design and implementation of the second objective:
1. Using available Datasets containing the details of blood tests with various parameters of
Dengue-positive patients and negative patients
2. Transform the data into a standard format and perform data wrangling. (Replace null values,
remove outliers and redundant values)
3. Perform data visualization of various parameters and understand the relationships between the
variables and Dengue
4. Perform feature extraction and selection, reduce dimensionality, and remove the non-essential
parameters.
5. Split the dataset into training and testing in the ratio of 80 to 20%
6. Use the sci-kit learn library's train-test split approach to train the data.
7. Use machine learning methods for classification, such as logistic regression, decision trees,
random forests, KNN, and Naïve Bayes, to comprehend the connections between different
characteristics and attributes.

P a g e 25 | 33
8. After training, assess the model using test data. There must be little variance. Performance can
be measured using metrics like as f1-score, AUC, recall, accuracy, precision, and ROC curves.
9. If the model fails in evaluation, perform parameter tuning.
10. Once the model is ready, we can predict whether the patient is Dengue positive or not based
on the training set.

Figure 9: Flowchart of Dengue diagnosis using Machine Learning

P a g e 26 | 33
5.3 Objective 3:
To optimize and compare the developed machine learning models with explainability
The following steps will be followed for the design and implementation of the third objective:

1. Collect the details of blood tests with various parameters of Dengue-positive patients and
negative patients in KMC Manipal and T.M.A Pai Hospital Udupi
2. Transform the data into a standard format and perform data wrangling. (Replace null values,
remove outliers and redundant values)
3. Perform data visualization of various parameters and understand the relationships between
the variables and Dengue
4. Perform feature extraction and selection, reduce dimensionality, and remove the non-essential
parameters.
5. Split the dataset into training and testing in the ratio of 80 to 20%
6. Train the data using the train-test split method of the sci-kit learn library
7. Apply classification ML algorithms like logistic regression, decision tree, random forest,
KNN, and Naïve Bayes to understand the relationship between various attributes and features
8. Compare model performance on the validation set to identify the best-performing models.
9. Explainability Techniques:
a. Feature Importance: Use methods like SHAP (Shapley Additive exPlanations),
LIME (Local Interpretable Model-agnostic Explanations), or permutation
importance to identify important features.
b. Model-Specific Explanations: For tree-based models, visualize decision trees or
use feature importance scores; for neural networks, apply techniques like integrated
gradients or DeepLIFT.
c. Global vs. Local Explanations: Provide both global model interpretability (overall
feature importance) and local explanations (individual prediction reasons).
10. Optimization:
a. Algorithm Optimization: Further tune the hyperparameters of the best-performing
models using more refined search spaces or advanced optimization techniques.
b. Ensemble Methods: Combine multiple models to create an ensemble model (e.g.,
stacking, bagging, boosting) for improved performance and robustness.

P a g e 27 | 33
c. Regularization: Apply techniques like L1/L2 regularization or dropout to prevent
overfitting and improve model generalizability.
11. Compare the optimized models based on performance metrics and explainability and select
the final model based on a balance of high performance and satisfactory explainability.

Figure 10: Flowchart of machine learning models with explainability

P a g e 28 | 33
6. Importance of proposed research
A serious public health risk is dengue fever, especially in tropical and subtropical areas. Dengue
may be predicted accurately and early, which can help contain epidemics and lower mortality
rates. In order to improve early diagnosis and treatment, this research focuses on refining machine
learning models to predict dengue virus infections using clinical markers.

In comparison to more conventional techniques, dengue prediction accuracy and reliability can be
greatly increased by optimizing machine learning techniques. Better resource allocation and
preparedness for dengue outbreaks may result from this improvement.
Explainability guarantees that medical professionals comprehend and have faith in the models'
predictions, which makes it easier to integrate these models into clinical practice. The adoption of
AI tools in healthcare depends on this.
7. Research Schedule

Fig 11: Research Schedule

P a g e 29 | 33
8. Preliminary Work Done
As a preliminary work I have carried out the following implementation and also enhanced my
research skills:
8.1 Dengue Detection
Machine learning was used to diagnose Dengue. The dataset available on the Kaggle website is
used for this. The test dataset has details of 501 patients with 18 attributes. The training dataset
has details of 1456 patients with 4 attributes.
In this machine learning model for extraction and manipulating the features the pandas are used.
In pandas, the Data Frame is created as df and using the Data Frames the features and target data
are selected. Further, the data frames are used to find the accuracy of the model. In this dataset,
the features that are selected are RBC and WBC and the target data is Dengue_ns1. These features
are selected and manipulated using df. And the accuracy is calculated.
The machine learning classifier logistic regression model with an accuracy of 0.59, and the
Decision tree classifier with an accuracy of 0.673.

8.2 Courses Completed: -


I have completed the following courses to strengthen my coding skills which will help me with
the implementations.
• Machine Learning with Python IBM (Coursera)
• Research and Publication Ethics (2 Credits)- Online course completed in 4
weeks. This course covers a wide range of topics, including scientific conduct
in research, publishing best practices, research metrics and databases, and
philosophy and ethics in research and publications. concepts covered in the
course include open-access publishing, databases, research metrics, scientific
misconduct, ethics, and philosophy of research. Ten quizzes and four discussion
boards make up this course. An A+ grade was obtained by finishing the course.
9. Expensed and Funding
Research work will be managed using Department Resources.
References
[1] Dengue and severe dengue https://www.who.int/news-room/fact-sheets/detail/dengue-and-
severe-dengue
[2] The revised WHO dengue case classification: does the system need to be modified?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3381438/

P a g e 30 | 33
[3] Java T point, https://www.javatpoint.com/machine-learning-life-cycle
[4] Data-Driven investor, https://medium.datadriveninvestor.com/machine-learning-101-
part-1-24835333d38a
[5] Kamil, M., Rahardja, U., Sunarya, P. A., Aini, Q., & Santoso, N. P. L. (2020,
November). Socio-economic perspective: Mitigate covid-19 impact on education.
In 2020 Fifth International Conference on Informatics and Computing (ICIC) (pp. 1-
7). IEEE.
[6] Nicola, M., Alsafi, Z., Sohrabi, C., Kerwan, A., Al-Jabir, A., Iosifidis, C., ... & Agha,
R. (2020). The socio-economic implications of the coronavirus pandemic (COVID-
19): A review. International journal of surgery, 78, 185-193.

[7] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Machine learning basics. Deep
learning, 1, 98-164.
[8] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and
prospects. Science, 349(6245), 255-260.
[9] Sra, S., Nowozin, S., & Wright, S. J. (Eds.). (2012). Optimization for machine
learning. MIT Press
[10] CIT, 2012.[online].http://cit.mak.ac.ug/index.php/projects-a-collaborations/257
[11] Al Nasar, M. R., Nasir, I., Mohamed, T., Elmitwally, N. S., Al-Sakhnini, M. M., &
Asgher, T. (2022, October). Detection of Dengue Disease Empowered with Fused
Machine Learning. In 2022 International Conference on Cyber Resilience
(ICCR) (pp. 01-10). IEEE.
[12] Salami, H., Ebeh, J. E., & Aminu, Y. O. (2023). Machine Learning Diagnosis of Dengue
Fever: A Cost-Effective Approach for Early Detection and Treatment. Ovidius University
Annals, Economic Sciences Series, 23(1), 229-238.
[13] Kerdprasop, N., Kerdorasop, K., & Chuaybamroong, P. (2020, November). A multi-criteria
scheme to build model ensemble for dengue infection case estimation. In 2020 International
Conference on Decision Aid Sciences and Application (DASA) (pp. 214-218). IEEE.
[14] Abdualgalil, B., Abraham, S., & Ismael, W. M. (2022). Early diagnosis for dengue disease
prediction using efficient machine learning techniques based on clinical data. Journal of
Robotics and Control (JRC), 3(3), 257-268.
[15] Ramachandran, L., Rathnayaka, R. K. T., & Wickramaarachchi, W. U. (2021, December).
Artificial Neural Networks Based Dengue Diagnosis Prediction Model. In 2021 IEEE 16th

P a g e 31 | 33
International Conference on Industrial and Information Systems (ICIIS) (pp. 265-270).
IEEE.
[16] Huang, S. W., Tsai, H. P., Hung, S. J., Ko, W. C., & Wang, J. R. (2020). Assessing the risk
of dengue severity using demographic information and laboratory test results with machine
learning. PLoS neglected tropical diseases, 14(12), e0008960.
[17] Archana, T. (2023, December). Forecasting Machine Learning Based Feature Selection for
Dengue Prediction in the Early Stage. In 2023 3rd International Conference on Mobile
Networks and Wireless Communications (ICMNWC) (pp. 1-6). IEEE.
[18] Singh, K. D. (2021, December). Particle swarm optimization assisted support vector
machine based diagnostic system for dengue prediction at the early stage. In 2021 3rd
International Conference on Advances in Computing, Communication Control and
Networking (ICAC3N) (pp. 844-848). IEEE.
[19] Islam, M. S., Khushbu, S. A., Rabby, A. S. A., & Bhuiyan, T. (2021, May). A study on
dengue fever in bangladesh: Predicting the probability of dengue infection with external
behavior with machine learning. In 2021 5th International Conference on Intelligent
Computing and Control Systems (ICICCS) (pp. 1717-1721). IEEE.
[20] Fahmi, A., Purwitasari, D., Sumpeno, S., & Purnomo, M. H. (2020, September).
Performance evaluation of classifiers for predicting infection cases of dengue virus based
on clinical diagnosis criteria. In 2020 International Electronics Symposium (IES) (pp. 456-
462). IEEE.
[21] Hoyos, W., Aguilar, J., & Toro, M. (2022). An autonomous cycle of data analysis tasks for
the clinical management of dengue. Heliyon, 8(10).
[22] Chaw, J. K., Chaw, S. H., Quah, C. H., Sahrani, S., Ang, M. C., Zhao, Y., & Ting, T. T.
(2024). A predictive analytics model using machine learning algorithms to estimate the risk
of shock development among dengue patients. Healthcare Analytics, 5, 100290.
[23] Data. Statistics on day-to-day records of Dengue-Patients. 2022; GitHub. Available online:
https://github.com/MananKr/Dengue-Patient-Detection.git (accessed on 06 May 2024).
[24] Data. Dengue Detection Patient-Detection from Complete Blood Count. 2023; GitHub.
Available online: https://github.com/Sohan2087/A-Stacking-Ensemble-Approach-for-
Robust-Dengue-Patient-Detection-from-Complete-Blood-Count-Data.git

P a g e 32 | 33
Course Work Details
Sr. No Course Name Credits Status

Registered
1 Research Methodology 4 (September-
October 2024)

2 Research Publication Ethics 2 Course Completed

3 (BME 7007) Machine learning (customized) 4 Approved

(BME7008) Infectious disease and clinical data


4 2 Approved
analytics (customized)

P a g e 33 | 33

You might also like