Using Sentiment Analysis and Machine Learning Algorithms To Determine Citizens' Perceptions
Using Sentiment Analysis and Machine Learning Algorithms To Determine Citizens' Perceptions
Abstract—More than 400 million people in the world have analysis still has room for improvement. Every hospital pos-
diabetes. High-risk factors of diabetic individuals vary dramat- sesses a different kind of necessary and medical information,
ically, and many patients suffer complications and avoidable and it is essential to extract useful information from these data
harm. Improving the identification level of high-risk factors
would help to reduce the rate of complications. To do this, to support future medical analysis and diagnosis [2, 3]. It is
it is essential to analyze a person’s medical record, detailed rational to believe that there are several valuable patterns and
health information that currently requires doctors and is manual, are waiting for researchers to examine them.
time-consuming, and subjective. In this work, we introduce As the number of the patient of diabetes patients is in-
an approach to automatically predict type 2 diabetes mellitus creasing, it is necessary to build a model that can classify
(T2DM) applying a neural network. The objective of this paper
is to find which type of model that works best for predicting patients with high risk of diabetes in the future. In the future,
diabetes. We used the Pima Indian Diabetes data-set in this the identified high-risk factors could potentially prevent more
analysis. The analysis was carried out on this database using cases of diabetes.
two methods. The first method includes Data Recovery followed
by feature selection. We input these features to the MLP neural II. PIMA I NDIAN D IABETES DATABASE
network classifier which achieved an accuracy of 85.15%. In
our second approach, we applied noise reduction based method Pima Indian data-set [2] is obtained originally from the
using k-means followed by feature selection. The features thus database of National Institute of Diabetes and Digestive and
obtained are used with Random Forest, Logistic Regression and Kidney Diseases of the United States. The objective of the
MLP neural network classifier. The maximum accuracy obtained data-set is to diagnostically predict whether or not a patient has
among these classifiers is 77.08%. The consultation shows why diabetes, based on specific diagnostic measurements included
Data recovery with MLP is far better than K-means based noise
reduction with the different type of classifier. in the data-set. Several constraints were placed on the selection
of these instances from a more extensive database. In partic-
I. I NTRODUCTION ular, all patients here are females from Pima Indian Heritage
who are at least 21 years old. The information consists of 768
Diabetes Mellitus (DM) is a significant public health prob- patients (268 instances of 1 and 500 cases of 0) coming from
lem that is approaching epidemic proportions globally [1]. It a population near Phoenix, Arizona, USA. 1 and 0 indicates
has notably increased in the 21 century. Diabetes is caused by whether the patient has diabetes or not, respectively. Each
several factors, including obesity, consumption of unhealthy instance is comprised of 8 attributes, which are all numeric.
food, heredity, etc. As of 2015, about 415 million people have The data-set consists of several medical predictor variables
had diabetes worldwide, and the trend suggests that the rate one target variable and the outcome. A number of predictor
will continue to rise. Diabetes has some serious long-term variables is available in the database, including number of
complications including cardiovascular disease, stroke, chronic times pregnant (preg), plasma glucose concentration at 2h in
kidney disease, foot ulcers, and damage to the eyes. For these an oral glucose tolerance test (plas), diastolic blood pressure
reasons, researchers need to put more focus on this problem. (pres), triceps skinfold thickness (skin), 2-h serum insulin
There are three kinds of diabetes. First, Type 1 DM which (insu), body mass index (bmi), diabetes pedigree function
is caused by a failure of the pancreas to produce sufficient (pedi), age, and class variable (class).
insulin. Second, Type 2 that is the most common DM and the
reason is identified as excessive body weight and insufficient III. R ELATED WORKS
exercise. Third, gestational diabetes which occurs in pregnant Artificial intelligence (AI) techniques are used today to
women with no prior history of diabetes. It is pointed out that enhance and improve our regular lifestyle. Use of AI tech-
Type 2 DM makes up about 90% of the cases. niques span modeling and analysis of hemoglobin level iden-
Data analysis has been successfully applied to various fields tification [3]–[7], activity detection [8], [9], pain level detec-
of human society, such as weather prognosis, market analysis, tion [10], [11], as well as prediction model to identify the high
engineering diagnosis, and customer relationship management. risk factors to prevent diabetic issues, including [12]–[15].
However, the utilization of disease prediction and medical data Kamer Kayaer et al. [12] used the PID dataset to evaluate the
perceptron-like general regression neural network (GRNN). affect the data analysis process and mislead the prediction
This study had 576 cases in the training set and 192 cases in results of the model. Model built with this data would be
the test set. Using 576 training instances, the sensitivity and misleading. There are different methods to preprocess the data.
specificity of their algorithm was 80.21% on the remaining 192 For example, we can delete the data from the data set, replace
instances. The same number of random training and test sets the missing data with the mean value, or replace the missing
was used to compare the simulation results. Dilip Kumar et al. data with the most likely value of this feature.
[13] used naive Bayes with the genetic algorithm to evaluate Pima Indian dataset is minimal, only 768 samples. There-
the perception. The accuracy and specificity of their algorithm fore, the model can end up being highly biased if the training
was 78.69% on the test set. Manjeevan S. et al. [14] used fuzzy data are deleted. So, removing observations from the training
min-max (FMM) neural network to evaluate the model and the set is not a good idea. Different person has a different level
accuracy of the algorithm was 78.39%. Hayashi, Y., & Yukita, of insulin level. If we transfer the high number of data with
S.(2016) [15] used Recursive-Rule extraction algorithm with the most likely value, then we may face the issue with high
J48graft combined with sampling selection techniques and get variance in the data. So, the best option is replacing the
83.83% accuracy. missing values with the mean value of that particular attribute.
As a part of preprocessing, we replace the missing data with
IV. M ETHODOLOGY
a NaN value at first. Here, NaN is used for replacing the
We introduce a rich analysis of the features available in the numerical missing value with a string. Afterward, we iterate
PIMA database to identify the risk factor of a diabetic. We de- over the column and find the sum of all the numerical values
velop two approaches leveraging machine learning algorithm, (NaN is a string value, so it does not add up here). Then, we
which to our knowledge has not been previously studied. We calculate the general mean by dividing total summation by
show that our method is able to both efficiently detect the risk the number of the entity in the column. Finally, we replace
factors as well as significantly outperform previous works on the NaN string with the numeric mean value.
(diabetic diseases) risk factor detection tool. Out first method
B. Feature selection
is accomplished applying neural network model. The second
process is developed based on the K-means algorithm. All the extracted features do not carry significant weight,
and some of them do not have any impact on the prediction
A. Neural network-based method model. Apply this kind of feature for training the model
We use a neural network model in this process. We have will only add up the additional computational power. Fig. 1
three different steps in this section such as data recovery, contains the Skin thickness graph of all the patient.
feature selection, and M.L.P. Classifier. As a first step, data
recovery techniques are applied by replacing the missing data
with the mean value for making the dataset complete for
building a model. Then, we do the feature selection process
which is done to find the features that have the most impact
on the risk factor identification. Lastly, a suitable number of
hyper-parameters are selected that works well for this data-set.
B. K-means-based method
We apply the K-means algorithm in this section after
selecting the features. We also use different machine learning
Fig. 1. Skin Thickness graph for ”Pima Indian data set”. Here X-axis contain
classifiers to compare the outcome. The k-means algorithm skin thickness, Y-axis Contain Number of patient that have diabetes positive(as
effectively reduces noise from the data. The output of the red) and diabetes negative(as blue).
k-means algorithm is used as a feature for the model. The
classification methods are applied to the selected features to The first block of the histogram contains a pretty much
see the result. same number of element from both classes (diabetes positive
or diabetes negative). And it still maintains this ratio when
V. A PPLYING DATA RECOVERY WITH NEURAL NETWORK the skin thickness is increased. So this cannot be an essential
MODEL feature for the model.
In this process, we apply data processing, feature selection, These data from Fig. 2 and Fig. 3 concludes that when
and machine learning algorithm sequentially. glucose and B.M.I. levels increase, the risk for diabetes rises
significantly. So it provides us with a useful linear property
A. Data recovery with the B.M.I or glucose level.
Pima Indian dataset contains various missing data in several Greedy Stepwise Search Algorithm to select the critical
features including blood pressure, insulin level, skin thickness, attributes. Greedy Stepwise Search Algorithm iterates through
blood pressure, BMI, and glucose levels. We observe zero each or set of the attribute to calculate which property gives
entries in 374 insulin, 227 skin thickness, 35 blood pressure, the minimum error. The step for the feature selection algorithm
11 BMI, and five glucose features since the missing data is as follows:
C. Multilayer Perceptron Classifier
A multilayer perceptron (MLP) is a class of feed-forward
artificial neural network. We use this algorithm because MLPs
are used in research for their ability to solve problems sarcasti-
cally, which often allows approximate solutions for extremely
complex issues like fitness approximation.
Learning occurs in the perception by changing connection
weights after each portion of data is processed, based on the
quantity of error in the output associated with the exacted
Fig. 2. B.M.I. feature graph from Pima Indian data set(Here X-axis contain
B.M.I And Y-axis contain the Number of patient that have diabetes positive(as result. This is an example of supervised learning and is carried
red) and diabetes negative(as blue). out through back-propagation, a generalization of the ”least
mean squares algorithm” in the linear perception. We represent
the error in output node j in the nth data point (training
example) by,
ej (n) = Yj (n) − a(n) (1)
1X 2
Fig. 3. Glucose level graph for Pima Indian data set(Here X-axis contain (n) = e (2)
glucose level, Y-axis contain Number of patient that have diabetes positive(as
2 j j
red) and diabetes negative(as blue).
There are many hyper-parameters for MLP classifier such
as alpha, hidden-layer size, solver, learning-rate decay, etc. To
1) Pick a dictionary of feature h0 (x),...hD (x)
find the best model, the different combination of these hyper-
• e.g., polynomials for linear regression
parameters are tried randomly and iteratively. Firstly the model
2) Greedy heuristic: gets lower accuracy due to the high bias problem since it gives
i Start with empty set of feature F0 = Ø (or simple test and training accuracy pretty much same. There is some
set, like just h0 (x) → yi + ε ) solution for high bias which is given below:
ii Fit model using current feature set Fi to get wt • Build a bigger network
iii Select next best feature hj .(x) • Train for a longer period
• e.g., hj (x) resulting in lowest training error • Search for different NN (Neural network) architecture
where learning with Fi +hj (x)
The appropriate choice is option one and three because
iv Set Ft + 1 ⇐ Ft + hj .(x)
the training set is minimal so training for a longer time is
v Recurse
not a very effective process to remove high bias problem.
By analyzing the different graph and investing which feature Analyzing the Pima dataset, we found the values of these
affects the most, four features were taken. Those features are following parameters. We selected LBFGS as an optimizer (an
given below: optimizer in the family of quasi-Newton methods), alpha=1e-
• Glucose 5, and hidden layer sizes = (15, 7, 7, 3). In the hidden layer,
• B.M.I. the first layer has 15 node (neurons), second layer has seven
• Diabetes Pedigree Function neurons, third layer has seven neurons, and fourth layer has
• Age three neurons.
We apply logistic regression on these features, and we
TABLE I observed the accuracy level which is shown in Table II.
S AMPLE DATA RESULTING FROM APPLYING GREEDY STEP - WISE
ALGORITHM
TABLE II
Glucose BMI Diabetes Pedigree Function Age Outcome M.L.P. C LASSIFIER ACCURACY
148 33.6 0.62 50 1
85 26.6 0.35 31 0
183 23.3 0.67 32 1 Algorithm Accuracy
89 28.1 0.16 21 0 Training Set: M.L.P. Classifier 86.73%
137 43.1 2.28 33 1 Test Set: M.L.P. Classifier 85.15%
VI. A PPLYING K- MEANS WITH DIFFERENT MACHINE useful feature. The greedy stepwise search algorithm takes
LEARNING MODEL those set of feature that gives minimum error rate. Here are
A. K-means Algorithm the selected features.
• Pregnancies
Cluster analysis aims at partitioning the observations into
• Glucose
disparate clusters so that comments within the same group
• B.M.I.
are more closely related to each other than those assigned to
• Age
different clusters [14]. Fig. 4 shows the procedure of the K-
• Diabetes Pedigree Function
means algorithm, and the methods for the K-means Cluster
• Cluster(Output of k-means algorithm)
algorithm is given below:
C. Classifier
We apply many different kinds of the classifier to examine
which method works well. Different classifier like the decision
tree, MLP, and Logistic regression has a different approach to
evaluate the model. Table III contain the result of the different
classifier:
TABLE III
T HE RESULT OF THE DIFFERENT CLASSIFIER
Algorithm Accuracy
Logistic Regression 77.08%
M.L.P. Classifier 75.39%
Fig. 4. Visualizing the k-means algorithm for Pima Indian data-set. Random Forest 75.00%