Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand
Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand
Abstract In this thesis, we analyse the personal health data to II. DATASET USED
predict insurance amount for individuals.
uals. Three regression
models naming Multiple Linear Regression, Decision tree The primary source of data for this project was from
Regression and Gradient Boosting Decision tree Regression have Kaggle user Dmarco. The dataset is comprised of 1338
been used to compare and contrast the performance of these
algorithms. Dataset was used for training the models and that
records with 6 attributes.
training helped to come up with some predictions. Then the as shown
predicted amount was compared with the actual data to test and in Fig. 1. The data was in structured format and was stores in a
verify the model. Later the accuracies of these models were csv file.
compared. It was gathered that multiple linear regression and Dataset is not suited for the regression to take place directly.
gradient boosting algorithms performed better than the linear So cleaning of dataset becomes important for using the data
regression and decision tree. Gradient boosting is best suited in under various regression algorithms.
this case because it takes much less computational time to In a dataset not every attribute has an impact on the prediction.
achieve the same performance metric, though its performance is Whereas some attributes even decline the accuracy, so it
comparable to multiple regression.
becomes necessary to remove these attributes from the
Keywords Regression, Premium, Machine Learning. features of the code. Removing such attributes not only help in
improving accuracy but also the overall performance and
I. INTRODUCTION speed.
Prediction is premature and does not comply with any Machine learning can be defined as the process of
particular company so it must not be only criteria in selection teaching a computer system which allows it to make accurate
of a health insurance. Early health insurance amount predictions after the data is fed.
prediction can help in better contemplation of the amount However, training has to be done first with the data
associated. By filtering and various machine learning models
needed. Where a person can ensure that the amount he/she is accuracy can be improved. Fig. 2 shows various machine
going to opt is justified. Also it can provide an idea about learning types along with their properties.
gaining extra benefits from the health insurance.
IV. REGRESSION
for the project. The data was in structured format and was Fig. 4 shows the graphs of every single attribute taken as
stores in a csv file format. The data was imported using input to the gradient boosting regression model.
pandas library.
The presence of missing, incomplete, or corrupted data leads
to wrong results while performing any functions such as
count, average, mean etc. These inconsistencies must be
removed before doing any analysis on data. The data included
some ambiguous values which were needed to be removed.
B. Training
Once training data is in a suitable form to feed to the model,
the training and testing phase of the model can proceed.
During the training phase, the primary concern is the model
selection. This involves choosing the best modelling
approach for the task, or the best parameter settings for a
given model. In fact, the term model selection often refers to
both of these processes, as, in many cases, various models
were tried first and best performing model (with the best
performing parameter settings for each model) was selected.
C. Prediction
The model was used to predict the insurance amount which
would be spent on their health. The model used the relation
between the features and the label to predict the amount.
Accuracy defines the degree of correctness of the predicted Figure 4: Attributes vs Prediction Graphs - Gradient Boosting Regression
value of the insurance amount. The model predicted the
accuracy of model by using different algorithms, different VII. CONCLUSION & FUTURE SCOPE
features and different train test split size. The size of the data
used for training of data has a huge impact on the accuracy of Backgroun In this project, three regression models are
data. The larger the train size, the better is the accuracy. evaluated for individual health insurance data. The health
The model predicts the premium amount using multiple insurance data was used to develop the three regression
algorithms and shows the effect of each attribute on the models, and the predicted premiums from these models were
predicted value. compared with actual premiums to compare the accuracies of
these models. It has been found that Gradient Boosting
VI. RESULT Regression model which is built upon decision tree is the best
performing model.
We see that the accuracy of predicted amount was seen best
i.e. 99.5% in gradient boosting decision tree regression. Other Various factors were used and their effect on predicted
two regression models also gave good accuracies about 80% amount was examined. It was
In their prediction. Fig 3 shows the accuracy percentage of and smoking status affects the prediction most in every
various attributes separately and combined over all three algorithm applied. Attributes which had no effect on the
models. prediction were removed from the features.
Model giving highest percentage of accuracy taking input of
all four attributes was selected to be the best model which The effect of various independent variables on the premium
eventually came out to be Gradient Boosting Regression. amount was also checked. The attributes also in combination
were checked for better accuracy results.
Premium amou
insurance terms and conditions.
The models can be applied to the data collected in coming
years to predict the premium. This can help not only people
but also insurance companies to work in tandem for better
and more health centric insurance amount.
REFERENCES
[1] https://www.moneycrashers.com/factors-health-insurance-premium-
costs/
[2] https://en.wikipedia.org/wiki/Healthcare_in_India
[3] https://www.kaggle.com/mirichoi0218/insurance
[4] https://economictimes.indiatimes.com/wealth/insure/what-you-need-to-
know-before-buying-health-
Figure 3: Accuracy in percentage (%) insurance/articleshow/47983447.cms?from=mdr
[5] https://statistics.laerd.com/spss-tutorials/multiple-regression-using-
spss-statistics.php
[6] https://www.zdnet.com/article/the-true-costs-and-roi-of-implementing-
ai-in-the-enterprise/ .
[7] https://www.saedsayad.com/decision_tree_reg.htm
[8] http://www.statsoft.com/Textbook/Boosting-Trees-Regression-
Classification