0% found this document useful (0 votes)
54 views

Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand

This document summarizes a research paper that used three machine learning regression models to predict health insurance premium amounts based on personal health data. The models tested were multiple linear regression, decision tree regression, and gradient boosting regression. The dataset contained 1338 records with attributes like age, smoking status, and family medical history. Data cleaning removed unnecessary attributes. All three models performed the prediction but gradient boosting was found to achieve similar accuracy to multiple regression while taking less computational time.

Uploaded by

S Prasanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand

This document summarizes a research paper that used three machine learning regression models to predict health insurance premium amounts based on personal health data. The models tested were multiple linear regression, decision tree regression, and gradient boosting regression. The dataset contained 1338 records with attributes like age, smoking status, and family medical history. Data cleaning removed unnecessary attributes. All three models performed the prediction but gradient boosting was found to achieve similar accuracy to multiple regression while taking less computational time.

Uploaded by

S Prasanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 9 Issue 05, May-2020

Health Insurance Amount Prediction


Nidhi Bhardwaj , Rishabh Anand
Delhi, India
Dr. Akhilesh Das Gupta Institute of Technology & Management

Abstract In this thesis, we analyse the personal health data to II. DATASET USED
predict insurance amount for individuals.
uals. Three regression
models naming Multiple Linear Regression, Decision tree The primary source of data for this project was from
Regression and Gradient Boosting Decision tree Regression have Kaggle user Dmarco. The dataset is comprised of 1338
been used to compare and contrast the performance of these
algorithms. Dataset was used for training the models and that
records with 6 attributes.
training helped to come up with some predictions. Then the as shown
predicted amount was compared with the actual data to test and in Fig. 1. The data was in structured format and was stores in a
verify the model. Later the accuracies of these models were csv file.
compared. It was gathered that multiple linear regression and Dataset is not suited for the regression to take place directly.
gradient boosting algorithms performed better than the linear So cleaning of dataset becomes important for using the data
regression and decision tree. Gradient boosting is best suited in under various regression algorithms.
this case because it takes much less computational time to In a dataset not every attribute has an impact on the prediction.
achieve the same performance metric, though its performance is Whereas some attributes even decline the accuracy, so it
comparable to multiple regression.
becomes necessary to remove these attributes from the
Keywords Regression, Premium, Machine Learning. features of the code. Removing such attributes not only help in
improving accuracy but also the overall performance and
I. INTRODUCTION speed.

The goal of this project is to allows a person to get an idea


about the necessary amount required according to their own
health status. Later they can comply with any health insurance
company and their schemes & benefits keeping in mind the
predicted amount from our project. This can help a person in
focusing more on the health aspect of an insurance rather than
the futile part.
Health insurance is a necessity nowadays, and almost every
individual is linked with a government or private health
insurance company. Factors determining the amount of
insurance vary from company to company. Also people in Figure 1: Sample of Health Insurance Dataset
rural areas are unaware of the fact that the government of
India provide free health insurance to those below poverty In health insurance many factors such as pre-existing body
line. It is very complex method and some rural people either condition, family medical history, Body Mass Index (BMI),
buy some private health insurance or do not invest money in marital status, location, past insurances etc affects the amount.
health insurance at all. Apart from this people can be fooled According to our dataset, age and smoking status has the
easily about the amount of the insurance and may maximum impact on the amount prediction with smoker being
unnecessarily buy some expensive health insurance. the one attribute with maximum effect. Children attribute had
Our project does not give the exact amount required for any almost no effect on the prediction, therefore this attribute was
health insurance company but gives enough idea about the removed from the input to the regression model to support
amount associated with an individual for his/her own health better computation in less time.
insurance. III. MACHINE LEARNING

Prediction is premature and does not comply with any Machine learning can be defined as the process of
particular company so it must not be only criteria in selection teaching a computer system which allows it to make accurate
of a health insurance. Early health insurance amount predictions after the data is fed.
prediction can help in better contemplation of the amount However, training has to be done first with the data
associated. By filtering and various machine learning models
needed. Where a person can ensure that the amount he/she is accuracy can be improved. Fig. 2 shows various machine
going to opt is justified. Also it can provide an idea about learning types along with their properties.
gaining extra benefits from the health insurance.

IJERTV9IS050700 www.ijert.org 1008


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 05, May-2020

IV. REGRESSION

Regression analysis allows us to quantify the relationship


between outcome and associated variables. Many techniques
for performing statistical predictions have been developed,
but, in this project, three models - Multiple Linear Regression
(MLR), Decision tree regression and Gradient Boosting
Regression were tested and compared.

A. Multiple Linear Regression


Multiple linear regression can be defined as extended simple
linear regression. It comes under usage when we want to
predict a single output depending upon multiple input or we
can say that the predicted value of a variable is based upon
the value of two or more different variables. The predicted
variable or the variable we want to predict is called the
Figure 2: Types of Machine Learning dependent variable (or sometimes, the outcome, target or
criterion variable) and the variables being used in predict of
A. Supervised Learning the value of the dependent variable are called the independent
Supervised learning algorithms create a mathematical model variables (or sometimes, the predictor, explanatory or
according to a set of data that contains both the inputs and the regressor variables).
desired outputs. Usually a random part of data is selected
from the complete dataset known as training data, or in other B. Decision tree regression
words a set of training examples. Training data has one or Regression or classification models in decision tree
more inputs and a desired output, called as a supervisory regression builds in the form of a tree structure. The dataset is
each divided or segmented into smaller and smaller subsets while
training dataset is represented by an array or vector, known as at the same time an associated decision tree is incrementally
a feature vector. A matrix is used for the representation of developed. A decision tree with decision nodes and leaf
training data. Supervised learning algorithms learn from a nodes is obtained as a final result. These decision nodes have
model containing function that can be used to predict the two or more branches, each representing values for the
output from the new inputs through iterative optimization of attribute tested. Decision on the numerical target is
an objective function. The algorithm correctly determines the represented by leaf node. The topmost decision node
output for inputs that were not a part of the training data with corresponds to the best predictor in the tree called root node.
the help of an optimal function. Numerical data along with categorical data can be handled by
decision tress.
B. Unsupervised Learning
In this learning, algorithms take a set of data that contains C. Gradient Boosting Regression
only inputs, and find structure in the data, like grouping or This algorithm for Boosting Trees came from the application
clustering of data points. Test data that has not been labeled, of boosting methods to regression trees. The basic idea
classified or categorized helps the algorithm to learn from it. behind this is to compute a sequence of simple trees, where
What actually happens is unsupervised learning algorithms each successive tree is built for the prediction residuals of the
identify commonalities in the data and react based on the preceding tree. For predictive models, gradient boosting is
presence or absence of such commonalities in each new piece considered as one of the most powerful techniques.
of data. The main application of unsupervised learning Gradient boosting involves three elements:
is density estimation in statistics. Though unsupervised
learning, encompasses other domains involving summarizing 1. An optimized loss function.
and explaining data features also. 2. An additive model to add weak learners to
minimize the loss function.
C. Reinforcement Learning 3. A weak learner to make predictions
Reinforcement learning is class of machine learning which is
concerned with how software agents ought to make actions in
an environment. These actions must be in a way so they V. DESIGNING AND IMPLEMENTATION
maximize some notion of cumulative reward. Reinforcement
learning is getting very common in nowadays, therefore this A. Data Preparation & Cleaning
field is studied in many other disciplines, such as game The data has been imported from kaggle website. The website
theory, control theory, operations research, information provides with a variety of data and the data used for the
theory, simulated-based optimization, multi-agent systems, project is an insurance amount data. The data included
swarm intelligence, statistics and genetic algorithms. various attributes such as age, gender, body mass index,
smoker and the charges attribute which will work as the label

IJERTV9IS050700 www.ijert.org 1009


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 05, May-2020

for the project. The data was in structured format and was Fig. 4 shows the graphs of every single attribute taken as
stores in a csv file format. The data was imported using input to the gradient boosting regression model.
pandas library.
The presence of missing, incomplete, or corrupted data leads
to wrong results while performing any functions such as
count, average, mean etc. These inconsistencies must be
removed before doing any analysis on data. The data included
some ambiguous values which were needed to be removed.

B. Training
Once training data is in a suitable form to feed to the model,
the training and testing phase of the model can proceed.
During the training phase, the primary concern is the model
selection. This involves choosing the best modelling
approach for the task, or the best parameter settings for a
given model. In fact, the term model selection often refers to
both of these processes, as, in many cases, various models
were tried first and best performing model (with the best
performing parameter settings for each model) was selected.

C. Prediction
The model was used to predict the insurance amount which
would be spent on their health. The model used the relation
between the features and the label to predict the amount.
Accuracy defines the degree of correctness of the predicted Figure 4: Attributes vs Prediction Graphs - Gradient Boosting Regression
value of the insurance amount. The model predicted the
accuracy of model by using different algorithms, different VII. CONCLUSION & FUTURE SCOPE
features and different train test split size. The size of the data
used for training of data has a huge impact on the accuracy of Backgroun In this project, three regression models are
data. The larger the train size, the better is the accuracy. evaluated for individual health insurance data. The health
The model predicts the premium amount using multiple insurance data was used to develop the three regression
algorithms and shows the effect of each attribute on the models, and the predicted premiums from these models were
predicted value. compared with actual premiums to compare the accuracies of
these models. It has been found that Gradient Boosting
VI. RESULT Regression model which is built upon decision tree is the best
performing model.
We see that the accuracy of predicted amount was seen best
i.e. 99.5% in gradient boosting decision tree regression. Other Various factors were used and their effect on predicted
two regression models also gave good accuracies about 80% amount was examined. It was
In their prediction. Fig 3 shows the accuracy percentage of and smoking status affects the prediction most in every
various attributes separately and combined over all three algorithm applied. Attributes which had no effect on the
models. prediction were removed from the features.
Model giving highest percentage of accuracy taking input of
all four attributes was selected to be the best model which The effect of various independent variables on the premium
eventually came out to be Gradient Boosting Regression. amount was also checked. The attributes also in combination
were checked for better accuracy results.

Premium amou
insurance terms and conditions.
The models can be applied to the data collected in coming
years to predict the premium. This can help not only people
but also insurance companies to work in tandem for better
and more health centric insurance amount.

REFERENCES
[1] https://www.moneycrashers.com/factors-health-insurance-premium-
costs/
[2] https://en.wikipedia.org/wiki/Healthcare_in_India
[3] https://www.kaggle.com/mirichoi0218/insurance
[4] https://economictimes.indiatimes.com/wealth/insure/what-you-need-to-
know-before-buying-health-
Figure 3: Accuracy in percentage (%) insurance/articleshow/47983447.cms?from=mdr

IJERTV9IS050700 www.ijert.org 1010


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 05, May-2020

[5] https://statistics.laerd.com/spss-tutorials/multiple-regression-using-
spss-statistics.php
[6] https://www.zdnet.com/article/the-true-costs-and-roi-of-implementing-
ai-in-the-enterprise/ .
[7] https://www.saedsayad.com/decision_tree_reg.htm
[8] http://www.statsoft.com/Textbook/Boosting-Trees-Regression-
Classification

IJERTV9IS050700 www.ijert.org 1011


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like