Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand

This document summarizes a research paper that used three machine learning regression models to predict health insurance premium amounts based on personal health data. The models tested were multiple linear regression, decision tree regression, and gradient boosting regression. The dataset contained 1338 records with attributes like age, smoking status, and family medical history. Data cleaning removed unnecessary attributes. All three models performed the prediction but gradient boosting was found to achieve similar accuracy to multiple regression while taking less computational time.

Uploaded by

S Prasanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand

Uploaded by

S Prasanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181

Vol. 9 Issue 05, May-2020

Health Insurance Amount Prediction

Nidhi Bhardwaj , Rishabh Anand
Delhi, India
Dr. Akhilesh Das Gupta Institute of Technology & Management

Abstract In this thesis, we analyse the personal health data to II. DATASET USED
predict insurance amount for individuals.
uals. Three regression
models naming Multiple Linear Regression, Decision tree The primary source of data for this project was from
Regression and Gradient Boosting Decision tree Regression have Kaggle user Dmarco. The dataset is comprised of 1338
been used to compare and contrast the performance of these
algorithms. Dataset was used for training the models and that
records with 6 attributes.
training helped to come up with some predictions. Then the as shown
predicted amount was compared with the actual data to test and in Fig. 1. The data was in structured format and was stores in a
verify the model. Later the accuracies of these models were csv file.
compared. It was gathered that multiple linear regression and Dataset is not suited for the regression to take place directly.
gradient boosting algorithms performed better than the linear So cleaning of dataset becomes important for using the data
regression and decision tree. Gradient boosting is best suited in under various regression algorithms.
this case because it takes much less computational time to In a dataset not every attribute has an impact on the prediction.
achieve the same performance metric, though its performance is Whereas some attributes even decline the accuracy, so it
comparable to multiple regression.
becomes necessary to remove these attributes from the
Keywords Regression, Premium, Machine Learning. features of the code. Removing such attributes not only help in
improving accuracy but also the overall performance and
I. INTRODUCTION speed.

The goal of this project is to allows a person to get an idea

about the necessary amount required according to their own
health status. Later they can comply with any health insurance
company and their schemes & benefits keeping in mind the
predicted amount from our project. This can help a person in
focusing more on the health aspect of an insurance rather than
the futile part.
Health insurance is a necessity nowadays, and almost every
individual is linked with a government or private health
insurance company. Factors determining the amount of
insurance vary from company to company. Also people in Figure 1: Sample of Health Insurance Dataset
rural areas are unaware of the fact that the government of
India provide free health insurance to those below poverty In health insurance many factors such as pre-existing body
line. It is very complex method and some rural people either condition, family medical history, Body Mass Index (BMI),
buy some private health insurance or do not invest money in marital status, location, past insurances etc affects the amount.
health insurance at all. Apart from this people can be fooled According to our dataset, age and smoking status has the
easily about the amount of the insurance and may maximum impact on the amount prediction with smoker being
unnecessarily buy some expensive health insurance. the one attribute with maximum effect. Children attribute had
Our project does not give the exact amount required for any almost no effect on the prediction, therefore this attribute was
health insurance company but gives enough idea about the removed from the input to the regression model to support
amount associated with an individual for his/her own health better computation in less time.
insurance. III. MACHINE LEARNING

Prediction is premature and does not comply with any Machine learning can be defined as the process of
particular company so it must not be only criteria in selection teaching a computer system which allows it to make accurate
of a health insurance. Early health insurance amount predictions after the data is fed.
prediction can help in better contemplation of the amount However, training has to be done first with the data
associated. By filtering and various machine learning models
needed. Where a person can ensure that the amount he/she is accuracy can be improved. Fig. 2 shows various machine
going to opt is justified. Also it can provide an idea about learning types along with their properties.
gaining extra benefits from the health insurance.

IJERTV9IS050700 www.ijert.org 1008

(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 05, May-2020

IV. REGRESSION

Regression analysis allows us to quantify the relationship

between outcome and associated variables. Many techniques
for performing statistical predictions have been developed,
but, in this project, three models - Multiple Linear Regression
(MLR), Decision tree regression and Gradient Boosting
Regression were tested and compared.

A. Multiple Linear Regression

Multiple linear regression can be defined as extended simple
linear regression. It comes under usage when we want to
predict a single output depending upon multiple input or we
can say that the predicted value of a variable is based upon
the value of two or more different variables. The predicted
variable or the variable we want to predict is called the
Figure 2: Types of Machine Learning dependent variable (or sometimes, the outcome, target or
criterion variable) and the variables being used in predict of
A. Supervised Learning the value of the dependent variable are called the independent
Supervised learning algorithms create a mathematical model variables (or sometimes, the predictor, explanatory or
according to a set of data that contains both the inputs and the regressor variables).
desired outputs. Usually a random part of data is selected
from the complete dataset known as training data, or in other B. Decision tree regression
words a set of training examples. Training data has one or Regression or classification models in decision tree
more inputs and a desired output, called as a supervisory regression builds in the form of a tree structure. The dataset is
each divided or segmented into smaller and smaller subsets while
training dataset is represented by an array or vector, known as at the same time an associated decision tree is incrementally
a feature vector. A matrix is used for the representation of developed. A decision tree with decision nodes and leaf
training data. Supervised learning algorithms learn from a nodes is obtained as a final result. These decision nodes have
model containing function that can be used to predict the two or more branches, each representing values for the
output from the new inputs through iterative optimization of attribute tested. Decision on the numerical target is
an objective function. The algorithm correctly determines the represented by leaf node. The topmost decision node
output for inputs that were not a part of the training data with corresponds to the best predictor in the tree called root node.
the help of an optimal function. Numerical data along with categorical data can be handled by
decision tress.
B. Unsupervised Learning
In this learning, algorithms take a set of data that contains C. Gradient Boosting Regression
only inputs, and find structure in the data, like grouping or This algorithm for Boosting Trees came from the application
clustering of data points. Test data that has not been labeled, of boosting methods to regression trees. The basic idea
classified or categorized helps the algorithm to learn from it. behind this is to compute a sequence of simple trees, where
What actually happens is unsupervised learning algorithms each successive tree is built for the prediction residuals of the
identify commonalities in the data and react based on the preceding tree. For predictive models, gradient boosting is
presence or absence of such commonalities in each new piece considered as one of the most powerful techniques.
of data. The main application of unsupervised learning Gradient boosting involves three elements:
is density estimation in statistics. Though unsupervised
learning, encompasses other domains involving summarizing 1. An optimized loss function.
and explaining data features also. 2. An additive model to add weak learners to
minimize the loss function.
C. Reinforcement Learning 3. A weak learner to make predictions
Reinforcement learning is class of machine learning which is
concerned with how software agents ought to make actions in
an environment. These actions must be in a way so they V. DESIGNING AND IMPLEMENTATION
maximize some notion of cumulative reward. Reinforcement
learning is getting very common in nowadays, therefore this A. Data Preparation & Cleaning
field is studied in many other disciplines, such as game The data has been imported from kaggle website. The website
theory, control theory, operations research, information provides with a variety of data and the data used for the
theory, simulated-based optimization, multi-agent systems, project is an insurance amount data. The data included
swarm intelligence, statistics and genetic algorithms. various attributes such as age, gender, body mass index,
smoker and the charges attribute which will work as the label

IJERTV9IS050700 www.ijert.org 1009

for the project. The data was in structured format and was Fig. 4 shows the graphs of every single attribute taken as
stores in a csv file format. The data was imported using input to the gradient boosting regression model.
pandas library.
The presence of missing, incomplete, or corrupted data leads
to wrong results while performing any functions such as
count, average, mean etc. These inconsistencies must be
removed before doing any analysis on data. The data included
some ambiguous values which were needed to be removed.

B. Training
Once training data is in a suitable form to feed to the model,
the training and testing phase of the model can proceed.
During the training phase, the primary concern is the model
selection. This involves choosing the best modelling
approach for the task, or the best parameter settings for a
given model. In fact, the term model selection often refers to
both of these processes, as, in many cases, various models
were tried first and best performing model (with the best
performing parameter settings for each model) was selected.

C. Prediction
The model was used to predict the insurance amount which
would be spent on their health. The model used the relation
between the features and the label to predict the amount.
Accuracy defines the degree of correctness of the predicted Figure 4: Attributes vs Prediction Graphs - Gradient Boosting Regression
value of the insurance amount. The model predicted the
accuracy of model by using different algorithms, different VII. CONCLUSION & FUTURE SCOPE
features and different train test split size. The size of the data
used for training of data has a huge impact on the accuracy of Backgroun In this project, three regression models are
data. The larger the train size, the better is the accuracy. evaluated for individual health insurance data. The health
The model predicts the premium amount using multiple insurance data was used to develop the three regression
algorithms and shows the effect of each attribute on the models, and the predicted premiums from these models were
predicted value. compared with actual premiums to compare the accuracies of
these models. It has been found that Gradient Boosting
VI. RESULT Regression model which is built upon decision tree is the best
performing model.
We see that the accuracy of predicted amount was seen best
i.e. 99.5% in gradient boosting decision tree regression. Other Various factors were used and their effect on predicted
two regression models also gave good accuracies about 80% amount was examined. It was
In their prediction. Fig 3 shows the accuracy percentage of and smoking status affects the prediction most in every
various attributes separately and combined over all three algorithm applied. Attributes which had no effect on the
models. prediction were removed from the features.
Model giving highest percentage of accuracy taking input of
all four attributes was selected to be the best model which The effect of various independent variables on the premium
eventually came out to be Gradient Boosting Regression. amount was also checked. The attributes also in combination
were checked for better accuracy results.

Premium amou
insurance terms and conditions.
The models can be applied to the data collected in coming
years to predict the premium. This can help not only people
but also insurance companies to work in tandem for better
and more health centric insurance amount.

REFERENCES
[1] https://www.moneycrashers.com/factors-health-insurance-premium-
costs/
[2] https://en.wikipedia.org/wiki/Healthcare_in_India
[3] https://www.kaggle.com/mirichoi0218/insurance
[4] https://economictimes.indiatimes.com/wealth/insure/what-you-need-to-
know-before-buying-health-
Figure 3: Accuracy in percentage (%) insurance/articleshow/47983447.cms?from=mdr

IJERTV9IS050700 www.ijert.org 1010

[5] https://statistics.laerd.com/spss-tutorials/multiple-regression-using-
spss-statistics.php
[6] https://www.zdnet.com/article/the-true-costs-and-roi-of-implementing-
ai-in-the-enterprise/ .
[7] https://www.saedsayad.com/decision_tree_reg.htm
[8] http://www.statsoft.com/Textbook/Boosting-Trees-Regression-
Classification

IJERTV9IS050700 www.ijert.org 1011

(This work is licensed under a Creative Commons Attribution 4.0 International License.)

Implementation of Medical Insurance Price Prediction System Using Regression Algorithms
No ratings yet
Implementation of Medical Insurance Price Prediction System Using Regression Algorithms
7 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Mini - Project - Report Health Insurance Price Prediction
50% (2)
Mini - Project - Report Health Insurance Price Prediction
33 pages
2016.random Forest in Remote Sensing A Review of Applications and Future
No ratings yet
2016.random Forest in Remote Sensing A Review of Applications and Future
8 pages
medicial
No ratings yet
medicial
13 pages
SSRN Id4366801
No ratings yet
SSRN Id4366801
4 pages
Medical Insurance Cost Prediction System: Dharesh Bahety EN18EL301057 Under The Guidance of Mr. Parag Ravekar Sir
0% (1)
Medical Insurance Cost Prediction System: Dharesh Bahety EN18EL301057 Under The Guidance of Mr. Parag Ravekar Sir
18 pages
Dental
No ratings yet
Dental
10 pages
AccuratePredictionofMedicalInsurancePricesusingMachineLearninginPython
No ratings yet
AccuratePredictionofMedicalInsurancePricesusingMachineLearninginPython
28 pages
C83640110321
No ratings yet
C83640110321
7 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
7 pages
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
0% (1)
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
7 pages
An Ensemble Methods For Medical Insurance Costs Prediction Task
No ratings yet
An Ensemble Methods For Medical Insurance Costs Prediction Task
16 pages
Internship Documnet_1
No ratings yet
Internship Documnet_1
34 pages
P4 Project Report
No ratings yet
P4 Project Report
28 pages
Machine Learning_project
No ratings yet
Machine Learning_project
26 pages
MLreview Article
No ratings yet
MLreview Article
20 pages
Medical Insurance Cost
No ratings yet
Medical Insurance Cost
12 pages
Analyzing The Amount of Health Insurance Premiums Using Multiple Linear Regression Models
100% (1)
Analyzing The Amount of Health Insurance Premiums Using Multiple Linear Regression Models
24 pages
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
Medical-Insurance-Cost-Prediction[1]
No ratings yet
Medical-Insurance-Cost-Prediction[1]
16 pages
201-15-3650,3032-Project Presentation Slide
No ratings yet
201-15-3650,3032-Project Presentation Slide
9 pages
Kafuria Angela D.
No ratings yet
Kafuria Angela D.
55 pages
Health Insurance Cost Prediction Using IBM Watson
No ratings yet
Health Insurance Cost Prediction Using IBM Watson
27 pages
ResearchPaper
No ratings yet
ResearchPaper
14 pages
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
Medical
No ratings yet
Medical
4 pages
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
No ratings yet
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
7 pages
Business Analytics Project Report: Deloitte Insurance, Pricing Strategy Development
No ratings yet
Business Analytics Project Report: Deloitte Insurance, Pricing Strategy Development
4 pages
A Benchmark of Health Insurance Fraud Detection Using Machine Learning Techniques
No ratings yet
A Benchmark of Health Insurance Fraud Detection Using Machine Learning Techniques
10 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
48 pages
SSRN Id3990877
No ratings yet
SSRN Id3990877
8 pages
DM Report
No ratings yet
DM Report
4 pages
Predictive Analytics in Health Care Using Machine Learningtools and Techniques
No ratings yet
Predictive Analytics in Health Care Using Machine Learningtools and Techniques
1 page
2024-2017
No ratings yet
2024-2017
7 pages
Hospital prediction using data mining
No ratings yet
Hospital prediction using data mining
9 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Hospital Readmission Prediction Using Machine Learning Techniques
No ratings yet
Hospital Readmission Prediction Using Machine Learning Techniques
10 pages
A Project Report
No ratings yet
A Project Report
5 pages
BT3046_PR
No ratings yet
BT3046_PR
22 pages
AIH_LAB1
No ratings yet
AIH_LAB1
10 pages
REASEARCH
No ratings yet
REASEARCH
4 pages
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Report
No ratings yet
Report
35 pages
ssrn-3526707
No ratings yet
ssrn-3526707
5 pages
PBL Sem 3 Documentation
No ratings yet
PBL Sem 3 Documentation
20 pages
Medical Insurance Cost Prediction
100% (2)
Medical Insurance Cost Prediction
16 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Medical Insurance Cost Prediction
100% (1)
Medical Insurance Cost Prediction
18 pages
Salary Prediction Using Machine Learning
No ratings yet
Salary Prediction Using Machine Learning
4 pages
Machine Learning in Healthcare Management For Medical Insurance Cost Prediction
No ratings yet
Machine Learning in Healthcare Management For Medical Insurance Cost Prediction
11 pages
A Computational Intelligence Approach For Predicti
No ratings yet
A Computational Intelligence Approach For Predicti
13 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
Statistics and Probability PROJECT 1
No ratings yet
Statistics and Probability PROJECT 1
4 pages
Machine Learning Methods
No ratings yet
Machine Learning Methods
27 pages
View Synthesis: Exploring Perspectives in Computer Vision
From Everand
View Synthesis: Exploring Perspectives in Computer Vision
Fouad Sabry
No ratings yet
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
Report Final 2
No ratings yet
Report Final 2
58 pages
Insurance Premium Prediction
No ratings yet
Insurance Premium Prediction
12 pages
TB 969425740
No ratings yet
TB 969425740
16 pages
Boosting
No ratings yet
Boosting
6 pages
Data Mining Techniques in Smart Agriculture
No ratings yet
Data Mining Techniques in Smart Agriculture
6 pages
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
No ratings yet
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
13 pages
Lesson Plan - ML24ECSC306
No ratings yet
Lesson Plan - ML24ECSC306
22 pages
(Program Curriculum) : PG Diploma in Data Science
No ratings yet
(Program Curriculum) : PG Diploma in Data Science
6 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Aiml Final Report
No ratings yet
Aiml Final Report
39 pages
Machine Learning Syllabus
No ratings yet
Machine Learning Syllabus
5 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
AI-Powered Credit Scoring System
No ratings yet
AI-Powered Credit Scoring System
7 pages
Analytical Methods of Machine Learning Model For E-Commerce Sales Analysis and Prediction
No ratings yet
Analytical Methods of Machine Learning Model For E-Commerce Sales Analysis and Prediction
6 pages
GENAI COURSE PROJECT DETAILS
No ratings yet
GENAI COURSE PROJECT DETAILS
3 pages
Gradient Boosting Machines, A Tutorial: Neurorobotics
No ratings yet
Gradient Boosting Machines, A Tutorial: Neurorobotics
21 pages
Customer Churn Prediction System: A Machine Learning Approach
No ratings yet
Customer Churn Prediction System: A Machine Learning Approach
24 pages
Cognitive Analytics Platform With AI Solutions For Anomaly Detection
No ratings yet
Cognitive Analytics Platform With AI Solutions For Anomaly Detection
17 pages
Games of Prediction
No ratings yet
Games of Prediction
33 pages
IEEE_Format_Paper
No ratings yet
IEEE_Format_Paper
20 pages
What Is Ensemble Learning
No ratings yet
What Is Ensemble Learning
4 pages
BDT KSETA Freudenstadt
No ratings yet
BDT KSETA Freudenstadt
32 pages
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
No ratings yet
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
28 pages
Research Paper Emaildetection
No ratings yet
Research Paper Emaildetection
6 pages
WEP+AIDML++2024-196-203
No ratings yet
WEP+AIDML++2024-196-203
8 pages
Machine Learning with R Cookbook 2nd Edition Bhatia - The complete ebook set is ready for download today
No ratings yet
Machine Learning with R Cookbook 2nd Edition Bhatia - The complete ebook set is ready for download today
80 pages
Pattern Classification Using Ensemble Methods Rokach L download
No ratings yet
Pattern Classification Using Ensemble Methods Rokach L download
79 pages
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
No ratings yet
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
7 pages
Applied Artificial Intelligence For Predicting Construction Projects Delay
No ratings yet
Applied Artificial Intelligence For Predicting Construction Projects Delay
16 pages
Previous Year Placement Questions of ISI KOLKATA
No ratings yet
Previous Year Placement Questions of ISI KOLKATA
9 pages
Aiml Project
No ratings yet
Aiml Project
22 pages