SSRN Id4366801

Health Insurance Cost Prediction by Using Machine Learning
Ajay Kumar Sahu 1*, Gopal Sharma 2, Janvi Kaushik 3, Kajal Agrawal 4, Devendra Singh 5
[email protected] 1, [email protected], [email protected], [email protected], [email protected]
1* 3
Greater Noida Institute of Technology, Greater Noida 201310, India Greater Noida Institute of Technology, Greater Noida 201310, India
2 4
Greater Noida Institute of Technology, Greater Noida 201310, India Greater Noida Institute of Technology, Greater Noida 201310, India
5
Greater Noida Institute of Technology, Greater Noida 201310, India
ABSTRACT: INTRODUCTION:
A large portion of the economy is devoted to paying for health care. The goal of this research is to help individuals understand the amount
Spending on healthcare accounts for around 30% of the GDP. In of money they may need for health insurance based on their personal
terms of both absolute spending and as a percentage of the economy, health status. This can assist individuals in focusing more on the health-
health spending in developed countries is the greatest. Through its related aspects of insurance rather than the unnecessary ones. In the
Medicare programme, the government foots a sizable percentage modern world, it is essential to have health insurance, and most people
of the older population's medical costs. A significant load is placed have a relationship with a public or private health insurance provider.
on the exchequer by the rising cost of health care paired with the The factors that influence insurance costs vary from company to
baby boomer generation's impending retirement and subsequent company. Additionally, some people in rural areas may not be aware
eligibility for Medicare. Therefore, it is imperative to use every that the Indian government offers free health insurance to those who are
available tool to limit health-related costs. In this study, we'll create below the poverty line. However, the process can be complex, and some
a method for predicting medical costs using machine learning rural residents either get private health insurance or make no investment
algorithms, which will help direct patients into affordable. The at all. Additionally, people may be vulnerable to being misled into
technology can also help policymakers identify which providers are paying for expensive health insurance that they don't need. Our research
often more expensive and, if required, take punitive action. The does not provide an exact amount required by any specific health
Random Forest Regression algorithm will be used in machine insurance provider, but it does give a general sense of the cost a person
learning to forecast the cost of medical care. We also intend to test may incur for their own health insurance. This is a preliminary estimate
experiments using different machine learning models, like Gradient and does not adhere to any particular company, so it should not be the
Boosted Trees and Linear Regression, on the same data and only factor considered when choosing health insurance. Early
compare the outcomes. Early estimation of health insurance costs estimation of health insurance costs can help individuals consider the
can help. Additionally, people may be vulnerable to being misled required amount more thoughtfully. When a person can determine the
into paying for expensive health insurance that they don't need. Our
research does not provide an exact amount required by any specific
health insurance provider, but it does give a general sense of the cost
a person may incur for their own health insurance.
KEYWORDS:
Economy,
Policy Makers,
Machine Learning,
Regression significant
Fig 1: Data Sets [1]
Electronic copy available at: https://ssrn.com/abstract=4366801

INPUT DATA USED: Linear Regression Algorithm:
The following article discusses a dataset that can be accessed on the

Kaggle website for the purpose of training and testing. This dataset is Linear regression is a machine learning algorithm that is based on the
saved in a CSV file and is well organized. It is available at the specified concept of "supervised learning." It is used to predict the value of a
link for those interested in using it. dependent variable (y) based on the value of an independent variable (x).
No. of columns=7 Essentially, this means that linear regression is used to determine how
closely a dependent variable is related to an independent variable, and
1338 rows total. Total number = 9366
then make predictions based on that relationship.(H. Goldstein, 2012)
In order to accurately predict the cost of health insurance, it is necessary

This is a very useful tool for data analysis, as it allows analysts to
to clean the dataset before applying regression algorithms. The data
understand complex patterns and relationships in data, and to make more
shows that age and smoking status have the most significant impact on accurate predictions about future outcomes.
the amount of insurance, with smoking having the greatest effect.
However, factors such as family medical history, BMI, marital status,
and geography also play a role. Children's property was found to have
Support Vector Machine Algorithm:
little impact on the prediction, so it was removed from the input for the
regression model to improve efficiency and accuracy.. The data shows SVM, or Support Vector Machine, is a widely used supervised
that age and smoking status have the most significant impact on the learning algorithm for solving classification and regression
amount of insurance, with smoking having the greatest effect. This is a problems, with a focus on classification in machine learning (T.
preliminary estimate and does not adhere to any company. These Han, 2020)
algorithms are designed to make classifications or predictions using
The goal of SVM is to determine the best line or decision boundary
statistical techniques, which can uncover key insights in data mining
that can separate a multi-dimensional space into different classes,
processes. The outcomes from these insights can be seen in the given
enabling the efficient classification of new data points in the future.
figure 1 key growth indicators in businesses and applications, if used
This optimal decision boundary is known as a hyperplane, which is
correctly. The data shows that age and smoking status have the most
created using extreme vectors and points called support vectors.
significant impact on the amount of insurance, with smoking having the
SVM is a popular choice due to its ability to effectively classify
greatest effect. They will be able to make a more informed decision.
data and handle high dimensional spaces.
Additionally, it may suggest.
This is a preliminary estimate. This optimal decision boundary is

known as a hyperplane.
CONCEPT USED:
Random Forest Regression:
Machine Learning:
The Random Forest approach utilizing bootstrapping involves the use of
Machine Learning is a subset of computer science and AI that
multiple decision trees generated from the data and combined through
involves using data and algorithms to replicate the way that humans
ensemble learning techniques. This method often leads to accurate
learn. These algorithms are designed to make classifications or
predictions and classifications by averaging the results of the randomly
predictions using statistical techniques, which can uncover key
selected trees.(X. Zhu, C. Ying, J. Wang, 2021)
insights in data mining processes. The outcomes from these insights
can have a significant impact on key growth indicators in
businesses and applications, if used correctly. (S. Ramakrishnan,
Gradient Boosting Algorithm:
2016) The data shows that age and smoking status have the most
significant impact on the amount of insurance, with smoking Gradient boosting is a highly popular machine learning technique
having the greatest effect. However, factors such as family medical for analyzing tabular data sets. It is well-known for its ability to
history, BMI, marital status, and geography also play a role. The handle missing values, outliers, and large categorical values in the
data shows that age and smoking status have the most significant features, as well as its ability to detect nonlinear relationships
impact on the amount of insurance, with smoking having the between the target and the features. This makes it a powerful tool
greatest effect. However, factors such as family medical history, for data analysis and prediction (Douglas C Montgomery, 2012)
BMI, marital status, and geography also play a role.
TRAINING AND PREDICTION

Training: From the Figure 2 we can see that the best optimum Algorithm for
the Amount prediction is Gradient Boosting Algorithm with the
After the necessary data has been formatted and prepared, the model can
highest accuracy of 87.776%.(H. Demirtas, J. Stat Soft.)
begin its training and testing phases.
A key focus during the training phase is choosing the appropriate model
for the task at hand. This may involve deciding on the optimal modelling
strategy or determining the best parameter values for a particular model
(V. Roth, 2014)
In some cases, this process is referred to as model selection because

various models may be tested and the one that performs the best, is
ultimately chosen , which is created using extreme vectors and points
called support vectors.
Prediction:
The model used for predicting the insurance sum for health was
based on the relationship between certain features and the label.
The accuracy of this prediction was determined by the extent to
which the expected value matched the actual amount.
In order to improve the accuracy, the model employed various

characteristics, methods, and train-test split sizes. It was found that Fig 2: Actual and Predicted Price
the amount of data used for training had a significant impact on the CONCLUSION AND FUTURE SCOPE:
accuracy, with a larger train size leading to better results.
Conclusion:
The model also employed multiple algorithms in order to forecast
It was found that Gradient Boosting Decision Tree Regression had the
the premium amount, and showed how each attribute affected the
highest accuracy rate for predicting the amount, with a score of 87.776%.
outcome(Kaggle, Regression data)
(Tian Jinyu , 2019) While linear regression and random forest were able
to make correct predictions about 80% of the time, Support Vector
Machine did not perform well and was not considered a reliable predictor
in this case (G. Reddy, S. Bhattacharya,2020).
RESULTS:
When all four attributes were considered, Gradient Boosting Regression
The following results can be seen in Prediction:
was determined to be the best model due to its high accuracy rate. The
accuracy of this prediction was determined by extent to which the
Linear Regression Algorithm:
expected value matched the actual amount .
The accuracy of the Linear Regression Algorithm is
78.334%.(Bertsimas, M.V. otter, 2018) Future Scope :
The use of the Random Forest algorithm allows for the introduction
Support Vector Machine:
of unpredictability in the feature selection process, which can
The accuracy of the Support Vector Machine Algorithm is 7.229% improve prediction accuracy (Ostertagova , 2012).
Random Forest Regression: In order to assess the scalability of the system, it would be
beneficial to test it on a dataset with at least a million records in the
The accuracy of Random Forest Regression is 87.006% (Stucki , future. Distributed frameworks like Spark and Hadoop can be
Finland , 2019). It was found that the amount of data used for utilized to handle large amounts of data and enhance the scalability
training had a significant impact on the accuracy, with a larger train of the system.
size leading to better results (Kenward , J.A. , 2019)
Currently, the algorithm is being trained and tested using thousands
Gradient Boosting Algorithm : of records (Donald W. Marquardt , 2012).
The Accuracy of the Gradient Boosting Algorithm is 87.776%.

Bertsimas, M. V. Bjarnad´ottir, M. A. Kane, J. C. Kryder, R. Pandey,
REFERENCES:
S. Vempala, and G. Wang, Operations Research, vol. 56, no. 6, pp.
1382–1392, 2018.
Kaggle Medical Cost Prediction datasets Kaggle Inclusion Kaggle.com
Stucki, O. “Predicting the customer churn with machine learning

An emerging trend of big data analytics with heatlh insurance in our
methods: case: private insurance customer data” Master’s dissertation,
country(2016,February). IEEE
LUT University, Lappeenranta, Finland, 2019.
H. Goldstein, W. Browne and J. RasBash, “Multilevel modelling of
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P.,
medical data,” Statistics in Medicine, John Wiley and Sons, vol. 21, no.
Kenward, Carpenter, J. R. (2019). Multiple imputation for missing data
21, pp. 3291–3315, 2012.
in epidemiological and clinical research: potential and pitfalls. BMI,
T. Han, A. Siddique, K. Khayat, J. Huang and A. Kumar, “An ensemble 338L.
machine learning approach for prediction and optimization of modulus
H. Demirtas, “Flexible Imputation of Missing Data”, J. Stat. Soft., vol.
of elasticity of recycled aggregate concrete,” Construction and Building
85, no. 4, pp. 1–5, Jul. 2018. Available: DOI: 10.18637/ jss. V 085. B
Materials, vol. 244, pp. 118–271, 2020.
04.
X. Zhu, C. Ying, J. Wang, J. Li, X. Lai et al., “Ensemble of ML-kNN
G. Reddy, S. Bhattacharya, S. Ramakrishnan, C. L. Chowdhary, S.
for classification algorithm recommendation,” Knowledge-Based
Hakak et al., “An ensemble-based machine learning model for diabetic
Systems, vol. 106, pp. 933, 2021.
retinopathy classification,” in 2020 Int. Conf. on Emergig Trends in
Douglas C Montgomery, Elizabeth A Peck and G Geoffrey Vining, Information Technology and Engineering, IC-ETITE, VIT Vellore,
“Introduction to linear regression analysis”, John Wiley & Sons, vol. IEEE, pp. 1–6, 2020.
821, 2012.
Tian Jinyu, Zhao Xin et al., “Apply multiple linear regression model to
V. Roth, “The generalised LASSO”,” IEEE Transactions on Neural predict the audit opinion,” in 2009 ISECS International Colloquium on
Networks”, vol. 15, pp – 16 28, 2014. Computing, Communication, Control, and Management, IEEE, pp.1–6,
2019.
Medical Cost Prediction Dataset, [Online]. Available:

Ostertagova et al.,” Modelling using Polynomial Regression”, vol. 48,
https://www.kaggle.com/hely333/eda-regression/data
pp. 500-506, 2012.
Donald W. Marquardt, Ronald D. Snee et al.,” Ridge Regression in

Practice”, ” The American Statistician”, vol. 29, pp – 3-20, 2012.

SSRN Id4366801

Uploaded by

Copyright:

Available Formats

SSRN Id4366801

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id4366801

Uploaded by

Copyright:

Available Formats

Health Insurance Cost Prediction by Using Machine Learning

Fig 1: Data Sets [1]

Electronic copy available at: https://ssrn.com/abstract=4366801

The following article discusses a dataset that can be accessed on the

In order to accurately predict the cost of health insurance, it is necessary

This is a preliminary estimate. This optimal decision boundary is

TRAINING AND PREDICTION

Electronic copy available at: https://ssrn.com/abstract=4366801

In some cases, this process is referred to as model selection because

In order to improve the accuracy, the model employed various

The Accuracy of the Gradient Boosting Algorithm is 87.776%.

Electronic copy available at: https://ssrn.com/abstract=4366801

Stucki, O. “Predicting the customer churn with machine learning

Medical Cost Prediction Dataset, [Online]. Available:

Donald W. Marquardt, Ronald D. Snee et al.,” Ridge Regression in

Electronic copy available at: https://ssrn.com/abstract=4366801

You might also like