SSRN Id4366801
SSRN Id4366801
SSRN Id4366801
Ajay Kumar Sahu 1*, Gopal Sharma 2, Janvi Kaushik 3, Kajal Agrawal 4, Devendra Singh 5
[email protected] 1, [email protected], [email protected], [email protected], [email protected]
1* 3
Greater Noida Institute of Technology, Greater Noida 201310, India Greater Noida Institute of Technology, Greater Noida 201310, India
2 4
Greater Noida Institute of Technology, Greater Noida 201310, India Greater Noida Institute of Technology, Greater Noida 201310, India
5
Greater Noida Institute of Technology, Greater Noida 201310, India
ABSTRACT: INTRODUCTION:
A large portion of the economy is devoted to paying for health care. The goal of this research is to help individuals understand the amount
Spending on healthcare accounts for around 30% of the GDP. In of money they may need for health insurance based on their personal
terms of both absolute spending and as a percentage of the economy, health status. This can assist individuals in focusing more on the health-
health spending in developed countries is the greatest. Through its related aspects of insurance rather than the unnecessary ones. In the
Medicare programme, the government foots a sizable percentage modern world, it is essential to have health insurance, and most people
of the older population's medical costs. A significant load is placed have a relationship with a public or private health insurance provider.
on the exchequer by the rising cost of health care paired with the The factors that influence insurance costs vary from company to
baby boomer generation's impending retirement and subsequent company. Additionally, some people in rural areas may not be aware
eligibility for Medicare. Therefore, it is imperative to use every that the Indian government offers free health insurance to those who are
available tool to limit health-related costs. In this study, we'll create below the poverty line. However, the process can be complex, and some
a method for predicting medical costs using machine learning rural residents either get private health insurance or make no investment
algorithms, which will help direct patients into affordable. The at all. Additionally, people may be vulnerable to being misled into
technology can also help policymakers identify which providers are paying for expensive health insurance that they don't need. Our research
often more expensive and, if required, take punitive action. The does not provide an exact amount required by any specific health
Random Forest Regression algorithm will be used in machine insurance provider, but it does give a general sense of the cost a person
learning to forecast the cost of medical care. We also intend to test may incur for their own health insurance. This is a preliminary estimate
experiments using different machine learning models, like Gradient and does not adhere to any particular company, so it should not be the
Boosted Trees and Linear Regression, on the same data and only factor considered when choosing health insurance. Early
compare the outcomes. Early estimation of health insurance costs estimation of health insurance costs can help individuals consider the
can help. Additionally, people may be vulnerable to being misled required amount more thoughtfully. When a person can determine the
into paying for expensive health insurance that they don't need. Our
research does not provide an exact amount required by any specific
health insurance provider, but it does give a general sense of the cost
a person may incur for their own health insurance.
KEYWORDS:
Economy,
Policy Makers,
Machine Learning,
Regression significant
No. of columns=7 Essentially, this means that linear regression is used to determine how
closely a dependent variable is related to an independent variable, and
1338 rows total. Total number = 9366
then make predictions based on that relationship.(H. Goldstein, 2012)
CONCEPT USED:
Random Forest Regression:
Machine Learning:
The Random Forest approach utilizing bootstrapping involves the use of
Machine Learning is a subset of computer science and AI that
multiple decision trees generated from the data and combined through
involves using data and algorithms to replicate the way that humans
ensemble learning techniques. This method often leads to accurate
learn. These algorithms are designed to make classifications or
predictions and classifications by averaging the results of the randomly
predictions using statistical techniques, which can uncover key
selected trees.(X. Zhu, C. Ying, J. Wang, 2021)
insights in data mining processes. The outcomes from these insights
can have a significant impact on key growth indicators in
businesses and applications, if used correctly. (S. Ramakrishnan,
Gradient Boosting Algorithm:
2016) The data shows that age and smoking status have the most
significant impact on the amount of insurance, with smoking Gradient boosting is a highly popular machine learning technique
having the greatest effect. However, factors such as family medical for analyzing tabular data sets. It is well-known for its ability to
history, BMI, marital status, and geography also play a role. The handle missing values, outliers, and large categorical values in the
data shows that age and smoking status have the most significant features, as well as its ability to detect nonlinear relationships
impact on the amount of insurance, with smoking having the between the target and the features. This makes it a powerful tool
greatest effect. However, factors such as family medical history, for data analysis and prediction (Douglas C Montgomery, 2012)
BMI, marital status, and geography also play a role.
A key focus during the training phase is choosing the appropriate model
for the task at hand. This may involve deciding on the optimal modelling
strategy or determining the best parameter values for a particular model
(V. Roth, 2014)
Prediction:
The model used for predicting the insurance sum for health was
based on the relationship between certain features and the label.
The accuracy of this prediction was determined by the extent to
which the expected value matched the actual amount.
the amount of data used for training had a significant impact on the CONCLUSION AND FUTURE SCOPE:
accuracy, with a larger train size leading to better results.
Conclusion:
The model also employed multiple algorithms in order to forecast
It was found that Gradient Boosting Decision Tree Regression had the
the premium amount, and showed how each attribute affected the
highest accuracy rate for predicting the amount, with a score of 87.776%.
outcome(Kaggle, Regression data)
(Tian Jinyu , 2019) While linear regression and random forest were able
to make correct predictions about 80% of the time, Support Vector
Machine did not perform well and was not considered a reliable predictor
in this case (G. Reddy, S. Bhattacharya,2020).
RESULTS:
When all four attributes were considered, Gradient Boosting Regression
The following results can be seen in Prediction:
was determined to be the best model due to its high accuracy rate. The
accuracy of this prediction was determined by extent to which the
Linear Regression Algorithm:
expected value matched the actual amount .
The accuracy of the Linear Regression Algorithm is
78.334%.(Bertsimas, M.V. otter, 2018) Future Scope :
The use of the Random Forest algorithm allows for the introduction
Support Vector Machine:
of unpredictability in the feature selection process, which can
The accuracy of the Support Vector Machine Algorithm is 7.229% improve prediction accuracy (Ostertagova , 2012).
Random Forest Regression: In order to assess the scalability of the system, it would be
beneficial to test it on a dataset with at least a million records in the
The accuracy of Random Forest Regression is 87.006% (Stucki , future. Distributed frameworks like Spark and Hadoop can be
Finland , 2019). It was found that the amount of data used for utilized to handle large amounts of data and enhance the scalability
training had a significant impact on the accuracy, with a larger train of the system.
size leading to better results (Kenward , J.A. , 2019)
Currently, the algorithm is being trained and tested using thousands
Gradient Boosting Algorithm : of records (Donald W. Marquardt , 2012).