A Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A PROJECT REPORT

ON
MONTHLY INSURANCE PREDICTOR
SUBMITTED FOR THE PARTIAL FULFILLMENT OF AWARD OF

BACHELOR OF TECHNOLOGY
DEGREE IN

COMPUTER SCIENCE AND ENGG.


UNDER THE CORE GUIDANCE OF
MR. SRI NATH DWIVEDI SIR

MR. SARVESH SIR

DR. A. P. J ABDUL KALAM TECHNICAL UNIVERSITY, LUCKNOW


(SESSION 2023 – 24)

SUBMITTED BY:
ANJALI VERMA (2101660100014)
PRIYANSHI (2101660100045)
SANVI SINGH (2101660100050)
Project Abstract

This project aims to develop an automated system for estimating annual medical expenditures
for new customers of ACME Insurance Inc., a leading health insurance provider in the United
States. The primary objective is to utilize machine learning techniques, specifically focusing
on linear regression, to predict annual medical charges based on relevant customer
information, including age, sex, BMI, number of children, smoking habits, and region of
residence.
The project will follow a practical and coding-focused approach, beginning with the
definition and application of machine learning and linear regression in the context of the
given problem. The goal is to create a robust predictive model that can accurately estimate
medical expenditures, allowing ACME Insurance Inc. to determine appropriate annual
insurance premiums for individual customers. The model's transparency is crucial to meet
regulatory requirements, ensuring that explanations for each prediction can be provided.
The dataset, comprising verified historical data with information on customer attributes and
corresponding medical charges for over 1300 individuals, will serve as the foundation for
model training and evaluation. The project will involve data exploration, preprocessing,
feature engineering, and model training, with a focus on achieving both accuracy and
interpretability.
Throughout the development process, emphasis will be placed on creating a well-documented
and reproducible workflow. The final model will be validated using appropriate metrics, and
its performance will be assessed against real-world data. The project's outcomes will not only
provide ACME Insurance Inc. with a valuable tool for estimating medical expenditures but
will also contribute to the broader understanding of machine learning applications in the
health insurance domain.
DATASETS UTILIZED BY OUR MODEL INCLUDES:

The dataset utilized for the aforementioned problem statement is sourced from
https://github.com/stedy/Machine-Learning-with-R-datasets and is available in CSV format.
This dataset comprises verified historical data related to health insurance, containing
information on various customer attributes and the actual medical charges incurred by over
1300 individuals. The dataset is instrumental in training and evaluating the machine learning
model for estimating annual medical expenditures.

The dataset includes the following key features:

 Age: The age of the customer in years, providing insight into how age correlates with
medical charges.
 Sex: Gender information, which could influence healthcare needs and expenses.
 BMI (Body Mass Index): A numerical value derived from the customer's weight and
height, serving as an indicator of body fat and potential health risks.
 Children: The number of children or dependents covered by the insurance plan,
impacting overall family medical expenses.
 Smoker: A binary indicator specifying whether the customer is a smoker or non-
smoker, as smoking habits significantly affect health and medical costs.
 Region: The geographic region of residence of the customer, allowing exploration of
regional variations in healthcare costs.
 Charges: The target variable, representing the actual medical charges incurred by each
customer. This variable serves as the ground truth for training the machine learning
model.

The dataset is structured in tabular form, with each row representing an individual customer
and each column corresponding to a specific attribute. This tabular structure facilitates data
manipulation, exploration, and analysis. The historical nature of the data provides a diverse
range of scenarios for model training, enabling the development of a robust and generalizable
predictive model.

Throughout the project, the dataset will undergo preprocessing steps such as handling
missing values, scaling numerical features, and encoding categorical variables. Exploratory
data analysis will be conducted to gain insights into the distribution and relationships
between different features. The model will then be trained on a subset of the data, and its
performance will be evaluated using appropriate evaluation metrics to ensure its effectiveness
in estimating annual medical expenditures for new customers.
TECHSTACKS USED:

The development of the automated system for estimating annual medical expenditures
involves the use of various technologies and tools to address different stages of the machine
learning model formation process. The following tech stack is employed for this project:

Programming Language: Python

Utilization:
Python is a versatile and widely-used programming language in the field of machine learning
and data science. It offers an extensive ecosystem of libraries and frameworks that simplify
tasks related to data manipulation, analysis, and machine learning model development.

Data Manipulation and Analysis: Pandas

Utilization:
Pandas is a powerful data manipulation library in Python. It is utilized for loading the dataset,
handling missing values, conducting exploratory data analysis, and preparing the data for
model training. Pandas provides convenient data structures, such as DataFrames, for efficient
manipulation and analysis of tabular data.

Data Visualization: Matplotlib and Seaborn

Utilization:
Matplotlib and Seaborn are Python libraries for creating visualizations. They are employed to
generate insightful plots and graphs during the exploratory data analysis phase, helping to
understand the distribution of features, relationships between variables, and potential patterns
in the data.

Machine Learning Library: Scikit-learn

Utilization:
Scikit-learn is a comprehensive machine learning library in Python. It is utilized for
implementing the linear regression model, handling model training, and conducting model
evaluation. Scikit-learn provides easy-to-use functions for splitting data, preprocessing, and
implementing various machine learning algorithms.

Jupyter Notebooks

Utilization:
Jupyter Notebooks provide an interactive and collaborative environment for developing and
documenting code. They are used to create a step-by-step workflow, allowing for easy
iteration and documentation of the machine learning model development process. Jupyter
Notebooks support both code execution and the inclusion of explanatory text, making them
ideal for sharing insights and results.
Version Control: Git and GitHub

Utilization:
Git is employed for version control, enabling the tracking of changes to the codebase
throughout the project. GitHub, as a web-based platform, facilitates collaboration and serves
as a centralized repository for storing and sharing code, datasets, and project documentation.

Online Data Source: GitHub Raw Data URL

Utilization:
The dataset is hosted on GitHub, and its raw data URL is used for easy and direct access to
the dataset within the Python code. This approach allows seamless integration of the dataset
into the analysis and model development pipeline without the need for manual downloads.

By leveraging this tech stack, the project ensures a streamlined and efficient workflow for
developing, training, and evaluating a machine learning model to estimate annual medical
expenditures for new customers based on the provided dataset and problem statement.

You might also like