Handwriting Recognition: Chappidi Aswarta Reddy (Urk18Cs080)
Handwriting Recognition: Chappidi Aswarta Reddy (Urk18Cs080)
Handwriting Recognition: Chappidi Aswarta Reddy (Urk18Cs080)
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
MARCH-2021
1|27
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that the project report entitled, “Diabetes Prediction using Machine
Learning” is a bonafide record of Mini Project work done during the even semester of the
academic year 2020-2021 by
in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology
in Computer Science and Engineering of Karunya Institute of Technology and Sciences.
2|27
ACKNOWLEDGEMENT
First and foremost, I praise and thank ALMIGTHY GOD whose blessings have bestowed
I am grateful to our beloved founders Late.Dr. D.G.S. Dhinakaran, C.A.I.I.B, Ph.D and
Dr. Paul Dhinakaran, M.B.A, Ph.D, for their love and always remembering us in their prayers.
I extend my thanks to our Vice Chancellor Dr.P. Mannar Jawahar, Ph.Dand our
Registrar Dr. Elijah Blessing, M.E., Ph.D,for giving me this opportunity to do the project.
I would like to thank Dr. Prince Arulraj, M.E., Ph.D.,Dean, School of Engineering and
Technology for his direction and invaluable support to complete the same.
I would like to place my heart-felt thanks and gratitude toDr. J. Immanuel John Raja,
and guidance.
Department of Computer Science and Engineering and DR.E.Bijolin Edwin,M.E.PH.D for their
I also thank all the staff members of the Department for extending their helping hands to
I would also like to thank all my friends and my parents who have prayed and helped me
3|27
ABSTRACT
Diabetes is a chronic disease with the potential to cause a worldwide health care crisis. According
to International Diabetes Federation 382 million people are living with diabetes across the whole
world. By 2035, this will be doubled as 592 million. Diabetes mellitus or simply diabetes is a
disease caused due to the increase level of blood glucose. Various traditional methods, based on
physical and chemical tests, are available for diagnosing diabetes. However, early prediction of
diabetes is quite challenging task for medical practitioners due to complex interdependence on
various factors as diabetes affects human organs such as kidney, eye, heart, nerves, foot etc.
Real-time data query is done, current practice in hospital is to collect required information for
diabetes diagnosis through various tests and appropriate treatment is provided. Healthcare
industries have large volume databases. Using big data analytics one can study huge datasets and
find hidden information, hidden patterns to discover knowledge from the data and predict
outcomes accordingly. In existing method, the classification and prediction accuracy is not so high.
In this paper, we have proposed a diabetes prediction model for better classification of diabetes
which includes few external factors responsible for diabetes along with regular factors like
Glucose, BMI, Age, Insulin, etc.
Machine learning is an emerging scientific field in data science dealing with the ways in which
machines learn from experience. The aim of this project is to develop a system which can perform
early prediction of diabetes for a patient with a higher accuracy by combining the results of
different machine learning techniques. This project aims to predict diabetes via three different
supervised machine learning methods including: SVM, Logistic regression, ANN. This project
also aims to propose an effective technique for earlier detection of the diabetes disease.
4|27
CONTENTS
Acknowledgement i
Abstract ii
1. Introduction 1
1.1 Introduction 3
1.2 Objectives 5
1.3 Motivation
1.4 Overview of the Project
1.5 Chapter wise Summary
2. Analysis and Design 20
2.1 Functional Requirements 21
2.2 Non-Functional Requirements
2.3 Architecture 22
2.4 Use case diagram
2.5. Sequence Diagram
3. Implementation . 30
3.1. Modules Description
3.2. Implementation Details
3.3. Tools used
4. Test results/experiments/verification.
4.1. Testing
4.2. Results
4.3. Verification
References 62
5|27
1. INTRODUCTION
1.1 INTRODUCTION
Diabetes is one of deadliest diseases in the world. It is not only a disease but also a creator of
different kinds of diseases like heart attack, blindness, kidney diseases, etc. The normal
identifying process is that patients need to visit a diagnostic center, consult their doctor, and sit
tight for a day or more to get their reports. Moreover, every time they want to get their diagnosis
report, they have to waste their money in vain.
Diabetes Mellitus (DM) is defined as a group of metabolic disorders mainly caused by abnormal
insulin secretion and/or action. Insulin deficiency results in elevated blood glucose levels
(hyperglycemia) and impaired metabolism of carbohydrates, fat and proteins. DM is one of the
most common endocrine disorders, affecting more than 200 million people worldwide. The onset
of diabetes is estimated to rise dramatically in the upcoming years. DM can be divided into
several distinct types. However, there are two major clinical types, type 1 diabetes (T1D) and
type 2 diabetes (T2D), according to the etiopathology of the disorder.
T2D appears to be the most common form of diabetes (90% of all diabetic patients), mainly
characterized by insulin resistance. The main causes of T2D include lifestyle, physical activity,
dietary habits and heredity, whereas T1D is thought to be due to autoimmunological destruction
of the Langerhans islets hosting pancreatic-β cells. T1D affects almost 10% of all diabetic
patients worldwide, with 10% of them ultimately developing idiopathic diabetes.
Other forms of DM, classified on the basis of insulin secretion profile and/or onset, include
Gestational Diabetes, endocrinopathies, MODY (Maturity Onset Diabetes of the Young),
neonatal, mitochondrial, and pregnancy diabetes. The symptoms of DM include polyuria,
polydipsia, and significant weight loss among others. Diagnosis depends on blood glucose levels
(fasting plasma glucose = 7.0 mmol/L.
6|27
1.2 OBJECTIVES
The objectives for the Diabetes prediction are:
• To develop a system which can perform early prediction of diabetes for a
patient with a higher accuracy by combining the results of different machine
learning techniques.
• To display the data to understand the condition in a pictorial and efficient
manner.
• To visualize the diabetics dataset using the Bar graphs, Graphs to know the
percentage of the cases in the particular States.
• Aims to build a classifier prediction model to predict the status of recovered
and death.
1.3 Motivation
There has been drastic increase in rate of people suffering from diabetes since a
decade. Current human lifestyle is the main reason behind growth in diabetes. In
current medical diagnosis method, there can be three different types of errors.
1.The false-negative type in which a patient in reality is already a diabetic patient but
test results tell that the person is not having diabetes. 2. The false-positive type. In
this type, patient in reality is not a diabetic patient but test reports say that he/she is a
diabetic patient. 3. The third type is unclassifiable type in which a system cannot
diagnose a given case. This happens due to insufficient knowledge extraction from
past data, a given patient may get predicted in an unclassified type.
However, in reality, the patient must predict either to be in diabetic category or non-
diabetic category. Such errors in diagnosis may lead to unnecessary treatments or no
treatments at all when required. In order to avoid or reduce severity of such impact,
there is a need to create a system using machine learning algorithm and data mining
techniques which will provide accurate results and reduce human efforts.
7|27
1.4 OVERVIEW OF THE PROJECT
In this project we will be using a set of data which involves the diabetics data of the
states with dates, total confirmed cases (national), total confirmed cases (foreign
national), deaths and cured. And one more data with countries to visualize the data and
plotting with graphs and maps to show the active. And after that second data will be
trained by using libraries like NumPy, seaborn. And at the last using decision tree and
random forest regression whatever we trained the data we are going to find the accuracy
of the project. The output will be the prediction when we give the date as input. It gives
the predicted cases as output. This is overview of the project.
In the current chapter we had a short introduction about the whole project such as
the introduction, motivation and overview which would give us a basic idea of what
this project is, the concepts which will be used in the project and how the concepts will
be implemented by the algorithms to achieve higher efficiency results.
Our first chapter resolves around the basic introduction of the topic diabetics
prediction, what diabetics is, where it from is, where are the most cases we got from
the country and how it is in other countries. The different machine learning techniques
used to solve this what challenges and motivated me to take this topic.
The following chapter deals with all kinds of technical terms behind what was involved
to build up the project the functional, nonfunctional requirements, the architecture of
the model.
Next chapter involves all modules, tools used to make this project successful and how
all of this is put together and implemented.
Next chapter is related to the execution of the code, the details to make the project run.
The last chapter being the references, about the current state and how the future
revolves around this project. Still what are the better things we can do using this project.
8|27
2. ANALYSIS AND DESIGN
Python libraries
• Numpy
• Pandas library
• Matplotlib.pyplot
• Linear Regression
• Sklearn
• RandomForestRegressor
• DecisionTreeRegressor
9|27
2.3 ARCHITECTURE
Architecture is nothing but the basic block-out of the whole scenario for easy understanding of
the project and will be used as a guideline for further development of the project and also to make
the use case diagram and also the sequence diagrams which will follow The first box represents
the data which is being received from the user and the second box denotes the digitization stages.
The third box is the preprocessing stage where the images are converted into gray scale and then
into a numpy array for easy processing. Next, the training for the Datasets take place .And after
that, will the data from the image now converted into an array and all the training of the datasets,
now it will predict the output of the ordinal date input and then we will move on to the output
phase all the predicted value will be displayed .
In this diagram we have seen no of diabetes cases and how many are infected, how many are
recovered from diabetes and from infected how many are caused death due to covid. This
diagram represents this data
10 | 2 7
Above diagram refers using data sources from WHO or kaggle.com by that data we are going to
process the data using python programming and machine learning diabetes prediction will be
trained by it and gives the result and accuracy.
Figure-2(Architecture diagram.2)
This is the architecture diagram for diabetes prediction. Which visualizes the data and preprocess
the data rename it and separate it and plot the graphs. Train the dataset using machine learning
algorithms and classify it and give the date as input to predict the cases and to check the accuracy
using algorithms.
11 | 2 7
2.4 USE CASE DIAGRAM
The Use Case Diagram captures the system's functionality and requirements by using simple
figures of actors and use cases. Use Cases model the tasks, services , function that program needs
to perform. Use cases represent how a user will handle the program in a system. In this use case
diagram there are two actors one is the user and the other one is the developer. The user has to
give contents like the input from datasets , the image input and he gets to see the output predicted
value as well as the graph .While in the developers end , he has to get the data set input and the
image input , has to change it to a numpy array and then plot the graph , view predicted value as
well as the graph.
12 | 2 7
2.5 SEQUENCE DIAGRAM
A sequence diagram is an interaction diagram that shows the order of cooperation of the
objects. It is a message sequence chart .A sequence diagram shows interaction of objects
arranged in time sequence. It depicts the objects, classes and user involved in the scenario and
the sequence of messages exchanged between the objects needed to carry out the functionality of
the scenario.
Figure-5(Sequence diagram)
This diagram helps us to understand the sequence in which the activities are performed
In the above Sequence diagram that we have four lines coming in from the user and two
from which are the datasets themselves , which will go to the testing phase directly where the
algorithm is run to train the program to predict. The input from the dataset is date from that we
can predict cases.
13 | 2 7
3. IMPLEMENTATION
3.1 NORMALIZATION
Although the terms normalization and standardization are often used as interchanged values.
Normalization makes training less sensitive to feature size, allowing us to solve for coefficients
more effectively. Since the numerical state of the optimization problems is improved,
standardizing tends to form a well-behaved training procedure.
3.2 STANDARDIZATION
The square of variance is the standard variation. One of the ways to live the knowledge in this
manner. Finding the mean is the first step in measuring the quality variance After that, each number
will be subtracted from the mean. The multiply squared variations by the mean
14 | 2 7
Numpy
NumPy is a Python library that provides a simple yet powerful data structure .In this project
we will be using this library to perform basic number operations such as converting the image
data into an array.
Pandas
Pandas is a fast, powerful , fexible and also a very easy library to use, to open source data
from datasets and for manipulation of the data in the files. In this project we will be using this
library to import the datasets which are in CSV format.
Matplot
Matplotlib is used for data visualization in python for plotting 2D graphs of arrays. It is a
multi-platform data visualization library built on numpy arrays. In this project we will be using
this library to plot the final graph.
PyPlot
Py plot is used for data visualization using py bar graphs.in this project we will be using for
ploting graphs for how many deaths, cured and confirmed.
Date time
The date time module supplies classes for manipulating dates and times.While date and
time arithmetic is supported, the focus of the implementation is on efficient attribute
extraction for output formatting and manipulation.
Seaborn library
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
15 | 2 7
Decision Tree Regression:
Decision tree regression observes features of an object and trains a model in the structure of a
tree to predict data in the future to produce meaningful continuous output. Continuous output
means that the output/result is not discrete, i.e., it is not represented just by a discrete, known set
of numbers or values.
Linear Regression:
Linear regression is a kind of statistical analysis that attempts to show a relationship between two
variables. Linear regression looks at various data points and plots a trend line.
16 | 2 7
Random Forest:
A random forest regressor. A random forest is a Meta estimator that fits a number of classifying
decision trees on various sub-samples of the dataset and uses averaging to improve the predictive
accuracy and control over-fitting.
17 | 2 7
Importing the csv dataset
As we have installed and imported the necessary libraries , we are ready to move on to the
next stage of implementation , which is importing the datasets from the directory which is in the
csv format .We will be importing three datasets for this project , one for countries to
visualization ,one for India covid cases and one for state wise testing details .The csv files
contain comma separated values as the name suggests and there will 315 rows × 5 columns with
an index column in the beginning and also varying number of cases.Now after importing the
dataset , we convert it into a numpy array using numpy library so that it can be read by the
program for manipulation
Training the data set
As importing the datasets and converting it into an array , the next step we will need to do is to
use the decision tree classifier from the scikit-learn library to train the program using one half of
the dataset (say).The other half will be left for the testing phase.
• As we are done with the training of the program from the
• testing data
• printing output for image
• printing output for data set input
18 | 2 7
3.2 IMPLEMENTATION DETAILS
• importing libraries with description
19 | 2 7
• Training the data set
20 | 2 7
• Testing data
21 | 2 7
3.3 TOOLS USED
1. Paint –For drawing architecture
2. Jupyter Notebook – For libraries
3. Python – Trained dataset by using python language
4. Kaggle -for dataset
4. TEST RESULT/EXPERIMENTATION/VERIFICATION
4.1 TESTING
1. EXPLORATORY ANALYSIS
In statistics, exploratory analysis is a process of evaluating data sets in order to summarize their
key characteristics, which is mostly achieved using visual methods. Exploratory can be used with
or without a statistical method, but it is mainly used to see what the data can teach us outside or
modelling or hypothesis testing. The graphic representation of data is known as data visualization.
22 | 2 7
23 | 2 7
Import the libraries linear regression and sk_learn linear model
Take the X_train and Y_train to train the linear regression model
24 | 2 7
4.2 RESULTS
Using linear regression, random forest, Decision tree to find the accuracy results for the given
input from diabetes database to check which gives the best accuracy.
In this after training the data, x_test is the test data set. Y_test is the set of labels to all the data in
x_test. Check the accuracy of linear regression, RandomForestRegressor and decision tree.
The decision tree model has the highest accuracy (99.7%) and the linear regression had the lowest
accuracy (99.3%). The decision tree performed well because we had a simple model and a critical
aspect to consider, which the number of installations was. The Linear Regression model, on the
other hand, had the lowest accuracy due to its high feature independence assumptions. This data
set includes a lot of knowledge that can be used for a number of items. Currently, the decision tree
model created with this data set will be used by potential developers.
25 | 2 7
4.3 VERIFICATION
After completing the accuracy testing try with the different input give the date and change it to
ordinal number, train it and check the predicted output, and check the accuracy using linear
regression, decision tree and random forest.
In further we can make a diabetes chatbot which will be very easy to use for the people can
understand easily, asking questions regarding the diabetes and updates, It gives the cured cases,
death cases and confirmed cases.
There is a lot of scope for Machine Learning. For Future work, it is recommended to work on
calibrated and ensemble methods that could resolve quirky problems faster with better outcomes
than the existing algorithms. Also an AI-based application can be developed using various sensors
and features to identify and help diagnose diseases.
Prediction is an essential field for future, a prediction system that could find the possibility of
outbreak of novel diseases that could harm mankind through socio-economic and cultural factor
consideration can be developed.
This is the conclusion and the further scope for the diabets prediction using machine learning
Project.
26 | 2 7
References
1.Balkau B, Lange C, Fezeu L, et al. Predicting diabetes: clinical, biological, and genetic
approaches: data from the epidemiological study on the insulin resistance syndrome (DESIR).
Diabetes Care. 2008;31:2056–61.
2.Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, et al. mlr: machine learning in R. J Mach
Learn Res. 2016;17(170):1–5.
3.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more
correlated receiver operating characteristic curves: a nonparametric approach. Biometrics.
1988;44:837–45.
4.Griffin SJ, Little PS, Hales CN, Kinmonth AL, Wareham NJ. Diabetes risk score: towards
earlier detection of type 2 diabetes in general practice. Diabetes Metab Res Rev. 2000;16:164–
71.
5.Habibi S, Ahmadi M, Alizadeh S. Type 2 diabetes mellitus screening and risk factors using
decision tree: results of data mining. Global J Health Sci. 2015;7(5):304–10.
7.Ioannis K, Olga T, Athanasios S, Nicos M, et al. Machine learning and data mining methods in
diabetes research. Comput Struct Biotechnol J. 2017;15:104–16.
9.Kahn HS, Cheng YJ, Thompson TJ, Imperatore G, Gregg EW. Two risk-scoring systems for
predicting incident diabetes mellitus in U.S. adults age 45 to 64 years. Ann Intern Med.
2009;150:741–51.
27 | 2 7