Internship Report File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Disease Prediction by Symptoms

Project report submitted to


Government Engineering College, Bharatpur
in partial fulfilment of the requirement for the award of
the degree

Bachelor of Technology
In
Computer Science & Engineering
by
Mohit Agrawal (16EELCS020)

Under the guidance of


Mr. Hemant Saxena

Department of Computer Science and Engineering


Government Engineering College
Bharatpur 321001(India)
2019-2020
© Government Engineering College(GEC) 2019
ABSTRACT

I carried out my internship at Dzone Software Solution & Service Provider, Jaipur. Dzone
Software Solution & Service Provider represents the connected word offering innovative
and customer-centric information technology experiences, enabling Enterprises,
Associates and the Society to Rise. Dzone Software Solution & Service Provider provides
internship opportunity to the students in various emerging technologies.

The purpose of the program is to fulfil the core equipment for the award of a degree of
Bachelor of Technology in Computer Science and Engineering to get a practical aspect of
the theoretical work studied at the university and to understand the operation in the
corporate sector and to enable students gain experience in different tasks.

During my internship period, I was assigned to the department of Machine Learning where
I was assigned to make a Disease Prediction Software. There I interacted with many
working professionals.
There I have gained the knowledge of the things actually work in an organization like the
complete procedure of implementing a project which include the understanding of the
problem, the cost estimation of project, the methodology and the final implementation of
the project.

In conclusion, this was an opportunity to develop and enhance skills and competencies in
my career field which I actually achieved.

i
ACKNOWLEDGEMENT

I would like to take the opportunity to thank and express my deep sense of
gratitude to my corporate mentor Mr. Hemant Saxena and my faculty
mentor Prof. Arvind Singh Chaudhary. I am greatly indebted to both of
them for providing their valuable guidelines at all stages of the study, their
advice, constructive suggestions, positive and supportive attitude and
continuous encouragement, without which it would have not been possible to
complete the project.

I am thankful to Mr. Hemant Saxena for giving me the opportunity to work


with Dzone Software Solution & Service Provider.

I would also like to thank my supervisor Ms. Surbhi Saxena, who helped me
a lot during my internship period in completing my machine learning project.

I owe my wholehearted thanks and appreciation to the entire staff of the


company for their cooperation and assistance during the course of my project.

I would also like to thank my parents, who helped me a lot during my


internship period in my project.

I hope that I can build upon the experience and knowledge that I have gained
and make a valuable contribution towards this industry in coming future.

Mohit Agrawal

ii
CERTIFICATE

iii
Table of Content

S. No Content Page No.


1. Abstract I
2. Acknowledgement Ii
3. Certificate Iii
4. Table of Content Iv
5. Introduction 1
6. Solution and Services 2
7. Machine Learning (Introduction) 3
8. Machine Learning Architecture 6
9. Machine Learning Algorithms 7
10. ML Development Lifecycle 10
11. Setup ML Codebase 12
12. ML Testing & Modelling 13
13. ML Project Structure 14
14. TKINTER in Python 25
15. Disease Prediction Prototype 28
16. Conclusion 29
17. Bibliography 30

iv
INTRODUCTION

The Company
Dzone Software Solution & Service Provider represents the connected word offering
innovative and customer-centric information technology experiences, enabling Enterprises,
Associates and the Society to Rise.
Dzone Software Solution & Service Provider is providing its services in field of software
solution for the application development sector and ERP design with accelerated growth
over the last 10 years. Our mission is to provide to our customer cost effective state of the
art product and services, to enable them to implement straight through processes to better
serve and retain their clients. We employ highly trained specialized and motivated people
to deliver outstanding consulting implementation and training services.
We believe “Innovate from inside” i.e. we offer innovative solutions to our valuable
customers that enable them to realize their full potential; we anticipate future trends and
demand by engaging in active dialogue with our customers. Our commitment to our
customer satisfaction is only matched by a relentless quest for forming strategic alliances
with world-class software vendors and business consultants that assist us to expand and
improve our value proposition to the benefit of our customer

Vision
We will Rise™ to be among the top three leaders in each of our chosen market segments
while fostering innovation and inclusion.
We will consistently achieve top quartile growth by contributing to our customers' success,
by enabling our employees to realize their potential and by creating value for all our
stakeholders.

History
Dzone Software Solution & Service Provider started in 2010 as a technology outsourcing.

1
SOLUTIONS AND SERVICES

 Next Gen Solutions


 Big Data
 Content Delivery Network
 Device Testing and Certification
 Digital Enterprise Services
 Green and Sustainability Solutions
 Internet of Things(IOT)
 Industrial Internet of Things (IIOT)
 Long term Evolution
 Smart Grid

 Python Programming
 Swift Programming
 JavaScript
 Java
 Infrastructure and Cloud Services
 Mobile App Development
 Customer Experience
 DevOps
 Enterprise Architecture
 Machine Learning

2
Machine Learning

Introduction

Machine learning (ML) is the scientific study of algorithms and statistical


models that computer systems use to perform a specific task without using explicit
instructions, relying on patterns and inference instead.
Machine learning algorithms build a mathematical model based on sample data, known
as "training data", in order to make predictions or decisions without being explicitly
programmed to perform the task.
The goal of Machine Learning is to discover patterns in your data and then make
predictions based on often complex patterns to answer business questions, detect and
analyses trends and help solve problems.

Learning Algorithms

The types of machine learning algorithms differ in their approach, the type of data they
input and output, and the type of task or problem that they are intended to solve.
o Supervised Learning
o Semi-Supervised Learning
o Unsupervised Learning
o Reinforcement Learning
o Features Learning

3
Features of Machine Learning

The important features of machine learning are:

1) Develop computational models of human learning process

2) Explore new learning methods and develop general learning algorithms independent of
applications.

3) Make the computers smarter, more intelligent.

4) Machine Learning is inherently a multi-disciplinary subject area.

5) ML will produce smarter computers capable of all the above intelligent behavior.

ML Applications
There are many machine learning applications in the market. The top categories are:

o Banking
o Financial Market Analysis
o Medical Diagnosis
o Natural Language Processing
o Sentiments Analysis
o Recommendation Systems
o Time Series Forecasting etc.

History
Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959 while at IBM.
A representative book of the machine learning research during 1960s was the Nilsson's
book on Learning Machines, dealing mostly with machine learning for pattern
classification.
However, an increasing emphasis on the logical, knowledge-based approach caused a rift
between AI and machine learning. Probabilistic systems were plagued by theoretical and
practical problems of data acquisition and representation.

4
HL vs ML

Dimension Human Learning Machine Learning


Speed Slow Fast
Ability to No Copy mechanism Easy to Copy
Transfer
Required Yes Yes/No
Repetition
Error-prone Yes Yes
Noise- tolerant Yes No

5
Machine Learning Architecture

6
Machine Learning Algorithms

Supervised Learning
Supervised learning is a machine learning technique for learning a function from training
data. The training data consist of pairs of input objects (typically vectors), and desired
outputs. The output of the function can be a continuous value (called regression), or can
predict a class label of the input object (called classification).

7
Unsupervised Learning -
Unsupervised learning is a type of machine learning where manual labels of inputs are not
used. It is distinguished from supervised learning approaches which learn how to perform
a task, such as classification or regression, using a set of human prepared examples.

Semi-supervised Learning -
Semi-supervised learning combines both labeled and unlabeled examples to generate an
appropriate function or classifier.

Reinforcement Learning -
Reinforcement Learning where the algorithm learns a policy of how to act given an
observation of the world. Every action has some impact in the environment, and the
environment provides feedback that guides the learning algorithm.

Transduction -
Similar to supervised learning, but does not explicitly construct a function.

Learning to Learn -
Learning to learn where the algorithm learns its own inductive bias based on previous
experience.

8
Algorithms Types

Linear Classifiers -
In machine learning, the goal of classification is to group items that have similar feature.
1. Fisher’s Linear Discriminant
2. Naïve Bayes Classifier
3. Perception
4. Support Vector Machine

Decision Tree -
A decision tree is a hierarchical data structure implementing the divide-and-conquer
strategy. It is an efficient nonparametric method, which can be used for both classification
and regression. A decision tree is a hierarchical model for supervised learning whereby the
local region is identified in a sequence of recursive splits in a smaller number of steps. A
decision tree is composed of internal decision nodes and terminal leaves (see figure).

9
Machine Learning Development Lifecycle

Machine learning projects are highly iterative; as you progress through the ML lifecycle,
you’ll find yourself iterating on a section until reaching a satisfactory level of performance,
then proceeding forward to the next task (which may be circling back to an even earlier
step).

Planning and project Setup-


 Define the task and scope out requirements.
 Determine project feasibility
 Discuss general model trade-offs (Accuracy vs Speed)
 Setup project codebase

10
Data Collection and labelling-

 Define ground truth (create labeling documentation)


 Build data ingestion pipeline
 Validate quality of data
 Revisit Step 1 and ensure data is sufficient for the task

Model Exploration-

 Establish baselines for model performance


 Start with a simple model using initial data pipeline
 Over fit simple model to training data
 Stay nimble and try many parallel (isolated) ideas during early stages

Model Refinement-

 Perform model-specific optimizations (i.e. hyper parameter tuning)


 Iteratively debug model as complexity is added
 Perform error analysis to uncover common failure modes
 Revisit Step 2 for targeted data collection of observed failures

Testing and Evaluation-

 Evaluate model on test distribution; understand differences between train and test
set distributions (how is “data in the wild” different than what you trained on)
 Revisit model evaluation metric; ensure that this metric drives desirable
downstream user behavior

Model Deployment-

 Expose model via a REST API


 Deploy new model to small subset of users to ensure everything goes smoothly,
then roll out to all users
 Maintain the ability to roll back model to previous versions
 Monitor live data and model prediction distributions

11
Setting up a ML Codebase

data/ provides a place to store raw and processed data for your project.

docker/ is a place to specify one or many Docker files for the project.

api/app.py exposes the model through a REST client for predictions.

models/ defines a collection of machine learning models for the task, unified by a common
API defined in base.py.

datasets.py manages construction of the dataset. Handles data pipelining/staging areas,


shuffling, reading from disk.

experiment.py manages the experiment process of evaluating multiple models/ideas.

train.py defines the actual training loop for the model. This code interacts with the
optimizer and handles logging during training.

12
ML-based System Testing and Monitoring-

Training system processes raw data, runs experiments, manages results, stores weights.
 Test the full training pipeline (from raw data to trained model) to ensure that
changes haven't been made upstream with respect to how data from our application
is stored. These tests should be run nightly/weekly.

Prediction system constructs the network, loads the stored weights, and makes
predictions.
 Run inference on the validation data (already processed) and ensure model score
does not degrade with new model/weights. This should be triggered every code
push.

Serving system exposed to accept "real world" input and perform inference on production
data. This system must be able to scale to demand.
Required monitoring:
 Alerts for downtime and errors
 Check for distribution shift in data

13
Machine Learning Project Structure-

Various businesses use machine learning to manage and improve operations. While ML
projects vary in scale and complexity requiring different data science teams, their general
structure is the same.

1. Strategy: Matching the problem with the solution-

In the first phase of an ML project realization, company representatives mostly outline


strategic goals. They assume a solution to a problem, define a scope of work, and plan the
development.

Disease Predication:
When a patient wants to consult to a doctor it may take much time or patient may be unable
to consult to a doctor at that incident. Then there is a solution of the problem is that He
can use Disease Prediction Software at primary level.
In this case, a user or patient can feed his symptoms to software, then machine learning
model will predict the disease using some machine learning algorithms.

2. Dataset Preparation and Pre-processing –

Data is the foundation for any machine learning project. The second stage of project
implementation is complex and involves data collection, selection, preprocessing, and
transformation. Each of these phases can be split into several steps.

Data Collection
It’s time for a data analyst to pick up the baton and lead the way to machine learning
implementation. The job of a data analyst is to find ways and sources of collecting relevant
and comprehensive data, interpreting it, and analyzing results with the help of statistical
techniques.

14
Data Visualization
A large amount of information represented in graphic form is easier to understand and
analyze. Some companies specify that a data analyst must know how to create slides,
diagrams, charts, and templates.

Data Cleaning
This set of procedures allows for removing noise and fixing inconsistencies in data. A data
scientist can fill in missing data using imputation techniques, e.g. substituting missing
values with mean attributes.

Fig. Clean Dataset

15
3. Dataset Splitting
A dataset used for machine learning should be partitioned into three subsets - training, test,
and validation sets.

Training Set:
A data scientist uses a training set to train a model and define its optimal parameters -
parameters it has to learn from data.

Fig. Training Dataset

16
Testing Set:
A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model’s ability to identify patterns in new unseen data
after having been trained over a training data. It’s crucial to use different subsets for
training and testing to avoid model overfitting, which is the incapacity for generalization
we mentioned above.

Fig. Testing Dataset

Validation Set:
The purpose of a validation set is to tweak a model’s hyper parameters — higher-level
structural settings that can’t be directly learned from data. These settings can express, for
instance, how complex a model is and how fast it finds patterns in data.

17
4. Modelling
During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.

Model training
After a data scientist has preprocessed the collected data and split it into three subsets, he
or she can proceed with a model training. This process entails “feeding” the algorithm with
training data. An algorithm will process data and output a model that is able to find a target
value (attribute) in new data — an answer you want to get with predictive analysis. The
purpose of model training is to develop a model.

Supervised learning: Supervised learning allows for processing data with target attributes
or labeled data. These attributes are mapped in historical data before the training begins.
With supervised learning, a data scientist can solve classification and regression problems.

Unsupervised learning: During this training style, an algorithm analyzes unlabeled data.
The goal of model training is to find hidden interconnections between data objects and
structure objects by similarities or differences. Unsupervised learning aims at solving such
problems as clustering, association rule learning, and dimensionality reduction. For
instance, it can be applied at the data preprocessing stage to reduce data complexity.

18
Decision Tree Algorithm
 Decision tree algorithm falls under the category of supervised learning. They can
be used to solve both regression and classification problems.
 Decision tree uses the tree representation to solve the problem in which each leaf
node corresponds to a class label and attributes are represented on the internal node
of the tree.

Fig. Decision Tree Diagram

19
Random Forest Algorithm
 A Random Forest is an ensemble technique capable of performing both regression
and classification tasks with the use of multiple decision trees and a technique
called Bootstrap Aggregation, commonly known as bagging.
 The basic idea behind this is to combine multiple decision trees in determining the
final output rather than relying on individual decision trees.

Fig. Random Forest Algorithm

20
Naïve Bayer Algorithm
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other.

Fig. Naïve Bayer Algorithm

21
Module Evaluation and Testing
The goal of this step is to develop the simplest model able to formulate a target value fast
and well enough. A data scientist can achieve this goal through model tuning. That’s the
optimization of model parameters to achieve an algorithm’s best performance.

22
Cross-validation:
Cross-validation is the most commonly used tuning method. It entails splitting a training
dataset into ten equal parts (folds). A given model is trained on only nine folds and then
tested on the tenth one (the one previously left out). Training continues until every fold is
left aside and used for testing. As a result of model performance measure, a specialist
calculates a cross-validated score for each set of hyper parameters. A data scientist trains
models with different sets of hyper parameters to define which model has the highest
prediction accuracy. The cross-validated score indicates average model performance
across ten hold-out folds.

23
5. Model Deployment

The model deployment stage covers putting a model into production use.

Once a data scientist has chosen a reliable model and specified its performance
requirements, he or she delegates its deployment to a data engineer or database
administrator. The distribution of roles depends on your organization’s structure and the
amount of data you store.

24
TKINTER in Python

Create GUI Window

# gui_stuff---------------------------------------------------------------
---------------------

root = Tk()
root.title("My Doctor")
root.configure(background='white')

Heading in Window

# Heading
w2 = Label(root, justify=LEFT, text="My Doctor : Disease Predictor", fg="B
lack", bg="white")
w2.config(font=("Aharoni", 25))
w2.grid(row=1, column=1, columnspan=2, padx=100)
w2 = Label(root, justify=LEFT, text="A Project by Mohit Agrawal", fg="Gree
n", bg="white")
w2.config(font=("Aharoni", 15))
w2.grid(row=2, column=1, columnspan=2, padx=100)

Create Levels for Symptoms

# labels
NameLb = Label(root, text="Patient Name", fg="black", bg="white")
NameLb.grid(row=6, column=0, pady=25,sticky=W)
NameLb.config(font=("Aharoni", 15))

S1Lb = Label(root, text="Symptom 1", fg="black", bg="white")


S1Lb.grid(row=7, column=0, pady=20, sticky=W)
S1Lb.config(font=("Aharoni", 15))

25
S2Lb = Label(root, text="Symptom 2", fg="black", bg="white")
S2Lb.grid(row=8, column=0, pady=20, sticky=W)
S2Lb.config(font=("Aharoni", 15))

S3Lb = Label(root, text="Symptom 3", fg="black", bg="white")


S3Lb.grid(row=9, column=0, pady=20, sticky=W)
S3Lb.config(font=("Aharoni", 15))

S4Lb = Label(root, text="Symptom 4", fg="black", bg="white")


S4Lb.grid(row=10, column=0, pady=20, sticky=W)
S4Lb.config(font=("Aharoni", 15))

S5Lb = Label(root, text="Symptom 5", fg="black", bg="white")


S5Lb.grid(row=11, column=0, pady=20, sticky=W)
S5Lb.config(font=("Aharoni", 15))

List View

# entries
OPTIONS = sorted(l1)

NameEn = Entry(root,textvariable=Name,width=50,bg="black",fg="white")
NameEn.grid(row=6, column=1, padx=10)

S1En = OptionMenu(root, Symptom1,*OPTIONS)


S1En.grid(row=7, column=1,padx=10)

S2En = OptionMenu(root, Symptom2,*OPTIONS)


S2En.grid(row=8, column=1,padx=10)

S3En = OptionMenu(root, Symptom3,*OPTIONS)


S3En.grid(row=9, column=1,padx=10)

S4En = OptionMenu(root, Symptom4,*OPTIONS)


S4En.grid(row=10, column=1,padx=10)

S5En = OptionMenu(root, Symptom5,*OPTIONS)


S5En.grid(row=11, column=1,padx=10)

26
Button

dst = Button(root, text="Decision Tree", command=DecisionTree,bg="orange",


fg="white", padx=10, pady=5,relief=RIDGE)
dst.grid(row=8, column=2,padx=10)

rnf = Button(root, text="Random Forest", command=randomforest,bg="red",fg=


"white",padx=10, pady=5,relief=RIDGE)
rnf.grid(row=9, column=2,padx=10)

lr = Button(root, text="Naive Bayes", command=NaiveBayes,bg="blue",fg="whi


te",padx=10, pady=5,relief=RIDGE)
lr.grid(row=10, column=2,padx=10)

Text Fields

#textfileds
t1 = Text(root, height=1, width=40,bg="black",fg="white", pady=5)
t1.grid(row=15, column=1, padx=10, pady=5)

t2 = Text(root, height=1, width=40,bg="black",fg="white", pady=5)


t2.grid(row=17, column=1 , padx=10, pady=5)

t3 = Text(root, height=1, width=40,bg="black",fg="white", pady=5)


t3.grid(row=19, column=1 , padx=10, pady=5)

27
Disease Predictor Prototype
 This Machine Learning project is used to predict the disease based on the
symptoms given by the user. So, the output is accurate.
 The patient can fill up to 5 symptoms and based on these symptoms Machine
Learning will predict disease.
 It predicts disease by using three different machine learning algorithms.
 It uses tkinter for GUI and Numpy, Pandas for data mining.

28
Conclusion

During my two months of summer internship at Dzone Software Solution and Service
Provider, I have gained the exposure of the real working environment of a company and
learned how the things work in real life. I have received the exposure of the company
world.

As I have done my summer internship in Machine Learning with Python, I have learnt a
lot about this technology and there is a lot more to be learned in this technology. There is
a lot of stuff that can be done using this technology. In the training period, I have gone
through an intermediate level of developing an android app and there is a lot more to be
explored.

29
Bibliography

The various information is taken from the following sources

https://www.javatpoint.com/

https://www.dzone.co.in

https://www.kaggle.com/datasets

https://www.wikipedia.org/

https://www.youtube.com/

30

You might also like