Summer Internship Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

ACKNOWLEDGEMENT

I would like to acknowledge the contributions of the following people without whose help and guidance this

report would not have been completed.

I acknowledge the counsel and support of our training coordinator, Kunal Jain, Parnav Dar, Aishwarya Singh
from Internshala , with respect and gratitude, whose expertise, guidance, support, encouragement, and
enthusiasm has made this report possible. Their feedback vastly improved the quality of this and report
enthralling experience. I am indeed proud and fortunate to be supported by him/her.

I am also thankful to Prof. M.K. Razak, H.O.D of Electrical Engineering Department, Purnea College
Of Engineering, Purnea for his constant encouragement, valuable suggestions and moral support and
blessings.

Although it is not possible to name individually, I shall ever remain indebted to the faculty members of
Purnea College Of Engineering, Purnea for their persistent support and cooperation extended during this
work.

This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to my parents
and God for their consistent blessings and encouragement.

Name:- Syed Shad Jami

Registration no:-20103131012

Roll no:- 20312

Branch:- Electrical Engineering


Chapter 1 Introduction
Machine Learning is the science of getting computers to learn without being explicitly programmed. It is
closely related to computational statistics, which focuses on making prediction using computer. In its
application across business problems, machine learning is also referred as predictive analysis. Machine
Learning is closely related to computational statistics. Machine Learning focuses on the development of
computer programs that can access data and use it to learn themselves. The process of learning begins with
observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data
and make better decisions in the future based on the examples that we provide. The primary aim is to allow
the computers learn automatically without human intervention or assistance and adjust actions accordingly.

History of Machine Learning

The name machine learning was coined in 1959 by Arthur Samuel. Tom M. Mitchell provided a widely quoted,
more formal definition of the algorithms studied in the machine learning field: "A computer program is said
to learn from experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E." This follows Alan Turing's
proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?"
is replaced with the question "Can machines do what we (as thinking entities) can do?". In Turing’s proposal
the characteristics that could be possessed by a thinking machine and the various implications in constructing
one are exposed.

Types of Machine Learning

The types of machine learning algorithms differ in their approach, the type of data they input and output, and
the type of task or problem that they are intended to solve. Broadly Machine Learning can be categorized into
four categories.

I. Supervised Learning

II. Unsupervised Learning

III. Reinforcement Learning

IV. Semi-supervised Learning


Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more
accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional
time and resources to train it properly.

Supervised Learning

Supervised Learning is a type of learning in which we are given a data set and we already know what are
correct output should look like, having the idea that there is a relationship between the input and output.
Basically, it is learning task of learning a function that maps an input to an output based on example input
output pairs. It infers a function from labelled training data consisting of a set of training examples. Supervised
learning problems are categorized

Unsupervised Learning

Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what
our problem should look like. We can derive the structure by clustering the data based on a relationship among
the variables in data. With unsupervised learning there is no feedback based on prediction result. Basically, it
is a type of self-organized learning that helps in finding previously unknown patterns in data set without pre-
existing label.

Reinforcement Learning

Reinforcement learning is a learning method that interacts with its environment by producing actions and
discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of
reinforcement learning. This method allows machines and software agents to automatically determine the ideal
behaviour within a specific context in order to maximize its performance. Simple reward feedback is required
for the agent to learn which action is best.

Semi-Supervised Learning

Semi-supervised learning fall somewhere in between supervised and unsupervised learning, since they use
both labelled and unlabelled data for training – typically a small amount of labelled data and a large amount
of unlabelled data. The systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labelled data requires skilled and relevant
resources in order to train it / learn from it. Otherwise, acquiring unlabelled data generally doesn’t require
additional resources.

Literature Survey

Theory
A core objective of a learner is to generalize from its experience. The computational analysis of machine
learning algorithms and their performance is a branch of theoretical computer science known as computational
learning theory. Because training sets are finite and the future is uncertain, learning theory usually does not
yield guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance are quite
common. The bias–variance decomposition is one way to quantify generalization
error.
For the best performance in the context of generalization, the complexity of the hypothesis should match the
complexity of the function underlying the data. If the hypothesis is less complex than the function, then the
model has underfit the data. If the complexity of the model is increased in response, then the training error
decreases. But if the hypothesis is too complex, then the model is subject to overfitting and generalization will
be poorer.
In addition to performance bounds, learning theorists study the time complexity and feasibility of learning.
In computational learning theory, a computation is considered feasible if it can be done in polynomial time.
There are two kinds of time complexity results. Positive results show that a certain class of functions can be
learned in polynomial time. Negative results show that certain classes cannot be learned in polynomial time.

The Challenges Facing Machine Learning

While there has been much progress in machine learning, there are also challenges. For example, the
mainstream machine learning technologies are black-box approaches, making us concerned about their
potential risks. To tackle this challenge, we may want to make machine learning more explainable and
controllable. As another example, the computational complexity of machine learning algorithms is usually
very high and we may want to invent lightweight algorithms or implementations. Furthermore, in many
domains such as physics, chemistry, biology, and social sciences, people usually seek elegantly simple
equations (e.g., the Schrödinger equation) to uncover the underlying laws behind various phenomena. Machine
learning takes much more time. You have to gather and prepare data, then train the algorithm. There are much
more uncertainties. That is why, while in traditional website or application development an experienced team
can estimate the time quite precisely, a machine learning project used for example to provide product
recommendations can take much less or much more time than expected. Why? Because even the best machine
learning engineers don’t know how the deep learning networks will behave when analysing different sets of
data. It also means that the machine learning engineers and data scientists cannot guarantee that the training
process of a model can be replicated.

Applications of Machine Learning


Machine learning is one of the most exciting technologies that one would have ever come across. As it is
evident from the name, it gives the computer that which makes it more similar to humans: The ability to learn.
Machine learning is actively being used today, perhaps in many more places than one would expect. We
probably use a learning algorithm dozen of time without even knowing it. Applications of Machine Learning
include:

• Web Search Engine: One of the reasons why search engines like google, Bing etc. work so well is
because the system has learnt how to rank pages through a complex learning algorithm.

• Photo tagging Applications: Be it Facebook or any other photo tagging application, the ability to tag
friends makes it even more happening. It is all possible because of a face recognition algorithm that
runs behind the application.

• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying the
mails and moving the spam mails to spam folder. This is again achieved by a spam classifier running
in the back end of mail application.

• Database Mining for growth of automation: Typical applications include Web-click data for better
UX, Medical records for better automation in healthcare, biological data and many more.

• Applications that cannot be programmed: There are some tasks that cannot be programmed as the
computers we use are not modelled that way. Examples include Autonomous Driving, Recognition
tasks from unordered data (Face Recognition/ Handwriting Recognition), Natural language
Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have understood and mimicked the human
brain. It is the start of a new revolution, The real AI. Now, after a brief insight lets come to a more
formal definition of Machine Learning

Future Scope
Future of Machine Learning is as vast as the limits of human mind. We can always keep learning, and teaching
the computers how to learn. And at the same time, wondering how some of the most complex machine learning
algorithms have been running in the back of our own mind so effortlessly all the time. There is a bright future
for machine learning. Companies like Google, Quora, and Facebook hire people with machine learning. There
is intense research in machine learning at the top universities in the world. The global machine learning as a
service market is rising expeditiously mainly due to the Internet revolution. The process of connecting the
world virtually has generated vast amount of data which is boosting the adoption of machine learning solutions.
Considering all these applications and dramatic improvements that ML has brought us, it doesn't take a genius
to realize that in coming future we will definitely see more advanced applications of ML, applications that will
stretch the capabilities of machine learning to an unimaginable level.

Organization of Training Workshop


Company Profile
Internshala is an internship and online training platform, based in Gurgaon, India. Founded by Sarvesh
Agrawal, an IIT Madras alumnus, in 2011, the website helps students find internships with organisations in
India. It provides internship, internship training, and job offer form many companies form all over India.

Objectives
Main objectives of training were to learn:

• How to determine and measure program complexity,


• Python Programming
• ML Library Scikit, Numpy , Matplotlib, Pandas , TensorFlow
• Statistical Math for the Algorithms.
• Learning to solve statistics and mathematical concepts.
• Supervised and Unsupervised Learning
• Classification and Regression
• ML Algorithms
• Machine Learning Programming and Use Cases.

Methodologies
There were several facilitation techniques used by the trainer which included question and answer,
brainstorming, group discussions, case study discussions and practical implementation of some of the topics
by trainees on flip charts and paper sheets. The multitude of training methodologies was utilized in order to
make sure all the participants get the whole concepts and they practice what they learn, because only listening
to the trainers can be forgotten, but what the trainees do by themselves they will never forget. After the post-
tests were administered and the final course evaluation forms were filled in by the participants, the trainer
expressed his closing remarks and reiterated the importance of the training for the trainees in their daily
activities and their readiness for applying the learnt concepts in their assigned tasks. Certificates of completion
were distributed among the participants at the end.

Chapter 2 Technology Implemented

Python – The New Generation Language


Python is a widely used general-purpose, high level programming language. It was initially designed by Guido
van Rossum in 1991 and developed by Python Software Foundation. It was mainly developed for an emphasis
on code readability, and its syntax allows programmers to express concepts in fewer lines of code. Python is
dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural,
object-oriented, and functional programming. Python is often described as a "batteries included" language due
to its comprehensive standard library.

Features
• Interpreted
In Python there is no separate compilation and execution steps like C/C++. It directly run the program
from the source code. Internally, Python converts the source code into an intermediate form called
bytecodes which is then translated into native language of specific computer to run it.

• Platform Independent
Python programs can be developed and executed on the multiple operating system platform. Python can
be used on Linux, Windows, Macintosh, Solaris and many more.

• Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented programming and structured
programming are fully supported, and many of its features support functional programming and
aspectoriented programming .

• Simple
Python is a very simple language. It is a very easy to learn as it is closer to English language. In python
more emphasis is on the solution to the problem rather than the syntax.

• Rich Library Support


Python standard library is very vast. It can help to do various things involving regular expressions,
documentation generation, unit testing, threading, databases, web browsers, CGI, email, XML, HTML,
WAV files, cryptography, GUI and many more.

• Free and Open Source


Firstly, Python is freely available. Secondly, it is open-source. This means that its source code is available
to the public. We can download it, change it, use it, and distribute it. This is called FLOSS (Free/Libre and
Open Source Software). As the Python community, we’re all headed toward one goal- an ever-bettering
Python.

Why Python Is a Perfect Language for Machine Learning?

1. A great library ecosystem -


A great choice of libraries is one of the main reasons Python is the most popular programming language
used for AI. A library is a module or a group of modules published by different sources which include
a pre-written piece of code that allows users to reach some functionality or perform different actions.
Python libraries provide base level items so developers don’t have to code them from the very
beginning every time. ML requires continuous data processing, and Python’s libraries let us access,
handle and transform data. These are some of the most widespread libraries you can use for ML and
AI:
o Scikit-learn for handling basic ML algorithms like clustering, linear and logistic regressions,
regression, classification, and others.
o Pandas for high-level data structures and analysis. It allows merging and filtering of data, as
well as gathering it from other external sources like Excel, for instance.
o Keras for deep learning. It allows fast calculations and prototyping, as it uses the GPU in
addition to the CPU of the computer.
o TensorFlow for working with deep learning by setting up, training, and utilizing artificial
neural networks with massive datasets.
o Matplotlib for creating 2D plots, histograms, charts, and other forms of visualization.

o NLTK for working with computational linguistics, natural language recognition, and
processing.

o Scikit-image for image processing.

o PyBrain for neural networks, unsupervised and reinforcement learning.

o Caffe for deep learning that allows switching between the CPU and the GPU and processing
60+ mln images a day using a single NVIDIA K40 GPU.

o StatsModels for statistical algorithms and data exploration.


In the PyPI repository, we can discover and compare more python libraries.

2. A low entry barrier -


Working in the ML and AI industry means dealing with a bunch of data that we need to process in the
most convenient and effective way. The low entry barrier allows more data scientists to quickly pick
up Python and start using it for AI development without wasting too much effort into learning the
language. In addition to this, there’s a lot of documentation available, and Python’s community is
always there to help out and give advice.

3. Flexibility-
Python for machine learning is a great choice, as this language is very flexible:

 It offers an option to choose either to use OOPs or scripting.


 There’s also no need to recompile the source code, developers can implement any
changes and quickly see the results.

 Programmers can combine Python and other languages to reach their goals.
4. Good Visualization Options-
For AI developers, it’s important to highlight that in artificial intelligence, deep learning, and
machine learning, it’s vital to be able to represent data in a human-readable format. Libraries like
Matplotlib allow data scientists to build charts, histograms, and plots for better data comprehension,
effective presentation, and visualization. Different application programming interfaces also simplify
the visualization process and make it easier to create clear reports.

5. Community Support-
It’s always very helpful when there’s strong community support built around the programming
language. Python is an open-source language which means that there’s a bunch of resources open for
programmers starting from beginners and ending with pros. A lot of Python documentation is available
online as well as in Python communities and forums, where programmers and machine learning
developers discuss errors, solve problems, and help each other out. Python programming language is
absolutely free as is the variety of useful libraries and tools.

6. Growing Popularity-
As a result of the advantages discussed above, Python is becoming more and more popular among data
scientists. According to StackOverflow, the popularity of Python is predicted to grow until 2020, at
least. This means it’s easier to search for developers and replace team players if required. Also, the
cost of their work maybe not as high as when using a less popular programming language.
For Machine Learning using python we can use different IDLEs

IDE Used

For Machine Learning using python we can use different IDEs but in this context we mainly
use 2 IDLEs mainly 1. Jupyter notebook and 2. Google Colaboratory.
• IDE :- An IDE (Integrated Development Environment) is a software site that consolidate
the basics tools that develops need to write and test a software
The different IDEs used during internship training:-
o Jupyter notebook :- Project Jupyter (though some users pronounce "py" as suggesting
the pronunciation of Python) is a community run project with a goal to "develop open-
source software, open-standards, and services for interactive computing across dozens
of programming languages". It was spun off from IPython in 2014 by Fernando Pérez
and Brian Granger. Project Jupyter's name is a reference to the three core programming
languages supported by Jupyter, which are Julia, Python and R, and also a homage to
Galileo's notebooks recording the discovery of the moons of Jupiter. Project Jupyter has
developed and supported the interactive computing products Jupyter Notebook,
JupyterHub, and JupyterLab. Jupyter is financially sponsored by NumFOCUS.
We can use jupyter by downloading Anaconda in the device and using it via Anaconda

o Google Colaboratory:- Colaboratory, or “Colab” for short, is a product from Google


Research. Colab allows anybody to write and execute arbitrary python code through the
browser, and is especially well suited to machine learning, data analysis and education.
More technically, Colab is a hosted Jupyter notebook service that requires no setup to
use, while providing access free of charge to computing resources including GPUs.
Colab is free of charge to use.

o Difference between Jupyter and Colab


Jupyter is the open source project on which Colab is based. Colab allows you to use and
share Jupyter notebooks with others without having to download, install, or run
anything.

Data Preprocessing, Analysis & Visualization


Machine Learning algorithms don’t work so well with processing raw data. Before we can feed such data
to an ML algorithm, we must pre-process it. We must apply some transformations on it. With data
preprocessing, we convert raw data into a clean data set. To perform data this, there are 7 techniques -

1. Rescaling Data -
For data with attributes of varying scales, we can rescale attributes to possess the same scale. We rescale
attributes into the range 0 to 1 and call it normalization. We use the MinMaxScaler class from scikitlearn.
This gives us values between 0 and 1.

2. Standardizing Data -
With standardizing, we can take attributes with a Gaussian distribution and different means and standard
deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.

3. Normalizing Data -
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer
class.
4. Binarizing Data -
Using a binary threshold, it is possible to transform our data by marking the values above it 1 and those
equal to or below it, 0. For this purpose, we use the Binarizer class.

5. Mean Removal-
We can remove the mean from each feature to centre it on zero.

6. One Hot Encoding -


When dealing with few and scattered numerical values, we may not need to store these. Then, we can
perform One Hot Encoding. For k distinct values, we can transform the feature into a k-dimensional vector
with one value of 1 and 0 as the rest values.

7. Label Encoding -
Some labels can be words or numbers. Usually, training data is labelled with words to make it readable.
Label encoding converts word labels into numbers to let algorithms work on them.

Machine Learning Algorithms

There are many types of Machine Learning Algorithms specific to different use cases. As we work with
datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80%
between testing and training stages. Under supervised learning, we split a dataset into a training data and test
data in Python ML. Followings are the Algorithms of Python Machine Learning -

1. Linear Regression-
Linear regression is one of the supervised Machine learning algorithms in Python that observes continuous
features and predicts an outcome. Depending on whether it runs on a single variable or on many features, we
can call it simple linear regression or multiple linear regression.
This is one of the most popular Python ML algorithms and often under-appreciated. It assigns optimal weights
to variables to create a line ax+b to predict the output. We often use linear regression to estimate real values
like a number of calls and costs of houses based on continuous variables. The regression line is the best line
that fits Y=a*X+b to denote a relationship between independent and dependent variables.
For implementing linear regression firstly we import the library and data set from and then we check the
multicollinearity and if there it is then we try to remove it. Then we scale the data and create the test and train
partitions after implementing the linear regression using the Scikit-learn and we generate predictions over the
test set and evaluating the model by plotting the residual and verifying the assumptions of the linear regression
we can visualise the coefficients to interrupt the model result

Feature Engineering :-Feature is independent data we used to make the prediction better or more intelligent
by the use of empty data
And the Feature Engineering is a science of entering more info from existing data any new data cannot be
added so we data is already made more useful with respect to the problem we had before. We can put feature
engineering in program by using 2 methods 1. Feature Preprocessing and 2. Feature Generation.
Feature Preprocessing is changing updating or transferring the existing data features and Feature Generating
its generating a new feature from the existing feature the difference between the both of them is feature
generation refer to the creation of new feature from existing data and not simply transferring the value of the
existing feature.

High Dimensionality :- Dimensionality is a relation between the single independent variable and target
variable. So a single 2D graph can be used. In a dimensionality to show that the relation between the 2
independent variable and target variable a 3D graph is required. So as a number of independent variables
increase the dimensionality of the data also increases the visualization becomes tougher and tougher.
So to reduce that we use Dimensionality Reduction :Dimensionality reduction is a process of reducing the
number of variables from the data to increase the reduce data and conveying the maximum information
We can use many techniques mainly 1. Missing value ratio 2. Low variance 3.High correlation 4.Backward
feature elimination 4. Forward feature selection

1. Missing Value Ratio:- Missing value ratio is the ratio of missing values in the variable by the total number
of observations. So if we take a if we consider that in our data set we have 28 variables which do not have
any missing values. So we cannot have very high number of observations so we take threshold value.
Let’s say threshold value is 0.7 so if the missing value ratio is less than 0.7 then the variable can be
estimated but if the missing value greater than 0.7 then the variable will be dropped to reduce multi
dimensionality.
Missing Value Ratio = No of missing values/Total number of observations.
2. Low Variance :- In this method we will try to eliminate the categorical variable by using the value constant
function. Look the frequency at a distant categories. If the frequency is very high (more than 95%) then
we eliminate the variable. In the predictive modelling, the independent variable with a low variance are
eliminated. That is those whose value are same throughout the data set. In this method we don’t use the
standard scalar instead of normalizer because it changed a variable to 1 and it defect our aim and we
cannot able to compare the variance with respect to independent variable if they are equal

3. High Correlation:- Values with high correlation is eliminated as it cannot be reliable as it only took on the
one correlation and. By the way Variable Inflation Factor are reliable technique as it look the thing
aggregated level

4. Adjust R²:- Now backward feature elimination and forward feature elimination are pretty Advance and
Mature dimensionality so we use so we use adjusted r square to simplify. R² does not consider the number
of input variables in the model. Value keeps on marginally increasing even if newly added features added
minimal importance. It consider the number of input feature being the fed to predictive model adding
extra feature with no significant effect on the dependent variable can reduce the model performance. Here,
Adjusted R²=1-{(n-1)/[n-(k+1)]}(1-R²) where n:-sample size; k:- no of variable in our Regression model

5. Forward Selection Model:- In this method for every input variable build a linear regression model
individually choose the independent variable with the highest adjusted R² score and call it as a variable
one. Then repeat the above process by using variable one as a base independent variable and combining
it with every independent variable to build regression model choose the combination with high adjusted
R² score. Repeat the above process till the maximum number of features have been obtained or adjusted r
square is no longer increasing.

6. Backward Feature Elimination :-It remove the redundant variable from the model. Strike a balance
between the model performance and model simplicity to do backward elimination. Now to build a model
with the independent variable and r square and adjusted r square value are stated over a piece line adjusted
value property one input variable at a time repeat the process and calculate the corresponding adjusted r
square subtract the baseline adjusted r square from the calculate calculation adjusted r square and record
the difference make a maximum difference value the new base line adjusted r square and permanent drop
the independent variable corresponding to it repeat the process to required number of time.

2. Logistic Regression:-
Logistic regression is a supervised classification is unique Machine Learning algorithms in Python that finds
its use in estimating discrete values like 0/1, yes/no, and true/false. This is based on a given set of independent
variables. We use a logistic function to predict the probability of an event and this gives us an output between
0 and 1. Although it says ‘regression’, this is actually a classification algorithm. Logistic regression fits data
into a logit function and is also called logit regression. In this method we use categorical variables. A function
which has range between zero and one irrespective of any input it will always give output between zero and
one. It is given as g(x)=1/1+e^-x. Now if we input the value of linear regression line we can get the line as
Y=g(x). On substituting the value of Z into equation and expanding it we get Y= 1/1+e^-(mx+c). Hence this
segment function will restrict the value between zero and one and resultant value is shape of S. In outlier this
platform performed very well. In this method we predict the value between zero and one. Logistic regression
in sk learn automatically count the probability of class by itself using 0.50 threshold. So if the probability is
greater than 0.5 then the value is put on class one and if probability is less than 0.5 then it is put in class zero.
For maximum likelihood like hold variable will be the best fit in logistic regression.

Confusion Matrix:- It is used to interpret the model production s


systematically. Confusing matrix is an into a n*n matrix n is number of distinct classes in the target variable
in binary notation we have class one as a positive class and class zero as negative class. It is a basic platform
of respected for most of the classification Matrix. It can be used to take out :-
1. Accuracy:- Accuracy=TP + TN /TP+TN+FP+FN as we have predicted more the accuracy greater the
model
2. Precision:-it handle the imbalance data set efficiently. It is used when we avoid the false positive is more
essential than enclosing the false negative. So Precision=TP/TP+FP
3. Recall:- it is used to minimise the value of false negative recall is used when we avoid the false negative
and prioritized the encountering the falls positive. Recall=TP/TP+FN.
Where TP=TRUE POSITIVE; TN=TRUE NEGATIVE; FP=FALSE POSITIVE; FN= FALSE
NEGATIVE.
Log Loss :- The log loss calculate the error of classification of model. Smaller the value of log the better the
model performance. It calculate the distance from the class 0 to class 1 of the probability of any variable so if
we if we take let’s say a model one C1 whose distance from the class 0 and class 1 is far more further than
model C2 whose distance from class C1 and C2 is less. For the up reduction of a probability from each through
class the higher the log loss function. So model C2 will be more confident as it is more closer model C1.

AUC-ROC Curve:- AUC-ROC Curve is a performance in measurement for the classification problem of
various threshold settings. Here AUC stands for Area Under Curve and ROC stands for Receiver Operating
Characteristics. Higher the AOC the better the model predicting as 0 as 0 and 1 as 1.
AUC-ROC work well only for the nearly balanced data set, it is not suitable for the cases of imbalance class
frequency.
ROC curve is plotted against with TPR against with FPR where TPR is at y axis and FPR is at x axis. FPR is
called false positive rate and TPR is called true positive rate. The graph is prepared by calculating the different
value of TPR and FPR at different threshold value.
The area under the curve can be used to determine the performance of the model. Higher the AUC-ROC better
is the performance of the model. If the AUC-ROC is greater than 0.95 their could be something very with the
model of the data set.

Data Dictionary:- Data dictionary is the centralized repository of the information about the data in such a
manner of a relationship to the other data, it’s origin, usage and format.
In our data different set the data dictionary are multiple variable in data set and can be divided into three
categories:- 1. Demographic information about the customer that is Customer ID, Vintage, Age etc.
2. Customer Bank Relationship and 3.Transaction information.

3. Decision Tree -
A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both
classification and regression- although mostly for classification. This model takes an instance, traverses the
tree, and compares important features with a determined conditional statement. Whether it descends to the left
child branch or the right depends on the result. Usually, more important features are closer to the root.
Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous
dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Parametric Models :- Parametric model is a learning model that summarize data with a set of parameter of
fixed size. Parametric model make a strong and options about the form of mapping function. It is very simple
and interpretable program. In this the set of parameter does not depend upon the amount of data.
We can in implement it by selecting the form of function and learning the coefficient parameter for the function
from the training data. The number of variable do not change the irrespective of the size. A function may or
may not be a linear function like a line. If it is nothing like a line then our assumption is wrong.
The benefit of using parametric model is that it is much simpler to understand and intercept the data result. It
is very fast to learn from the data. It do not required much training data. Can work even if the fit data is not
perfect.
But still there are some limitation as it is constrained by using a function from those method that are highly
constant to specific form. It has limited complexity and more suited for simpler problem. In practice the
method are unlikely to match the exact underline mapping function.

Non Parametric Models :- Non Parametric method are good when you have a lot of data and prior knowledge
and when you don't want to worry too much about choosing the right feature. It do not make strong assumption
about the form of mapping function. They are free to learn any function form the data. This method is used
when best fit training data is constantly mapping function. Able to fit number of functional form.
Benefit of using non parametric method:- This algorithm is much flexible and can capable of fitting large
number of functional form. The number of assumption or weak assumption about the underline function. Its
performance is very high for the prediction.
But there are some drawbacks of this method:- It required more data to train and examine the function. It is
much slower to train as they have more parameter to train. There is more risk of over fitting the training data
and harder to explain why specific prediction is made.

Pure Node:- Pure node is a node in which all the data points will exhibit that desired behaviour on have same
class that is all the point of specific kind will have with be in same branch.
The objection of decision tree try to have close to pure node. A decision tree with a pure note is good at
segregating the data into respective class.

Key terms used in Decision tree:-


1. Root Node:- A Root note refer to entire population of data.
2. Splitting:- Process of dividing a node into two or more sub nodes.
3. Decision Node:- A sub node that is further divided into more sub nodes.
4. Leaf or Terminal Node:- A node which do not split further to other sub nodes.
5. Branch or Subtree:- A sub section of entire tree is called branch or sub tree.
6. Parent Node:- A node which is further divided into sub node.
7. Child Node :- A sub node of a parent node.
8. Depth of Tree :- The length of longest path from root node to leaf note. So root note by itself has a depth
of zero.

Type of Decision Tree:- There are two type of decision tree:- 1.Classification Decision tree 2.Regression
decision tree.
1. Classification Decision Tree:- It is used when a target variable is categorical in nature.
2. Regression Decision Tree:- It is used when target variable is continuous in nature.
Criteria to split the Decision Node:- There is two ways to split the data in Decision Node :- 1. Gini impurity
2. Information gain using entropy.
1. Gini impurity:- It measured the impurity of node Gini impurity is = 1 – Gini.
Gini is a probability of choosing to random point chosen from the population of same class. So probability
of randomly picking point from the pure note belongs to the same class will be 100%. A Gini ranges from 0
to 1 and more the Gini higher the purity of a node.
So; Gini impurity=1-Gini where
Gini=(p1²+p2²+p3²+p4²+…+pn²) where p1 is a probability of any two random data points belongs to class 1.
Gini impurity for a split:- Using the weighted impurity of both sub nodes
Weight of node= Number of sample in that node/Total number of sample in parent node.
On comparing if we see the Weighted Gini impurity of is less then choose it first to split because minimum
the Gini impurity more the purity of node.
2. Information gain by entropy:- Information gain is the difference between information needed to
describe a parent note and information needed to describe a child node. More the homogeneity of
the node the more information gain is given. So IG=1-Entropy (sub node)
Where Entropy= P1 log(P1)- P2 log (P2)-P3 log(P3)……-Pn log(Pn).
So lower the entropy means higher the purity of node so higher the information gain. It is also used in
categorical variable.
To calculate entropy we have to first calculate the entropy of parent node. Calculate the entropy of
each of the children known node. Calculate the weighted average entropy of the split and if the
weighted entropy of child node is greater than parent node then we ignore the split.

Ensemble Model:-In this method instead of using single model we can use a group model and take a prediction
from each model and we can make a final prediction. To combine the model from each row to get one
prediction we can take a mode vote ie; if prediction 1 = prediction 2= prediction 3 then there is no issue.
Otherwise the majority class will take the prediction. This technique is called maximum voting but it work in
classification problem where the target variable is discrete. But in continuous variable when we predict the
value those values are also continuous so what can’t be taken then mean is taken in that case this method is
called Averageing.
Pros of Ensemble Model:- It capture most of diverse signal or patterns. Less incorrect prediction and more
divers. Reduce over fitting and is it have collective result of diverse model. In the case of regression model
mean, median, mode, weighted average all are used to combine the prediction of model.
Cons of Ensemble Model:- Increase complexity, Not interpretable as whole lead to Black Box, Increase time
complexity and computational required.

Bagging:- it focus on bringing the diversity in the model using the diverse data. In bagging each individual
tree is independent of each other because they are considered different subset of sample.
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to
reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with
replacement—meaning that the individual data points can be chosen more than once. After several data
samples are generated, these weak models are then trained independently, and depending on the type of task—
regression or classification, for example—the average or majority of those predictions yield a more accurate
estimate.

As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging
and feature randomness to create an uncorrelated forest of decision trees.
4. K-Means Algorithm -
k-Means is an unsupervised algorithm that solves the problem of clustering. It classifies data using a number
of clusters. The data points inside a class are homogeneous and heterogeneous to peer groups. k-means
clustering is a method of vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-means
clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's
algorithm. It often is used as a preprocessing step for other algorithms, for example to find a starting
configuration. The problem is computationally difficult (NP-hard). k-means originates from signal processing,
and still finds use in this domain. In cluster analysis, the k-means algorithm can be used to partition the input
data set into k partitions (clusters). k-means clustering has been used as a feature learning (or dictionary
learning) step, in either (semi-)supervised learning or unsupervised learning.

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

5. Random Forest -
A random forest is an ensemble of decision trees. In order to classify every new object based on its attributes,
trees vote for class- each tree provides a classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning method for classification, regression and
other tasks that operates by constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Random forest algorithms have three main hyperparameters, which need to be set before training. These
include node size, the number of trees, and the number of features sampled. From there, the random forest
classifier can be used to solve for regression or classification problems.

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is
comprised of a data sample drawn from a training set with replacement, called the bootstrap sample. Of that
training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll
come back to later. Another instance of randomness is then injected through feature bagging, adding more
diversity to the dataset and reducing the correlation among decision trees. Depending on the type of problem,
the determination of the prediction will vary. For a regression task, the individual decision trees will be
averaged, and for a classification task, a majority vote—i.e. the most frequent categorical variable—will yield
the predicted class. Finally, the oob sample is then used for cross-validation, finalizing that prediction.
Chapter 3 Result Discussion

Result

This training has introduced us to Machine Learning. Now, we know that Machine Learning is a technique of
training machines to perform the activities a human brain can do, albeit bit faster and better than an average
human-being. Today we have seen that the machines can beat human champions in games such as Chess,
Mahjong, which are considered very complex. We have seen that machines can be trained to perform human
activities in several areas and can aid humans in living better lives. Machine learning is quickly growing field
in computer science. It has applications in nearly every other field of study and is already being implemented
commercially because machine learning can solve problems too difficult or time consuming for humans to
solve. To describe machine learning in general terms, a variety models are used to learn patterns in data and
make accurate predictions based on the patterns it observes.

Machine Learning can be a Supervised or Unsupervised. If we have a lesser amount of data and clearly labelled
data for training, we opt for Supervised Learning. Unsupervised Learning would generally give better
performance and results for large data sets. If we have a huge data set easily available, we go for deep learning
techniques. We also have learned Reinforcement Learning and Deep Reinforcement Learning. We now know
what Neural Networks are, their applications and limitations. Specifically, we have developed a thought
process for approaching problems that machine learning works so well at solving. We have learnt how machine
learning is different than descriptive statistics.

Finally, when it comes to the development of machine learning models of our own, we looked at the choices
of various development languages, IDEs and Platforms. Next thing that we need to do is start learning and
practicing each machine learning technique. The subject is vast, it means that there is width, but if we consider
the depth, each topic can be learned in a few hours. Each topic is independent of each other. We need to take
into consideration one topic at a time, learn it, practice it and implement the algorithm/s in it using a language
choice of yours. This is the best way to start studying Machine Learning. Practicing one topic at a time, very
soon we can acquire the width that is eventually required of a Machine Learning expert.
Chapter 4

Project Report

Overview-
A dataset related number of households sold. The data set contain various data’s about the house that was sent
sold including the selling price and date when is sold and condition of house and many more criteria in which
we have to take out the similarities to predict the selling price of the house.

Dataset Description-
The data set was given by the Internshala team to work upon and it can be found by taking the machine
learning training in the Internshala app or web site. Data set can be downloaded by the link provided by
Internshala team at 3rd chapter.

Result-
Our project successfully classifies the selling price of the house with 73.43 %
Accuracy

Chapter 5

Advantages of Machine Learning


Every coin has two faces, each face has its own property and features. It’s time to uncover the faces of ML.
A very powerful tool that holds the potential to revolutionize the way things work.

1. Easily identifies trends and patterns -


Machine Learning can review large volumes of data and discover specific trends and patterns that would not
be apparent to humans. For instance, for an e-commerce website like Amazon, it serves to understand the
browsing behaviour and purchase histories of its users to help cater to the right products, deals, and reminders
relevant to them. It uses the results to reveal relevant advertisements to them.
2. No human intervention needed (automation) -
With ML, we don’t need to babysit our project every step of the way. Since it means giving machines the
ability to learn, it lets them make predictions and also improve the algorithms on their own. A common
example of this is anti-virus software. they learn to filter new threats as they are recognized. ML is also good
at recognizing spam.

3. Continuous Improvement -
As ML algorithms gain experience, they keep improving in accuracy and efficiency. This lets them make better
decisions. Say we need to make a weather forecast model. As the amount of data, we have keeps growing, our
algorithms learn to make more accurate predictions faster.

4. Handling multi-dimensional and multi-variety data -


Machine Learning algorithms are good at handling data that are multi-dimensional and multi-variety, and they
can do this in dynamic or uncertain environments.

5. Wide Applications -
We could be an e-seller or a healthcare provider and make ML work for us. Where it does apply, it holds the
capability to help deliver a much more personal experience to customers while also targeting the right
customers.

Disadvantages of Machine Learning


With all those advantages to its powerfulness and popularity, Machine Learning isn’t perfect. The following
factors serve to limit it:

1. Data Acquisition -
Machine Learning requires massive data sets to train on, and these should be inclusive/unbiased, and of good
quality. There can also be times where they must wait for new data to be generated.

2. Time and Resources -


ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose with a
considerable amount of accuracy and relevancy. It also needs massive resources to function. This can mean
additional requirements of computer power for us.
3. Interpretation of Results -
Another major challenge is the ability to accurately interpret results generated by the algorithms. We must also
carefully choose the algorithms for your purpose.

4. High error-susceptibility -
Machine Learning is autonomous but highly susceptible to errors. Suppose you train an algorithm with data
sets small enough to not be inclusive. You end up with biased predictions coming from a biased training set.
This leads to irrelevant advertisements being displayed to customers. In the case of ML, such blunders can set
off a chain of errors that can go undetected for long periods of time. And when they do get noticed, it takes
quite some time to recognize the source of the issue, and even longer to correct it.

You might also like