Machine Learning Document
Machine Learning Document
Machine Learning Document
Submitted to JNTUA in partial fulfillment of the requirement
For the award of the degree of
Submitted By
(Accredited By: NAAC|Approved By:AICTE|Affiliated to JNTUA)
First and Foremost, I would like to thank my beloved parents for their blessings
and grace in making this skill oriented programming success. I avail this
opportunity to express our profound sense of sincere and deep gratitude to
those who constantly guided, supported and encourage during the course of my
skill oriented programming.
I wish to express my heartfelt thanks and deep sense of gratitude to the
honorable chairman Dr.V.PENCHALAIAH for his encouragement and inspiration
throughout the process.
I would like to thank my beloved Director of “AUDISANKARA INSTITUTE OF
TECHNOLOGY” Dr. A. MOHAN creating a competitive environment in our
collage and encouraging throughout this course.
I would like to thank my collage management for having allowed me to do the
project work.Lastly ,I would like to pay our regards and thank our principal
Dr.T.VENU MADHAV whose ideas are proved to be really worth full in our work.
I wish to express our deep sense of gratitude to my beloved and esteemed Head
of the department of CSE, Dr.A.SWARUPA RANI, assoc.Professor. For her
support, encouragement and valuable suggestions, this went a long way in the
successful completion of this skill oriented programming.
I hereby declare that the skill oriented programming entitled “MACHINE
LEARNING” been successfully completed. This skill oriented programming work
as a part of partial fulfillment of the requirements for the award of degree of
bachelor of technology. I also declare that this skill oriented programming
report has not been submitted at any time to another institute or university for
the award of any degree.
1. INTRODUCTION TO 12/08/2022 1
2. CLASSIFICATION OF 13/08/2022- 2
MACHINE 15/08/2022
3. HISTORY OF 16/08/2022 4
4. LIFE CYCLE OF 17/08/2022 6
5 STRUCTURE OF 18/08/2022 6
6 CLASSIFICATION 19/08/2022 8
7 LOGISTIC 20/08/2022- 9
8 CLUSTERING IN ML 23/08/2022- 12
9 CLUSTERING 26/08/2022- 16
ALGORITHMS 27/08/2022
10 DATA PROCESSING 28/08/2022- 17
11 REINFORCEMENT 30/08/2022- 19
LEARNING 31/08/2022
12 INTRODUCTION TO 1/09/2022- 21
13 STEP BY STEP 3/09/2022- 25
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers the
ability to learn without being explicitly programmed “.
➢ The field of study known as machine learning is concerned with the
question of how to construct computer programs that automatically
improve with experience.
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P , if its performance at tasks T,
as measured by P , improves with experience E.
1.Handwritingn learning problem
> Task T : Recognizing and classifying handwritten words within images
> Performance P : Percent of words correctly classified
> experience E : A dataset of handwritten words with given classifications
2.A robot driving learning problem
> Task T : Driving on highways using vision sensors
> Performance P : Average distance traveled before an error
> Training experience E : A sequence of images and steering commands record
while observing a human driver
➢ A computer program which learns from experience is called a machine
learning program or simply a learning program .
A.Supervised learning
Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. The given data is
labeled . Both classification and regression problems are supervised learning
problems .
Example — Consider the following data regarding patients entering a clinic . The
data consists of the gender and age of the patients and each patient is labeled as
“healthy” or “sick”.
Gender age label
M 49 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
M 34 healthy
M 21 healthy
B. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses. In
unsupervised learning algorithms, classification or categorization is not included in
the observations. Example: Consider the following data regarding patients entering
a clinic. The data consists of the gender and age of the patients.
Gender age
M 48
M 67
F 53
M 49
F 34
M 21
C. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as
to maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but
instead must discover which actions yield the most reward by trying them. For
example — Consider teaching a dog a new trick: we cannot tell it what tell it to do
what to do, but we can reward/punish it if it does the right/wrong thing.
D. Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many)
of the target outputs missing. There is a special case of this principle known as
Transduction where the entire set of problem instances is known at
learning time, except that part of the targets are missing. Semi-supervised learning
is an approach to machine learning that combines small labeled data with a large
amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning.
➔ 1985 — Terry Sinofsky invents NetTalk, which learns to pronounce words the
same way a baby does.
➔ 1990s — Work on machine learning shifts from a knowledge-driven
approach to a data-driven approach. Scientists begin creating programs for
computers to analyze large amounts of data and draw conclusions — or
“learn” — from the results.
➔ 1997 — IBM’s Deep Blue beats the world champion at chess.
➔ 2006 — Geoffrey Hinton coins the term “deep learning” to explain new
algorithms that let computers “see” and distinguish objects and text in
images and videos.
➔ 2010 — The Microsoft Kinect can track 20 human features at a rate of 30
times per second, allowing people to interact with the computer via
movements and gestures.
➔ 2011 — IBM’s Watson beats its human competitors at Jeopardy.
➔ 2011 — Google Brain is developed, and its deep neural network can learn to
discover and categorize objects much the way a cat does.
➔ 2012 – Google’s X Lab develops a machine learning algorithm that is able to
autonomously browse YouTube videos to identify the videos that contain
➔ 2014 – Facebook develops DeepFace, a software algorithm that is able to
recognize or verify individuals on photos to the same level as humans can.
➔ 2015 – Amazon launches its own machine learning platform.
➔ 2015 – Microsoft creates the Distributed Machine Learning Toolkit, which
enables the efficient distribution of machine learning problems across
multiple computers.
➔ 2015 – Over 3,000 AI and Robotics researchers, endorsed by Stephen
Hawking, Elon Musk and Steve Wozniak (among many others), sign an open
letter warning of the danger of autonomous weapons which select and
engage targets without human intervention.
➔ 2016 – Google’s artificial intelligence algorithm beats a professional player
at the Chinese board game Go, which is considered the world’s most complex
board game and is many times harder than chess. The AlphaGo algorithm
developed by Google DeepMind managed to win five games out of five in the
Go competition.
➢ Finance
➢ Health
➢ Government
➢ Stores
➢ Oil and gas
➢ Transport
The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories. Unlike regression, the output variable of Classification is a category, not
a value, such as “Green or Blue”, “fruit or animal”, etc. Since the Classification
algorithm is a Supervised learning technique, hence it takes labeled input data,
which means it contains input with the corresponding output.
In classification algorithm a discrete output function (y) is mapped to input variable
Y=f(x) , where u=categorical output
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Classification Algorithms can be further divided into the Mainly two category:
➢ Linear Models
Logistic Regression
Support Vector Machines
➢ Non-linear Models
K-Nearest Neighbours
Kernal SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classifications
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form. The S-form curve
is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
On the basis of the categories, Logistic Regression can be classified into three
Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix
also uses this technique to recommend the movies and web-series to its users as
per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:
• Partitioning Clustering
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
The distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).
Distribution Model-Based Clustering:
Hierarchical Clustering:
Fuzzy Clustering:
Fuzzy clustering is a type of soft method in which a data object may belong
to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a cluster.
Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
9. Clustering Algorithms:
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find
the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
Collection :
The most crucial step when starting with ML is to have data of good quality and
accuracy. Data can be collected from any authenticated source
like Kaggle or UCI dataset repository. For example, while preparing for a
competitive exam, students study from the best study material that they can
access so that they learn the best to obtain the best results. In the same way,
high-quality and accurate data will make the learning process of the model
easier and better and at the time of testing, the model would yield state-of-
the-art results. A huge amount of capital, time and resources are consumed in
collecting data. Organizations or researchers have to decide what kind of data
they need to execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous
images having a variety of human expressions. Good data ensures that the
results of the model are valid and can be trusted upon.
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further
processing and exploration. This preparation can be performed either
manually or from the automatic approach. Data can also be prepared in
numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions, the value
of each cell will indicate the image pixel.
Now the prepared data can be in the form that may not be machine-readable,
so to convert this data to the readable form, some conversion algorithms are
needed. For this task to be executed, high computation and accuracy is needed.
Example: Data can be collected through the sources like MNIST Digit
data(images), Twitter comments, audio files, video clips.
This is the stage where algorithms and ML techniques are required to perform
the instructions provided over a large volume of data with accuracy and
optimal computation.
In this stage, results are procured by the machine in a meaningful manner
which can be inferred easily by the user. Output can be in the form of reports,
graphs, videos, etc.
This is the final step in which the obtained output and the data model data
and all the useful information are saved for future use.
The above image shows the robot, diamond, and fire. The goal of the robot
is to get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return
a state and the user will decide to reward or punish the model based
on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Types of Reinforcement:
There are two types of Reinforcement:
Positive –
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states
which can diminish the results
Negative –
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Why is Dimensionality Reduction important in Machine Learning
and Predictive Modeling?
• Filter
• Wrapper
• Embedded
Feature extraction: This reduces the data in a high dimensional
space to a lower dimension space, i.e. a space with lesser no. of
Advantages of Dimensionality Reduction:
statistics, and much more. Scipy is a functional library for
scientific and high-performance computations.
b. Read the CSV file:
We check the first five rows of our dataset. In this case, we are using a vehicle
model dataset — please check out the dataset on Softlayer IBM.
Here our goal is to predict the value of “co2 emissions” from the value
“engine size” in our dataset.
e. Divide the data into training and testing data:
To check the accuracy of a model, we are going to divide our data into
training and testing datasets. We will use training data to train our model,
and then we will check the accuracy of our model using the testing dataset.
f. Training our model:
Here is how we can train our model and find the coefficients for our best-fit
regression line.
Based on the coefficients, we can plot the best fit line for our dataset.
h. Prediction function:
i.Predicting co2 emissions:
We can check the accuracy of a model by comparing the actual values with
the predicted values in our dataset.
Put it all together
#Import required libraries:
Import pandas as pd
Import numpy as np
Data = pd.read_csv(“Fuel.csv”)
# We are using 80% data for training.
Train = data[int((len(data)*0.8)))]
Test = data[(int((len(data)*0.8))):]
# Modeling:
Regr = linear_model.LinearRegression()
Train_x = np.array(train[[“ENGINESIZE”]])
Train_y = np.array(train[[“CO2EMISSIONS”]]),train_y)
# The coefficients:
Plt.xlabel(“Engine size”)
# Predicting values:
Def get_regression_predictions(input_features,intercept,slope):
Return predicted_values
My_engine_size = 3.5
Estimatd_emission =
Test_x = np.array(test[[‘ENGINESIZE’]])
Test_x = np.array(test[[‘CO2EMISSIONS’]])
Test_y_ = regr.predict(test_x)