Machine Learning Document
Machine Learning Document
Machine Learning Document
MACHINE LEARNING
Submitted to JNTUA in partial fulfillment of the requirement
For the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted By
B.Srinivasulu
(192H1A0514)
AUDISANKARA
INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(Accredited By: NAAC|Approved By:AICTE|Affiliated to JNTUA)
NH-5, BYPASS ROAD, GUDUR-524101,SRI BALAJI (DT.). ANDHRA PRADESH.
2022-2023
ACKNOWLEDGEMENT
First and Foremost, I would like to thank my beloved parents for their blessings
and grace in making this skill oriented programming success. I avail this
opportunity to express our profound sense of sincere and deep gratitude to
those who constantly guided, supported and encourage during the course of my
skill oriented programming.
I wish to express my heartfelt thanks and deep sense of gratitude to the
honorable chairman Dr.V.PENCHALAIAH for his encouragement and inspiration
throughout the process.
I would like to thank my beloved Director of “AUDISANKARA INSTITUTE OF
TECHNOLOGY” Dr. A. MOHAN creating a competitive environment in our
collage and encouraging throughout this course.
I would like to thank my collage management for having allowed me to do the
project work.Lastly ,I would like to pay our regards and thank our principal
Dr.T.VENU MADHAV whose ideas are proved to be really worth full in our work.
I wish to express our deep sense of gratitude to my beloved and esteemed Head
of the department of CSE, Dr.A.SWARUPA RANI, assoc.Professor. For her
support, encouragement and valuable suggestions, this went a long way in the
successful completion of this skill oriented programming.
DECLARATION
I hereby declare that the skill oriented programming entitled “MACHINE
LEARNING” been successfully completed. This skill oriented programming work
has been submitted to “AUDISANKARA INSTITUTE OF TECHNOLOGY”, GUDUR
as a part of partial fulfillment of the requirements for the award of degree of
bachelor of technology. I also declare that this skill oriented programming
report has not been submitted at any time to another institute or university for
the award of any degree.
B.Srinivasulu
(192H1A0514)
PLACE: GUDUR,
DATE:
INDEX
SL.NO NAME OF THE DATE OF PAGE NO
CHAPTER WORK
1. INTRODUCTION TO 12/08/2022 1
MACHINE
LEARNING
2. CLASSIFICATION OF 13/08/2022- 2
MACHINE 15/08/2022
LEARNING
3. HISTORY OF 16/08/2022 4
MACHINE
LEARNING
4. LIFE CYCLE OF 17/08/2022 6
MACHINE
LEARNING
5 STRUCTURE OF 18/08/2022 6
MACHINE
LEARNING
6 CLASSIFICATION 19/08/2022 8
ALGORITHM IN ML
7 LOGISTIC 20/08/2022- 9
REGRESSION IN ML 22/08/2022
8 CLUSTERING IN ML 23/08/2022- 12
25/08/2022
9 CLUSTERING 26/08/2022- 16
ALGORITHMS 27/08/2022
10 DATA PROCESSING 28/08/2022- 17
29/08/2022
11 REINFORCEMENT 30/08/2022- 19
LEARNING 31/08/2022
12 INTRODUCTION TO 1/09/2022- 21
DIMENSIONALITY 2/09/2022
REDUCTION
13 STEP BY STEP 3/09/2022- 25
IMPLEMENTATION 5/09/2022
IN PYTHON
1 INTRODUCTION TO MACHINE LEARNING:
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers the
ability to learn without being explicitly programmed “.
➢ The field of study known as machine learning is concerned with the
question of how to construct computer programs that automatically
improve with experience.
DEFINITION OF LEARNING:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P , if its performance at tasks T,
as measured by P , improves with experience E.
EXAMPLES:
1.Handwritingn learning problem
> Task T : Recognizing and classifying handwritten words within images
> Performance P : Percent of words correctly classified
> experience E : A dataset of handwritten words with given classifications
2.A robot driving learning problem
> Task T : Driving on highways using vision sensors
> Performance P : Average distance traveled before an error
> Training experience E : A sequence of images and steering commands record
while observing a human driver
1
DEFINITION:
➢ A computer program which learns from experience is called a machine
learning program or simply a learning program .
A.Supervised learning
Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. The given data is
labeled . Both classification and regression problems are supervised learning
problems .
Example — Consider the following data regarding patients entering a clinic . The
data consists of the gender and age of the patients and each patient is labeled as
“healthy” or “sick”.
Gender age label
M 49 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
M 34 healthy
M 21 healthy
2
B. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses. In
unsupervised learning algorithms, classification or categorization is not included in
the observations. Example: Consider the following data regarding patients entering
a clinic. The data consists of the gender and age of the patients.
Gender age
M 48
M 67
F 53
M 49
F 34
M 21
C. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as
to maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but
instead must discover which actions yield the most reward by trying them. For
example — Consider teaching a dog a new trick: we cannot tell it what tell it to do
what to do, but we can reward/punish it if it does the right/wrong thing.
3
D. Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many)
of the target outputs missing. There is a special case of this principle known as
Transduction where the entire set of problem instances is known at
learning time, except that part of the targets are missing. Semi-supervised learning
is an approach to machine learning that combines small labeled data with a large
amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning.
4
➔ 1985 — Terry Sinofsky invents NetTalk, which learns to pronounce words the
same way a baby does.
➔ 1990s — Work on machine learning shifts from a knowledge-driven
approach to a data-driven approach. Scientists begin creating programs for
computers to analyze large amounts of data and draw conclusions — or
“learn” — from the results.
➔ 1997 — IBM’s Deep Blue beats the world champion at chess.
➔ 2006 — Geoffrey Hinton coins the term “deep learning” to explain new
algorithms that let computers “see” and distinguish objects and text in
images and videos.
➔ 2010 — The Microsoft Kinect can track 20 human features at a rate of 30
times per second, allowing people to interact with the computer via
movements and gestures.
➔ 2011 — IBM’s Watson beats its human competitors at Jeopardy.
➔ 2011 — Google Brain is developed, and its deep neural network can learn to
discover and categorize objects much the way a cat does.
➔ 2012 – Google’s X Lab develops a machine learning algorithm that is able to
autonomously browse YouTube videos to identify the videos that contain
cats.
➔ 2014 – Facebook develops DeepFace, a software algorithm that is able to
recognize or verify individuals on photos to the same level as humans can.
➔ 2015 – Amazon launches its own machine learning platform.
➔ 2015 – Microsoft creates the Distributed Machine Learning Toolkit, which
enables the efficient distribution of machine learning problems across
multiple computers.
➔ 2015 – Over 3,000 AI and Robotics researchers, endorsed by Stephen
Hawking, Elon Musk and Steve Wozniak (among many others), sign an open
letter warning of the danger of autonomous weapons which select and
engage targets without human intervention.
5
➔ 2016 – Google’s artificial intelligence algorithm beats a professional player
at the Chinese board game Go, which is considered the world’s most complex
board game and is many times harder than chess. The AlphaGo algorithm
developed by Google DeepMind managed to win five games out of five in the
Go competition.
6
USES OF MACHINE LEARNING:
➢ Finance
➢ Health
➢ Government
➢ Stores
➢ Oil and gas
➢ Transport
7
6. CLASSIFICATIONMACHINE LEARNING:
The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories. Unlike regression, the output variable of Classification is a category, not
a value, such as “Green or Blue”, “fruit or animal”, etc. Since the Classification
algorithm is a Supervised learning technique, hence it takes labeled input data,
which means it contains input with the corresponding output.
In classification algorithm a discrete output function (y) is mapped to input variable
(x)
Y=f(x) , where u=categorical output
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
8
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
TYPES OF ML CLASSIFICATIONS:
Classification Algorithms can be further divided into the Mainly two category:
➢ Linear Models
Logistic Regression
Support Vector Machines
➢ Non-linear Models
K-Nearest Neighbours
Kernal SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classifications
9
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
10
LOGISTIC FUNCTION:
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form. The S-form curve
is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
ASSUMPTIONS:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
11
TYPES OF LOGICAL REGRESSION:
On the basis of the categories, Logistic Regression can be classified into three
types:
Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
12
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix
also uses this technique to recommend the movies and web-series to its users as
per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
13
TYPES OF CLUSTERING:
The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:
• Partitioning Clustering
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
14
The distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).
Distribution Model-Based Clustering:
Hierarchical Clustering:
Fuzzy Clustering:
Fuzzy clustering is a type of soft method in which a data object may belong
to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a cluster.
Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
15
9. Clustering Algorithms:
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find
the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
16
Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
17
Collection :
The most crucial step when starting with ML is to have data of good quality and
accuracy. Data can be collected from any authenticated source
like Kaggle or UCI dataset repository. For example, while preparing for a
competitive exam, students study from the best study material that they can
access so that they learn the best to obtain the best results. In the same way,
high-quality and accurate data will make the learning process of the model
easier and better and at the time of testing, the model would yield state-of-
the-art results. A huge amount of capital, time and resources are consumed in
collecting data. Organizations or researchers have to decide what kind of data
they need to execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous
images having a variety of human expressions. Good data ensures that the
results of the model are valid and can be trusted upon.
Preparation
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further
processing and exploration. This preparation can be performed either
manually or from the automatic approach. Data can also be prepared in
numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions, the value
of each cell will indicate the image pixel.
Input
Now the prepared data can be in the form that may not be machine-readable,
so to convert this data to the readable form, some conversion algorithms are
needed. For this task to be executed, high computation and accuracy is needed.
Example: Data can be collected through the sources like MNIST Digit
data(images), Twitter comments, audio files, video clips.
Processing
This is the stage where algorithms and ML techniques are required to perform
the instructions provided over a large volume of data with accuracy and
optimal computation.
18
Output
In this stage, results are procured by the machine in a meaningful manner
which can be inferred easily by the user. Output can be in the form of reports,
graphs, videos, etc.
Storage
This is the final step in which the obtained output and the data model data
and all the useful information are saved for future use.
The above image shows the robot, diamond, and fire. The goal of the robot
is to get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
19
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will
start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return
a state and the user will decide to reward or punish the model based
on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Types of Reinforcement:
There are two types of Reinforcement:
Positive –
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states
which can diminish the results
Negative –
20
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
21
Why is Dimensionality Reduction important in Machine Learning
and Predictive Modeling?
• Filter
• Wrapper
• Embedded
22
Feature extraction: This reduces the data in a high dimensional
space to a lower dimension space, i.e. a space with lesser no. of
dimensions.
23
Advantages of Dimensionality Reduction:
24
statistics, and much more. Scipy is a functional library for
scientific and high-performance computations.
25
b. Read the CSV file:
We check the first five rows of our dataset. In this case, we are using a vehicle
model dataset — please check out the dataset on Softlayer IBM.
Here our goal is to predict the value of “co2 emissions” from the value
“engine size” in our dataset.
26
e. Divide the data into training and testing data:
To check the accuracy of a model, we are going to divide our data into
training and testing datasets. We will use training data to train our model,
and then we will check the accuracy of our model using the testing dataset.
27
f. Training our model:
Here is how we can train our model and find the coefficients for our best-fit
regression line.
Based on the coefficients, we can plot the best fit line for our dataset.
28
h. Prediction function:
29
i.Predicting co2 emissions:
We can check the accuracy of a model by comparing the actual values with
the predicted values in our dataset.
30
Put it all together
#Import required libraries:
Import pandas as pd
Import numpy as np
Data = pd.read_csv(“Fuel.csv”)
Data.head()
Data = data[[“ENGINESIZE”,”CO2EMISSIONS”]]
# ENGINESIZE vs CO2EMISSIONS:
Plt.xlabel(“ENGINESIZE”)
Plt.ylabel(“CO2EMISSIONS”)
Plt.show()
31
# We are using 80% data for training.
Train = data[int((len(data)*0.8)))]
Test = data[(int((len(data)*0.8))):]
# Modeling:
Regr = linear_model.LinearRegression()
Train_x = np.array(train[[“ENGINESIZE”]])
Train_y = np.array(train[[“CO2EMISSIONS”]])
Regr.fit(train_x,train_y)
# The coefficients:
Plt.xlabel(“Engine size”)
Plt.ylabel(“Emission”)
# Predicting values:
32
Def get_regression_predictions(input_features,intercept,slope):
Return predicted_values
My_engine_size = 3.5
Estimatd_emission =
get_regression_predictions(my_engine_size,regr.intercept_[0],regr.coef_[0
][0])
Test_x = np.array(test[[‘ENGINESIZE’]])
Test_x = np.array(test[[‘CO2EMISSIONS’]])
Test_y_ = regr.predict(test_x)
33