Dayananda Sagar University: A Mini Project Report ON

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17
At a glance
Powered by AI
The paper discusses implementing KNN classifier and regression analysis algorithms on the Iris dataset. It evaluates the performance of algorithms like KNN, linear regression, and discusses the steps involved in implementing them.

The paper discusses KNN (K-Nearest Neighbor) classifier algorithm and linear regression analysis. It also mentions other algorithms discussed in referenced papers like class-based weighted KNN, clustering-based KNN regression, and binary search based regression.

The Iris dataset is used to evaluate the performance of the KNN classifier and linear regression algorithms discussed in the paper.

DAYANANDA SAGAR UNIVERSITY

A MINI PROJECT REPORT

ON
“IMPLEMENTATION OF KNN CLASSIFIER ALGORITHM AND
REGRESSION ANALYSIS ON IRIS DATASET”

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted by

SEEMA S [ ENG16CS0144 ]
SHAMANTH B M [ ENG16CS0146 ]
SRINISHA S [ ENG16CS0164 ]

VI Semester, 2018
Under the supervision of
Shivakumar C
professor, Department of CSE,
Dayananda Sagar University.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DAYANANDA SAGAR UNIVERSITY
SCHOOL OF ENGINEERING
KUDLU GATE, BANGALORE-560068

Page 1 of 17
DAYANANDA SAGAR UNIVERSITY
School of Engineering, Kudlu Gate, Bangalore-560068

CERTIFICATE

This is to certify that Project Report entitled “IMPLEMENTATION OF KNN CLASSIFIER


ALGORITHM AND REGRESSION ANALYSIS ON IRIS DATASET” submitted by
SEEMA S [ ENG16CS0144 ], SHAMANTH B M [ ENG16CS0146 ], SRINISHA S
[ ENG16CS0164 ] in partial fulfilment of the requirement for the award of degree B. Tech. in
Department of Computer Science & Engineering of Dayananda Sagar University , is a
record of the candidates’ own work carried out by them under my supervision. The matter
embodied in this report is original and has not been submitted for the award of any other
degree.

Date: ______________ _____________________


Supervisor(s)

_____________________
Chairman
Department of Computer Science and Engineering

Page 2 of 17
ACKNOWLEDGEMENT

It gives us a great sense of pleasure to present the report of the B. Tech. Project undertaken during
B. Tech fourth year. We owe special debt of gratitude to Prof. Shivakumar, Professor, Department
of Computer Science & Engineering, DSU, Karnataka for her constant support and guidance
through-out the course of our work. His sincerity, thoroughness and perseverance have been a
constant source of inspiration for us. It is only his cognisant efforts that our endeavours have seen
light of the day.

We also take the opportunity to acknowledge the contribution of Dr. M K Banga, Chairman,
Department of Computer Science & Engineering, DSU, Karnataka for his full support and
assistance during the development of the project.

We also do not like to miss the opportunity to acknowledge the contribution of all faculty members
of the department for their kind assistance and cooperation during the development of our project.
Last but not the least; we acknowledge our family and friends for their contribution in the
completion of the project.

Page 3 of 17
DECLARATION

We hereby declare that this submission is our own work and that, to the best of our knowledge and
belief, it contains no material previously published or written by another person nor material which
to a substantial extent has been accepted for the award of any other degree or diploma of the
university or other institute of higher learning, except where due acknowledge has been made in the
text.

Name:


SEEMA S [ ENG16CS0144 ]

SHAMANTH B M [ ENG16CS0146 ]

SRINISHA S [ ENG16CS0164 ]

Page 4 of 17
TABLE OF CONTENTS

ABSTRACT 6

INTRODUCTION 7

1.10 Problem Statement 9

LITERATURE REVIEW 10

DESIGN 11

3.1 Software requirements 11

3.2 Algorithm 11

IMPLEMENTATION 14

4.1 Output screenshots 14

REFERENCES 17

Page 5 of 17
ABSTRACT

The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more
complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines
(SVM). KNN (K-Nearest Neighbor) is a simple supervised classification algorithm we can use to
assign a class to new data point. It can be used for regression as well, KNN does not make any as-
sumptions on the data distribution, hence it is non-parametric. It keeps all the training data to make
future predictions by computing the similarity between an input sample and each training instance.
Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of
applications such as economic forecasting, data compression and genetics. For example, KNN was
leveraged in a 2006 study of functional genomics for the assignment of genes based on their expres-
sion profiles. In this project , We are going to work with the well-known supervised machine learn-
ing algorithm called k-NN or k-Nearest Neighbors. For this exercise, we will use the Iris data set for
classification. The attribute Species of the data set will be the variable that we want to predict.
In statistics, linear regression is a linear approach to modelling the relationship between a scalar re-
sponse (or dependent variable) and one or more explanatory variables (or independent variables).
The case of one explanatory variable is called simple linear regression. In this project we perform
regression analysis on iris dataset.

Page 6 of 17
CHAPTER 1
INTRODUCTION

1.1 Introduction to kNN Algorithm

Statistical learning refers to a collection of mathematical and computation tools to understand


data.In what is often called supervised learning, the goal is to estimate or predict an output based on
one or more inputs.The inputs have many names, like predictors, independent variables, features,
and variables being called common.The output or outputs are often called response variables, or
dependent variables.If the response is quantitative – say, a number that measures weight or height,
we call these problems regression problems.If the response is qualitative– say, yes or no, or blue or
green, we call these problems classification problems.This case study deals with one specific
approach to classification.The goal is to set up a classifier such that when it’s presented with a new
observation whose category is not known, it will attempt to assign that observation to a category, or
a class, based on the observations for which it does know the true category.This specific method is
known as the k-Nearest Neighbors classifier, or kNN for short.Given a positive integer k, say 5, and
a new data point, it first identifies those k points in the data that are nearest to the point and
classifies the new data point as belonging to the most common class among those k neighbors.

1.2 Objective

Build our very own k – Nearest Neighbor classifier to classify data from the IRIS dataset of scikit-
learn.

1.3 Distance between two points

We are going to write a function, which will find the distance between two given 2-D points in the
x-y plane.We will import numpy, to take help of numpy arrays for storing the coordinates.Finding
the distance between two points will help in finding the nearest neighbor of the input point.

import numpy as np

def distance(p1, p2):

return np.sqrt(np.sum(np.power(p2-p1, 2))) #distance between two points

p1 = np.array([1, 1]) #coordinate x = 1, y = 1

p2 = np.array([4, 4]) #coordinate x = 4, y = 4

distance(p1, p2)

1.4 Majority vote counter

We will create a 3 x 3 matrix of points with the help of numpy array to build the environment of
dispersed points in the plane.We will also create a function called majority_vote() to find the
highest count/vote of a particular vote list, e.g ( 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 3, 2, 3) etc.This is
inderectly the mode of the given data, so can also be calculated with the help of scipy statistics
module.We will create another function called majority_vote_short() which will perform the same
Page 7 of 17
functionality as majority_vote() but will make use of mode() from scipy.stats.Both these functions
will be necessary in predicting the points later.

Our aim is to build a kNN classifier, so we need to develop an algorithm to find the nearest
neighbours of a given set of points.Suppose we need to insert a point into x-y plane within an
environment of given set of existing points.We will have to classify the point we wish to insert into
one of the category of the existing points and then insert accordingly.So, we will build a function
find_nearest_neighbours() to find the nearest neighbor of the given point.It will take in (i)The point
we wish to insert (ii)set of existing points and (iii)k helps with the indices, as parameters to the
function.We will visualize the situation by plotting the x-y plane filled with points with the help of
matplotlib.

1.5 kNN Predict around Synthetic Data

After finding the nearest neighbors, we will have to predict the category of the input point.We will
build a function called knn_predict() which will predict the category of the point we wish to
insert.We can build another function called generate_synth_data() to generate synthetic points in the
x-y plane.

1.6 kNN Prediction GRID

We will build a function called make_prediction_grid() which will make a grid and allot the
different class of points in the grid.Another function plot_prediction_grid() must be created to plot
the outputs of make_prediction_grid() using matplotlib.

1.7 Introduction to Iris data set

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British
statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in
taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and
Iris versicolor). Four features were measured from each sample: the length and the width of the
sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed
a linear discriminant model to distinguish the species from each other.

1.8 Classifying the IRIS Dataset

We will test our classifier on a scikit learn dataset, called “IRIS”.For importing “IRIS”, we need to
import datasets from sklearn and call the function datasets.load_iris().The “IRIS” dataset holds
information on sepal length, sepal width, petal length & petal width for three different class of Iris
flower – Iris-Setosa, Iris-Versicolour & Iris-Verginica.Based on the data from the dataset, we need
to classify and visualize them using our classifier.The Sci-kit learn (sklearn) library already holds a
pre built classifier.We will compare both the classifiers, [scikitlearn vs the one that we built] and
check/compare prediction accuracy of both the classifier .

1.9 Introduction to regression analysis

In statistical modeling, regression analysis is a set of statistical processes for estimating the
relationships among variables. It includes many techniques for modeling and analyzing several
variables, when the focus is on the relationship between a dependent variable and one or more
Page 8 of 17
independent variables (or 'predictors'). More specifically, regression analysis helps one understand
how the typical value of the dependent variable (or 'criterion variable') changes when any one of the
independent variables is varied, while the other independent variables are held fixed. OneOne trick
you can use to adapt linear regression to nonlinear relationships between variables is to transform
the data according to basis functions. We have seen one version of this before, in the
PolynomialRegression pipeline used in Hyperparameters and Model Validation and Feature
Engineering. The idea is to take our multidimensional linear model: y=a0+a1x1+a2x2+a3x3+… and
build the x_1, x_2, x_3, and so on, from our single-dimensional input x. ThisThis polynomial
projection is useful enough that it is built into Scikit-Learn, using the PolynomialFeatures
transformer.

1.10 Problem Statement


Implementation of Knn classifier algorithm and regression analysis using Iris data set.

Page 9 of 17
CHAPTER 2
LITERATURE REVIEW

[1] Harshit Dubey and Vikram Pudi,“Class Based Weighted K-Nearest Neighbor Over Imbalance
Dataset”,In this paper, a modified version of kNN algorithm is proposed so that it takes into account
the class distribution in a wider region around the query instance. Our empirical experiments with
several real world datasets show that our algorithm outperforms current state-of-the-art approaches.
In Proceedings of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD 2013).

[2] Harshit Dubey and Vikram Pudi, “CLUEKR : CLUstering based Efficient kNN Regression”, In
this paper, we propose a novel, efficient and accurate, clustering based kNN regression algorithm
CLUEKR having the advantage of low computational complexity. Instead of searching for nearest
neighbors directly in the entire dataset, we first hierarchically cluster the data and then find the
cluster in which the query point should lie. In Proceedings of 17th Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD 2013).

[3] Saket Bharambe, Harshit Dubey and Vikram Pudi, “BINER : BINary search based Efficient
Regression”, Regression is the study of functional dependency of one numeric variable with respect
to another. In this paper, we present a novel, efficient, binary search based regression algorithm
having the advantage of low computational complexity. In Proceed0ings of 8th International
Conference on Machine Learning and Data Mining (MLDM 2012).

[4] Harshit Dubey, Saket Bharambe and Vikram Pudi, “BINGR : BINary search based Gaussian
Regression”, In this paper we have presented a new regression algorithm and evaluated it against
existing standard algorithms. The algorithm focuses on minimizing the range in which the response
attribute has the maximum likelihood. In Proceedings of 4th International Conference on
Knowledge Discovery in Information Retrieval (KDIR 2012).

Page 10 of 17
CHAPTER 3
DESIGN

3.1 Software requirements


1. Python 3

2. Anaconda-Navigator

3. Jupyter notebook

3.2 Algorithm
1. Pseudo code for regression analysis

Page 11 of 17
2. Pseudo Code of KNN

We can implement a KNN model by following the below steps:


1. Load the data

2. Initialise the value of k

3. For getting the predicted class, iterate from 1 to total number of training data points

4. Calculate the distance between test data and each row of training data. Here we will use
Euclidean distance as our distance metric since it’s the most popular method. The other metrics
that can be used are Chebyshev, cosine, etc.
5. Sort the calculated distances in ascending order based on distance values

6. Get top k rows from the sorted array

7. Get the most frequent class of these rows

8. Return the predicted class


Output:

It seems from the output that our classifier is actually performing better than the sklearn classifier.

3.3 Flow Chart

Figure 1

Page 12 of 17
Figure 2

Page 13 of 17
CHAPTER 4

IMPLEMENTATION

4.1 Output screenshots


Importing Iris Dataset

Scatter Plot with Iris Dataset (Relationship between Petal Length and Petal Width)

Scatter Plot with Iris Dataset (Relationship between Sepal Length and Sepal Width)
Page 14 of 17
kNN classifier Algorithm on Iris Dataset

Page 15 of 17
Linear regression on Iris Dataset

Page 16 of 17
REFERENCES

[1]. Harshit Dubey and Vikram Pudi,“Class Based Weighted K-Nearest Neighbor Over Imbalance
Dataset”, In Proceedings of 17th Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD 2013).

[2]. Harshit Dubey and Vikram Pudi, “CLUEKR : CLUstering based Efficient kNN Regression”, In
Proceedings of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD
2013).

[3]. Saket Bharambe, Harshit Dubey and Vikram Pudi, “BINER : BINary search based Efficient
Regression”, In Proceedings of 8th International Conference on Machine Learning and Data Mining
(MLDM 2012).

[4]. Harshit Dubey, Saket Bharambe and Vikram Pudi, “BINGR : BINary search based Gaussian
Regression”, In Proceedings of 4th International Conference on Knowledge Discovery in
Information Retrieval (KDIR 2012).

Page 17 of 17

You might also like