9.0 KNN Nearest Neighbours Algorithm

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

9.

0 KNN NEAREST NEIGHBOURS ALGORITHM

KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a
new data point into the target class, depending on the features of its neighboring data points.

Features Of KNN Algorithm:

• KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of
the data points.

• It is one of the most simple Machine learning algorithms and it can be easily implemented for a
varied set of problems.

• It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbour and
classifies the data point into the class it is most similar to.

• Unlike most algorithms, KNN is a non-parametric model which means that it does not make any
assumptions about the data set. This makes the algorithm more effective since it can handle realistic
data.

• KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a
discriminative function from the training data.

• KNN can be used for solving both classification and regression problems.

Algorithm:

• Step 1: Given the point P, determine the sub-set of data that lies in the ball of radius r centered at
P,

Br (P) = { Xi ∊ X | dist( P, Xi ) ≤ r }

• Step 2: If Br (P) is empty, then output the majority class of the entire data set.

• Step 3: If Br (P) is not empty, output the majority class of the data points in it.

CGPA PREDICTION BY NEAREST NEIGHBOUR ALGORITHM

(Considering: Credit Hours, Time Spent outside College in Academic Activities, Usage Of Library,
Seeking A Librarian Or Staff Member’s Help In Academics, Read Assigned Materials Other Than
Textbooks, Used An Index Or Database, Developed A Bibliography Or Reference List, Gone Back To
Read A Basic Reference Or Document)
11.0 DECISION TREE

The decision tree is a data mining technique for solving classification and prediction problems.
Decision trees are a simple recursive structure for expressing a sequential classification process in
which a case, described by a set of attributes, is assigned to one of a disjoint set of classes. Decision
trees consist of nodes and leaves. Each node in the tree involves testing a particular attribute and
each leaf of the tree denotes a class. Usually, the test compares an attribute value with a constant.
Leaf nodes give a classification that applies to all instances that reach the leaf, or a set of
classifications, or a probability distribution over all possible classifications. To classify an unknown
instance, it is routed down the tree according to the values of the attributes tested in successive
nodes, and when a leaf is reached, the instance is classified according to the class assigned to the
leaf. If the attribute that is tested at a node is a nominal one, the number of children is usually the
number of possible values of the attribute. The tree complexity is measured by one of the following
metrics: the total number of nodes, total number of leaves, tree depth and number of attributes
used. Example:

Type to enter a caption.


According to Mitchell, the central choice in the algorithm is selecting which attribute to test at each
node in the tree. There is a good quantitative measure for this problem, called information gain. But
in order to define information gain precisely, it is necessary to define a measure commonly used in
information theory, called entropy, that characterises the (im)purity of an arbitrary collection of
examples. If the target attribute can take on m different values, then the entropy of S relative to this
m-wise classification is defined as:

Where S is a given collection and pi is the proportion of S belonging to class i.

The given entropy as a measure of the impurity in a collection of training examples, a measure of the
effectiveness of an attribute in classifying the training data can be defined now. The measure is
called information gain. It is the expected reduction in entropy caused by partitioning the examples
according to this attribute. The information gain, Gain(S,A) of an attribute A, relative to a collection
of examples S, is defined as:
where Values(A) is the set of all possible values for attribute A, and S v is the subset of S for which
attribute A has value v.

The algorithm was proposed in 1992, by Ross Quinlan, to overcome the limitation of the ID3
algorithm (unavailable values, continuous attribute value ranges, pruning of decision trees, etc.). It
uses a divide-and-conquer approach to growing decision trees. The default splitting criterion used by
this is gain ratio, an information-based measure that takes into account different number of test
outcomes.

14.0 CHI-
Type to enter a caption. SQUARE
TEST

A Chi-Square test is a test of statistical significance for categorical variables. Chi-square test in
hypothesis testing is used to test the hypothesis about the distribution of observations/frequencies
in different categories.

NULL HYPOTHESIS (H0): The data follow a specified distribution

ALTERNATE HYPOTHESIS (HA): The data don’t follow the specified distribution

CHI-SQUARE TEST MANUALLY

Χ2 EQUATION:

Here, Oi is the observed frequency for bin i and Ei is the expected frequency for bin i.

TO CALCULATE THE EXPECTED FREQUENCIES FOR BIN i:

Here, N is the total sample size, and pi is the hypothesized proportion of observations in bin i.

HYPOTHESIS: STUDENTS ARE EQUALLY DISTRIBUTED ACROSS ALL THREE CGPA CATEGORIES

Since we have 3 categories of CGPA (5-7; 7-8; 8-10), and our null hypothesis is that no of students
are equally distributed across all three CGPA categories. Our hypothesized proportion p = 1/3.
CALCULATION OF Χ2

χ2= [ {(58-41)2 / 41} + {(24-41)2 / 41} + {(21-41)2 / 41} ]

=14.097

CALCULATE df

df=k−1 df=k−1, where k is the number of categories. So, df=3−1=2

So, we calculated χ2 (df=2) = 14.097, p = 0.00087.

15.0 ANOVA:

Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the equality of two or
more population (or treatment) means by examining the variances of samples that are taken.
ANOVA allows one to determine whether the differences between the samples are simply due to
random error (sampling errors) or whether there are systematic treatment effects that causes the
mean in one group to differ from the mean in another.

Most of the time ANOVA is used to compare the equality of three or more means, however when
the means from two samples are compared using ANOVA it is equivalent to using a t-test to compare
the means of independent samples.

ANOVA is based on comparing the variance (or variation) between the data samples to variation
within each particular sample. If the between variation is much larger than the within variation, the
means of different samples will not be equal. If the between and within variations are approximately
the same size, then there will be no significant difference between sample means.

Assumptions of ANOVA:

(i) All populations involved follow a normal distribution.

(ii) All populations have the same variance (or standard deviation).

(iii) The samples are randomly selected and independent of one another.

Since ANOVA assumes the populations involved follow a normal distribution, ANOVA falls into a
category of hypothesis tests known as parametric tests. If the populations involved did not follow a
normal distribution, an ANOVA test could not be used to examine the equality of the sample means.
Instead, one would have to use a non-parametric test (or distribution-free test), which is a more
general form of hypothesis testing that does not rely on distributional assumptions.

You might also like