DWM Microproject Report GRP No.24

Data Warehousing and Mining Techniques (70-72)
Thakur Polytechnic
Department of Computer Engineering
TYCO-B
Semester-6
Academic year 2022-2023
GROUP-24(70-72)
SUBJECT: Data Warehousing and Mining Techniques

(22621)
Sr. No. Name Roll. No.

1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72
Guided by Mr. Dhrupesh Savdiya
1
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
This is to certify that the following group of students Roll no.70-72 of

6th semester of Diploma in Computer Engineering of institute, Thakur
Polytechnic (Code:0522) have completed the Micro Project
satisfactorily in subject-Data Warehousing and Mining techniques
(22621)for the academic year 2022–2023 as prescribed in the
curriculum.
Place: Mumbai Date:
Roll. No Enrollment. No: Name:
70 2005220333 Prem Raval
71 2005220335 Purva Rokade
72 2005220336 Ronit Mehta
Subject Teacher Head of Institution Principal

Mr. Dhrupesh Savdiya Ms. Vaishali Rane Dr.S.M.Ganechari
Seal of
institution
2
ACKNOWLEDGEMENT
We feel immense pleasure in submitting this report about “KNN(K-nearest

neighbours) Algorithm of Data Mining ”
While submitting this report, we avail this opportunity to express our gratitude to
all those who helped us in completing this task.
Heading the list with our own honourable Principal Dr. S.M.Ganechari who is
the beginner of our inspiration. We owe our deep gratitude and also very thankful
to our guide HOD Ms. Vaishali Rane and Mr. Dhrupesh savdiya .who has
proved to be more than just a mere guide to us. Apart from bringing to us what
can be joy of successful completion of this project was only possible due to her
guidance and co-operation without which this work would never have been
completed.
Finally, we wish to express our deep sense of respect and gratitude to each and
every staff member who has helped us in many ways and also our parents who
have always barred with us in any critical situation and to all others, sparing their
time and helping us for completion of this project in whatever way they could.
And lastly, we are grateful to each other the members of our group.
THANK YOU.
3
PROPOSAL
4
Topic: KNN(K-Nearest neighbours)

Algorithm of Data Mining
1. Aim/benefits of the Micro-project:
 Gain confidence and boost your bottom line.

 To develop cognitive domain and affective domain of learning outcomes.
 Improving productivity and quality of work.
 Students will develop industry-oriented course outcomes.
2. Course outcomes Addressed (COs):

a) Establish scope and necessity of Data Mining for various applications.
b) Establish scope and necessity of Data warehouse for various applications.
c) Use concept of data mining components and techniques in designing data
mining system.
d) Use data mining tools for different applications.
e) Apply basic Statistical calculations on Data.
3. Proposed Methodology:
1. Getting the overview of the project and understanding the concept thoroughly.
2. Making of the proposal.
3. Collecting information about “KNN(K-Nearest Neighbours)algorithm of data
mining”.
4. Making of the Report.
5
4. Action Plan:
Sr. Details of Planned start Planned Name of responsible team
No. activity date finish date member
1 Information 02/02/2023 01/03/2023 Ronit, Purva,

search
2 Group 02/03/2023 16/03/2023 Ronit,purva prem

discussion
3 Making of 17/03/2023 31/03/2023 Ronit,prem

the proposal
4 Compilation 01/04/2023 20/04/2023 Ronit, Prem

of report
5 Presentation 21/04/2023 21/04/2023 Prem,

Ronit, Purva
5. Resources Required:
Sr. No. Name of the Specifications Quantity Remarks
resources
1 Laptop Intel i5, 8GB RAM, 512GB 1 Available
2 Microsoft Office 365 1 Available
Word
3 Internet Minimum 32 Mbps 1 Available
4 Software Windows 10 1 Available
6
TEAM MEMBERS ALONG WITH ROLL NUMBERS:

1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72
_______________
Mr. Dhrupesh savdiya
(SUBJECT TEACHER)
7
REPORT
8
Topic: KNN(K-Nearest neighbours)

Algorithm of Data Mining
1. Rationale:
Data mining and warehousing are the essential components of decision support
systems for the modern days in industry and business. These techniques enable
students to take better and faster decisions. The objective of this course is to
introduce students to various Data Mining and Data Warehousing concepts and
techniques. This course introduce principles, algorithm, architecture, design and
implementation of data mining and data warehousing techniques. Learning this
course would improve the employment potential of students in the information
management sector.
2. Aim/Benefits of the Micro-project:
 Gain confidence and boost your bottom line.

 To develop cognitive domain and affective domain of learning outcomes.
 Improving productivity and quality of work.
 Students will develop industry-oriented course outcomes.
3. Course outcomes Addressed (COs):

A. Establish scope and necessity of Data Mining for various applications.
B. Use concept of data mining components and techniques in designing
data mining system.
C. Use data mining tools for different applications.
D. Apply basic Statistical calculations on Data.
4. Literature Review: -
Data Mining (Introduction):-
Data mining is the process of discovering hidden patterns and knowledge
from large datasets. It involves using statistical and computational techniques
to analyze and extract useful information from data, which can then be used
for making informed decisions, identifying trends, and predicting future
outcomes.
Data mining is used in a wide range of industries, including finance,

healthcare, retail, and telecommunications, among others. It can be used to
perform tasks such as customer segmentation, fraud detection, market basket
analysis, and churn prediction.
9
The data mining process typically involves several steps, including data
cleaning and preprocessing, data exploration, model building, and model
evaluation. The first step involves cleaning and transforming the data to
remove any noise or inconsistencies and make it suitable for analysis. The
next step involves exploring the data to identify any patterns or relationships
that may exist.
Model building involves selecting and applying appropriate algorithms to the

data to create a model that can predict or classify new data points. The final
step involves evaluating the performance of the model to ensure that it is
accurate and reliable.
Data mining requires expertise in several fields, including statistics, computer

science, and machine learning. It also requires access to large amounts of
high-quality data and powerful computing resources. However, the insights
gained from data mining can be invaluable for businesses and organizations
looking to make data-driven decisions and gain a competitive edge in their
respective industries.
Fig 1.1 Data Mining Procedure
10
➢ The List of data Mining Algorithms:-

The List below shows a large amount of the algorithms related to data
mining :-
1. Apriori algorithm
2. K-means clustering
3. Decision tree
4. Random forest
5. Support vector machine (SVM)
6. Naive Bayes classifier
7. Linear regression
8. Logistic regression
9. Artificial neural networks (ANN)
10.Association rule mining
11.Gradient boosting
12.Principal component analysis (PCA)
13.Singular value decomposition (SVD)
14.Collaborative filtering
15.DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
16.Hidden Markov Models (HMM)
17.Natural Language Processing (NLP) algorithms such as
stemming and sentiment analysis
18.Bayesian networks
19.Reinforcement learning
20.Genetic algorithms.
Fig 1.2 Map of DW algorithm
11
Table with brief description of each and every algorithm related to dw:-
Algorithm Task Description
Association rule Finds frequent itemsets and association rules

Apriori mining in a dataset
Divides a dataset into K clusters based on

K-means clustering Clustering similarity
Classification, Builds a tree of decisions based on data

Decision tree regression features to predict outcomes
Classification, Ensemble of decision trees that improves

Random forest regression accuracy and reduces overfitting
Support vector machine Classification, Constructs a hyperplane that maximizes the

(SVM) regression margin between classes
Probability-based classifier that assumes

Naive Bayes classifier Classification features are independent
Models a linear relationship between a

dependent variable and one or more
Linear regression Regression independent variables
12
Models the probability of a binary outcome

Logistic regression Classification based on independent variables
Artificial neural networks Classification, Mimics the functioning of a biological neural

(ANN) regression network to learn and predict
Association rule Finds interesting relationships between

Association rule mining mining variables in a dataset
Classification, Ensemble method that combines weak models

Gradient boosting regression to create a stronger model
Principal component Dimensionality Reduces the dimensionality of a dataset while

analysis (PCA) reduction retaining the most important information
Singular value Dimensionality Factorizes a matrix into singular values and

decomposition (SVD) reduction orthogonal vectors
Predicts user preferences based on similar

Collaborative filtering Recommendation users or items
13
Identifies clusters based on density and

DBSCAN Clustering distance
Hidden Markov Models Sequential data Models the probability distribution of

(HMM) analysis sequential data
Natural Language
Processing (NLP)
algorithms Text analysis Analyzes and processes natural language data
Probabilistic Models the probability distribution of a set of

Bayesian networks modeling random variables
Trains an agent to learn through interactions

Reinforcement learning Machine learning with an environment
Uses natural selection and genetic operators

Genetic algorithms Optimization to find the optimal solution to a problem
Note: "DBSCAN" stands for Density-Based Spatial Clustering of Applications with Noise.
14
Cluster analysis(the technique which makes use of KNN algorithm ):-
Cluster analysis is a technique in data mining that involves identifying groups

of similar data points in a dataset and grouping them together into clusters. The
goal of cluster analysis is to find structure in a dataset by grouping similar data
points together and separating dissimilar ones.
There are different methods of cluster analysis, including hierarchical

clustering and k-means clustering. In hierarchical clustering, the data points are
successively merged together to form a tree-like structure, where the leaves
represent individual data points and the branches represent clusters of data
points. In k-means clustering, the data points are divided into a predetermined
number of clusters based on their similarity, where the number of clusters is
specified by the user.
Cluster analysis is used in various fields such as biology, marketing, and social
sciences to identify patterns in data and gain insights into the relationships
between different data points. For example, in biology, cluster analysis can be
used to group genes with similar expression patterns, which can help in
identifying the function of unknown genes. In marketing, cluster analysis can
be used to group customers with similar preferences, which can help in
developing targeted marketing campaigns.
Fig 1.3 Cluster Analysis demonstration
15
Requirements for Clustering analysis :-

Clustering analysis is a technique in data mining that involves grouping
similar data points together to form clusters. The following are the key
requirements for clustering analysis:
1. Data: Clustering analysis requires a dataset that contains the data points to be
clustered. The dataset can be of any type, such as numeric, categorical, or
mixed.
2. Similarity measure: A similarity measure is required to compare the

similarity between data points. The similarity measure should be chosen
based on the type of data and the problem domain.
3. Distance metric: A distance metric is required to measure the distance

between data points. The distance metric should be chosen based on the type
of data and the problem domain.
4. Clustering algorithm: A clustering algorithm is required to group the data

points into clusters. There are several clustering algorithms available,
including k-means, hierarchical clustering, and density-based clustering.
5. Cluster evaluation: Cluster evaluation is required to assess the quality of the

clustering results. This can be done using various metrics, such as the
silhouette coefficient, Dunn index, and Davies-Bouldin index.
6. Visualization: Visualization is required to understand the clustering results

and to identify any patterns or trends in the data. This can be done using
various visualization techniques, such as scatter plots, heat maps, and
dendrograms.
Overall, clustering analysis requires a combination of data, similarity

measure, distance metric, clustering algorithm, cluster evaluation, and
visualization techniques to effectively group similar data points into clusters.
16
Application of clustering:-
Clustering analysis has a wide range of applications in various fields, including:
1. Marketing: Clustering analysis can be used to group customers with similar

preferences or buying behavior, which can help in developing targeted
marketing campaigns and improving customer satisfaction.
2. Biology: Clustering analysis can be used to group genes with similar

expression patterns, which can help in identifying the function of unknown
genes and understanding the genetic basis of diseases.
3. Image analysis: Clustering analysis can be used to group similar images

together, which can help in image retrieval, classification, and segmentation.
4. Social network analysis: Clustering analysis can be used to group individuals

with similar characteristics or interests, which can help in identifying
communities or social groups.
5. Anomaly detection: Clustering analysis can be used to identify unusual

patterns or outliers in a dataset, which can help in detecting fraud, cyber
attacks, or other abnormal behavior.
6. Recommendation systems: Clustering analysis can be used to group similar

products or items together, which can help in developing recommendation
systems for e-commerce or content platforms.
Overall, clustering analysis has a wide range of applications in various domains

where grouping similar data points together can provide valuable insights and
help in making informed decisions.
Fig1.4 Clustering technique example
17
KNN(K-Nearest neighbours)Algorithm of Data Mining:-

K-Nearest Neighbors (KNN) is a popular algorithm used in data
mining and machine learning for classification and regression
problems. It is a simple and effective algorithm that works by finding
the K nearest data points to a given query point, and then using the
labels or values of these nearest neighbors to predict the label or
value of the query point.
The basic idea behind the KNN algorithm is that similar data points
tend to be clustered together in the feature space. Therefore, if we
want to predict the label or value of a new data point, we can look at
its K nearest neighbors and use their labels or values to predict the
label or value of the new data point. The distance between two data
points is typically measured using a distance metric, such as
Euclidean distance, Manhattan distance, or cosine distance.
In the case of classification, the KNN algorithm assigns the label that
appears most frequently among the K nearest neighbors to the query
point. For example, if K=3 and the nearest neighbors of a query point
are labeled as A, A, and B, the KNN algorithm would predict the
label of the query point as A. In the case of regression, the KNN
algorithm computes the average or weighted average of the values
of the K nearest neighbors and uses this as the predicted value of the
query point.
One of the main advantages of the KNN algorithm is its simplicity

and flexibility. It can work with any type of data, including
numerical, categorical, and mixed data. It is also a non-parametric
algorithm, which means that it does not make any assumptions about
the underlying distribution of the data. This makes it particularly
useful in situations where the data is complex or the underlying
distribution is unknown.
18
However, the KNN algorithm also has some limitations and

challenges. One of the main challenges is choosing the value of K,
which can have a significant impact on the performance of the
algorithm. If K is too small, the algorithm may be sensitive to noise
or outliers in the data, while if K is too large, the algorithm may be
less accurate and more computationally expensive. Another
challenge is dealing with high-dimensional data, where the distance
between data points can become less meaningful.
To overcome these challenges, several extensions and variations of

the KNN algorithm have been proposed, including weighted KNN,
locally weighted KNN, and KNN with feature selection or
dimensionality reduction. These extensions aim to improve the
accuracy and efficiency of the KNN algorithm by taking into account
the local structure of the data and reducing the dimensionality of the
feature space.
In summary, the KNN algorithm is a simple and effective algorithm

for classification and regression problems in data mining and
machine learning. It works by finding the K nearest neighbors to a
query point and using their labels or values to predict the label or
value of the query point. While the KNN algorithm has some
limitations and challenges, it remains a popular and useful algorithm
in many applications.
Fig1.5 KNN logic
19
Algorithmic Steps for KNN(K-nearest algorithm):-

Here are the algorithmic steps for the K-Nearest Neighbors (KNN) algorithm:
Input:
Training set T = {x_1, y_1), (x_2, y_2), ..., (x_n, y_n)}, where x_i is a feature
vector and y_i is the corresponding class label
Test instance x to be classified
Number of neighbors k
Output:
Class label y for the test instance x
Steps:
1. Compute the distance between the test instance x and each training instance x_i
using a distance metric, such as Euclidean distance or Manhattan distance.
2. Select the k training instances with the shortest distances to the test instance x.
3. Assign the test instance x to the class that is most frequent among the k selected
training instances.
4. Return the class label y for the test instance x.
Note: If k=1, then the test instance x is assigned to the class of its nearest neighbor.
This is called the nearest neighbor classifier.
In summary, the KNN algorithm is a simple yet powerful classification algorithm
that works by finding the k nearest neighbors to a new instance and assigning it to
the class that is most common among those neighbors. It is a popular algorithm for
its simplicity, ease of implementation, and good performance on many
classification tasks. However, it can be sensitive to the choice of k and may not
work well on high-dimensional or imbalanced data.
Fig1.6 Algorithmic steps for KNN logic

20
Method for KNN(K-nearest)algorithm:-

The K-Nearest Neighbors (KNN) algorithm is a non-parametric method used for
classification and regression. It works by comparing the distance between a new
data point and its nearest neighbors in the training dataset to predict the class of
the new data point.
The basic method of KNN can be summarized as follows:
1. Load the training data set into memory.

2. Normalize the data set to avoid any bias due to the difference in the scale
of the data features.
3. Load the new data point that needs to be classified.
4. Calculate the distance between the new data point and each data point in
the training set. This can be done using various distance measures, such as
Euclidean distance or Manhattan distance.
5. Select the k nearest neighbors to the new data point based on the calculated
distances.
6. Determine the class of the new data point based on the majority class of its
k nearest neighbors. This is done for classification tasks.
7. For regression tasks, determine the value of the new data point based on
the average of the values of its k nearest neighbors.
8. Return the predicted class or value for the new data point.
Some variations of the KNN method include weighting the distance of each neighbor
based on its proximity to the new data point, using different distance measures for
different features, and optimizing the value of k for the given dataset.
In summary, the KNN method is a simple and effective way to classify new data
points based on their proximity to the nearest neighbors in the training dataset. It is
widely used in various fields such as image recognition, text classification, and
bioinformatics.
Fig 1.7 method of KNN /implementation of KNN
21
Advantage and Disadvantages of KNN:-

• Advantages:
1. Simple and easy to implement: KNN is a simple algorithm that requires no
training or parameter tuning. It is easy to understand and implement, making it
suitable for beginners in data mining.
2. Non-parametric: KNN is a non-parametric algorithm, meaning it does not
make any assumptions about the underlying distribution of the data. This
makes it suitable for a wide range of applications and data types.
3. High accuracy: KNN can achieve high accuracy on many classification tasks,
especially when the training data set is large and representative.
4. Versatile: KNN can be used for both classification and regression tasks,
making it a versatile algorithm.
• Disadvantages:
1. Computationally expensive: KNN requires calculating the distance between
each new data point and all training data points, which can be computationally
expensive and time-consuming for large data sets.
2. Sensitive to outliers: KNN is sensitive to outliers in the data set, which can
affect its performance.
3. Sensitive to the choice of k: The choice of the number of neighbors k can have
a significant impact on the performance of KNN. Choosing the optimal k value
can be challenging and may require trial and error.
4. Curse of dimensionality: KNN can suffer from the curse of dimensionality
when dealing with high-dimensional data, where the distance between data
points becomes less meaningful in higher dimensions.
22
6. Actual Methodology followed:
• Getting the overview of the project and understanding the concept

thoroughly.
• Making of the proposal.
• Collecting information about binary to decimal converter using android
studio
• Making of the Report.
7. Actual Resources Used:
Sr. No. Name of the Specifications Quantity Remarks

resources
1 Laptop Intel i5, 8GB RAM, 512GB 1 Available
2 Microsoft Office 365 1 Available
Word
3 Internet Minimum 32 Mbps 1 Available
4 Software Windows 10 1 Available
8.Skills developed/Outcome of the Micro-Project:

We got to learn about android studio and its various features an also how to
implement the KNN algorithm and also what is the logic behind it and also
which type of technique it makes use of and its advantage and disadvantage
and method and algorithmic steps and methods related to it
9.Applications of the Micro-Project:

with the help of this microproject we got to learn how to implement the
method in order to apply KNN algorithm with the help of clustering technique
and use it for efficient data mining
23
TEAM MEMBERS ALONG WITH ROLL NUMBERS:

1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72
___________
Mr. Dhrupesh Savdiya
(SUBJECT TEACHER)
24

DWM Microproject Report GRP No.24

Uploaded by

Copyright:

Available Formats

DWM Microproject Report GRP No.24

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM Microproject Report GRP No.24

Uploaded by

Copyright:

Available Formats

Data Warehousing and Mining Techniques (70-72)

Department of Computer Engineering

SUBJECT: Data Warehousing and Mining Techniques

Sr. No. Name Roll. No.

Guided by Mr. Dhrupesh Savdiya

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION

This is to certify that the following group of students Roll no.70-72 of

Subject Teacher Head of Institution Principal

We feel immense pleasure in submitting this report about “KNN(K-nearest

Topic: KNN(K-Nearest neighbours)

1. Aim/benefits of the Micro-project:

 Gain confidence and boost your bottom line.

2. Course outcomes Addressed (COs):

1 Information 02/02/2023 01/03/2023 Ronit, Purva,

2 Group 02/03/2023 16/03/2023 Ronit,purva prem

3 Making of 17/03/2023 31/03/2023 Ronit,prem

4 Compilation 01/04/2023 20/04/2023 Ronit, Prem

5 Presentation 21/04/2023 21/04/2023 Prem,

4 Software Windows 10 1 Available

TEAM MEMBERS ALONG WITH ROLL NUMBERS:

Sr. No. Name Roll. No.

Topic: KNN(K-Nearest neighbours)

2. Aim/Benefits of the Micro-project:

 Gain confidence and boost your bottom line.

3. Course outcomes Addressed (COs):

Data mining is used in a wide range of industries, including finance,

Model building involves selecting and applying appropriate algorithms to the

Data mining requires expertise in several fields, including statistics, computer

Fig 1.1 Data Mining Procedure

➢ The List of data Mining Algorithms:-

Fig 1.2 Map of DW algorithm

Algorithm Task Description

Association rule Finds frequent itemsets and association rules

Divides a dataset into K clusters based on

Classification, Builds a tree of decisions based on data

Classification, Ensemble of decision trees that improves

Support vector machine Classification, Constructs a hyperplane that maximizes the

Probability-based classifier that assumes

Models a linear relationship between a

Algorithm Task Description

Models the probability of a binary outcome

Artificial neural networks Classification, Mimics the functioning of a biological neural

Association rule Finds interesting relationships between

Classification, Ensemble method that combines weak models

Principal component Dimensionality Reduces the dimensionality of a dataset while

Singular value Dimensionality Factorizes a matrix into singular values and

Predicts user preferences based on similar

Algorithm Task Description

Identifies clusters based on density and

Hidden Markov Models Sequential data Models the probability distribution of

Probabilistic Models the probability distribution of a set of

Trains an agent to learn through interactions

Uses natural selection and genetic operators

Cluster analysis(the technique which makes use of KNN algorithm ):-

Cluster analysis is a technique in data mining that involves identifying groups

There are different methods of cluster analysis, including hierarchical

Fig 1.3 Cluster Analysis demonstration

Requirements for Clustering analysis :-