DWM Microproject Report GRP No.24

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Data Warehousing and Mining Techniques (70-72)

Thakur Polytechnic

Department of Computer Engineering

TYCO-B
Semester-6
Academic year 2022-2023
GROUP-24(70-72)

SUBJECT: Data Warehousing and Mining Techniques


(22621)

Sr. No. Name Roll. No.


1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72

Guided by Mr. Dhrupesh Savdiya

1
Data Warehousing and Mining Techniques (70-72)

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION

This is to certify that the following group of students Roll no.70-72 of


6th semester of Diploma in Computer Engineering of institute, Thakur
Polytechnic (Code:0522) have completed the Micro Project
satisfactorily in subject-Data Warehousing and Mining techniques
(22621)for the academic year 2022–2023 as prescribed in the
curriculum.
Place: Mumbai Date:
Roll. No Enrollment. No: Name:
70 2005220333 Prem Raval
71 2005220335 Purva Rokade
72 2005220336 Ronit Mehta

Subject Teacher Head of Institution Principal


Mr. Dhrupesh Savdiya Ms. Vaishali Rane Dr.S.M.Ganechari

Seal of
institution

2
Data Warehousing and Mining Techniques (70-72)

ACKNOWLEDGEMENT

We feel immense pleasure in submitting this report about “KNN(K-nearest


neighbours) Algorithm of Data Mining ”
While submitting this report, we avail this opportunity to express our gratitude to
all those who helped us in completing this task.
Heading the list with our own honourable Principal Dr. S.M.Ganechari who is
the beginner of our inspiration. We owe our deep gratitude and also very thankful
to our guide HOD Ms. Vaishali Rane and Mr. Dhrupesh savdiya .who has
proved to be more than just a mere guide to us. Apart from bringing to us what
can be joy of successful completion of this project was only possible due to her
guidance and co-operation without which this work would never have been
completed.
Finally, we wish to express our deep sense of respect and gratitude to each and
every staff member who has helped us in many ways and also our parents who
have always barred with us in any critical situation and to all others, sparing their
time and helping us for completion of this project in whatever way they could.
And lastly, we are grateful to each other the members of our group.

THANK YOU.

3
Data Warehousing and Mining Techniques (70-72)

PROPOSAL

4
Data Warehousing and Mining Techniques (70-72)

Topic: KNN(K-Nearest neighbours)


Algorithm of Data Mining

1. Aim/benefits of the Micro-project:

 Gain confidence and boost your bottom line.


 To develop cognitive domain and affective domain of learning outcomes.
 Improving productivity and quality of work.
 Students will develop industry-oriented course outcomes.

2. Course outcomes Addressed (COs):


a) Establish scope and necessity of Data Mining for various applications.
b) Establish scope and necessity of Data warehouse for various applications.
c) Use concept of data mining components and techniques in designing data
mining system.
d) Use data mining tools for different applications.
e) Apply basic Statistical calculations on Data.

3. Proposed Methodology:
1. Getting the overview of the project and understanding the concept thoroughly.
2. Making of the proposal.
3. Collecting information about “KNN(K-Nearest Neighbours)algorithm of data
mining”.
4. Making of the Report.

5
Data Warehousing and Mining Techniques (70-72)

4. Action Plan:
Sr. Details of Planned start Planned Name of responsible team
No. activity date finish date member

1 Information 02/02/2023 01/03/2023 Ronit, Purva,


search

2 Group 02/03/2023 16/03/2023 Ronit,purva prem


discussion

3 Making of 17/03/2023 31/03/2023 Ronit,prem


the proposal

4 Compilation 01/04/2023 20/04/2023 Ronit, Prem


of report

5 Presentation 21/04/2023 21/04/2023 Prem,


Ronit, Purva

5. Resources Required:
Sr. No. Name of the Specifications Quantity Remarks
resources
1 Laptop Intel i5, 8GB RAM, 512GB 1 Available
2 Microsoft Office 365 1 Available
Word
3 Internet Minimum 32 Mbps 1 Available

4 Software Windows 10 1 Available

6
Data Warehousing and Mining Techniques (70-72)

TEAM MEMBERS ALONG WITH ROLL NUMBERS:

Sr. No. Name Roll. No.


1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72

_______________
Mr. Dhrupesh savdiya
(SUBJECT TEACHER)

7
Data Warehousing and Mining Techniques (70-72)

REPORT

8
Data Warehousing and Mining Techniques (70-72)

Topic: KNN(K-Nearest neighbours)


Algorithm of Data Mining
1. Rationale:
Data mining and warehousing are the essential components of decision support
systems for the modern days in industry and business. These techniques enable
students to take better and faster decisions. The objective of this course is to
introduce students to various Data Mining and Data Warehousing concepts and
techniques. This course introduce principles, algorithm, architecture, design and
implementation of data mining and data warehousing techniques. Learning this
course would improve the employment potential of students in the information
management sector.

2. Aim/Benefits of the Micro-project:

 Gain confidence and boost your bottom line.


 To develop cognitive domain and affective domain of learning outcomes.
 Improving productivity and quality of work.
 Students will develop industry-oriented course outcomes.

3. Course outcomes Addressed (COs):


A. Establish scope and necessity of Data Mining for various applications.
B. Use concept of data mining components and techniques in designing
data mining system.
C. Use data mining tools for different applications.
D. Apply basic Statistical calculations on Data.
4. Literature Review: -
Data Mining (Introduction):-
Data mining is the process of discovering hidden patterns and knowledge
from large datasets. It involves using statistical and computational techniques
to analyze and extract useful information from data, which can then be used
for making informed decisions, identifying trends, and predicting future
outcomes.

Data mining is used in a wide range of industries, including finance,


healthcare, retail, and telecommunications, among others. It can be used to
perform tasks such as customer segmentation, fraud detection, market basket
analysis, and churn prediction.

9
Data Warehousing and Mining Techniques (70-72)

The data mining process typically involves several steps, including data
cleaning and preprocessing, data exploration, model building, and model
evaluation. The first step involves cleaning and transforming the data to
remove any noise or inconsistencies and make it suitable for analysis. The
next step involves exploring the data to identify any patterns or relationships
that may exist.

Model building involves selecting and applying appropriate algorithms to the


data to create a model that can predict or classify new data points. The final
step involves evaluating the performance of the model to ensure that it is
accurate and reliable.

Data mining requires expertise in several fields, including statistics, computer


science, and machine learning. It also requires access to large amounts of
high-quality data and powerful computing resources. However, the insights
gained from data mining can be invaluable for businesses and organizations
looking to make data-driven decisions and gain a competitive edge in their
respective industries.

Fig 1.1 Data Mining Procedure

10
Data Warehousing and Mining Techniques (70-72)

➢ The List of data Mining Algorithms:-


The List below shows a large amount of the algorithms related to data
mining :-
1. Apriori algorithm
2. K-means clustering
3. Decision tree
4. Random forest
5. Support vector machine (SVM)
6. Naive Bayes classifier
7. Linear regression
8. Logistic regression
9. Artificial neural networks (ANN)
10.Association rule mining
11.Gradient boosting
12.Principal component analysis (PCA)
13.Singular value decomposition (SVD)
14.Collaborative filtering
15.DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
16.Hidden Markov Models (HMM)
17.Natural Language Processing (NLP) algorithms such as
stemming and sentiment analysis
18.Bayesian networks
19.Reinforcement learning
20.Genetic algorithms.

Fig 1.2 Map of DW algorithm

11
Data Warehousing and Mining Techniques (70-72)

Table with brief description of each and every algorithm related to dw:-

Algorithm Task Description

Association rule Finds frequent itemsets and association rules


Apriori mining in a dataset

Divides a dataset into K clusters based on


K-means clustering Clustering similarity

Classification, Builds a tree of decisions based on data


Decision tree regression features to predict outcomes

Classification, Ensemble of decision trees that improves


Random forest regression accuracy and reduces overfitting

Support vector machine Classification, Constructs a hyperplane that maximizes the


(SVM) regression margin between classes

Probability-based classifier that assumes


Naive Bayes classifier Classification features are independent

Models a linear relationship between a


dependent variable and one or more
Linear regression Regression independent variables

12
Data Warehousing and Mining Techniques (70-72)

Algorithm Task Description

Models the probability of a binary outcome


Logistic regression Classification based on independent variables

Artificial neural networks Classification, Mimics the functioning of a biological neural


(ANN) regression network to learn and predict

Association rule Finds interesting relationships between


Association rule mining mining variables in a dataset

Classification, Ensemble method that combines weak models


Gradient boosting regression to create a stronger model

Principal component Dimensionality Reduces the dimensionality of a dataset while


analysis (PCA) reduction retaining the most important information

Singular value Dimensionality Factorizes a matrix into singular values and


decomposition (SVD) reduction orthogonal vectors

Predicts user preferences based on similar


Collaborative filtering Recommendation users or items

13
Data Warehousing and Mining Techniques (70-72)

Algorithm Task Description

Identifies clusters based on density and


DBSCAN Clustering distance

Hidden Markov Models Sequential data Models the probability distribution of


(HMM) analysis sequential data

Natural Language
Processing (NLP)
algorithms Text analysis Analyzes and processes natural language data

Probabilistic Models the probability distribution of a set of


Bayesian networks modeling random variables

Trains an agent to learn through interactions


Reinforcement learning Machine learning with an environment

Uses natural selection and genetic operators


Genetic algorithms Optimization to find the optimal solution to a problem

Note: "DBSCAN" stands for Density-Based Spatial Clustering of Applications with Noise.

14
Data Warehousing and Mining Techniques (70-72)

Cluster analysis(the technique which makes use of KNN algorithm ):-

Cluster analysis is a technique in data mining that involves identifying groups


of similar data points in a dataset and grouping them together into clusters. The
goal of cluster analysis is to find structure in a dataset by grouping similar data
points together and separating dissimilar ones.

There are different methods of cluster analysis, including hierarchical


clustering and k-means clustering. In hierarchical clustering, the data points are
successively merged together to form a tree-like structure, where the leaves
represent individual data points and the branches represent clusters of data
points. In k-means clustering, the data points are divided into a predetermined
number of clusters based on their similarity, where the number of clusters is
specified by the user.

Cluster analysis is used in various fields such as biology, marketing, and social
sciences to identify patterns in data and gain insights into the relationships
between different data points. For example, in biology, cluster analysis can be
used to group genes with similar expression patterns, which can help in
identifying the function of unknown genes. In marketing, cluster analysis can
be used to group customers with similar preferences, which can help in
developing targeted marketing campaigns.

Fig 1.3 Cluster Analysis demonstration

15
Data Warehousing and Mining Techniques (70-72)

Requirements for Clustering analysis :-


Clustering analysis is a technique in data mining that involves grouping
similar data points together to form clusters. The following are the key
requirements for clustering analysis:

1. Data: Clustering analysis requires a dataset that contains the data points to be
clustered. The dataset can be of any type, such as numeric, categorical, or
mixed.

2. Similarity measure: A similarity measure is required to compare the


similarity between data points. The similarity measure should be chosen
based on the type of data and the problem domain.

3. Distance metric: A distance metric is required to measure the distance


between data points. The distance metric should be chosen based on the type
of data and the problem domain.

4. Clustering algorithm: A clustering algorithm is required to group the data


points into clusters. There are several clustering algorithms available,
including k-means, hierarchical clustering, and density-based clustering.

5. Cluster evaluation: Cluster evaluation is required to assess the quality of the


clustering results. This can be done using various metrics, such as the
silhouette coefficient, Dunn index, and Davies-Bouldin index.

6. Visualization: Visualization is required to understand the clustering results


and to identify any patterns or trends in the data. This can be done using
various visualization techniques, such as scatter plots, heat maps, and
dendrograms.

Overall, clustering analysis requires a combination of data, similarity


measure, distance metric, clustering algorithm, cluster evaluation, and
visualization techniques to effectively group similar data points into clusters.

16
Data Warehousing and Mining Techniques (70-72)

Application of clustering:-
Clustering analysis has a wide range of applications in various fields, including:

1. Marketing: Clustering analysis can be used to group customers with similar


preferences or buying behavior, which can help in developing targeted
marketing campaigns and improving customer satisfaction.

2. Biology: Clustering analysis can be used to group genes with similar


expression patterns, which can help in identifying the function of unknown
genes and understanding the genetic basis of diseases.

3. Image analysis: Clustering analysis can be used to group similar images


together, which can help in image retrieval, classification, and segmentation.

4. Social network analysis: Clustering analysis can be used to group individuals


with similar characteristics or interests, which can help in identifying
communities or social groups.

5. Anomaly detection: Clustering analysis can be used to identify unusual


patterns or outliers in a dataset, which can help in detecting fraud, cyber
attacks, or other abnormal behavior.

6. Recommendation systems: Clustering analysis can be used to group similar


products or items together, which can help in developing recommendation
systems for e-commerce or content platforms.

Overall, clustering analysis has a wide range of applications in various domains


where grouping similar data points together can provide valuable insights and
help in making informed decisions.

Fig1.4 Clustering technique example

17
Data Warehousing and Mining Techniques (70-72)

KNN(K-Nearest neighbours)Algorithm of Data Mining:-


K-Nearest Neighbors (KNN) is a popular algorithm used in data
mining and machine learning for classification and regression
problems. It is a simple and effective algorithm that works by finding
the K nearest data points to a given query point, and then using the
labels or values of these nearest neighbors to predict the label or
value of the query point.

The basic idea behind the KNN algorithm is that similar data points
tend to be clustered together in the feature space. Therefore, if we
want to predict the label or value of a new data point, we can look at
its K nearest neighbors and use their labels or values to predict the
label or value of the new data point. The distance between two data
points is typically measured using a distance metric, such as
Euclidean distance, Manhattan distance, or cosine distance.

In the case of classification, the KNN algorithm assigns the label that
appears most frequently among the K nearest neighbors to the query
point. For example, if K=3 and the nearest neighbors of a query point
are labeled as A, A, and B, the KNN algorithm would predict the
label of the query point as A. In the case of regression, the KNN
algorithm computes the average or weighted average of the values
of the K nearest neighbors and uses this as the predicted value of the
query point.

One of the main advantages of the KNN algorithm is its simplicity


and flexibility. It can work with any type of data, including
numerical, categorical, and mixed data. It is also a non-parametric
algorithm, which means that it does not make any assumptions about
the underlying distribution of the data. This makes it particularly
useful in situations where the data is complex or the underlying
distribution is unknown.

18
Data Warehousing and Mining Techniques (70-72)

However, the KNN algorithm also has some limitations and


challenges. One of the main challenges is choosing the value of K,
which can have a significant impact on the performance of the
algorithm. If K is too small, the algorithm may be sensitive to noise
or outliers in the data, while if K is too large, the algorithm may be
less accurate and more computationally expensive. Another
challenge is dealing with high-dimensional data, where the distance
between data points can become less meaningful.

To overcome these challenges, several extensions and variations of


the KNN algorithm have been proposed, including weighted KNN,
locally weighted KNN, and KNN with feature selection or
dimensionality reduction. These extensions aim to improve the
accuracy and efficiency of the KNN algorithm by taking into account
the local structure of the data and reducing the dimensionality of the
feature space.

In summary, the KNN algorithm is a simple and effective algorithm


for classification and regression problems in data mining and
machine learning. It works by finding the K nearest neighbors to a
query point and using their labels or values to predict the label or
value of the query point. While the KNN algorithm has some
limitations and challenges, it remains a popular and useful algorithm
in many applications.

Fig1.5 KNN logic

19
Data Warehousing and Mining Techniques (70-72)

Algorithmic Steps for KNN(K-nearest algorithm):-


Here are the algorithmic steps for the K-Nearest Neighbors (KNN) algorithm:
Input:
Training set T = {x_1, y_1), (x_2, y_2), ..., (x_n, y_n)}, where x_i is a feature
vector and y_i is the corresponding class label
Test instance x to be classified
Number of neighbors k
Output:
Class label y for the test instance x
Steps:
1. Compute the distance between the test instance x and each training instance x_i
using a distance metric, such as Euclidean distance or Manhattan distance.
2. Select the k training instances with the shortest distances to the test instance x.
3. Assign the test instance x to the class that is most frequent among the k selected
training instances.
4. Return the class label y for the test instance x.
Note: If k=1, then the test instance x is assigned to the class of its nearest neighbor.
This is called the nearest neighbor classifier.
In summary, the KNN algorithm is a simple yet powerful classification algorithm
that works by finding the k nearest neighbors to a new instance and assigning it to
the class that is most common among those neighbors. It is a popular algorithm for
its simplicity, ease of implementation, and good performance on many
classification tasks. However, it can be sensitive to the choice of k and may not
work well on high-dimensional or imbalanced data.

Fig1.6 Algorithmic steps for KNN logic


20
Data Warehousing and Mining Techniques (70-72)

Method for KNN(K-nearest)algorithm:-


The K-Nearest Neighbors (KNN) algorithm is a non-parametric method used for
classification and regression. It works by comparing the distance between a new
data point and its nearest neighbors in the training dataset to predict the class of
the new data point.

The basic method of KNN can be summarized as follows:

1. Load the training data set into memory.


2. Normalize the data set to avoid any bias due to the difference in the scale
of the data features.
3. Load the new data point that needs to be classified.
4. Calculate the distance between the new data point and each data point in
the training set. This can be done using various distance measures, such as
Euclidean distance or Manhattan distance.
5. Select the k nearest neighbors to the new data point based on the calculated
distances.
6. Determine the class of the new data point based on the majority class of its
k nearest neighbors. This is done for classification tasks.
7. For regression tasks, determine the value of the new data point based on
the average of the values of its k nearest neighbors.
8. Return the predicted class or value for the new data point.

Some variations of the KNN method include weighting the distance of each neighbor
based on its proximity to the new data point, using different distance measures for
different features, and optimizing the value of k for the given dataset.
In summary, the KNN method is a simple and effective way to classify new data
points based on their proximity to the nearest neighbors in the training dataset. It is
widely used in various fields such as image recognition, text classification, and
bioinformatics.

Fig 1.7 method of KNN /implementation of KNN

21
Data Warehousing and Mining Techniques (70-72)

Advantage and Disadvantages of KNN:-


• Advantages:
1. Simple and easy to implement: KNN is a simple algorithm that requires no
training or parameter tuning. It is easy to understand and implement, making it
suitable for beginners in data mining.
2. Non-parametric: KNN is a non-parametric algorithm, meaning it does not
make any assumptions about the underlying distribution of the data. This
makes it suitable for a wide range of applications and data types.
3. High accuracy: KNN can achieve high accuracy on many classification tasks,
especially when the training data set is large and representative.
4. Versatile: KNN can be used for both classification and regression tasks,
making it a versatile algorithm.

• Disadvantages:
1. Computationally expensive: KNN requires calculating the distance between
each new data point and all training data points, which can be computationally
expensive and time-consuming for large data sets.
2. Sensitive to outliers: KNN is sensitive to outliers in the data set, which can
affect its performance.
3. Sensitive to the choice of k: The choice of the number of neighbors k can have
a significant impact on the performance of KNN. Choosing the optimal k value
can be challenging and may require trial and error.
4. Curse of dimensionality: KNN can suffer from the curse of dimensionality
when dealing with high-dimensional data, where the distance between data
points becomes less meaningful in higher dimensions.

22
Data Warehousing and Mining Techniques (70-72)

6. Actual Methodology followed:

• Getting the overview of the project and understanding the concept


thoroughly.
• Making of the proposal.
• Collecting information about binary to decimal converter using android
studio
• Making of the Report.

7. Actual Resources Used:

Sr. No. Name of the Specifications Quantity Remarks


resources
1 Laptop Intel i5, 8GB RAM, 512GB 1 Available
2 Microsoft Office 365 1 Available
Word
3 Internet Minimum 32 Mbps 1 Available

4 Software Windows 10 1 Available

8.Skills developed/Outcome of the Micro-Project:


We got to learn about android studio and its various features an also how to
implement the KNN algorithm and also what is the logic behind it and also
which type of technique it makes use of and its advantage and disadvantage
and method and algorithmic steps and methods related to it

9.Applications of the Micro-Project:


with the help of this microproject we got to learn how to implement the
method in order to apply KNN algorithm with the help of clustering technique
and use it for efficient data mining

23
Data Warehousing and Mining Techniques (70-72)

TEAM MEMBERS ALONG WITH ROLL NUMBERS:

Sr. No. Name Roll. No.


1 Prem Raval 70
2 Purva Rokade 71
3 Ronit Mehta 72

___________
Mr. Dhrupesh Savdiya
(SUBJECT TEACHER)

24

You might also like