Experiment 9,10

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

EXPERIMENT - 9

Implement grid-based clustering using weka tool


Aim
To perform grid-based clustering on a dataset using WEKA.
Background Theory:
Grid-Based Method in Data Mining:

In Grid-Based Methods, the space of instance is divided into a grid structure. Clustering
techniques are then applied using the Cells of the grid, instead of individual data points, as
the base units. The biggest advantage of this method is to improve the processing time.

Statistical Information Grid(STING):

A STING is a grid-based clustering technique. It uses a multidimensional grid data structure


that quantifies space into a finite number of cells. Instead of focusing on data points, it
focuses on the value space surrounding the data points.

In STING, the spatial area is divided into rectangular cells and several levels of cells at
different resolution levels. High-level cells are divided into several low-level cells.

In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks.

The statistical parameter of higher-level cells can easily be computed from the parameters of
the lower-level cells.
How STING Work:

Step 1: Determine a layer, to begin with.

Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.

Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.

Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.

Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that
form the relevant cell of the high-level layer.

Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.

Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return
the result that meets the requirement of the query. Go to point 9.

Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of
the query. Go to point 9.

Step 9: Stop or terminate.

Advantages:

 Grid-based computing is query-independent because the statistics stored in each cell


represent a summary of the data in the grid cells and are query-independent.

 The grid structure facilitates parallel processing and incremental updates.

Disadvantage:

 The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries
are either horizontal or vertical, so no diagonal boundaries are detected.

Procedure
1. Create or Obtain Dataset:
o Create a new ARFF file or obtain an existing dataset suitable for clustering.
Ensure the dataset contains numerical attributes.
2. Load the Dataset into WEKA:
o Open the created or obtained ARFF file in WEKA Explorer.
3. Apply Grid-Based Clustering Algorithm:
o Use WEKA to perform grid-based clustering and analyze the results.
Creating the Dataset
1. Create a New ARFF File:
o Open a text editor (e.g., Notepad).
o Define the attributes and data for the dataset as shown below. For example,
let's create a dataset of 2D points:
@relation grid_clustering_example
@attribute x numeric
@attribute y numeric
@data
1.0,2.0
1.5,1.8
5.0,8.0
8.0,8.0
1.0,0.6
9.0,11.0
8.0,2.0
10.0,2.0
9.0,3.0
2. Save the ARFF File:
o Save the file with a .arff extension, e.g., grid_clustering_example.arff.
3. Load the Dataset into WEKA:
o Open WEKA Explorer.
o Click on "Open file..." and load the grid_clustering_example.arff file.
4. Apply Grid-Based Clustering Algorithm:
Grid-Based Clustering Algorithm:
o Go to the "Cluster" tab.
o Click on the "Choose" button under "Clusterer".
o Select grid -> GridBasedClusterer.
o Configure the parameters:
 gridSize: The size of the grid cells. Larger values create larger grid
cells and fewer clusters.
 minPoints: The minimum number of points required to form a cluster
in a grid cell.
o Example Configuration:
 gridSize: 2.0
 minPoints: 2
o Click "Start" to apply the Grid-Based Clustering algorithm.
5. Analyze the Results:
o Review the clustering output to analyze the clusters formed based on the grid
size and minimum points.
Example Configuration
 Parameters:
o gridSize: 2.0 (determines the size of each grid cell)
o minPoints: 2 (minimum number of points required in a grid cell to form a
cluster)
Sample Output
 Clusters Identified:
o Example clusters formed by Grid-Based Clustering:
 Cluster 1: {(1.0,2.0), (1.5,1.8), (1.0,0.6)}
 Cluster 2: {(5.0,8.0), (8.0,8.0), (9.0,11.0)}
 Cluster 3: {(8.0,2.0), (10.0,2.0), (9.0,3.0)}
 Cluster Summary:
o Number of Clusters: X
o Cluster Size and Composition
Result
Thus, the Grid-Based Clustering algorithm was successfully applied to the dataset, and
meaningful clusters were identified and analyzed based on grid size and minimum points.
EXPERIMENT -10
.Create multi-dimensional data as inputs and cluster them according to the model
parameters and determine outlier using density-based outlier detection method

Aim
To create a multi-dimensional dataset, apply clustering using a density-based method, and
determine outliers using density-based outlier detection in WEKA.
Background Theory:
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
It is a popular unsupervised learning method used for model construction and machine
learning algorithms. It is a clustering method utilized for separating high-density clusters
from low-density clusters. It divides the data points into many groups so that points lying in
the same group will have the same properties. It was proposed by Martin Ester, Hans-Peter
Kriegel, Jorg Sander, and Xiaowei Xu in 1996.
DBSCAN is designed for use with databases that can accelerate region queries. It can not
cluster data sets with large differences in their densities.

Characteristics
 It identifies clusters of any shape in a data set, it means it can detect arbitrarily shaped
clusters.
 It is based on intuitive notions of clusters and noise.
 It is very robust in detection of outliers in data set
 It requires only two points which are very insensitive to the order of occurrence of the
points in data set
Advantages
 Specification of number of clusters of data in the data set is not required.
 It can find any shape cluster even if the cluster is surrounded by any other cluster.
 It can easily find outliers in data set.
 It is not much sensitive to noise, it means it is noise tolerant.
 It is the second most used clustering method after K-means.
Disadvantages
 The quality of the result depends on the distance measure used in the regionQuery
function.
 Border points may go in any cluster depending on the processing order so it is not
completely deterministic.
 It can be expensive when cost of computation of nearest neighbor is high.
 It can be slow in execution for higher dimension.
 Adaptability of variation in local density is less.
Procedure
1. Install and Open WEKA:
o Download and install WEKA from the official website: WEKA Downloads.
o Open WEKA Explorer.
2. Create Multi-Dimensional Dataset:
o Create a new ARFF file with multi-dimensional attributes.
3. Load the Dataset into WEKA:
o Open the ARFF file in WEKA Explorer.
4. Apply Density-Based Clustering (DBSCAN) and Outlier Detection:
o Use WEKA to perform density-based clustering and detect outliers.
Creating the Multi-Dimensional Dataset
1. Create a New ARFF File:
o Open a text editor (e.g., Notepad).
o Define the attributes and data for the multi-dimensional dataset. For example,
let's create a dataset with three features:
@relation multi_dimensional_clustering

@attribute x numeric
@attribute y numeric
@attribute z numeric

@data
1.0,2.0,3.0
1.5,1.8,2.5
2.0,2.1,3.1
10.0,10.0,10.0
10.5,10.5,10.5
11.0,11.0,11.0
100.0,100.0,100.0
100.5,100.5,100.5
105.0,105.0,105.0
2. Save the ARFF File:
o Save the file with a .arff extension, e.g., multi_dimensional_clustering.arff.
3. Load the Dataset into WEKA:
o Open WEKA Explorer.
o Click on "Open file..." and load the multi_dimensional_clustering.arff file.
4. Apply Density-Based Clustering and Outlier Detection:
Density-Based Clustering (DBSCAN):
o Go to the "Cluster" tab.
o Click on the "Choose" button under "Clusterer".
o Select dbscan -> DBSCAN.
o Configure the parameters:
 epsilon (ε): The maximum distance between two points to be
considered neighbors.
 minPoints (MinPts): The minimum number of points required to form
a dense region (core point).
o Example Configuration:
 epsilon (ε): 5.0
 minPoints (MinPts): 2
o Click "Start" to apply the DBSCAN algorithm.
Outlier Detection:
o After clustering, outliers are typically identified as points that do not belong to
any cluster or are in low-density regions.
Analyzing the Results
 Clusters Identified:
o Example clusters formed by DBSCAN:
 Cluster 1: {(1.0,2.0,3.0), (1.5,1.8,2.5), (2.0,2.1,3.1)}
 Cluster 2: {(10.0,10.0,10.0), (10.5,10.5,10.5), (11.0,11.0,11.0)}
 Outliers: {(100.0,100.0,100.0), (100.5,100.5,100.5),
(105.0,105.0,105.0)}
 Outliers Detected:
o Points that do not belong to any cluster or are far from dense regions.

Result
Thus, the density-based clustering algorithm was successfully applied to the multi-
dimensional dataset. Clusters were formed based on the density criteria, and outliers were
identified as points not belonging to any cluster or located in sparse regions.

You might also like