Improved Visual Final Version New

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

Improved Visual Clustering through Unsupervised

Dimensionality Reduction

K.Thangavel1, P.Alagambigai2, D. Devakumari3


1
Department of Computer Science, Periyar University,
Salem - 11, Tamilnadu, India.
2Department of Computer Applications, Easwari Engineering College,

Chennai - 89, Tamilnadu, India.


3Department of Computer Science, Government Arts College,

Dharmapuri,Tamilnadu, India.
[email protected], [email protected], [email protected]

Abstract. Interactive visual clustering allows the user to be involved into the
clustering through visualizing process via interactive visualization. In order to
perform effective interaction in the visual clustering process, the efficient
feature selection methods are required to identify the most dominating features.
Hence, in this paper an improved visual clustering system is proposed using an
efficient feature selection method. The relevant features for visual clustering are
identified based on their contribution to the entropy. Experimental results show
that the proposed method works well in finding the best cluster.

Keywords: Data Mining, Contribution to Entropy, Feature Selection, Visual


Clustering.

1 Introduction

Clustering is a common technique used for unsupervised learning, for understanding


and manipulating datasets [6]. It is a process of grouping the data into classes or
clusters so that the objects within clusters have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters. Cluster analysis is based
on a mathematical formulation of a measure of similarity. Generally the clustering
algorithms deal with the following issues:
(i) The definition of similarity of data items
(ii) The characteristics of clusters including size, shape and statistical properties
(iii) The computation cost and error rate of the result.
Many clustering algorithms are proposed regarding these issues [3]. Most of earlier
clustering research have been focused on automatic clustering process and statistical
validity indices. All the clustering algorithms such as K-Means almost exclude human
interaction during the clustering process. Human experts do not monitor the whole
clustering process and the incorporation of domain knowledge is highly complicated.
In reality, no clustering algorithm is completed, until it is evaluated, validated and
accepted by the user.
Visualization is known to be most intuitive method for validating clusters,
especially clusters in irregular shape and it can improve the understanding of the
clustering structure. Visual representations can be very powerful in revealing trends,
highlighting outliers, showing clusters and exposing gaps [6, 7]. To incorporate
visualization techniques, the existing clustering algorithms use the result of clustering
algorithm as the input for visualization system. The better solution is to combine two
processes together, which means to use the same model in clustering and
visualization. This leads to the necessity of interactive visual clustering.
Interactive visual clustering has demonstrated great advantage in data mining,
since it allows the user to participate in the clustering process by providing the
domain knowledge and making better decisions based on his visual perception. This
makes “clustering – analysis/evaluation” process to be efficient. VISTA, an
interactive visual cluster rendering system, is known to be an effective model, which
involves human in the clustering process. It allows the user to observe potential
clusters interactively in a series of continuous visual tuning by the parameter called α.
The change of α in dominating dimensions resulting in good cluster distribution.
When the dimensionality of the dataset increases, identification of the dominating
dimensions for visual tuning and visual distance computation process becomes
tedious. One common approach to solve this problem is dimensionality reduction.
Dimension reduction or attribute selection aims at choosing a small subset of
attributes that is sufficient to describe the data set. It is the process of identifying and
removing as much as possible the irrelevant and redundant attributes. Sophisticated
attribute selection methods have been developed to tackle the following three
problems: (i) reduce classifier cost and complexity, (ii) To improve model accuracy
(attribute selection), (iii) improve the visualization and comprehensibility of induced
concepts [10].
At the pre-processing and post-processing phase, feature selection/extraction (as
well as standardization and normalization) and cluster validation are as important as
the clustering algorithms. The benefits of feature selection are twofold: it
considerably decreases the computation time of the induction algorithm and increases
the accuracy of the resulting mode [2]. All feature selection algorithms fall into two
categories: (i) the filter approach and (ii) the wrapper approach [2, 11]. The filter
approach basically pre-selects the dimensions and then applies the selected feature
subset to the clustering algorithm. The wrapper approach incorporates the clustering
algorithm in the feature search and selection. The wrapper approach divides the task
into three components (i) feature search (ii) clustering algorithm and (iii) feature
subset evaluation.
In this paper, we propose a framework to improve the visual clustering by applying
“filter” based dimensionality reduction. In this approach, the most relevant features
are identified for visual tuning according to its contribution to the entropy (CE),
which is calculated on a leave-one-out basis. Then VISTA has been applied for visual
clustering with reduced data.
The rest of the paper is organized as follows. The related work is explored in
section 2. The overview of visual cluster rendering system is discussed in section 3.
The proposed work is discussed in section 4. The experimental analysis is presented
in section 5. Section 6 concludes the paper with directions for future research work.

2 Related Work

Clustering of large data bases is an important research area with a large variety of
applications in the data base context. Missing in most of the research efforts are
means for guiding the clustering process and understand the results. Visualization
technology may help to solve this problem since it allows an effective support of
different clustering paradigms and provides means for a visual inspection of the
results [4]. Since a wide range of users for different environment utilize the
visualization models for clustering, it is essential to ease the human computer
interaction. One way to ease the human computer interaction is to provide minimum
number of features for clustering and analysis. There is large variety of visualization
models are proposed during the past decade, but very few are deals with exploring the
dataset with minimum features.
The goal of feature selection for clustering is to find the smallest feature subset that
best uncovers “interesting natural” grouping (clusters) from data set. Feature selection
has been extensively studied in the past two decades. Even though feature selection
methods are applied for traditional automatic clustering, visualization models are not
utilizing them much. This motivates to the proposed framework.
The issues related to feature selection for unsupervised learning can be found in [2,
11, 13]. Jennifer G. Dy and Broaley [2] proposed a wrapper based feature selection
for unsupervised learning, which wraps the search around Expectation-Maximization
clustering algorithm. Roy Varshavsky, et. al., [12] proposed a novel unsupervised
feature filtering of Biological data based on maximization of Singular Value
Decomposition (SVD) entropy. The features are selected based on, (i) simple ranking
according to Contribution to the Entropy (CE) values (SR), (ii) forward selection by
accumulating features according to which set produces highest entropy (FS1), (iii)
forward selection by accumulating features through the choice of the best CCE out of
the remaining ones (FS2), (iv) backward elimination (BE) of features with the lowest
CE. This proposed work involves the feature selection based on simple ranking.
Interactive clustering differs from traditional automatic clustering in such a way
that it incorporates user’s domain knowledge into the clustering process. There are
wide variety of interactive clustering methods are proposed in recent years [4, 6, 8].
while a very few of them concentrates on feature selection. Keke Chen and Liu, L. [6,
7] proposed VISTA model, an intuitive way to visualize clusters. This model provides
a similar mapping such as star coordinates [5], where a dense point cloud is
considered a real cluster or several overlapped clusters.
3 VISTA – Visual Cluster Rendering System

Chen and L. Liu [6], [7] proposed a dynamic visualization model; VISTA provides an
intuitive way to visualize clusters with interactive feedbacks to encourage domain
experts to participate in the clustering revision and cluster validation process.
The VISTA model adopts star coordinates [5]. A k-axis 2D star coordinates is

defined by an origin o ( xo , yo ) and k coordinate S1, S2 , S3 , . . . , Sk which represents
the k dimensions in 2D spaces. The k coordinates are equidistantly distributed on the
circumference of the circle C, where the unit vectors are

Si =(cos( 2πi k ), sin( 2πi k )), i =1, 2, 3, . . . , k (1)
The radius c of the circle C is the scaling factor to the entire visualization.
Changing c will change the effective size and the detailed level of visualization. Let a
2D point Q(x, y) represent the mapping of a k-dimensional max-min normalized (with
normalization bounds [-1, 1]) data point P (x1, x2, x3, . . . , xk) on the 2D star
coordinates. Q(x, y) is determined by the average of the vector sum of the k vectors

αi xi Si , (i = 1, 2, . . . , k), where αi are the k adjustable parameters. This sum can be
scaled by the radius c. The VISTA mapping is adjustable by αi. By tuning αi
continuously, we can see the influence of ith dimension on the cluster distribution
through a series of smoothly changing visualizations, which usually provides
important clustering clues. The dimensions that are important for clustering will
cause significant changes to the visualization as the corresponding α values are
continuously changed [7]. Even though the visual rendering is completed within few
minutes, the sequential rendering becomes tedious when the number of dimensions is
large. In most of the cases, the continuous change of α leads to different patterns, may
resulting in incorrect clusters.

4 Improved Visual Clustering Framework

Visual Custer
Raw Feature Selection Rendering System Clustered Data
data
Cluster Visualization
for Evaluation
Interaction The User

Fig. 1. Improved Visual Clustering Framework

The block diagram of the proposed frame work is demonstrated in Fig .1. The
basic idea of the proposed method is to identify important dimensions according to its
contribution to the entropy (CE) by a leave-out basis. Features with high CE lead to
entropy increase; hence they are assumed to be very relevant to our proposed method.
The features of the second group are neutral. Their presence or absence does not
change the entropy of the dataset and hence they can be filtered out without much
information loss. The third group includes features that reduce the total Singular
Value Decomposition (SVD) - entropy (usually C < 0). Such features may be expected
to contribute uniformly to the different instances, and may just as well be filtered out
from the analysis. The relevant features are then applied to VISTA model for
clustering process. The step-by-step process of the proposed algorithm is mentioned
here under,

Input : n Data set with underlying Distribution


Output : K Partitions of Data sets
---------------------------------------------------------------------------------------------------------------
Step 1: Find the Contribution to the entropy [12, 14] of the ith feature as

CE i = E ( A[ nXm ] ) − E ( A[ nXm −1] ) (2)

Where the Entropy


1
∑j =1Vj log( Vj )
N
E =− (3)
log( N )
And the normalized relative values
(4)
S 2j
Vj =
∑k S k2
Where S 2j is the eigen values of the nXn matrix AAi
Step 2: Sort the features based on their contribution to the entropy..

Step 3: Group the features as


i). CE i > C, features with high contribution
ii). C > CE i > C features with average contribution
iii). CE i < C features with low (usually negative) contribution
Where C = Average (CE )

Step 4: Eliminate the dimensions with average and negative contribution, since they are irrelevant.

Step 5: Explore only the selected dimensions in VISTA.

Step 6: Perform interactive visual clustering with α- tuning until satisfactory results.

Fig. 2. Improved Visual Clustering method


5 Experimental Results and Discussion

Five well known UCI machine learning datasets are used to show the effectiveness of
the proposed work (http://www.ics.uci.eedu/~mlearn/). The quality of clusters is
assessed by Jaccard coefficient proposed in [1]. The Jaccard coefficient validations
are based on the agreement between clustering results and the “ground truth”. The
experiments are performed based on the domain knowledge obtained from automatic
clustering results. The domain knowledge plays a critical role in the clustering
process, which is the semantic explanation to the data groups. It often indicates a high
level cluster distribution, which may be different from the structural clustering results.
Initially the dataset is explored in VISTA. The initial alpha values are set as -0.5. For
experimental purpose alpha variation is set as 0.01.

10 10

8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6

-8 -8

-10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10

(a) Before Feature selection (b) After Feature Selection

Fig. 3. Visualization of Breast Cancer Data Set

10 1
0
8

6
5
4

0 0
-2

-4
-5
-6

-8

-10 -1
0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -1
0 -5 0 5 1
0

(a) Before Feature selection (b) After Feature Selection

Fig. 4. Visualization of Hepatitis Data Set

The visualization of Breast Cancer data with the entire set of dimensions is shown
in Fig. 3 a) and the visualization of dataset with selected features based on CE is
shown in Fig. 3 b) with αi = 0.5. From the visualization results, it is observed the
distribution of points obtained by the proposed method is quite different to that of the
original data distribution obtained with original sample. The distribution of points
with feature selection shows the cluster distribution effectively than original
visualization. Since the number of features selected is very less, this eases the visual
distance computation process and makes the human – computer interaction process to
be more effective. Fig. 4 a) and b) shows the visualization of Hepatitis data set before
and after feature selection. From the visualization results, it is observed that feature
selection makes the data visualization be more effective than the entire dimension and
the cluster distribution is also clearly identified. Similarly Fig. 5 a) and b) and Fig. 6
a) and b) show the visualization of Australian data set and Ionosphere dataset before
and after feature selection respectively.
The visual clustering is performed by the user with domain knowledge. Even
though the visual rendering performed sequentially the user may vary the α, which
leads to different point distribution. And another important issue is every rendering
may result in different point distribution. Hence, the visual clustering on each data set
performed several times and the best and average results are considered for analysis.
The results of visual clustering with entire set of dimensions and with selected
features are represented in Table 1. From the results, it is observed that proposed
method resulting in better cluster quality compared with clustering using the entire
dimension for Breast cancer data set. For the other two data sets it shows similar
results before and after feature selection.

10 10

8 8

6 6

4 4

2 2

0 0

-2
-2

-4
-4

-6
-6

-8
-8

-10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10

(a) Before Feature selection (b) After Feature Selection

Fig. 5. Visualization of Australian Data Set

10 10

8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6

-8 -8

-10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
Fig. 6. Visualization of Ionosphere Data Set

Table 1. Comparison of number of features and Cluster quality

No. VISTA with VISTA with


No. of Features Entire dimension Selected Features
No. of
Datasets of
Records
Clusters
Before After
Best Avg Best Avg
Selection Selection

Hepatitis 155 2 19 4 57.24 54.30 65.97 63.24


Australian 690 2 14 13 60.50 60.50 60.50 60.50
Breast
569 2 32 5 53.21 53.21 53.23 53.18
Cancer
Dermotology 366 6 34 33 25.65 23.65 31.28 23.65
Ionosphere 351 2 34 16 45.10 43.39 46.15 44.06
6 Conclusion

Most of the feature selection process for clustering is focused on wrapper method. In
this proposed feature selection method based on the filter model individual attributes
are selected based on their CE value. The features that contain low contribution are
considered as irrelevant features, and they are eliminated. The interactive clustering is
performed only with the relevant features, thus reduces the number of iterations in the
process of computing visual distance and eases the interactive process. The
experimental result shows that, the proposed feature selection method for interactive
visual cluster rendering system improves the performance of the clustering results.
The identification of relevant features with different criteria needs further research.

Acknowledgments The first author acknowledges for financial assistance received


from UGC, New Delhi under grant No: (F-No. 34-105/2008, SR).

References

1. Daxin Jiang, Chun Tang, Aidong Zhang.:Cluster analysis for gene expression data: a survey.
IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No.11, (2004).
2. Dy, J.G., Broadly, E.C.:Feature Selection for unsupervised learning. J. Machine Learning
Research, Vol.5. pp. 845--889 (2004).
3. G., Guha., R., Rastogi., K., Shim.:CURE: An efficient clustering algorithm for large
databases. Proc. of the ACM SIGMOD, 73--84 (1998).
4. Hinnerburg, A., Keim, D., Wawryniuk, M.: HD-Eye:Visual Mining of High – Dimensional
Data. IEEE Computer Graphics and Applications, Vol.19, No.5, pp. 22--31(1999).
5. Kandogan, E.: Visualizing Multi-dimensional Clusters. Trends and outliers using star co-
ordinates, Proc of ACM KDD, pp. 107--116 (2001).
6. Keke Chen, Liu, L.: VISTA: “Validating and Refining Clusters via Visualization.
Information Visualization, vol. 4, pp.257--270 (2004).
7. Keke Chen and Liu, L.: iVIBRATE:” Interactive Visualization-Based Framework for
Clustering Large Datasets. ACM Transactions on Information Systems, Vol. 24, pp. 245--
294 (April 2006).
8. Marie desJardins, James MacGlashan, Julia Ferraioli.: Interactive visual clustering.
Intelligent User Interfaces , pp. 361--364 (2007).
9. Melanie Tory and Torsten Moller.: Human Factors in Visualization Research. IEEE
Transactions on Visualization and Computer Graphics, 10(1) (2004).
10.Mithra, P. Murthy, C.A. Sankar K. pal.:Unsupervised Feature Selection using Feature
Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No.
3,pp. 301--312 (2002).
11.Pierre-Emmmanual Jouve , Nocolas Nicoloyannis.: A filter feature selection method for
clustering. Springer, ISMIS, pp .583--593 (2005).
12. Roy Vayshavsky, Assaf Gottlieb, Michal Linial, David Horn.: Noval Unsupervised Feature
Filtering of Biological Data. Text Mining and Information extraction, oxford University
Press, pp.1--7(2004).
13. Thangavel, A. Pethalakshmi.: Feature selection for Medical database using Rough System.
International Journal on Artificial Intelligence and Machine Learning, Vol. (6), Issue(1),
pp.11--17 (2006).
14.Wall. M, RechtsteinerA, Rocha . : Singular value Decomposition and Principal component
Analysis, In Berrar D,Dubitzky(eds), A Practical approach to Microarray data analysisi,
Kluwer, 91--109.

You might also like