Improved Visual Final Version New
Improved Visual Final Version New
Improved Visual Final Version New
Dimensionality Reduction
Dharmapuri,Tamilnadu, India.
[email protected], [email protected], [email protected]
Abstract. Interactive visual clustering allows the user to be involved into the
clustering through visualizing process via interactive visualization. In order to
perform effective interaction in the visual clustering process, the efficient
feature selection methods are required to identify the most dominating features.
Hence, in this paper an improved visual clustering system is proposed using an
efficient feature selection method. The relevant features for visual clustering are
identified based on their contribution to the entropy. Experimental results show
that the proposed method works well in finding the best cluster.
1 Introduction
2 Related Work
Clustering of large data bases is an important research area with a large variety of
applications in the data base context. Missing in most of the research efforts are
means for guiding the clustering process and understand the results. Visualization
technology may help to solve this problem since it allows an effective support of
different clustering paradigms and provides means for a visual inspection of the
results [4]. Since a wide range of users for different environment utilize the
visualization models for clustering, it is essential to ease the human computer
interaction. One way to ease the human computer interaction is to provide minimum
number of features for clustering and analysis. There is large variety of visualization
models are proposed during the past decade, but very few are deals with exploring the
dataset with minimum features.
The goal of feature selection for clustering is to find the smallest feature subset that
best uncovers “interesting natural” grouping (clusters) from data set. Feature selection
has been extensively studied in the past two decades. Even though feature selection
methods are applied for traditional automatic clustering, visualization models are not
utilizing them much. This motivates to the proposed framework.
The issues related to feature selection for unsupervised learning can be found in [2,
11, 13]. Jennifer G. Dy and Broaley [2] proposed a wrapper based feature selection
for unsupervised learning, which wraps the search around Expectation-Maximization
clustering algorithm. Roy Varshavsky, et. al., [12] proposed a novel unsupervised
feature filtering of Biological data based on maximization of Singular Value
Decomposition (SVD) entropy. The features are selected based on, (i) simple ranking
according to Contribution to the Entropy (CE) values (SR), (ii) forward selection by
accumulating features according to which set produces highest entropy (FS1), (iii)
forward selection by accumulating features through the choice of the best CCE out of
the remaining ones (FS2), (iv) backward elimination (BE) of features with the lowest
CE. This proposed work involves the feature selection based on simple ranking.
Interactive clustering differs from traditional automatic clustering in such a way
that it incorporates user’s domain knowledge into the clustering process. There are
wide variety of interactive clustering methods are proposed in recent years [4, 6, 8].
while a very few of them concentrates on feature selection. Keke Chen and Liu, L. [6,
7] proposed VISTA model, an intuitive way to visualize clusters. This model provides
a similar mapping such as star coordinates [5], where a dense point cloud is
considered a real cluster or several overlapped clusters.
3 VISTA – Visual Cluster Rendering System
Chen and L. Liu [6], [7] proposed a dynamic visualization model; VISTA provides an
intuitive way to visualize clusters with interactive feedbacks to encourage domain
experts to participate in the clustering revision and cluster validation process.
The VISTA model adopts star coordinates [5]. A k-axis 2D star coordinates is
defined by an origin o ( xo , yo ) and k coordinate S1, S2 , S3 , . . . , Sk which represents
the k dimensions in 2D spaces. The k coordinates are equidistantly distributed on the
circumference of the circle C, where the unit vectors are
Si =(cos( 2πi k ), sin( 2πi k )), i =1, 2, 3, . . . , k (1)
The radius c of the circle C is the scaling factor to the entire visualization.
Changing c will change the effective size and the detailed level of visualization. Let a
2D point Q(x, y) represent the mapping of a k-dimensional max-min normalized (with
normalization bounds [-1, 1]) data point P (x1, x2, x3, . . . , xk) on the 2D star
coordinates. Q(x, y) is determined by the average of the vector sum of the k vectors
αi xi Si , (i = 1, 2, . . . , k), where αi are the k adjustable parameters. This sum can be
scaled by the radius c. The VISTA mapping is adjustable by αi. By tuning αi
continuously, we can see the influence of ith dimension on the cluster distribution
through a series of smoothly changing visualizations, which usually provides
important clustering clues. The dimensions that are important for clustering will
cause significant changes to the visualization as the corresponding α values are
continuously changed [7]. Even though the visual rendering is completed within few
minutes, the sequential rendering becomes tedious when the number of dimensions is
large. In most of the cases, the continuous change of α leads to different patterns, may
resulting in incorrect clusters.
Visual Custer
Raw Feature Selection Rendering System Clustered Data
data
Cluster Visualization
for Evaluation
Interaction The User
The block diagram of the proposed frame work is demonstrated in Fig .1. The
basic idea of the proposed method is to identify important dimensions according to its
contribution to the entropy (CE) by a leave-out basis. Features with high CE lead to
entropy increase; hence they are assumed to be very relevant to our proposed method.
The features of the second group are neutral. Their presence or absence does not
change the entropy of the dataset and hence they can be filtered out without much
information loss. The third group includes features that reduce the total Singular
Value Decomposition (SVD) - entropy (usually C < 0). Such features may be expected
to contribute uniformly to the different instances, and may just as well be filtered out
from the analysis. The relevant features are then applied to VISTA model for
clustering process. The step-by-step process of the proposed algorithm is mentioned
here under,
Step 4: Eliminate the dimensions with average and negative contribution, since they are irrelevant.
Step 6: Perform interactive visual clustering with α- tuning until satisfactory results.
Five well known UCI machine learning datasets are used to show the effectiveness of
the proposed work (http://www.ics.uci.eedu/~mlearn/). The quality of clusters is
assessed by Jaccard coefficient proposed in [1]. The Jaccard coefficient validations
are based on the agreement between clustering results and the “ground truth”. The
experiments are performed based on the domain knowledge obtained from automatic
clustering results. The domain knowledge plays a critical role in the clustering
process, which is the semantic explanation to the data groups. It often indicates a high
level cluster distribution, which may be different from the structural clustering results.
Initially the dataset is explored in VISTA. The initial alpha values are set as -0.5. For
experimental purpose alpha variation is set as 0.01.
10 10
8 8
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-6 -6
-8 -8
-10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
10 1
0
8
6
5
4
0 0
-2
-4
-5
-6
-8
-10 -1
0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -1
0 -5 0 5 1
0
The visualization of Breast Cancer data with the entire set of dimensions is shown
in Fig. 3 a) and the visualization of dataset with selected features based on CE is
shown in Fig. 3 b) with αi = 0.5. From the visualization results, it is observed the
distribution of points obtained by the proposed method is quite different to that of the
original data distribution obtained with original sample. The distribution of points
with feature selection shows the cluster distribution effectively than original
visualization. Since the number of features selected is very less, this eases the visual
distance computation process and makes the human – computer interaction process to
be more effective. Fig. 4 a) and b) shows the visualization of Hepatitis data set before
and after feature selection. From the visualization results, it is observed that feature
selection makes the data visualization be more effective than the entire dimension and
the cluster distribution is also clearly identified. Similarly Fig. 5 a) and b) and Fig. 6
a) and b) show the visualization of Australian data set and Ionosphere dataset before
and after feature selection respectively.
The visual clustering is performed by the user with domain knowledge. Even
though the visual rendering performed sequentially the user may vary the α, which
leads to different point distribution. And another important issue is every rendering
may result in different point distribution. Hence, the visual clustering on each data set
performed several times and the best and average results are considered for analysis.
The results of visual clustering with entire set of dimensions and with selected
features are represented in Table 1. From the results, it is observed that proposed
method resulting in better cluster quality compared with clustering using the entire
dimension for Breast cancer data set. For the other two data sets it shows similar
results before and after feature selection.
10 10
8 8
6 6
4 4
2 2
0 0
-2
-2
-4
-4
-6
-6
-8
-8
-10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10
10 10
8 8
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-6 -6
-8 -8
-10 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
Fig. 6. Visualization of Ionosphere Data Set
Most of the feature selection process for clustering is focused on wrapper method. In
this proposed feature selection method based on the filter model individual attributes
are selected based on their CE value. The features that contain low contribution are
considered as irrelevant features, and they are eliminated. The interactive clustering is
performed only with the relevant features, thus reduces the number of iterations in the
process of computing visual distance and eases the interactive process. The
experimental result shows that, the proposed feature selection method for interactive
visual cluster rendering system improves the performance of the clustering results.
The identification of relevant features with different criteria needs further research.
References
1. Daxin Jiang, Chun Tang, Aidong Zhang.:Cluster analysis for gene expression data: a survey.
IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No.11, (2004).
2. Dy, J.G., Broadly, E.C.:Feature Selection for unsupervised learning. J. Machine Learning
Research, Vol.5. pp. 845--889 (2004).
3. G., Guha., R., Rastogi., K., Shim.:CURE: An efficient clustering algorithm for large
databases. Proc. of the ACM SIGMOD, 73--84 (1998).
4. Hinnerburg, A., Keim, D., Wawryniuk, M.: HD-Eye:Visual Mining of High – Dimensional
Data. IEEE Computer Graphics and Applications, Vol.19, No.5, pp. 22--31(1999).
5. Kandogan, E.: Visualizing Multi-dimensional Clusters. Trends and outliers using star co-
ordinates, Proc of ACM KDD, pp. 107--116 (2001).
6. Keke Chen, Liu, L.: VISTA: “Validating and Refining Clusters via Visualization.
Information Visualization, vol. 4, pp.257--270 (2004).
7. Keke Chen and Liu, L.: iVIBRATE:” Interactive Visualization-Based Framework for
Clustering Large Datasets. ACM Transactions on Information Systems, Vol. 24, pp. 245--
294 (April 2006).
8. Marie desJardins, James MacGlashan, Julia Ferraioli.: Interactive visual clustering.
Intelligent User Interfaces , pp. 361--364 (2007).
9. Melanie Tory and Torsten Moller.: Human Factors in Visualization Research. IEEE
Transactions on Visualization and Computer Graphics, 10(1) (2004).
10.Mithra, P. Murthy, C.A. Sankar K. pal.:Unsupervised Feature Selection using Feature
Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No.
3,pp. 301--312 (2002).
11.Pierre-Emmmanual Jouve , Nocolas Nicoloyannis.: A filter feature selection method for
clustering. Springer, ISMIS, pp .583--593 (2005).
12. Roy Vayshavsky, Assaf Gottlieb, Michal Linial, David Horn.: Noval Unsupervised Feature
Filtering of Biological Data. Text Mining and Information extraction, oxford University
Press, pp.1--7(2004).
13. Thangavel, A. Pethalakshmi.: Feature selection for Medical database using Rough System.
International Journal on Artificial Intelligence and Machine Learning, Vol. (6), Issue(1),
pp.11--17 (2006).
14.Wall. M, RechtsteinerA, Rocha . : Singular value Decomposition and Principal component
Analysis, In Berrar D,Dubitzky(eds), A Practical approach to Microarray data analysisi,
Kluwer, 91--109.