Research Article

Hindawi Publishing Corporation
Mathematical Problems in Engineering

Volume 2013, Article ID 323750, 8 pages
http://dx.doi.org/10.1155/2013/323750
Research Article
A Multidimensional and Multimembership Clustering Method
for Social Networks and Its Application in Customer
Relationship Management
Peixin Zhao,1 Cun-Quan Zhang,2 Di Wan,3 and Xin Zhang4

1
School of Management, Shandong University, Jinan, Shandong 250100, China
2
Department of Mathematics, West Virginia University, Morgantown, WV 26506, USA
3
Department of Physics and Astronomy, University of Victoria, Victoria, BC, Canada V8W 2Y2
4
Foundation Department, Shandong College of Electronic Technology, Jinan, Shandong 250200, China
Correspondence should be addressed to Peixin Zhao; pxzhao@126.com
Received 15 July 2013; Accepted 7 August 2013
Academic Editor: Yoshinori Hayafuji
Copyright © 2013 Peixin Zhao et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Community detection in social networks plays an important role in cluster analysis. Many traditional techniques for one-
dimensional problems have been proven inadequate for high-dimensional or mixed type datasets due to the data sparseness and
attribute redundancy. In this paper we propose a graph-based clustering method for multidimensional datasets. This novel method
has two distinguished features: nonbinary hierarchical tree and the multi-membership clusters. The nonbinary hierarchical tree
clearly highlights meaningful clusters, while the multimembership feature may provide more useful service strategies. Experimental
results on the customer relationship management confirm the effectiveness of the new method.
1. Introduction Clustering analysis is a data mining technique developed

for the purpose of identifying groups of entities that are
A social network is a set of people or groups each of similar to each other with respect to certain similarity
which has connections of some kind to some or all of the measures. Many different clustering methods have been
others. Although the general concept of social networks proposed and used in a variety of fields. Jain [7] broadly
seems simple, the underlying structure of a network implies divided these methods into two groups: hierarchical clus-
a set of characteristics which are typical to all complex tering and partitioned clustering. Hierarchical clustering
systems. Social network plays an extremely important role is the grouping of objects of interest according to their
in many systems and processes and has been intensively similarity into a hierarchy, with different levels reflecting the
studied over the past few years in order to understand degree of inter-object resemblance. The most well-known
both local phenomena, such as clique formation and their hierarchical methods are singlelink and completelink. In
dynamics, and network-wide processes, for example, flow of singlelink hierarchical methods, the two clusters whose two
data in computer networks [1], energy flow in food webs closest members have the smallest distance are merged in
[2], customer relation management [3–6], and so forth. each step; in completelink cases, the two clusters whose
Modern information and communication technology has merger has the smallest diameter are merged in each step.
offered new interaction modes between individuals, like Compared to hierarchical clustering methods, partitioned
mobile phone communications and online interactions. Such clustering methods find all the clusters simultaneously as
new social exchanges can be accurately monitored for very a partition of the data lK-means, which is widely used for
large systems, including millions of individuals, representing the ease of implementation, simplicity, and efficiency where
a huge opportunity for the study of social science. a certain data point cannot be simultaneously included in
2 Mathematical Problems in Engineering
more than one cluster [8]. Based on the difference of their As an application in customer relationship management, this
capabilities, applicability, and computational requirements, approach is used to analyze mobile customer segmentation
clustering methods can be categorized into several different problem in Section 4. Finally, a summary and conclusions are
approaches: partitioning, hierarchical, density-based, grid- given in Section 5.
based, and model-based. No particular clustering method has
been shown to be superior to all its competitors in all aspects
[9]. 2. Related Works
In recent years, community detection based on clustering The detection for communities has brought about significant
has become a growing research field partly as a result of advances to the understanding of many real-world complex
the increasing availability of a huge number of networks in networks. Plenty of detection algorithms and techniques
the real world. The most intuitive and common definition have been proposed drawing on methods and principles
of community structure is that such network seems to from many different areas, including physics, artificial intel-
have communities in them: subsets of vertices within which ligence, graph theory, and even electrical circuits [11]. The
vertex-vertex connections are dense, but between which spectral bisection methods [18] and the Kernighan-Lin [19]
connections are relatively sparse. Yang and Luo [10] show algorithm are early solutions to this problem in computer
that community structure has close relationship with some society. The spectral approach bisects graph iteratively, which
functionality such as robustness and fast diffusion. It is an is unsuitable to general networks. For the Kernighan-Lin
important network property and is able to reveal many algorithm, it requires a priori knowledge about the sizes
hidden features of the given network [11]. The detection and of the initial divisions. In 2002, Girvan and Newman [20]
analysis of communities in social networks have played an proposed a divisive hierarchical clustering algorithm referred
important role in the mining of different kinds of networks, to as GN, which can generate optimizion of the division of
including the World Wide Web [12, 13], communication a network by iteratively cutting the edge with the greatest
networks [14], and biological networks [15]. betweenness value. However, a disadvantage of GN is that
Most traditional community detection algorithms based its time complexity is 𝑂(𝑚2 𝑛) on a network of 𝑛 nodes and
on clustering are limited to handling one-dimensional
𝑚 edges or 𝑂(𝑚3 ) on a sparse network; then Newman [21]
datasets [16, 17]. However, the datasets to be mined in real
proposed a faster algorithm, referred to as NM, with time
life often contain millions of objects described by many
various types of attributes or variables. For example, in complexity 𝑂(𝑛2 ) or 𝑂((𝑚+𝑛)𝑛) on a sparse network. A lot of
customer relation management, a customer can be depicted works have been done to improve GN and NM; for example,
by multidimensional data or mixed type data such as gender, Radicchi et al. [22] proposed a similar algorithm with GN by
age, income, education level, and so forth. In such cases, data using the edge-clustering coefficient as a new metric with a
mining operations and methods are required to be scalable smaller time complexity 𝑂(𝑚2 ); Clauset et al. [23] have also
as well as capable of dealing datasets’ complex structures proposed a fast clustering algorithm with 𝑂(𝑛 log2 𝑛) time
and dimensions. Previous researches were mainly focused on complexity on sparse graph. Especially in 2007, Ou and Zhang
the representation of a set of items with a single attribute, [24] proposed a new clustering method with the feature of
which is apparently unsuitable for the scenarios described hierarchical tree and overlapping clusters, the complexity of
above: (i) a single attribute can not accurately represent this method is 𝑂(ℎ𝑛2 log 𝑛) where ℎ denotes the height of the
all the dimensions of items; (ii) clustering according to a hierarchical structure. This method was, respectively, used
single attribute often fails to capture the inherent dependency to cluster extremist web pages [25] and some classic social
among multiple attributes and leads to meaningless cluster. networks [26] with single weighted edges.
Under such considerations, in this paper we firstly intro- Random walk has also been successfully used in finding
duce two pretreatment methods for multi-dimensional and network communities [27, 28]. The idea of this method
mixed type data, followed by a new clustering approach for is that the walk tends to be trapped in dense parts of a
community detection in social networks. In this approach, network corresponding to communities. Pons and Latapy
individuals and their relationships are denoted by weighted [27] proposed a measure of similarity between vertices based
graphs, and the graph density we defined gives a better on random walks which has several important advantages: it
quantity depict of the overall correlation among individuals captures well the community structure in a network, it can be
in a community, so that a reasonable clustering output can computed efficiently, and it can be used in an agglomerative
be presented. In particular, our method produces “trees” of algorithm to compute efficiently the community structure of a
simple hierarchy and allows for fuzzy (overlapping) clusters, network. The algorithm called Walktrap runs in time 𝑂(𝑚𝑛2 )
which distinguishes it from other methods. In order to verify and space 𝑂(𝑛2 ) in the worst case and in time 𝑂(𝑛2 log 𝑛) and
the utility/effectiveness of our method, we did a (preliminary) space 𝑂(𝑛2 ) in most real-world cases; Hu et al. [29] proposed
evaluation against a mobile customer segmentation use case. a method for the identification of community structure based
The numerical output of which shows supporting evidence on a signaling process of complex networks. Each node is
for further (improvement) application. taken as the initial signal source to excite the whole network
The rest of the paper is organized as follows. In Section 2 one time, and the source node is associated with an 𝑛-
we summarize the related works of community detections dimensional vector which records the effects of the signaling
in social networks. In Section 3, we introduce the details of process. By this process, the topological relationship of nodes
the novel clustering approach for multiattribute data sets. on the network could be transferred into a geometrical
Mathematical Problems in Engineering 3
structure of vectors in 𝑛-dimensional Euclidean space. Then results shows that Euclidean distance has better performance
the best partition of groups is determined by F statistics, in vector models, while some other numerical examples in
and the final community structure is given by the K-means high dimensional spaces show that the farthest and nearest
clustering method. distance are almost equal, although Euclidean distance is
Spectral clustering techniques have seen an explosive used to measure the similarity between data points. That is
development and proliferation over the past few years [30– in high-dimensional data, traditional similarity measures as
32]. Previous work indicated that a robust approach to used in conventional clustering algorithms are usually not
community detection is the maximization of the benefit meaningful. This problem and related phenomena require
function known as “modularity” over possible divisions of a adaptations of clustering approaches to the nature of high-
network, but Newman and Girvan [30] showed that the max- dimensional data. This area of research has been a highly
imization process can be written in terms of the modularity active one in recent years. Common approaches are known
matrix, which plays a role in community detection similar as, for example, subspace clustering, projected clustering,
to that played by the graph Laplacian in graph partitioning pattern-based clustering, or correlation clustering. Subspace
calculations, and the time complexity of this algorithm is clustering is the task of detecting all clusters in all subspaces,
𝑂(𝑛2 ). They also proposed an objective function for graph which means that a point might be a member of multiple
clustering called the 𝑄 function, which allows for automatic clusters, each existing in a different subspace. Subspaces can
selection of the number of clusters, and then higher values of either be axis parallel or affine. Projected clustering seeks
the 𝑄 function were proven to correlate well with good graph to assign each point to a unique cluster, but clusters may
clustering. White and Smyth [31] showed how the 𝑄 function exist in different subspaces. The general approach is to use
can be reformulated as a spectral relaxation problem and a special distance function together with a regular clustering
proposed two new spectral clustering algorithms that seek algorithm. Correlation clustering provides a method for
to maximize 𝑄. Capocci et al. [32] developed some spectral- clustering a set of objects into the optimum number of
based algorithm to reveal the structure of a complex network, clusters without specifying that number in advance.
which could be blurred by the bias artificially overimposed by In 2011, A new function “Close()” is presented based
the iterative bisection constraint. Such a method should be on the improvement of traditional algorithm to compensate
able to conjugate the power of spectral analysis to the caution their inadequacy for high-dimensional space [35]. Let
needed to reveal an underlying structure when there is no
clear cut partitioning, as is often the case in real networks. 𝑋 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ,
Lots of other community detection algorithms have also (1)
𝑌 = (𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )
been proposed in the recent literatures. For example, Wu
and Huberman [33] proposed a method which partitions a denote two points in 𝑛-dimensional space. The function
network into two communities, where the network is viewed “Close()” is defined as
as an electric circuit, and a battery is attached to two random
nodes that are supposed to be within two communities. Shi ∑𝑛𝑖=1 𝑒−|𝑥𝑖 −𝑦𝑖 |
et al. [11] proposed a new genetic algorithm for community Close (𝑋, 𝑌) = . (2)
𝑛
detection, using the fundamental measure criterion modular-
ity 𝑄 as the fitness function. A special locus-based adjacency It depicts the similarity degree between two data points
encoding scheme is applied to represent the community and has the following properties.
partition; Shi et al. [34] proposed a novel method based on
(a) The minimum value of the function is 0, which means
particle swarm optimization to detect community structures
that the similarity degree between 𝑋 and 𝑌 is smallest
by optimizing network modularity.
since the difference comes closest to infinity in each
dimension.
3. Multidimensional and Multimembership (b) The maximum value of the function is 1, which
Clustering Method for Social Networks means that the similarity degree between 𝑋 and 𝑌 is
largest since they come closest to coinciding in each
3.1. Similarity of Multidimensional Data. Traditional dis- dimension.
tance functions include Euclidean distance, Chebyshev dis-
Similar to the weighted operator in traditional distance
tance, Manhattan distance, Mahalanobis distance, Weighted
functions, the close function can be corrected as
Minkowski distance, and Cosine distance. Among these
distance functions, Mahalanobis distance is based on corre- ∑𝑛𝑖=1 𝜔𝑖 𝑒−|𝑥𝑖 −𝑦𝑖 |
lations between variables by which different patterns can be Close (𝑋, 𝑌) = , (3)
𝑛
identified and analyzed. It gauges similarity of an unknown
sample set to a known one. It differs from Euclidean distance where 𝜔𝑖 ∈ [0, 1] denotes the importance degree of data in the
in that it takes into account the correlations of the data set 𝑖th dimension. Advantages of the new function are obvious in
and is scale-invariant. In other words, it is a multivariate effect high-dimensional similarity measurement according to the
size. comparison in [35]. Quantitative analysis also proved that this
All these distance functions have their own advantages function can avoid the effects of noise and the curse of high-
and disadvantages in practical applications. Some research dimension.
3.2. Similarity of Mixed Type Data. For clustering mul- For a subgraph 𝐶(|𝑉(𝐶)| > 1) of 𝐺, we define the 𝑘th
tiattributes datasets, we first introduce a method for the density of 𝐶 by
measurement of similarity between items as follows [36]. The
multiattribute datasets can be separated into two parts: the 2 ∑𝑒∈𝐸(𝐶) 𝜔𝑘 (𝑒)
pure numeric datasets and pure categorical datasets. Some 𝑑𝑘 (𝐶) = . (9)
|𝑉(𝐶)| (|𝑉(𝐶)| − 1)
existing efficiency clustering methods designed for these two
types of data sets are employed to produce corresponding In single weighted graph 𝐶, if 𝜔(𝑒) = 1 and 𝑑(𝐶) = 1
clusters. For the similarity matrix, we define 𝑆(𝑖, 𝑗) as the for every edge 𝑒 in 𝐶, the subgraph 𝐶 induces a clique. For
number of times the given sample pair 𝑥𝑖 and 𝑥𝑗 has co- a multiweighted graph (𝐺; 𝜔1 , 𝜔2 , . . . , 𝜔𝑟 ), a subgraph 𝐶 is
occurred in a cluster [37]. called a Δ-quasiclique if 𝑑𝑘 (𝐶) ≥ Δ for some positive real
Consider number Δ and for every 𝑘 ∈ {1, 2, . . . , 𝑟} (𝑟 is the number
of weights on the edge).
1 𝐻 Clustering is a process that detects all dense subgraphs in
𝑆 (𝑖, 𝑗) = 𝑆 (𝑥𝑖 , 𝑥𝑗 ) = ∑ 𝛿 (𝜋𝑘 (𝑥𝑖 ) , 𝜋𝑘 (𝑥𝑗 )) ,
𝐻 𝑘=1 𝐺 and constructs a hierarchically nested system to illustrate
(4) their inclusion relation.
1, 𝑎 = 𝑏, A heuristic process is applied here for finding all qua-
𝛿 (𝑎, 𝑏) ≡ {
0, 𝑎 ≠ 𝑏, sicliques with density of various levels. The core of the
algorithm is deciding whether or not to add a vertex to an
where 𝐻 denotes the number of the clustering. 𝜋𝑘 (𝑥𝑖 ) and already selected dense subgraph 𝐶. For a vertex V ∉ 𝑉(𝐶), we
𝜋𝑘 (𝑥𝑗 ) denote the cluster label of items 𝑥𝑖 and 𝑥𝑗 , respectively. define the contribution of V to 𝐶 by
Then for the pure numerical datasets, the similarity can be
defined as ∑𝑢∈𝑉(𝐶) 𝜔𝑘 (𝑢V)
𝑐𝑘 (V, 𝐶) = . (10)
∑𝑁 𝐶 (𝑖, 𝑗) |𝑉(𝐶)|
𝑛
𝑆1 (𝑖, 𝑗) = = 𝑖=1 , (5)
𝑁 𝑁 A vertex V is added into 𝐶 if 𝑐𝑘 (V, 𝐶) > 𝛼𝑑(𝐶) where 𝛼 is a
where 𝑁 is the number of clustering and 𝑛 is the number of user specified parameter.
times the pattern pair (𝑥𝑖 , 𝑥𝑗 ) is assigned to the same cluster In short, the main steps of our algorithm can be described
among the 𝑁 clustering. If (𝑥𝑖 , 𝑥𝑗 ) is assigned to the same as shown in Algorithm 1.
cluster, 𝐶(𝑖, 𝑗) = 1, otherwise 𝐶(𝑖, 𝑗) = 0. Trace the process of each vertex, and obtain the hierarchic
For the pure categorical datasets, the similarity can be tree.
defined as Our detailed community detection algorithm that can
find Δ-quasicliques in 𝐺 with various levels of Δ is as follows.
𝑛 ∑𝑚 𝐶 (𝑖, 𝑗) A hierarchically nested system is constructed to illustrate
𝑆2 (𝑖, 𝑗) = = 𝑖=1 , (6)
𝑚 𝑚 their inclusion relation.
where 𝑚 denotes the number of attributes. Then the similarity Step 0. 𝑙 ← 1 where 𝑙 is the indicator of the levels in the
of multiattribute datasets can be denoted by hierarchical system:
𝑆 = 𝑆1 + 𝛼𝑆2 , (7) 𝑀0 ←󳨀 𝛾 max {𝜔𝑘 (𝑒) : ∀𝑒 ∈ 𝐸 (𝐺) , ∀𝑘} , (11)
where 𝛼 is a user-defined parameter. If 𝛼 > 1, the categorical where 𝛾 (0 < 𝛾 < 1) is a user specified parameter (𝛾 is a cut-
datasets is more important than the numerical datasets; if 𝛼 < off threshold).
1, numerical datasets is more important. 𝑆1 (𝑖, 𝑗) and 𝑆2 (𝑖, 𝑗)
can also be used as two-dimensional (or multidimensional) Step 1 (the initial step). Let 𝐹 be the set of all edges 𝑒 of 𝐺 with
datasets to represent the similarities between items 𝑥𝑖 and 𝑥𝑗 .
min {𝜔𝑘 (𝑒) : 𝑘 = 1, 2, . . . , 𝑟} ≥ 𝑀0 . (12)
3.3. Multidimensional and Multimembership Clustering Meth-
od for Social Networks. A graph or network is one of the most Let 𝑚 = |𝐹|. Sort the edges of the set 𝐹 as a sequence 𝑆 =
commonly used models to represent real-valued relationships 𝑒1 , . . . , 𝑒𝑚 such that
of a set of input items. Since many traditional techniques 𝑟 𝑟 𝑟
for one-dimensional problems have been proven inadequate ∑ 𝜔𝑘 (𝑒1 ) ≥ ∑ 𝜔𝑘 (𝑒2 ) ≥ ⋅ ⋅ ⋅ ≥ ∑ 𝜔𝑘 (𝑒𝑚 ) , (13)
for high-dimensional or mixed type datasets due to the 𝑘=1 𝑘=1 𝑘=1
data sparseness and attribute redundancy, the graph-based
clustering method for single dimensional datasets proposed 𝜇 ← 1, 𝑝 ← 0, and 𝐿 𝑙 ← 0 where 𝐿 𝑙 is the community sets
in [24–26] can be extended as follows to directly cluster in the 𝑙th hierarchical level.
multidimensional datasets.
Step 2 (One has starting a new search).
Let 𝐺 = (𝑉, 𝐸) be a graph with the vertex set 𝑉 and
associated with 𝑟 weights:
𝑝 ←󳨀 𝑝 + 1, 𝐶𝑝 ←󳨀 𝑉 (𝑒𝜇 ) . 𝐿 𝑙 ←󳨀 𝐿 𝑙 ∪ {𝐶𝑝 } .
𝜔𝑘 : 𝐸 (𝐺) 󳨃󳨀→ [0, 1] , 𝑘 = 1, 2, . . . , 𝑟. (8) (14)
Input: A graph 𝐺 = (𝑉; 𝜔1 , 𝜔2 , . . . , 𝜔𝑟 ) is a multi-weighted graph with 𝜔𝑘 : 𝐸(𝐺) 󳨃→ [0, 1].

Output: Meaningful community sets in 𝐺.
Algorithm: Detect Δ-quasi-cliques in 𝐺 with various levels of Δ, and construct a hierarchically nested system to illustrate their
inclusion relation.
While 𝐸 (𝐺) ≠ 0
begin
determine the value of 𝑀0
Decompose(𝐺, 𝑀0 )
𝐸0 = {𝑒 ∈ 𝐸(𝐺): 𝜔𝑘 (𝑒) ≥ 𝑀0 , 𝑘 = 1, 2, . . . , 𝑟}
for each edge in 𝐸0 in decreasing order of weights, if the two vertexes of edge are not in any community, create a new empty
community 𝐶 Choose V in the rest vertex sets that have maximum contribution to 𝐶 and add V in it.
Merging (𝐺)
Merge two communities according to their common vertexes;
Contract each community to a vertex and redefine the weight of the corresponding edges.
Store the resulted graph to 𝐺.
End.
Algorithm 1
Step 3 (growing) Substep 4.3. 𝑗 ← 𝑗 + 1. If 𝑗 < ℎ, go to Substep 4.2.

Substep 3.1. 𝑈 ← 𝑉(𝐺) − 𝑉(𝐶𝑝 ); if 𝑈 = 0, go to Step 4; Substep 4.4. ℎ ← ℎ + 1, 𝑗 ← 1. If ℎ ≤ 𝑠, go to Substep 4.2.
otherwise continue.
Step 5. Contract each 𝐶𝑝 ∈ 𝐿 𝑙 as a vertex:
(∗) Pick V ∈ 𝑈 such that Π𝑟𝑘=1 𝑐𝑘 (V, 𝐶𝑝 ) is a maximum.
𝑠
If, for every 𝑘, 𝑉 (𝐺) ←󳨀 [𝑉 (𝐺) − ⋃ 𝑉 (𝐶𝑝 )] ∪ {𝐶1 , . . . , 𝐶𝑠 } ,
𝑝=1
𝑐𝑘 (V, 𝐶𝑝 ) ≥ 𝛼𝑛 𝑑𝑘 (𝐶𝑝 ) , (15)
∑𝑒∈𝐸𝑖,𝑗 𝜔𝑘 (𝑒)
where 𝑛 = |𝑉(𝐶𝑝 )| and 𝛼𝑛 = 1 − (1/2)𝜆(𝑛 + 𝑡) with 𝜆 ≥ 1, 𝜔𝑘 (𝑢V) ←󳨀 𝜔𝑘 (𝐶𝑖 , 𝐶𝑗 ) = , 𝑘 = 1, 2, . . . , 𝑟.
𝐸𝑖,𝑗
𝑡 ≥ 1 as user specified parameters, then 𝐶𝑝 ← 𝐶𝑝 ∪ {V}, and
(19)
go back to Substep 3.1.
If Inequality (15) is not satisfied, then
The vertex 𝑢 is obtained by contracting 𝐶𝑖 and V is obtained
𝑈 ←󳨀 𝑈 − {V} . (16) by contracting 𝐶𝑗 where 𝐸𝑖𝑗 is the set of crossing edges which
is defined as
If 𝑈 ≠ 0, repeat (∗). If 𝑈 = 0, go to Substep 3.2.
𝐸𝑖,𝑗 = {𝑥𝑦 : 𝑥 ∈ 𝐶𝑖 , 𝑦 ∈ 𝐶𝑗 , 𝑥 = 𝑦} . (20)
Substep 3.2. 𝜇 ← 𝜇 + 1. If 𝜇 > 𝑚 go to Step 4; otherwise
continue. For
𝑝−1
Substep 3.3. Suppose 𝑒𝜇 = 𝑥𝑦. If at least one of 𝑥, 𝑦 ∉ ∪𝑖=1
𝑞 ∈ 𝑉 (𝐺) − {𝐶1 , . . . , 𝐶𝑠 } , (21)
𝑉(𝐶𝑖 ), then go to Step 2; otherwise go to Substep 3.2.
Step 4 (merging). define 𝜔𝑘 (𝑞, 𝐶𝑖 ) = 𝜔𝑘 ({𝑞}, 𝐶𝑖 ). Other cases are defined
similarly.
Substep 4.1. List all members of 𝐿 𝑙 as a sequence 𝐶1 , . . . , 𝐶𝑠 If |𝑉(𝐺)| ≥ 2, then go to Step 6; otherwise go to End.
such that
Step 6. One has
󵄨󵄨 󵄨 󵄨 󵄨 󵄨 󵄨
󵄨󵄨𝑉 (𝐶1 )󵄨󵄨󵄨 ≥ 󵄨󵄨󵄨𝑉 (𝐶2 )󵄨󵄨󵄨 ≥ ⋅ ⋅ ⋅ ≥ 󵄨󵄨󵄨𝑉 (𝐶𝑠 )󵄨󵄨󵄨 , (17)
𝑙 ←󳨀 𝑙 + 1, 𝐿 𝑙 ←󳨀 0,
where 𝑠 = |𝐿 𝑙 |. ℎ ← 2, 𝑗 ← 1. (22)
𝜔0 ←󳨀 𝛾 max {𝜔 (𝑒) : ∀𝑒 ∈ 𝐸 (𝐺) , ∀𝑘} ,
Substep 4.2. If
󵄨󵄨 󵄨 󵄨 󵄨 where 𝛾 (0 < 𝛾 < 1) is a user specified parameter, and go to
󵄨󵄨𝐶𝑗 ∩ 𝐶ℎ 󵄨󵄨󵄨 > 𝛽 min (󵄨󵄨󵄨𝐶𝑗 󵄨󵄨󵄨 , 󵄨󵄨󵄨󵄨𝐶ℎ 󵄨󵄨󵄨󵄨) , (18)
󵄨 󵄨 󵄨 󵄨 Step 1 (to start a new search in a higher level of the hierarchical
system).
(where 0 < 𝛽 < 1 is a user specified parameter), then 𝐶𝑠+1 ←
𝐶𝑗 ∪ 𝐶ℎ , and the sequence 𝐿 𝑙 is rearranged as follows: End.
𝐶1 , . . . , 𝐶𝑠−1 ← deleting 𝐶𝑗 , 𝐶ℎ from 𝐶1 , . . . , 𝐶𝑠+1 . Trace the movement of each vertex and generate the
𝑠 ← 𝑠 − 1, ℎ ← max{ℎ − 2, 1}, and go to Substep 4.4. hierarchic tree.
Table 1: Some information of 3000 mobile customers. Table 2: The customer segmentation of mobile network.
Long Text Average
Local call Roaming Average Average Average
Customer distance message Package text Number
fee fee Cluster local call long roaming
number call fee and WAP type message of
(Yuan) (Yuan) number fee distance call fee
(Yuan) fee (Yuan) and WAP customer
(Yuan) fee (Yuan) (Yuan)
1 55.3 13.7 120.6 14.2 D fee (Yuan)
2 132 44.8 36.2 5.6 B 1 156.9 172.8 39.8 58.5 121
3 47.1 233.6 79.4 6.2 B 2 299.1 43.2 38.7 46.9 64
4 173 19.3 87.5 19.3 C 3 42.6 32.9 174.7 36.2 168
5 23.7 80.5 21 9 A 4 212.8 103.3 574.3 39.7 13
6 62.3 62.9 77.8 10.6 E 5 187.9 871.5 35.3 28.7 9
7 242.5 21.8 23.5 24.2 A 6 162.1 262.3 354.8 21.2 12
8 166.2 34.5 8 19.5 C 7 43.0 25.8 13.7 21.2 2077
⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ 8 19.2 7.5 4.8 13.5 792
3000 77.6 67 21.2 24.7 D
1000
If the input data is an unweighted graph 𝐺, the adjacency 900

information is used for establishing the similarity matrix of
800
𝐺. Let 𝐴 = (𝑎𝑖𝑗 ) be the adjacency matrix of 𝐺 where
700
Average consumption
1, there is an edge between 𝑖 and 𝑗

𝑎𝑖𝑗 = { (23) 600
0, otherwise
500
and the inner product of the 𝑖th and the 𝑗th row of 𝐴 is used
to describe the similarity between nodes 𝑖 and 𝑗 and stored as 400
𝐺(𝑖, 𝑗) in the similarity matrix 𝐺.
300
4. Simulation Examples 200
100
In order to validate the feasibility of the proposed novel
approach to cluster multi-dimensional data sets, we randomly 0
took 3000 customers’ consumption lists of August 2012 from 1 2 3 4 5 6 7 8
Shandong Mobile Corporation and use our new approach to
Local call fee Roaming fee
divide these customers into distinguishing clusters according Long distance call fee Text message and WAP fee
to 4 evaluation indices: local call fee, long distance call fee,
roaming fee and text message and WAP fee. The original data Figure 1: Average consumption list of 8 Groups.
of 3000 customers are listed in Table 1.
We have applied our approach to this problem, and the
results of segmentation and their average consumption are
listed in Table 2 and Figure 1. customer respectively, and some special policies should be
As we can see from the clustering result, the long distance recommended for the 39 customers, who belong to either
fee of group 1 has a high proportion of their total expenses; Group 1 or 8 to help them become loyal higher value
Groups 3 and 4 have high roaming fees; Group 8 has lower customers.
cost in each index; Groups 2, 3, and 4 have higher text
message and WAP fees. Mobile corporations can initiate 5. Conclusions
corresponding policies according to the clustering results. For
example, for the customers in Groups 2, 3, and 4, mobile In this paper, a graph-based new clustering method for multi-
corporation should provide them with some discount text dimensional datasets is proposed. Due to the inherent spar-
message package; for the customers in Groups 3, 4, and 6, sity of data points, most existing clustering algorithms do not
some discount package of roaming will also help to increase work efficiently for multi-dimensional datasets, and it is not
customer loyalty and stability. feasible to find interesting clusters in the original full space of
On the other hand, we noticed that the sum of the all dimensions. These researches were mainly focused on the
last column of Table 2 is larger than 3000. This is because representation of a set of items with a single attribute, which
our method allows multimembership clustering; thus some cannot accurately represent all the attributes and capture the
customers can belong to more than one group. For instance, inherent dependency among multiple attributes. The new
Groups 8 and 1 are low value customer and high value clustering method we proposed in this paper overcomes
this problem by directly clustering items according to the [9] F. Cao, “A weighting K-modes algorithm for subspace clustering
multidimensional information. Since it does not need data of categorical data,” Neurocomputing, vol. 108, pp. 23–30, 2012.
preprocessing, this new method may significantly improves [10] S. Yang and S. Luo, “A local quantitative measure for commu-
clustering efficiency. It also has two-distinguished features: nity detection in networks,” International Journal of Intelligent
nonbinary hierarchical tree and multimembership clusters. Engineering Informatics, vol. 1, no. 1, pp. 38–52, 2010.
The application in customer relationship management has [11] C. Shi, Y. Wang, B. Wu, and C. Zhong, “A new genetic algorithm
proved the efficiency and feasibility of the new clustering for community detection,” in Complex Sciences, Part II, vol. 5
method. of Lecture Notes of the Institute for Computer Sciences, Social
Informatics and Telecommunications Engineering, pp. 1298–
1309, 2009.
Conflict of Interests [12] M. Hoerdt and U. Louis, “Completeness of the internet core
topology collected by a fast mapping software,” in Proceedings
Peixin Zhao, Cun-Quan Zhang, Di Wan, and Xin Zhang of the 11th International Conference on Software, Telecommuni-
certify that there is no actual or potential conflict of interests cations and Computer Networks, pp. 257–261, 2003.
in relation to this paper. [13] A. Broder, P. Kumar, F. Maghoul et al., “Graph structure in the
web,” in Proceedings of the 9th International Conference on the
Acknowledgments World Wide Web, pp. 15–19, 2003.
[14] J. Scott, Social Network Analysis: A Handbook, Sage, 2000.
The first author is partially supported by the China Post- [15] D. A. Fell and A. Wagner, “The small world of metabolism,”
doctoral Science Foundation funded Project (2011M501149), Nature Biotechnology, vol. 18, no. 11, pp. 1121–1122, 2000.
the Humanity and Social Science Foundation of Min- [16] T. Chen, N. L. Zhang, T. Liu, K. M. Poon, and Y. Wang, “Model-
istry of Education of China (12YJCZH303), the Spe- based multidimensional clustering of categorical data,” Artificial
cial Fund Project for Postdoctoral Innovation of Shan- Intelligence, vol. 176, pp. 2246–2269, 2012.
dong Province (201103061), the Informationization Research [17] L. Poon, N. L. Zhang, T. Liu, and A. H. Liu, “Model-based
Project of Shandong Province (2013EI153), and Independent clustering of high-dimensional data: variable selection versus
Innovation Foundation of Shandong University, IIFSDU facet determination,” International Journal of Approximate Rea-
(IFW12109). The second is author partially supported by soning, vol. 54, no. 1, pp. 196–215, 2012.
an NSA Grant H98230-12-1-0233 and an NSF Grant DMS- [18] A. Pothen, H. D. Simon, and K.-P. Liou, “Partitioning sparse
1264800. matrices with eigenvectors of graphs,” SIAM Journal on Matrix
Analysis and Applications, vol. 11, no. 3, pp. 430–452, 1990,
Sparse matrices (Gleneden Beach, OR, 1989).
References [19] B. W. Kernighan and S. Lin, “A efficient heuristic procedure for
partitioning graphs,” Bell System Technical Journal, vol. 49, pp.
[1] X. Jin, C. M. K. Cheung, M. K. O. Lee, and H. Chen, “How to 291–307, 1970.
keep members using the information in a computer-supported
[20] M. Girvan and M. E. J. Newman, “Community structure in
social network,” Computers in Human Behavior, vol. 25, no. 5,
social and biological networks,” Proceedings of the National
pp. 1172–1181, 2009.
Academy of Sciences of the United States of America, vol. 99, no.
[2] A. Bodini, “The qualitative analysis of community food webs: 12, pp. 7821–7826, 2002.
implications for wildlife management and conservation,” Jour-
[21] M. E. J. Newman, “Fast algorithm for detecting community
nal of Environmental Management, vol. 41, no. 1, pp. 49–65, 1994.
structure in networks,” Physical Review E, vol. 69, no. 6, Article
[3] P. C. Verhoef and K. N. Lemon, “Successful customer value ID 066133, pp. 1–66133, 2004.
management: key lessons and emerging trends,” European [22] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D.
Management Journal, vol. 31, no. 1, pp. 1–15, 2013. Paris, “Defining and identifying communities in networks,”
[4] C. Kiss and M. Bichler, “Identification of influencers— Proceedings of the National Academy of Sciences of the United
measuring influence in customer networks,” Decision Support States of America, vol. 101, no. 9, pp. 2658–2663, 2004.
Systems, vol. 46, no. 1, pp. 233–253, 2008. [23] A. Clauset, M. E. J. Newman, and C. Moore, “Finding commu-
[5] E. S. Bernardes and G. A. Zsidisin, “An examination of strategic nity structure in very large networks,” Physical Review E, vol. 70,
supply management benefits and performance implications,” no. 6, Article ID 066111, 6 pages, 2004.
Journal of Purchasing and Supply Management, vol. 14, no. 4, pp. [24] Y. Ou and C.-Q. Zhang, “A new multimembership clustering
209–219, 2008. method,” Journal of Industrial and Management Optimization,
[6] D. Li, W. Dai, and W. Tseng, “A two-stage clustering method vol. 3, no. 4, pp. 619–624, 2007.
to analyze customer characteristics to build discriminative cus- [25] X. Qi, K. Christensen, R. Duval et al., “A hierarchical algorithm
tomer management: a case of textile manufacturing business,” for clustering extremist web pages,” in Proceedings of the
Expert Systems with Applications, vol. 38, no. 6, pp. 7186–7191, International Conference on Advances in Social Network Analysis
2011. and Mining (ASONAM ’10), pp. 458–463, August 2010.
[7] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern [26] P. Zhao and C. Zhang, “A new clustering method and its
Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010. application in social networks,” Pattern Recognition Letters, vol.
[8] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. 32, no. 15, pp. 2109–2118, 2011.
O’Callaghan, “Clustering data streams: theory and practice,” [27] P. Pons and M. Latapy, “Computing communities in large
IEEE Transactions on Knowledge and Data Engineering, vol. 15, networks using random walks,” Journal of Graph Algorithms and
no. 3, pp. 515–528, 2003. Applications, vol. 10, no. 2, pp. 191–218, 2006.
[28] S. V. Dongen, Graph clustering by flow simulation [Ph.D.

dissertation], University of Utrecht, 2000.
[29] Y. Hu, M. Li, P. Zhang, Y. Fan, and Z. Di, “Community detection
by signaling on complex networks,” Physical Review E, vol. 78,
no. 1, Article ID 016115, 2008.
[30] M. E. J. Newman and M. Girvan, “Finding and evaluating
community structure in networks,” Physical Review E, vol. 69,
no. 2, Article ID 026113, 2004.
[31] S. White and P. Smyth, “A spectral clustering approach to
finding communities in graphs,” in Proceedings of SIAM Inter-
national Conference on Data Mining, pp. 76–84, 2005.
[32] A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori,
“Detecting communities in large networks,” Physica A, vol. 352,
no. 2-4, pp. 669–676, 2005.
[33] F. Wu and B. A. Huberman, “Finding communities in linear
time: a physics approach,” European Physical Journal B, vol. 38,
no. 2, pp. 331–338, 2004.
[34] Z. Shi, Y. Liu, and J. Liang, “PSO-based community detection
in complex networks,” in Proceedings of the 2nd International
Symposium on Knowledge Acquisition and Modeling (KAM ’09),
pp. 114–119, December 2009.
[35] C. Shao, W. Lou, and L. Yan, “Optimization of algorithm of
similarity measurement in high dimensional data,” Computer
Technology and Development, vol. 20, no. 2, pp. 1–4, 2011.
[36] H. Luo and H. Wei, “Clustering algorithm for mixed data based
on clustering ensemble technique,” Computer Science, vol. 37,
no. 11, pp. 234–238, 2010.
[37] A. Fred, “Finding consistent clusters in data partitions,” in Mul-
tiple Classifier Systems, vol. 2096 of Lecture Notes in Computer
Science, pp. 309–318, 2001.
Advances in Advances in Journal of Journal of
Operations Research
Decision Sciences
Applied Mathematics
Algebra
Probability and Statistics
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
The Scientific International Journal of

World Journal
Differential Equations
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
Submit your manuscripts at

http://www.hindawi.com
International Journal of Advances in

Combinatorics
Mathematical Physics
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
Journal of Journal of Mathematical Problems Abstract and Discrete Dynamics in

Complex Analysis
Mathematics
in Engineering
Applied Analysis
Nature and Society
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
International
Journal of Journal of
Mathematics and
Mathematical
Discrete Mathematics
Sciences
Journal of International Journal of Journal of
Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014

Function Spaces
Stochastic Analysis
Optimization
http://www.hindawi.com Volume 2014 http://www.hindawi.com http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Research Article

Uploaded by

Copyright:

Available Formats

Research Article

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Article

Uploaded by

Copyright:

Available Formats

Hindawi Publishing Corporation

Mathematical Problems in Engineering

Peixin Zhao,1 Cun-Quan Zhang,2 Di Wan,3 and Xin Zhang4

Correspondence should be addressed to Peixin Zhao; pxzhao@126.com

Received 15 July 2013; Accepted 7 August 2013

Academic Editor: Yoshinori Hayafuji

1. Introduction Clustering analysis is a data mining technique developed

Input: A graph 𝐺 = (𝑉; 𝜔1 , 𝜔2 , . . . , 𝜔𝑟 ) is a multi-weighted graph with 𝜔𝑘 : 𝐸(𝐺) 󳨃→ [0, 1].

Step 3 (growing) Substep 4.3. 𝑗 ← 𝑗 + 1. If 𝑗 < ℎ, go to Substep 4.2.

If the input data is an unweighted graph 𝐺, the adjacency 900

1, there is an edge between 𝑖 and 𝑗

4. Simulation Examples 200

[28] S. V. Dongen, Graph clustering by flow simulation [Ph.D.

The Scientific International Journal of

Submit your manuscripts at

International Journal of Advances in

Journal of Journal of Mathematical Problems Abstract and Discrete Dynamics in

Journal of International Journal of Journal of

Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014

You might also like