Concept-Relational Text Clustering
Antoon Bronselaer,∗ Guy De Tré†
Department of Telecommunications and Information Processing,
Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium
The ongoing exponential growth of online information sources has led to a need for reliable
and efficient algorithms for text clustering.In this paper, we propose a novel text model called
the relational text model that represents each sentence as a binary multirelation over a concept
space C. Through usage of the smart indexing engine (SIE), a patented technology of the Belgian
company i.Know, the concept space adopted by the text model can be constructed dynamically.This
means that there is no need for an a priori knowledge base such as an ontology, which makes
our approach context independent. The concepts resulting from SIE possess the property that
frequency of concepts is a measure for relevance. We exploit this property with the development
of the CR-algorithm. Our approach relies on the representation of a data set D as a multirelation,
of which k-cuts can be taken. These cuts can be seen as sets of relevant patterns with respect to
the topics that are described by documents. Analysis of dependencies between patterns allows
to produce clusters, such that precision is sufficiently high. The best k-cut is the one that best
approximates the estimated number of clusters to ensure recall. Experimental results on Dutch
C 2012
news fragments show that our approach outperforms both basic and advanced methods.
Wiley Periodicals, Inc.
1. INTRODUCTION
In the past decades, text clustering has been a challenging problem for researchers in both computational linguistics and data mining. Several applications
such as automated text summarization and automated deduplication of textual data
sources rely on text clustering approaches. The key problem—deciding whether two
textual documents are dealing with the same topic—is far from trivial. In addition,
the ongoing exponential growth of online information sources has created a need for
efficient algorithms. To illustrate this, Figure 1 shows the number of really simple
syndication (RSS) feeds on news sites, generated on a daily basis in Belgium and
the Netherlands.
Taking into account that the number of RSS feeds produced by news sites is
a lower bound for the number of articles produced on these news sites, Figure 1
shows that manual tracking of all news published on daily basis is infeasible.
∗
Author to whom all correspondence should be addressed: e-mail: antoon.bronselaer@ugent
.be
†
e-mail:
[email protected]
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 00, 1–24 (2012)
2012 Wiley Periodicals, Inc.
View this article online at wileyonlinelibrary.com. • DOI 10.1002/int.21557
C
2
BRONSELAER AND DE TRÉ
Figure 1. Accumulated number of RSS feeds produced on a daily basis by 10 Belgian and Dutch
news sites, measured during one month.
This is partially due to the fact that, when tracking multiple sources of information
on the Web, there exists a strong repetition in the whole of information provided by
these sources. If one is able to delete such duplicate information, manual overview
of information would become more feasible. A solution for this problem is offered
by text clustering methods. The need for good clustering algorithms is emphasized
in Ref. 14 in a context of automated text summarization.
In this paper, we propose a new model for the representation of textual documents as an alternative for the vector space model.9 Whereas the vector space
model uses a naive strategy for the decomposition of text into a vector of words, the
relational model relies on a decomposition into concepts. Moreover, our model takes
into account the relations between concepts. The new model offers several benefits.
First, a semantical rich representation of textual documents is obtained that can be
processed efficiently by using the standard operators of multirelations.18 Second,
relevant information can be identified by thresholding the frequency of (couples of)
concepts. Moreover, the results of such thresholding are extremely sparse structures.
This is in contrast with the issues of high dimensionality that are present in vector
spaces. Third, the identification of concepts and relations does not rely on ontologies
or other a priori knowledge, which makes our approach context independent.
The remainder of the paper is structured as follows. In Section 2, some preliminary definitions regarding multirelations, possibility theory, and evaluators are
given. Section 3 introduces the new text model, called the relational text model,
which relies on the theory of multirelations. Section 4 discusses the evaluation of
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
3
documents and introduces a set of assumptions. As a result, the problem of text
clustering can be rewritten in terms of an unknown binary relation. Section 5 adopts
this problem formulation and deals with the estimation of the unknown relation.
Moreover, it is studied how clusters of high quality can be deducted from a candidate relation. From these considerations, the concept-relational (CR) algorithm is
formulated. To test our approach, some experimental results are reported in Section 6. Finally, Section 7 summarizes the most important contributions of this paper.
2. PRELIMINARIES
2.1. Multisets
In what follows, the concept of a multiset will be used intensively. Informally, a
multiset M is an extension of a (Cantorian) set in which elements can occur multiple
times. Formally, a multiset M derived from a universe U is defined by a counting
function M : U → N.18 For u ∈ U , M(u) represents the number of times that u
appears in M. The set of all multisets drawn from a universe U is denoted M(U ).
The concept of subset is extended for multisets as M1 ⊆ M2 ⇔
∀u ∈ U : M1 (u) ≤
M2 (u) and the cardinality of a multiset M is given by |M| = u∈U M(u). Yager18
defines the following operators for multisets:
∀u ∈ U : (M1 ∪ M2 )(u) = max(M1 (u), M2 (u)),
∀u ∈ U : (M1 ∩ M2 )(u) = min(M1 (u), M2 (u)),
∀u ∈ U : (M1 ⊕ M2 )(u) = M1 (u) + M2 (u).
The ∈-operator applies for multisets as follows: u ∈ M ⇔ M(u) > 0. The k-cut of
a multiset M is a regular set Mk = {u|u ∈ U ∧ M(u) ≥ k}.
2.2. The Coreference Problem
In the remainder of this paper, let D = {d1 , ..., dn } denote a set of documents
and let T = {t1 , ..., tm } denote a set of topics. It is assumed that each document can
be assigned exactly one topic, given by the surjective function ρ : D → T . Two
arbitrary documents di and dj are called coreferent if and only if ρ (di ) = ρ dj ,
i.e., if they refer to the same topic.
In a more general setting, where two objects are not bound to be textual
documents, the uncertainty about coreference for two objects can be expressed as a
possibility distribution6–8, 10, 15 over the Boolean domain B = {T , F }. A possibility
distribution π over an outcome space X can be derived from a possibility measure,
much like a probability distribution Pr can be derived from a probability measure.
It satisfies the following normalization constraint:
max π(x) = 1.
(1)
x∈X
International Journal of Intelligent Systems
DOI 10.1002/int
4
BRONSELAER AND DE TRÉ
This implies that a possibility distribution is modeled by a fuzzy set, which is why
we denote the set of all possibility distributions over X as F (X ). A possibility distribution, which satisfies X = B, is called a possibilistic truth value and a function,
which generates it in the case of coreference, is called an evaluator.1
DEFINITION 1 (Evaluator). Given an object space O, an evaluator is defined as a
function:
EO : O 2 → F (B) : (o1 , o2 ) → {(T , μp (T )), (F, μp (F ))}.
(2)
Hereby, μp (T ) represents the possibility that o1 and o2 are coreferent and μp (F )
represents the possibility that o1 and o2 are not coreferent. F (B) represents the set
of all possibility distributions over B.
In the remainder of this paper, we shall adopt the couple notation for possibilistic truth values. This means that {(T , μp (T )), (F, μp (F ))} is abbreviated to
(μp (T ), μp (F )). We will assume that an evaluator is both reflexive:
∀(o1 , o2 ) ∈ O 2 : o1 = o2 ⇒ EO (o1 , o2 ) = (1, 0)
(3)
∀(o1 , o2 ) ∈ O 2 : EO (o1 , o2 ) = EO (o2 , o1 ).
(4)
and symmetric:
On the basis of the possibilities that two objects are (not) coreferent, a Boolean
decision can be taken by use of a binary decision model.
DEFINITION 2 (Binary decision model). Assume z ∈ [0, 1]. A binary z-decision
model is a function B : F (B) → B such that:
B (
p) =
T
F
if
if
1 − μp (T ) > z,
else.
(5)
The value z is called the certainty threshold.
3. A RELATIONAL MODEL FOR TEXT DOCUMENTS
In this section, we introduce a relational model for textual documents that is
based on the concept of multirelations. We define operators for the relational model.
These operators will be used to define a new approach on text clustering. To define
our model, we first define binary multirelations. Within the scope of this paper, it is
assumed that relations are homogeneous.
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
5
DEFINITION 3 (Binary multirelation). Consider a universe U . A binary multirelation
R over U is defined as subset of M(U × U ).
A binary multirelation is thus a multiset of couples. The number of times that
a couple occurs in the multirelation is called the multiplicity of the couple. A binary
multirelation R over U is symmetric if and only if
∀(u1 , u2 ) ∈ U 2 : R(u1 , u2 ) = R(u2 , u1 ).
(6)
A binary multirelation R is called k-transitive if and only if the regular relation Rk
is transitive, with Rk the k-cut R. A binary multirelation R is called k-reflexive if
and only if Rk is reflexive. A binary multirelation R is irreflexive if and only if
∀u ∈ U : R(u, u) = 0.
(7)
With the definition of multirelations at hand, it is possible to define the relational
model for documents.
DEFINITION 4 (Relational model for documents). Assume a concept space C . A
document d is defined as a set of l sentences S = {s1 , ..., sl }, where each sentence
si ⊂ M(C × C ) is a binary, irreflexive, symmetrical, and 1-transitive multirelation
over C . The set of all documents is denoted as D.
In what follows, we shall always assume that multirelations are binary. For
that reason, we omit the adjective “binary” in the following. The relational model
for documents considers a document as a set of sentences, where each sentence is
a multirelation over concepts. Each concept is a collection of words that together
constitute a meaningful, semantical unit. An important observation is that we can
represent the whole document as a multirelation by taking the relational transformation of the document.
DEFINITION 5 (Relational transformation). For a universe of documents D, the
relational transformation is defined as a function:
ψ : D → M(C × C ) : d →
|S|
si ,
(8)
i=1
where ⊕ is the sum operator for multisets, used in prefix notation. The relational
transformation ψ(d) is equal to the sum of all sentences.
An important question that arises with the introduction of the relational model,
is how to represent a piece of text with this model. Such a representation is possible
by use of the i.Know-engine. This engine is a patented technology of the Belgian
company i.Know that transforms textual documents into a semistructured model. An
important advantage of this technology, which is also the most significant innovation
International Journal of Intelligent Systems
DOI 10.1002/int
6
BRONSELAER AND DE TRÉ
with respect to existing methods, is the fact that the i.Know-engine is context independent. In literature, many methods are based on semantical knowledge systems,
but the advantage of these methods is that they rely on a context-dependent component. Classical natural language processing (NLP) systems use a top-down approach
to identify terms based on a predefined thesaurus, ontology or statistical model. In
contrast to this, i.Know adheres to a bottom-up approach where the SIE (smart
indexing engine) automatically identifies all complex terms in a text, regardless of
their length or semantical complexity. The i.Know engine relies on the relational
structure of natural languages to find all concept–relation–concept (CRC) units in
a document. Thus, instead of using a predefined ontology, the i.Know technology
offers a method where concepts are dynamically mined from a text.
Example. Let us consider the following sentence:
“Heavy rain is falling over southern and south-western England, while more snow
is due to arrive in Wales.”
When this sentence is parsed by the SIE, the result is the following:
“Heavy rain IS FALLING OVER southern AND south-western England, WHILE
more snow IS DUE TO ARRIVE IN Wales.”
In this result, a group of words that is marked in bold font, is considered to be
a concept. A group of words in CAPITAL LETTERS is considered to be a relation. Other
words are considered to be nonrelevant. This result is casted to the relational text
model by the multirelation M shown in Figure 2. Hereby, the multiplicity of each
couple is equal to 1. Notice that the multirelation is indeed 1-transitive, symmetric,
and irreflexive.
A remarkable property of the concepts produced by the SIE is that their multiplicity is a measure for their relevance. This property is induced by the fact that
concepts are meaningful groups of words. It is a well-known fact that this property
is not possessed by classical q-grams. In this paper, we would like to exploit this
interesting feature. Therefore, we shall define the relational transformation of a data
Southern
South-western England
Heavy rain
Snow
Wales
Figure 2. Relational representation of an example sentence.
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
7
set D as follows:
(D) =
(9)
ψ(d).
d∈D
The relational transformation of a data set of documents is a simple and elegant way
to represent the relational and conceptual information of a data set D. We will now
introduce a number of operators for documents represented in the relational text
model.
DEFINITION 6. Assume a document d ∈ D and a couple of concepts (c1 , c2 ) ∈ C 2 .
The couple (c1 , c2 ) belongs to d, denoted as (c1 , c2 ) ∈ d if and only if:
(c1 , c2 ) ∈ ψ(d).
(10)
DEFINITION 7 (Concepts of a document). Assume a document d ∈ D. The concepts
of d are given by the multiset Cd such that
∀c1 ∈ C : Cd (c1 ) =
ψ(d)(c1 , c2 ).
(11)
c2 ∈C
Definition 7 implies that the concepts of a document are represented as a
multiset, where the multiplicity Cd (c) indicates the number of times c occurs as the
first concept in a couple of ψ(d). We only verify the first concept because ψ(d)
is symmetrical (Definition 5). In what follows, we will be interested in verifying
whether the concepts of a couple (possibly) occur in the multiset of concepts related
to a document.
DEFINITION 8. Assume a document d ∈ D with concept space C . We define
∀(c1 , c2 ) ∈ C 2 : (c1 , c2 ) ∈ d ⇔ (c1 ∈ Cd ∧ c2 ∈ Cd ) .
(12)
DEFINITION 9. Assume a document d ∈ D with concept space C and an evaluator
EC . We define
∀(c1 , c2 ) ∈ C 2 : (c1 , c2 )
∈d
⇔ (∃(c1′ , c2′ ) ∈ d : B (EC (c1 , c1′ )) ∧ B (EC (c2 , c2′ ))),
where B is a binary decision model with z = 0.
We can now define several selection operators for documents.
International Journal of Intelligent Systems
DOI 10.1002/int
(13)
8
BRONSELAER AND DE TRÉ
DEFINITION 10 (Relational selection of documents). Assume a couple of concepts
(c1 , c2 ) ∈ C . The relational selection of documents under (c1 , c2 ) is defined as a set
of documents:
D(c1 ,c2 ),∈ = {d ∈ D|(c1 , c2 ) ∈ d} .
(14)
The couple (c1 , c2 ) is called a (search) pattern.
DEFINITION 11 (Conceptual selection of documents). Assume a couple of concepts
(c1 , c2 ) ∈ C . The conceptual selection of documents under (c1 , c2 ) is defined as a set
of documents:
D(c1 ,c2 ), ∈ = {d ∈ D|(c1 , c2 ) ∈ d} .
(15)
The couple (c1 , c2 ) is called a (search) pattern.
DEFINITION 12 (Possibilistic selection of documents). Assume a couple of concepts
(c1 , c2 ) ∈ C . The possibilistic selection of documents under (c1 , c2 ) is defined as a
set of documents:
D(c1 ,c2 ), ∈ = {d ∈ D|(c1 , c2 )
∈ d}
(16)
The couple (c1 , c2 ) is called a (search) pattern.
On the basis of the selection of documents, it is possible to define dependencies
between patterns.
DEFINITION 13 (Dependent patterns). Two patterns (c1 , c2 ) and (c1′ , c2′ ) are called
dependent (denoted as (c1 , c2 ) ∼ (c1′ , c2′ )) if and only if
D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈ = ∅.
(17)
Dependency of patterns signifies that there is at least one document in which
both patterns occur. Although it is possible to define alternative dependencies based
on weaker selection operators, we omit these definitions because such weaker dependencies are not used in our approach for text clustering. When we consider a
vector of patterns, we can construct a dependency matrix.
DEFINITION 14 (Dependency matrix). Assume a vector of patterns v. The dependency matrix for v is a |v| × |v| matrix Mv such that
1 if vi ∼ vj ,
(18)
Mv (i, j ) =
0 if else.
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
9
When all patterns in v are mutually independent, Mv equals the unity matrix
I|v| . An important tool to study dependencies between patterns is implied by the
following theorem.
THEOREM 1. For a vector of patterns v and the dependency matrix Mv , there exists
a permutation of v, denoted v∗ , such that
∀j ∈ {1, ..., |v| − 1} :
|v|
M (i, j ) ≥
v∗
|v|
Mv∗ (i, j + 1).
(19)
i=1
i=1
Theorem 1 states that we can order a vector of patterns such that high dependent
patterns are put first. An example of a dependency matrix Mv and the derived Mv∗
are shown in Figure 3, where dependence is represented by a white square and
independence is represented by a black square. The permutation of v ensures that
the number of independencies on the columns of Mv∗ increases from left to right. As
Mv∗ is a symmetric matrix, the same observation holds for rows from top to bottom.
Our definition of dependence considers (in)dependence as a Boolean matter.
When information is required about the extent to which two patterns are dependent,
a degree of dependence fulfills this need.
DEFINITION 15 (Degree of dependence). The degree of dependence for two patterns
(c1 , c2 ) and (c1′ , c2′ ) is defined as
dep((c1 , c2 ), (c1′ , c2′ )) =
|D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈ |
.
min(|D(c1 ,c2 ),∈ |, |D(c1′ ,c2′ ),∈ |)
Figure 3. Dependency matrix Mv (left) and the derived Mv∗ (right).
International Journal of Intelligent Systems
DOI 10.1002/int
(20)
10
BRONSELAER AND DE TRÉ
The degree of dependence lies in the unit interval and satisfies
dep((c1 , c2 ), (c1′ , c2′ )) = dep((c1′ , c2′ ), (c1 , c2 )).
(21)
We can also see that:
(c1 , c2 ) ∼ (c1′ , c2′ ) ⇒ dep((c1 , c2 ), (c1′ , c2′ )) > 0.
(22)
4. DOCUMENT EVALUATION
We have introduced evaluators in Section 2. An evaluator E is a function that
compares two objects and models the uncertainty about the coreference of these
two objects. Uncertainty is hereby represented as a possibility distribution over B.
In this section, we aim at constructing such an evaluator for documents. According
to Definition 2.2, this should be a function ED :
ED : D2 → F (B).
(23)
To construct such a function, we shall adopt the relational transformation to work
with. This means that we shall consider an evaluator as follows:
EM(C ×C ) : M(C × C )2 → F (B).
(24)
This notation signifies that two documents are compared by comparing their relational transformations. In general, not all patterns in the relational transformation
are needed during comparison. In this paper, we wish to adopt a model such that
two documents are coreferent if they share sufficient relevant patterns. When using
such a model, we are faced with two key questions: (i) what are the relevant patterns
and (ii) how is the linguistic quantity “sufficient” to be interpreted? In response to
the first question, we can say that due to the specific properties of the SIE, we can
use multiplicity (i.e., frequency) of concepts and patterns as a measure for their relevance. The second question is more difficult to answer. We can see, however, that the
interpretation of “sufficient” depends on the definition of relevant patterns. Indeed,
if the interpretation of “relevant” is relaxed, then the interpretation of “sufficient”
should be strengthened. Therefore, we adopt the following approach. We will give
a fixed interpretation to the linguistic quantity “sufficient” and we shall adapt the
definition of relevant to this fixed quantity. This can be formalized as follows. Let
us assume that there exists a function T defined as follows:
T : M(C × C ) → M(C × C )
(25)
such that T (ψ(d)) represents the set of relevant patterns of a document d. If we
assume that the linguistic quantity “sufficient” is modeled by “at least one” we find
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
11
the following:
ED (d1 , d2 ) =
(1, 0) if T (ψ(d1 )) ∩ T (ψ(d2 )) = ∅,
(0, 1) if else.
(26)
We now make an assumption about the function T . More specific, we assume that
there exists a binary relation R (D) ⊂ C 2 such that
∀d ∈ D : T (ψ(d)) = ψ(d) ∩ R (D) .
(27)
This means that R (D) is a relation that contains all relevant patterns. We now see
that
(T (ψ(d1 )) ∩ T (ψ(d2 )) = ∅) ⇔ ((ψ(d1 ) ∩ R (D) ∩ ψ(d2 )) = ∅).
(28)
We can thus rewrite the evaluation of two documents as follows:
ED (d1 , d2 ) = (1, 0)
(0, 1)
2
if ∃(c1 , c2 ) ∈ R (D) : (d1 , d2 ) ∈ D(c1 ,c2 ),∈ ,
if else.
(29)
This last expression reveals that coreferent documents can be found by using the
selection of documents through search patterns. Some important consequences are
hereby implied.
First, usage of a selection operator is efficient. More specific, the examination
of all pairs of documents (which would have quadratic complexity) can be avoided.
Instead, we will provide an algorithm in Section 5 that directly computes clusters
of documents without explicit calculation of ED .
Second, when we want to construct such an algorithm, we must take into
account that for two patterns (c1 , c2 ) and (c1′ , c2′ ), the set
D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈
(30)
is not bound to be empty. This means that documents are not bound to belong to one
cluster. This seems to be a natural property for textual documents. However, in light
of the coreference problem, we shall adhere to the fact that ρ is indeed a function,
which implies that a document references to exactly one topic. As a consequence,
our approach is an approximation of reality. The authors would like to stress that a
model where a document can belong to several clusters has certain advantages such
as a more natural distribution of documents over topics. For reasons of simplicity,
this more complex model is not adopted here.
Third, the replacement of ∈ by ∈ or
∈ leads to alternative evaluators for
documents.
International Journal of Intelligent Systems
DOI 10.1002/int
12
BRONSELAER AND DE TRÉ
DEFINITION 16 (op-evaluator). An op-evaluator for documents with op ∈ {∈, ∈ ,
∈}
is defined as
op
op
ED : D2 → F (B) : (d1 , d2 ) → ED (d1 , d2 )
(31)
such that
op
ED (d1 , d2 ) = (1, 0)
(0, 1)
2
if ∃(c1 , c2 ) ∈ R (D) : (d1 , d2 ) ∈ D(c1 ,c2 ),op ,
if else.
(32)
For this family of operators, we can prove the following theorem.
THEOREM 2 . Assume a document space D, then
∈
∀(d1 , d2 ) ∈ D : ED
(d1 , d2 ) ≤ ED∈ (d1 , d2 ) ≤ ED∈ (d1 , d2 ).
(33)
Proof. The proof follows from the fact that for a pattern (c1 , c2 ) ∈ C 2 we have that
D(c1 ,c2 ),∈ ⊆ D(c1 ,c2 ), ∈ ⊆ D(c1 ,c2 ), ∈ .
(34)
By using the family of op-evaluators, we can define a set of coreferent documents more general as D(c1 ,c2 ),op . Theorem 2 can be used to predict the impact
of choosing a different operator op. We will address this issue when discussing
experimental results (Section 6).
5. THE CR ALGORITHM
In this section, we will introduce the CR algorithm, an algorithm that is based
on the relational transformation of textual documents. During the explanation of our
approach, we shall frequently use definitions and operators introduced in the previous section. Our algorithm departs from the assumption that we have made about
evaluators for documents. These assumptions imply that the coreference problem
for textual documents can be written in terms of a relation R (D) and a selection operator D(c1 ,c2 ),op . The relation R (D) is an unknown relation that represents the set of
relevant patterns. To construct the unknown relation R (D) , we suggest the following
approach. We will develop a procedure that constructs clusters with high precision
for a given relation R. More specific, if a binary relation R is given as a candidate
for R (D) , our approach produces clusters of which the precision is sufficiently high.
Because of the fact that multiplicity is a measure for relevance, we shall consider the
several k-cuts of (D), i.e., the relational transformation of the whole data set, as
possible candidates for R (D) . Finally, to control recall, we shall adopt an algorithm
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
13
developed in Ref. 4 that estimates the number of clusters. The best candidate for
R (D) is then the relation for which the number of clusters best approximates the
estimation of the number of clusters.
As we have mentioned in the previous section, the usage of a selection operator
D(c1 ,c2 ),op for finding groups of coreferent documents implies that some documents
can be selected by several patterns. Our definition of coreference implies that clusters
behave as equivalence classes. To construct such classes, we shall sort the relevant
patterns. We then pass over the patterns in the given order and we select documents
through a selection operator D(c1 ,c2 ),op . Hereby, documents that are selected earlier
are not taken into account anymore. This way, we ensure that equivalence classes
are obtained.
Assume that a data set D is given. We can calculate the relational transformation
of D as (D), which is a multirelation over the concept space C . For a given
k ∈ N, the k-cut of this multirelation is a binary relation and is denoted as ((D))k .
Figure 4 shows a typical k-cut for a relative high value of k. Hereby, green circles
represent concepts and a line between concepts denotes a relation between concepts.
A remarkable property that we observe is that for sufficiently high values of k,
((D))k typically consists of components in a graph theoretical sense. In Figure 4,
the components are delimited with dashed lines. Components can be found in linear
time12 and form the basis for our approach. Each component consists of patterns,
and we denote these patterns as a vector v. A naive approach would be to produce
clusters as follows:
D(c1 ,c2 ),op .
(35)
(c1 ,c2 )∈v
Figure 4. A typical k-cut of (D).
International Journal of Intelligent Systems
DOI 10.1002/int
14
BRONSELAER AND DE TRÉ
This would be a very simple way of selecting documents and put them in a
cluster. However, there is no verification about the coreference of the selected documents. If we would be able to perform such a verification, a simple way of producing
clusters would become available. It appears that the mentioned verification can be
done by analysis of dependencies. In particular, we shall focus on dependencies
within a component and dependencies between components. Both types are discussed in the following subsections. After that, we use the analysis of dependencies
to construct the CR algorithm.
5.1. Dependencies within Components
Suppose that we have a data set of documents D and suppose that we take
a k-cut of (D), denoted as ((D))k . An arbitrary component of this k-cut can
be represented as a vector v. For each pattern (c1 , c2 ) in this vector, D(c1 ,c2 ),op
represents the selection of documents when (c1 , c2 ) is used as a search pattern. Let
us now suppose that we can sort the patterns in v, such that
v = v(1) ⊕ v(2) ,
(36)
where ⊕ represents the concatenation operator for vectors and such that
(2)
∀i ∈ {1, . . . , |v(1) |} : ∀j ∈ {1, ..., |v(2) |} : ¬ v(1)
i ∼ vj
(37)
This means that patterns in v(1) are independent of patterns in v(2) : there is no
document in D that contains a pattern from v(1) and a pattern from v(2) . In light
of our considerations about an evaluator for documents, this means that documents
under the selection of patterns from v(1) are not coreferent with documents under
the selection of patterns from v(2) . We can thus argue that, given a vector of patterns
v, the set of documents
D(c1 ,c2 ),op
(38)
(c1 ,c2 )∈v
contains coreferent documents if all patterns in v are mutual dependent. To make
our approach more robust, we shall replace the quantifier “all” by a fuzzy quantifier
“most”,16 which leads to the constraint that most patterns in v must be mutual
dependent. To summarize, if we are given a component with patterns v, we must
split up that component into maximal subcomponents such that most patterns of
each subcomponent are mutual dependent. Algorithm 1 shows how such a split can
be obtained.
It can be seen that Algorithm 1 has quadratic complexity in the worst case
(O(|v|2 )). This is one of the reasons why k-cuts are first decomposed into components: the size of components is typically sufficiently small for acceptable efficiency.
International Journal of Intelligent Systems
DOI 10.1002/int
15
CONCEPT-RELATIONAL TEXT CLUSTERING
Algorithm 1 split(Mv∗ )
Require: split factor sf ∈ [0, 1]
1: for j ∈ {1, ..., |v∗ |} do
|v∗ |
2:
x ← k=1 (Mv∗ (j, k))
3:
if |vx∗ | < sf then
4:
return {v∗(1,...,j ) } ∪ {split(Mv∗(j +1,...,|v∗ |) )}
5:
end if
6: end for
5.2. Dependencies between Components
In the same way strong independencies within a component require a split
of the component, strong dependencies between components reveal that the documents corresponding with these components, are in fact coreferent. Therefore, we
generalize the definition of degree of dependence toward vectors of patterns:
|v|
i=1
dep(v, v′ ) =
min
Dvi ,∈ ∩
|v|
i=1
Dvi ,∈ ,
|v′ |
i=1
Dv′i ,∈
|v′ |
i=1
Dv′i ,∈
.
(39)
We will reason that if a component has a strong dependency with a bigger component
(i.e., a component with more patterns), we shall ignore the smaller component.
The fact that we consider ignorance of a component rather than merging of two
components can be explained as follows. On the one hand, a strong dependency
between two components with patterns v and v′ indicates by definition that the set
|v|
i=1
⎛
Dvi ,∈ ∩ ⎝
|v′ |
i=1
⎞
Dv′i ,∈ ⎠
(40)
contains a large number of documents. Thus, we know a priori that the contribution
of the smaller component is rather low. On the other hand, due to the quadratic
complexity of Algorithm 1, the number of patterns to be analyzed with Algorithm 1
is best kept low.
5.3. Production of Clusters Based on ((D))k
In this section, we provide an algorithm to compute clusters when a k-cut of
((D)) is given as a candidate for R (D) . This means that patterns in ((D))k are considered to be relevant patterns. Algorithm 2 summarizes the computation of clusters
in pseudo code, taking into account the observations regarding (in)dependencies.
First, a cut of the relational transformation of D is taken (line 2). This cut is
then splitted in maximal connected components (line 3). The set of components is
International Journal of Intelligent Systems
DOI 10.1002/int
16
BRONSELAER AND DE TRÉ
Algorithm 2 produce(D,k)
Require: delete factor df ∈ [0, 1]
1: K ← ∅
(D )
2: Rk ← (D )k
(D )
3: Rk ← components(Rk )
4: while Rk = ∅ do
5:
v ←arg minv∈Rk |v|
6:
if ¬ ∃v′ ∈ Rk : dep v, v′ > df then
7:
(v(1) , ..., v(l) ) ← split(Mv∗ )
8:
for j ∈ {l, ..., 1} do
9:
cluster ← (c1 ,c2 )∈v(j ) D(c1 ,c2 ),op
10:
add(K, cluster)
11:
D ← D\cluster
12:
end for
13:
end if
14:
Rk ← Rk \v
15: end while
16: return K
denoted as Rk . The components are then sorted according to their size (i.e., the
number of patterns). Smaller components are treated first because they represent a
topic for which few information is available. These topics are thus the hardest to
identify. For each component, it is checked whether a larger component exists with
which the selected component has a strong dependency (line 6). When no strong
dependencies are found, the component is splitted in subcomponents (line 7). For
each subcomponent, a cluster is composed as the union of the selection of patterns
in the subcomponent. Documents from the new cluster are deleted from D to ensure
that clusters are mutual disjunct.
5.4. Optimizing k
With the ability of computing clusters based on a k-cut of (D), a question
left unanswered is how to choose the best k-cut. Therefore, we adopt the following
reasoning. On the one hand, for a given k-cut, we compute clusters such that the
precision of clusters is sufficiently high. That is, by choosing groups of patterns with
a high ratio of mutual dependencies, we group documents that share a sufficient
amount of relevant patterns. Relevancy of the patterns hereby stems from the fact
that the selected patterns have a high multiplicity in (D). If not, they would not be
part of the k-cut. On the other hand, we should also be able to perform a control of
recall. Recall is a measure for the completeness of clusters, i.e., it measures to what
extent all documents about one topic are grouped in one cluster. To ensure recall,
we state that the best k-cut of (D) is the one for which the number of clusters best
approaches the number of topics. To do this, we adopt a method for the estimation of
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
17
the number of topics.4 If we denote the set of topics as T = {t1 , ..., tm }, the method
in Ref. 4 provides us with an estimation of m, denoted m.
Algorithm 3 CR(D)
Require: cut factor cf ∈ N
1: m ← estimate(D )
2: stop ← false
3: K ← ∅
4: while ¬stop do
5:
koptimal ← arg mink |produce(D, k)| + |D\Dproduce(D,k) | − m
6:
if koptimal > cf then
7:
K ← K ∪ produce(D, koptimal )
8:
D ← D \D K
9:
stop ← |K| + |D| < m
10:
else
11:
stop ← true
12:
end if
13: end while
14: ∀d ∈ D : K ← K ∪ {d}
15: return K
Algorithm 3 shows how the CR algorithm estimates the number of topics
(line 1), to find an optimal k (line 5). The set K represents the set of clusters. It
is possible that the optimal k-cut leads to a number of clusters that is higher than
the estimated number of clusters. When this situation occurs, the whole procedure
is repeated on the remaining documents (line 9). The algorithm stops when (a) the
number of clusters becomes smaller that m or (b) koptimal is not high enough. The
second stop condition is explained by noting that patterns should have sufficient
multiplicity. If not, they cannot be considered relevant anymore.
6. EXPERIMENTAL RESULTS
In this section, a comparative study between the proposed method and existing
methods is conducted and reported. The data set used is a collection of 550 Dutch
documents collected on news sites and manually labeled with a topic. In total, 134
different topics are considered. If we calculate ((D))1 , we find that there are 73, 681
different patterns. In total, there are 15, 310 unique concepts. It can be seen that the
1-cut is a dense structure with many relations between the concepts. This observation
agrees with the fact that textual documents typically contain an overwhelming
amount of information, where most of this information is non relevant with respect
to the topic of the document. An interesting trend is shown in Figure 5. Here, the
average number of patterns per component is shown in function of k. We can clearly
see that, as k increases, the components contain few patterns on average. This is
an important observation, considering the quadratic complexity of Algorithm 1. We
International Journal of Intelligent Systems
DOI 10.1002/int
18
BRONSELAER AND DE TRÉ
Average patterns/component
200
180
160
140
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
k
Figure 5. Average number of patterns per component in function of k.
observe that for increasing k, the resulting k-cut becomes a sparse structure, which
is a promising result, considering the fact that classical vector spaces rapidly cope
with overdimensionality in the case of text clustering.
The results of an algorithm are reported as follows. In each experiment, a set
of documents D is assumed. After execution of an algorithm, a set of clusters is
available. These clusters represent a partition of D. For each combination of topic
ti and cluster Kj , the precision p(ti , Kj ) and the recall r(ti , Kj ) are calculated as
follows:
p(ti , Kj ) =
ni,j
,
|Kj |
(41)
r(ti , Kj ) =
ni,j
,
docs(ti )
(42)
where ni,j is equal to the number of documents about topic ti in cluster Kj and
docs(ti ) is equal to number of documents about topic ti in D. A balance between
precision and recall is given by their harmonic mean. This number is called the
f -value and is calculated as follows:
f (ti , Kj ) =
2 p(ti , Kj ) r(ti , Kj )
.
p(ti , Kj ) + r(ti , Kj )
International Journal of Intelligent Systems
DOI 10.1002/int
(43)
19
CONCEPT-RELATIONAL TEXT CLUSTERING
The accuracy of a cluster algorithm can then be expressed as a linear combination
of maximal f -values for the individual clusters:
f =
docs(ti )
|D|
i
max f (ti , Kj ).
(44)
j
To better understand the composition of clusters, we also report the overall precision
of a set of clusters, also called purity:
p=
|Kj |
|D|
j
max p(ti , Kj ).
(45)
i
6.1. Basic Methods on Vector Spaces
To compare the CR algorithm with existing approaches, we adopt a recent
overview paper on text clustering.17 This paper mentions some baseline algorithms
for text clustering and discusses the important preprocessing steps. The preprocessing steps we apply are length filtering, stopword removal, stemming, frequency
filtering, and principal component analysis (PCA). The stemming is done by using a
Snowball stemmer. PCA is performed by choosing components that together explain
95% of the total variance. We consider two ways of constructing vectors from textual
documents: binary vectors that indicate word occurrence and TFIDF vectors.13 On
the resulting vector spaces, we apply the k-means clustering algorithm20 and the hierarchical clustering algorithm.11 For hierarchical clustering, we consider both single
linkage and full linkage. Note that for both k-means clustering and hierarchical clustering, the number of clusters is assumed to be known. Therefore, we study the evolution of both the f -value and the purity in terms of a variating number of clusters.
Figure 6 shows the results for k-means clustering.The accuracy of the k-means
algorithm is very low for the given data set of documents. Note that we have also
purity
f-value
purity
1
0,9
0,9
0,8
0,8
0,7
0,7
Accuracy
Accuracy
1
0,6
0,5
0,4
f-value
0,6
0,5
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0
0
50
100
150
200
250
300
350
400
50
Number of clusters
100
150
200
250
300
350
400
Number of clusters
Figure 6. Purity and f -value for k-means clustering with binary vectors (left) and TFIDF vectors
(right).
International Journal of Intelligent Systems
DOI 10.1002/int
20
BRONSELAER AND DE TRÉ
purity
purity
1,0
0,9
0,9
0,8
0,8
0,7
0,7
Accuracy
Accuracy
1,0
f-value
0,6
0,5
0,4
f-value
0,6
0,5
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0,0
0,0
50
100
150
200
250
300
350
50
400
100
150
200
250
300
350
400
Number of clusters
Number of clusters
Figure 7. Purity and f -value for single linkage with binary vectors (left) and TFIDF vectors
(right).
purity
f-value
purity
1,0
0,9
0,9
0,8
0,8
0,7
0,7
Accuracy
Accuracy
1,0
0,6
0,5
0,4
0,6
0,5
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0,0
f-value
0,0
50
100
150
200
250
300
350
400
50
Number of clusters
100
150
200
250
300
350
400
Number of clusters
Figure 8. Purity and f -value for full linkage with binary vectors (left) and TFIDF vectors (right).
experimented with kernel k-means clustering, using the cosine kernel function. The
results of this k-means variant were similar to the ones shown in Figure 6 and are,
therefore, omitted here. A better approach is the hierarchical clustering algorithm,
especially under the condition of full linkage. The results are shown in Figures 7
and 8.
In general, we conclude that TFIDF vectors give a slightly better result than
binary vectors, which is not surprising. For each tested algorithm, we shall keep the
maximal f -value that is obtained by modifying the number of clusters and we shall
compare this maximal f -value with the novel CR algorithm.
6.2. Latent Dirichlet Allocation
Next to the basic algorithms that are reported in Ref. 17, we have tested
an advanced and quite recent statistical method called latent Dirichlet allocation
(LDA).5 LDA initiates from the idea that the topic of a document is in general not
International Journal of Intelligent Systems
DOI 10.1002/int
21
CONCEPT-RELATIONAL TEXT CLUSTERING
clearly measurable. It therefore models the topic of a document as a probability
distribution over the topic space T . Each topic in T is, in turn, modeled as a
probability distribution over a word space, denoted as W . We explicitly use a
different notation to indicate that concepts in C are the result of SIE parsing. LDA
generates the word space W in a more classical way. The adopted probability
distribution is the Dirichlet distribution. LDA is closely related to latent semantic
indexing (LSI):19 the only difference lies in the prior distribution that is assumed. We
remark that LDA does not adopt a vector space. Instead, it uses the bag-of-words
model. For this reason, PCA is not performed as a preprocessing step. All other
preprocessing steps that are used in the case of basic methods, are applied in the
case of LDA. Again, the number of clusters is assumed to be known, so we report
the evolution of purity and f -value for a variating number of clusters. The results
are shown in Figure 9.
Comparison of the results of LDA with the results of the basic methods reveals
that LDA outperforms these methods. A remarkable difference between LDA and
the best performing basic method (i.e., the hierarchical algorithm adopting the full
linkage rule and TFIDF vectors) is the stability with respect to a variation in the
number of clusters. Indeed, the right panel of Figure 8 shows that there is an optimal
number of clusters for hierarchical clustering, whereas this is not the case with LDA
(Figure 9).
6.3. The CR Algorithm
In this section, we compare the novel CR algorithm with the tested algorithms from literature. As can be seen in Algorithm 2 (line 9), the selection operator
purity
1,0
f-value
0,9
0,8
Accuracy
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0,0
50
100
150
200
250
300
350
Number of clusters
Figure 9. Purity and f -value for LDA.
International Journal of Intelligent Systems
DOI 10.1002/int
400
22
BRONSELAER AND DE TRÉ
Table I. Comparison of the CR algorithm with existing methods.
Method
Purity
f -Value
Number of clusters
CR(1)
CR(2)
CR(3)
k-Means
Single linkage
Full linkage
LDA
0.9508
0.9326
0.9399
0.6904
0.8925
0.8900
0.8452
0.6240
0.8036
0.8288
0.5731
0.4714
0.6927
0.8107
348
219
205
100
390
240
92
D(c1 ,c2 ),op is a general one, where op ∈ {∈, ∈ ,
∈ }. We, therefore, test three versions of our algorithm. We denote CR(1) as the CR algorithm that uses a relational
selection (i.e., D(c1 ,c2 ),∈ ), CR(2) as the CR algorithm that uses a conceptual selection
(i.e., D(c1 ,c2 ), ∈ ) and CR(3) as the CR algorithm that uses a possibilistic selection
(i.e., D(c1 ,c2 ), ∈ ). For CR(3) , we require an evaluator to compare concepts. Such an
evaluator is introduced in Refs.2 and.3 We refer to those works for more information about the used evaluators for concepts. We report that the used algorithm for
estimation of the number of clusters4 provides us an estimation of 137.
To make a comparison with existing methods, we report the maximal f -value
that is obtained by variating the number of clusters for existing methods. These
maximal f -values, together with the results of the CR algorithm, are summarized
in Table 1. For the basic methods, we only report the results obtained with TFIDF
vectors, as these vectors have shown to provide better results than binary vectors.
Table 1 also reports the actual number of clusters that is obtained. Let us now discuss
the most important observations that are revealed by Table 1.
First, it can be seen that the CR algorithm results in a remarkable high purity.
This means that, regardless of the selection operator used, the clusters produced by
the CR algorithm mostly contain documents that are coreferent. The quality of the
individual clusters is thus very high. It can be seen that the CR algorithm outperforms
all the tested methods with respect to purity. This result shows that the analysis of
dependencies indeed allows for production of clusters with high precision. Second,
when studying the impact of the selection operator, we observe that the purity is
slightly higher for CR(1) . However, the f -value is significantly lower for CR(1) ,
which is explained by the fact that CR(2) and CR(3) yield a significant higher recall.
This can be seen by consulting the obtained number of clusters. The observed
trends in purity, recall and f -value are not surprising, considering Theorem 2. In
general, we observe that CR(1) is too strict for the selection of documents, which
is why CR(2) or CR(3) should be used. Third, we can see that there is a significant
improvement in the f -value with respect to the basic methods that adopt a vector
space. Finally, although there is only a slight difference between LDA and CR with
respect to f -value, there are clearly some differences between both approaches. We
can, for example, see that the purity of CR is much higher than the purity of LDA.
Considering the small difference in f -value, this means that the recall of LDA must
be higher than the recall of CR. This is in fact confirmed by the difference in the
International Journal of Intelligent Systems
DOI 10.1002/int
CONCEPT-RELATIONAL TEXT CLUSTERING
23
Figure 10. High purity (left) versus high recall (right) for equal f -value.
number of clusters (219 and 205 for CR(2) and CR(3) versus 92 for LDA). The
authors consider the emphasis on purity to be a benefit of the CR algorithm. For
example, when text clustering is used in a context of text summarization, clusters
with high purity will result in summaries of high quality. A lower recall can imply
that several summaries deal with the same topic. However, when the purity of
clusters decreases, the quality of summaries will decrease because of the fact that
several topics are represented in the same cluster.
Figure 10 shows two sets of clusters, both involving 40 samples and 3 topics.For
both sets of clusters, the f -value is the same. The left panel shows a set of clusters
for which the purity is high. All clusters contain documents that mostly refer to
the same topic. The right panel shows a set of clusters for which the recall is high.
All documents of one topic are mostly grouped into one cluster. However, the main
topic of a cluster is not easily identified. In a context of text summarization, such
clusters will result in summaries in which several topics are discussed.
7. CONCLUSION
In this paper, we have introduced a new model for textual documents, called
the relational document model. This model adopts the theory of multisets and
multirelations to represent a document. A textual document is transformed into this
relational model by using the SIE, a patented technology of the Belgian company
i.Know. The SIE automatically decomposes a text into concepts and relations. Each
concept and each relation is a substring of the original text and is extracted from
the text without prior knowledge such as a taxonomy or an ontology. Next, we use
the new text model for construction of the CR algorithm, which is an algorithm that
clusters textual documents according to the topic that is referred to by a document.
We hereby rely on the property that multiplicity (i.e., frequency) of (couples of)
concepts is a good measure for the relevance of (couples of) concepts. Given a set
International Journal of Intelligent Systems
DOI 10.1002/int
24
BRONSELAER AND DE TRÉ
of documents, the CR algorithm tries to construct a relation R (D) that contains the
relevant couples of concepts with respect to the topics that are described. On the
basis of such a relation, clusters are produced by use of a simple selection operator.
It is shown that the CR algorithm tries to optimize both precision (purity) and recall.
However, experimental results show that in practice, priority is given to precision.
These experiments show that the CR algorithm offers a major improvement with
respect to some basic cluster algorithms. In comparison with advanced methods
such as LDA, there is a major improvement with respect to purity.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Bronselaer A, Hallez A, De Tré G. A possibilistic view on set and multiset comparison.
Control Cybern 2009;38(2):341–366.
Bronselaer A, De Tré G. A possibilistic approach on string comparison. IEEE Trans Fuzzy
Syst 2009;17(1):208–223.
Bronselaer A, De Tré G. Properties of possibilistic string comparison. IEEE Trans Fuzzy
Syst 2010;18(2):312–325.
Antoon Bronselaer, Saskia Debergh, Dirk Van Hyfte, Guy De Tré. Estimation of topic
cardinality in document collections. In: Towards natural language based data/text mining
and summarization via soft approaches; Philadelphia, PA: SIAM; 2010. pp 31–39.
David Blei, Andrew Ng, and Michael Jordan. Latent Dirichlet allocation. J Mach Learn
Res, 3:993–1022, 2003.
Didier Dubois and Henri Prade. Possibility theory. Plenum Press; 1988.
Ellen Hisdal. Conditional possibilities independence and noninteraction. Fuzzy Sets Syst,
1:283–297, 1978.
George Shackle. Decision, order and time in human affairs. Cambridge, UK: Cambridge
University Press; 1961.
Gerald Salton, A Wong, and C Yang. A vector space model for automatic indexing. Commun
ACM, 18(11):613–620, 1975.
Gert De Cooman. Towards a possibilistic logic. In: Ruan D, editor. Fuzzy set theory and
advanced mathematical applications. Boston, MA: Kluwer Academic; 1995. pp 89–133.
Joe Ward. Hierarchical grouping to optimize an objective function. J Am Stat Assoc,
58(301):236–244, 1963.
John Hopcroft and Robert Tarjan. Efficient algorithms for graph manipulation. Commun
ACM, 16:372–378, 1973.
Karen Jones. A statistical interpretation of term specificity and its application in retrieval.
J Doc, 28(1):11–21, 1972.
Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, and Julia Hirschberg.
Do summaries help? A task based evaluation of multi document summarization. In: Proc
28th Annu Int ACM SIGIR Conf Res Development in Information Retrieval; New York
ACM; 2005. pp 210–217.
Lotfi Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst, 1:3–28, 1978.
Lotfi Zadeh. A computational approach to fuzzy quantifiers in natural languages. Comput
Math Appl, 9:149–184, 1983.
Nicholas Andrews and Edward Fox. Recent developments in document clustering. Technical
Report TR-07-35, Department of Computer Science, Virginia Tech, Blacksburg, VA; 2007.
Ronald Yager. On the theory of bags. Int J Gen Syst, 13(1):23–27, 1986.
Scott Deerwester. Improving information retrieval with latent semantic indexing. In: Proc
51st Ann Meeting Am Soc Information Science; Maryland; 1988. pp 36–40, 1988.
Stuart Lloyd. Least squares quantization in PCM. IEEE Trans Inf Theory, 28(2):129–137,
1982.
International Journal of Intelligent Systems
DOI 10.1002/int