Concept-relational text clustering

Guy De Tré

Concept-relational text clustering

Guy De Tré

2012, International Journal of Intelligent Systems

visibility

…

description

24 pages

link

1 file

The ongoing exponential growth of online information sources has led to a need for reliable and efficient algorithms for text clustering.In this paper, we propose a novel text model called the relational text model that represents each sentence as a binary multirelation over a concept space C. Through usage of the smart indexing engine (SIE), a patented technology of the Belgian company i.Know, the concept space adopted by the text model can be constructed dynamically.This means that there is no need for an a priori knowledge base such as an ontology, which makes our approach context independent. The concepts resulting from SIE possess the property that frequency of concepts is a measure for relevance. We exploit this property with the development of the CR-algorithm. Our approach relies on the representation of a data set D as a multirelation, of which k-cuts can be taken. These cuts can be seen as sets of relevant patterns with respect to the topics that are described by documents. Analysis of dependencies between patterns allows to produce clusters, such that precision is sufficiently high. The best k-cut is the one that best approximates the estimated number of clusters to ensure recall. Experimental results on Dutch news fragments show that our approach outperforms both basic and advanced methods.

Concept-Relational Text Clustering Antoon Bronselaer,∗ Guy De Tré† Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium The ongoing exponential growth of online information sources has led to a need for reliable and efficient algorithms for text clustering.In this paper, we propose a novel text model called the relational text model that represents each sentence as a binary multirelation over a concept space C. Through usage of the smart indexing engine (SIE), a patented technology of the Belgian company i.Know, the concept space adopted by the text model can be constructed dynamically.This means that there is no need for an a priori knowledge base such as an ontology, which makes our approach context independent. The concepts resulting from SIE possess the property that frequency of concepts is a measure for relevance. We exploit this property with the development of the CR-algorithm. Our approach relies on the representation of a data set D as a multirelation, of which k-cuts can be taken. These cuts can be seen as sets of relevant patterns with respect to the topics that are described by documents. Analysis of dependencies between patterns allows to produce clusters, such that precision is sufficiently high. The best k-cut is the one that best approximates the estimated number of clusters to ensure recall. Experimental results on Dutch C 2012 news fragments show that our approach outperforms both basic and advanced methods. Wiley Periodicals, Inc. 1. INTRODUCTION In the past decades, text clustering has been a challenging problem for researchers in both computational linguistics and data mining. Several applications such as automated text summarization and automated deduplication of textual data sources rely on text clustering approaches. The key problem—deciding whether two textual documents are dealing with the same topic—is far from trivial. In addition, the ongoing exponential growth of online information sources has created a need for efficient algorithms. To illustrate this, Figure 1 shows the number of really simple syndication (RSS) feeds on news sites, generated on a daily basis in Belgium and the Netherlands. Taking into account that the number of RSS feeds produced by news sites is a lower bound for the number of articles produced on these news sites, Figure 1 shows that manual tracking of all news published on daily basis is infeasible. ∗ Author to whom all correspondence should be addressed: e-mail: antoon.bronselaer@ugent .be † e-mail: [email protected] INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 00, 1–24 (2012) 2012 Wiley Periodicals, Inc. View this article online at wileyonlinelibrary.com. • DOI 10.1002/int.21557 C 2 BRONSELAER AND DE TRÉ Figure 1. Accumulated number of RSS feeds produced on a daily basis by 10 Belgian and Dutch news sites, measured during one month. This is partially due to the fact that, when tracking multiple sources of information on the Web, there exists a strong repetition in the whole of information provided by these sources. If one is able to delete such duplicate information, manual overview of information would become more feasible. A solution for this problem is offered by text clustering methods. The need for good clustering algorithms is emphasized in Ref. 14 in a context of automated text summarization. In this paper, we propose a new model for the representation of textual documents as an alternative for the vector space model.9 Whereas the vector space model uses a naive strategy for the decomposition of text into a vector of words, the relational model relies on a decomposition into concepts. Moreover, our model takes into account the relations between concepts. The new model offers several benefits. First, a semantical rich representation of textual documents is obtained that can be processed efficiently by using the standard operators of multirelations.18 Second, relevant information can be identified by thresholding the frequency of (couples of) concepts. Moreover, the results of such thresholding are extremely sparse structures. This is in contrast with the issues of high dimensionality that are present in vector spaces. Third, the identification of concepts and relations does not rely on ontologies or other a priori knowledge, which makes our approach context independent. The remainder of the paper is structured as follows. In Section 2, some preliminary definitions regarding multirelations, possibility theory, and evaluators are given. Section 3 introduces the new text model, called the relational text model, which relies on the theory of multirelations. Section 4 discusses the evaluation of International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 3 documents and introduces a set of assumptions. As a result, the problem of text clustering can be rewritten in terms of an unknown binary relation. Section 5 adopts this problem formulation and deals with the estimation of the unknown relation. Moreover, it is studied how clusters of high quality can be deducted from a candidate relation. From these considerations, the concept-relational (CR) algorithm is formulated. To test our approach, some experimental results are reported in Section 6. Finally, Section 7 summarizes the most important contributions of this paper. 2. PRELIMINARIES 2.1. Multisets In what follows, the concept of a multiset will be used intensively. Informally, a multiset M is an extension of a (Cantorian) set in which elements can occur multiple times. Formally, a multiset M derived from a universe U is defined by a counting function M : U → N.18 For u ∈ U , M(u) represents the number of times that u appears in M. The set of all multisets drawn from a universe U is denoted M(U ). The concept of subset is extended for multisets as M1 ⊆ M2 ⇔ ∀u ∈ U : M1 (u) ≤ M2 (u) and the cardinality of a multiset M is given by |M| = u∈U M(u). Yager18 defines the following operators for multisets: ∀u ∈ U : (M1 ∪ M2 )(u) = max(M1 (u), M2 (u)), ∀u ∈ U : (M1 ∩ M2 )(u) = min(M1 (u), M2 (u)), ∀u ∈ U : (M1 ⊕ M2 )(u) = M1 (u) + M2 (u). The ∈-operator applies for multisets as follows: u ∈ M ⇔ M(u) > 0. The k-cut of a multiset M is a regular set Mk = {u|u ∈ U ∧ M(u) ≥ k}. 2.2. The Coreference Problem In the remainder of this paper, let D = {d1 , ..., dn } denote a set of documents and let T = {t1 , ..., tm } denote a set of topics. It is assumed that each document can be assigned exactly one topic, given by the surjective function ρ : D → T . Two arbitrary documents di and dj are called coreferent if and only if ρ (di ) = ρ dj , i.e., if they refer to the same topic. In a more general setting, where two objects are not bound to be textual documents, the uncertainty about coreference for two objects can be expressed as a possibility distribution6–8, 10, 15 over the Boolean domain B = {T , F }. A possibility distribution π over an outcome space X can be derived from a possibility measure, much like a probability distribution Pr can be derived from a probability measure. It satisfies the following normalization constraint: max π(x) = 1. (1) x∈X International Journal of Intelligent Systems DOI 10.1002/int 4 BRONSELAER AND DE TRÉ This implies that a possibility distribution is modeled by a fuzzy set, which is why we denote the set of all possibility distributions over X as F (X ). A possibility distribution, which satisfies X = B, is called a possibilistic truth value and a function, which generates it in the case of coreference, is called an evaluator.1 DEFINITION 1 (Evaluator). Given an object space O, an evaluator is defined as a function: EO : O 2 → F (B) : (o1 , o2 ) → {(T , μp (T )), (F, μp (F ))}. (2) Hereby, μp (T ) represents the possibility that o1 and o2 are coreferent and μp (F ) represents the possibility that o1 and o2 are not coreferent. F (B) represents the set of all possibility distributions over B. In the remainder of this paper, we shall adopt the couple notation for possibilistic truth values. This means that {(T , μp (T )), (F, μp (F ))} is abbreviated to (μp (T ), μp (F )). We will assume that an evaluator is both reflexive: ∀(o1 , o2 ) ∈ O 2 : o1 = o2 ⇒ EO (o1 , o2 ) = (1, 0) (3) ∀(o1 , o2 ) ∈ O 2 : EO (o1 , o2 ) = EO (o2 , o1 ). (4) and symmetric: On the basis of the possibilities that two objects are (not) coreferent, a Boolean decision can be taken by use of a binary decision model. DEFINITION 2 (Binary decision model). Assume z ∈ [0, 1]. A binary z-decision model is a function B : F (B) → B such that: B ( p) = T F if if 1 − μp (T ) > z, else. (5) The value z is called the certainty threshold. 3. A RELATIONAL MODEL FOR TEXT DOCUMENTS In this section, we introduce a relational model for textual documents that is based on the concept of multirelations. We define operators for the relational model. These operators will be used to define a new approach on text clustering. To define our model, we first define binary multirelations. Within the scope of this paper, it is assumed that relations are homogeneous. International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 5 DEFINITION 3 (Binary multirelation). Consider a universe U . A binary multirelation R over U is defined as subset of M(U × U ). A binary multirelation is thus a multiset of couples. The number of times that a couple occurs in the multirelation is called the multiplicity of the couple. A binary multirelation R over U is symmetric if and only if ∀(u1 , u2 ) ∈ U 2 : R(u1 , u2 ) = R(u2 , u1 ). (6) A binary multirelation R is called k-transitive if and only if the regular relation Rk is transitive, with Rk the k-cut R. A binary multirelation R is called k-reflexive if and only if Rk is reflexive. A binary multirelation R is irreflexive if and only if ∀u ∈ U : R(u, u) = 0. (7) With the definition of multirelations at hand, it is possible to define the relational model for documents. DEFINITION 4 (Relational model for documents). Assume a concept space C . A document d is defined as a set of l sentences S = {s1 , ..., sl }, where each sentence si ⊂ M(C × C ) is a binary, irreflexive, symmetrical, and 1-transitive multirelation over C . The set of all documents is denoted as D. In what follows, we shall always assume that multirelations are binary. For that reason, we omit the adjective “binary” in the following. The relational model for documents considers a document as a set of sentences, where each sentence is a multirelation over concepts. Each concept is a collection of words that together constitute a meaningful, semantical unit. An important observation is that we can represent the whole document as a multirelation by taking the relational transformation of the document. DEFINITION 5 (Relational transformation). For a universe of documents D, the relational transformation is defined as a function: ψ : D → M(C × C ) : d → |S| si , (8) i=1 where ⊕ is the sum operator for multisets, used in prefix notation. The relational transformation ψ(d) is equal to the sum of all sentences. An important question that arises with the introduction of the relational model, is how to represent a piece of text with this model. Such a representation is possible by use of the i.Know-engine. This engine is a patented technology of the Belgian company i.Know that transforms textual documents into a semistructured model. An important advantage of this technology, which is also the most significant innovation International Journal of Intelligent Systems DOI 10.1002/int 6 BRONSELAER AND DE TRÉ with respect to existing methods, is the fact that the i.Know-engine is context independent. In literature, many methods are based on semantical knowledge systems, but the advantage of these methods is that they rely on a context-dependent component. Classical natural language processing (NLP) systems use a top-down approach to identify terms based on a predefined thesaurus, ontology or statistical model. In contrast to this, i.Know adheres to a bottom-up approach where the SIE (smart indexing engine) automatically identifies all complex terms in a text, regardless of their length or semantical complexity. The i.Know engine relies on the relational structure of natural languages to find all concept–relation–concept (CRC) units in a document. Thus, instead of using a predefined ontology, the i.Know technology offers a method where concepts are dynamically mined from a text. Example. Let us consider the following sentence: “Heavy rain is falling over southern and south-western England, while more snow is due to arrive in Wales.” When this sentence is parsed by the SIE, the result is the following: “Heavy rain IS FALLING OVER southern AND south-western England, WHILE more snow IS DUE TO ARRIVE IN Wales.” In this result, a group of words that is marked in bold font, is considered to be a concept. A group of words in CAPITAL LETTERS is considered to be a relation. Other words are considered to be nonrelevant. This result is casted to the relational text model by the multirelation M shown in Figure 2. Hereby, the multiplicity of each couple is equal to 1. Notice that the multirelation is indeed 1-transitive, symmetric, and irreflexive. A remarkable property of the concepts produced by the SIE is that their multiplicity is a measure for their relevance. This property is induced by the fact that concepts are meaningful groups of words. It is a well-known fact that this property is not possessed by classical q-grams. In this paper, we would like to exploit this interesting feature. Therefore, we shall define the relational transformation of a data Southern South-western England Heavy rain Snow Wales Figure 2. Relational representation of an example sentence. International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 7 set D as follows: (D) = (9) ψ(d). d∈D The relational transformation of a data set of documents is a simple and elegant way to represent the relational and conceptual information of a data set D. We will now introduce a number of operators for documents represented in the relational text model. DEFINITION 6. Assume a document d ∈ D and a couple of concepts (c1 , c2 ) ∈ C 2 . The couple (c1 , c2 ) belongs to d, denoted as (c1 , c2 ) ∈ d if and only if: (c1 , c2 ) ∈ ψ(d). (10) DEFINITION 7 (Concepts of a document). Assume a document d ∈ D. The concepts of d are given by the multiset Cd such that ∀c1 ∈ C : Cd (c1 ) = ψ(d)(c1 , c2 ). (11) c2 ∈C Definition 7 implies that the concepts of a document are represented as a multiset, where the multiplicity Cd (c) indicates the number of times c occurs as the first concept in a couple of ψ(d). We only verify the first concept because ψ(d) is symmetrical (Definition 5). In what follows, we will be interested in verifying whether the concepts of a couple (possibly) occur in the multiset of concepts related to a document. DEFINITION 8. Assume a document d ∈ D with concept space C . We define ∀(c1 , c2 ) ∈ C 2 : (c1 , c2 ) ∈ d ⇔ (c1 ∈ Cd ∧ c2 ∈ Cd ) . (12) DEFINITION 9. Assume a document d ∈ D with concept space C and an evaluator EC . We define ∀(c1 , c2 ) ∈ C 2 : (c1 , c2 ) ∈d ⇔ (∃(c1′ , c2′ ) ∈ d : B (EC (c1 , c1′ )) ∧ B (EC (c2 , c2′ ))), where B is a binary decision model with z = 0. We can now define several selection operators for documents. International Journal of Intelligent Systems DOI 10.1002/int (13) 8 BRONSELAER AND DE TRÉ DEFINITION 10 (Relational selection of documents). Assume a couple of concepts (c1 , c2 ) ∈ C . The relational selection of documents under (c1 , c2 ) is defined as a set of documents: D(c1 ,c2 ),∈ = {d ∈ D|(c1 , c2 ) ∈ d} . (14) The couple (c1 , c2 ) is called a (search) pattern. DEFINITION 11 (Conceptual selection of documents). Assume a couple of concepts (c1 , c2 ) ∈ C . The conceptual selection of documents under (c1 , c2 ) is defined as a set of documents: D(c1 ,c2 ), ∈ = {d ∈ D|(c1 , c2 ) ∈ d} . (15) The couple (c1 , c2 ) is called a (search) pattern. DEFINITION 12 (Possibilistic selection of documents). Assume a couple of concepts (c1 , c2 ) ∈ C . The possibilistic selection of documents under (c1 , c2 ) is defined as a set of documents: D(c1 ,c2 ), ∈ = {d ∈ D|(c1 , c2 ) ∈ d} (16) The couple (c1 , c2 ) is called a (search) pattern. On the basis of the selection of documents, it is possible to define dependencies between patterns. DEFINITION 13 (Dependent patterns). Two patterns (c1 , c2 ) and (c1′ , c2′ ) are called dependent (denoted as (c1 , c2 ) ∼ (c1′ , c2′ )) if and only if D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈ = ∅. (17) Dependency of patterns signifies that there is at least one document in which both patterns occur. Although it is possible to define alternative dependencies based on weaker selection operators, we omit these definitions because such weaker dependencies are not used in our approach for text clustering. When we consider a vector of patterns, we can construct a dependency matrix. DEFINITION 14 (Dependency matrix). Assume a vector of patterns v. The dependency matrix for v is a |v| × |v| matrix Mv such that 1 if vi ∼ vj , (18) Mv (i, j ) = 0 if else. International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 9 When all patterns in v are mutually independent, Mv equals the unity matrix I|v| . An important tool to study dependencies between patterns is implied by the following theorem. THEOREM 1. For a vector of patterns v and the dependency matrix Mv , there exists a permutation of v, denoted v∗ , such that ∀j ∈ {1, ..., |v| − 1} : |v| M (i, j ) ≥ v∗ |v| Mv∗ (i, j + 1). (19) i=1 i=1 Theorem 1 states that we can order a vector of patterns such that high dependent patterns are put first. An example of a dependency matrix Mv and the derived Mv∗ are shown in Figure 3, where dependence is represented by a white square and independence is represented by a black square. The permutation of v ensures that the number of independencies on the columns of Mv∗ increases from left to right. As Mv∗ is a symmetric matrix, the same observation holds for rows from top to bottom. Our definition of dependence considers (in)dependence as a Boolean matter. When information is required about the extent to which two patterns are dependent, a degree of dependence fulfills this need. DEFINITION 15 (Degree of dependence). The degree of dependence for two patterns (c1 , c2 ) and (c1′ , c2′ ) is defined as dep((c1 , c2 ), (c1′ , c2′ )) = |D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈ | . min(|D(c1 ,c2 ),∈ |, |D(c1′ ,c2′ ),∈ |) Figure 3. Dependency matrix Mv (left) and the derived Mv∗ (right). International Journal of Intelligent Systems DOI 10.1002/int (20) 10 BRONSELAER AND DE TRÉ The degree of dependence lies in the unit interval and satisfies dep((c1 , c2 ), (c1′ , c2′ )) = dep((c1′ , c2′ ), (c1 , c2 )). (21) We can also see that: (c1 , c2 ) ∼ (c1′ , c2′ ) ⇒ dep((c1 , c2 ), (c1′ , c2′ )) > 0. (22) 4. DOCUMENT EVALUATION We have introduced evaluators in Section 2. An evaluator E is a function that compares two objects and models the uncertainty about the coreference of these two objects. Uncertainty is hereby represented as a possibility distribution over B. In this section, we aim at constructing such an evaluator for documents. According to Definition 2.2, this should be a function ED : ED : D2 → F (B). (23) To construct such a function, we shall adopt the relational transformation to work with. This means that we shall consider an evaluator as follows: EM(C ×C ) : M(C × C )2 → F (B). (24) This notation signifies that two documents are compared by comparing their relational transformations. In general, not all patterns in the relational transformation are needed during comparison. In this paper, we wish to adopt a model such that two documents are coreferent if they share sufficient relevant patterns. When using such a model, we are faced with two key questions: (i) what are the relevant patterns and (ii) how is the linguistic quantity “sufficient” to be interpreted? In response to the first question, we can say that due to the specific properties of the SIE, we can use multiplicity (i.e., frequency) of concepts and patterns as a measure for their relevance. The second question is more difficult to answer. We can see, however, that the interpretation of “sufficient” depends on the definition of relevant patterns. Indeed, if the interpretation of “relevant” is relaxed, then the interpretation of “sufficient” should be strengthened. Therefore, we adopt the following approach. We will give a fixed interpretation to the linguistic quantity “sufficient” and we shall adapt the definition of relevant to this fixed quantity. This can be formalized as follows. Let us assume that there exists a function T defined as follows: T : M(C × C ) → M(C × C ) (25) such that T (ψ(d)) represents the set of relevant patterns of a document d. If we assume that the linguistic quantity “sufficient” is modeled by “at least one” we find International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 11 the following: ED (d1 , d2 ) = (1, 0) if T (ψ(d1 )) ∩ T (ψ(d2 )) = ∅, (0, 1) if else. (26) We now make an assumption about the function T . More specific, we assume that there exists a binary relation R (D) ⊂ C 2 such that ∀d ∈ D : T (ψ(d)) = ψ(d) ∩ R (D) . (27) This means that R (D) is a relation that contains all relevant patterns. We now see that (T (ψ(d1 )) ∩ T (ψ(d2 )) = ∅) ⇔ ((ψ(d1 ) ∩ R (D) ∩ ψ(d2 )) = ∅). (28) We can thus rewrite the evaluation of two documents as follows: ED (d1 , d2 ) = (1, 0) (0, 1) 2 if ∃(c1 , c2 ) ∈ R (D) : (d1 , d2 ) ∈ D(c1 ,c2 ),∈ , if else. (29) This last expression reveals that coreferent documents can be found by using the selection of documents through search patterns. Some important consequences are hereby implied. First, usage of a selection operator is efficient. More specific, the examination of all pairs of documents (which would have quadratic complexity) can be avoided. Instead, we will provide an algorithm in Section 5 that directly computes clusters of documents without explicit calculation of ED . Second, when we want to construct such an algorithm, we must take into account that for two patterns (c1 , c2 ) and (c1′ , c2′ ), the set D(c1 ,c2 ),∈ ∩ D(c1′ ,c2′ ),∈ (30) is not bound to be empty. This means that documents are not bound to belong to one cluster. This seems to be a natural property for textual documents. However, in light of the coreference problem, we shall adhere to the fact that ρ is indeed a function, which implies that a document references to exactly one topic. As a consequence, our approach is an approximation of reality. The authors would like to stress that a model where a document can belong to several clusters has certain advantages such as a more natural distribution of documents over topics. For reasons of simplicity, this more complex model is not adopted here. Third, the replacement of ∈ by ∈ or ∈ leads to alternative evaluators for documents. International Journal of Intelligent Systems DOI 10.1002/int 12 BRONSELAER AND DE TRÉ DEFINITION 16 (op-evaluator). An op-evaluator for documents with op ∈ {∈, ∈ , ∈} is defined as op op ED : D2 → F (B) : (d1 , d2 ) → ED (d1 , d2 ) (31) such that op ED (d1 , d2 ) = (1, 0) (0, 1) 2 if ∃(c1 , c2 ) ∈ R (D) : (d1 , d2 ) ∈ D(c1 ,c2 ),op , if else. (32) For this family of operators, we can prove the following theorem. THEOREM 2 . Assume a document space D, then ∈ ∀(d1 , d2 ) ∈ D : ED (d1 , d2 ) ≤ ED∈ (d1 , d2 ) ≤ ED∈ (d1 , d2 ). (33) Proof. The proof follows from the fact that for a pattern (c1 , c2 ) ∈ C 2 we have that D(c1 ,c2 ),∈ ⊆ D(c1 ,c2 ), ∈ ⊆ D(c1 ,c2 ), ∈ . (34) By using the family of op-evaluators, we can define a set of coreferent documents more general as D(c1 ,c2 ),op . Theorem 2 can be used to predict the impact of choosing a different operator op. We will address this issue when discussing experimental results (Section 6). 5. THE CR ALGORITHM In this section, we will introduce the CR algorithm, an algorithm that is based on the relational transformation of textual documents. During the explanation of our approach, we shall frequently use definitions and operators introduced in the previous section. Our algorithm departs from the assumption that we have made about evaluators for documents. These assumptions imply that the coreference problem for textual documents can be written in terms of a relation R (D) and a selection operator D(c1 ,c2 ),op . The relation R (D) is an unknown relation that represents the set of relevant patterns. To construct the unknown relation R (D) , we suggest the following approach. We will develop a procedure that constructs clusters with high precision for a given relation R. More specific, if a binary relation R is given as a candidate for R (D) , our approach produces clusters of which the precision is sufficiently high. Because of the fact that multiplicity is a measure for relevance, we shall consider the several k-cuts of (D), i.e., the relational transformation of the whole data set, as possible candidates for R (D) . Finally, to control recall, we shall adopt an algorithm International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 13 developed in Ref. 4 that estimates the number of clusters. The best candidate for R (D) is then the relation for which the number of clusters best approximates the estimation of the number of clusters. As we have mentioned in the previous section, the usage of a selection operator D(c1 ,c2 ),op for finding groups of coreferent documents implies that some documents can be selected by several patterns. Our definition of coreference implies that clusters behave as equivalence classes. To construct such classes, we shall sort the relevant patterns. We then pass over the patterns in the given order and we select documents through a selection operator D(c1 ,c2 ),op . Hereby, documents that are selected earlier are not taken into account anymore. This way, we ensure that equivalence classes are obtained. Assume that a data set D is given. We can calculate the relational transformation of D as (D), which is a multirelation over the concept space C . For a given k ∈ N, the k-cut of this multirelation is a binary relation and is denoted as ((D))k . Figure 4 shows a typical k-cut for a relative high value of k. Hereby, green circles represent concepts and a line between concepts denotes a relation between concepts. A remarkable property that we observe is that for sufficiently high values of k, ((D))k typically consists of components in a graph theoretical sense. In Figure 4, the components are delimited with dashed lines. Components can be found in linear time12 and form the basis for our approach. Each component consists of patterns, and we denote these patterns as a vector v. A naive approach would be to produce clusters as follows: D(c1 ,c2 ),op . (35) (c1 ,c2 )∈v Figure 4. A typical k-cut of (D). International Journal of Intelligent Systems DOI 10.1002/int 14 BRONSELAER AND DE TRÉ This would be a very simple way of selecting documents and put them in a cluster. However, there is no verification about the coreference of the selected documents. If we would be able to perform such a verification, a simple way of producing clusters would become available. It appears that the mentioned verification can be done by analysis of dependencies. In particular, we shall focus on dependencies within a component and dependencies between components. Both types are discussed in the following subsections. After that, we use the analysis of dependencies to construct the CR algorithm. 5.1. Dependencies within Components Suppose that we have a data set of documents D and suppose that we take a k-cut of (D), denoted as ((D))k . An arbitrary component of this k-cut can be represented as a vector v. For each pattern (c1 , c2 ) in this vector, D(c1 ,c2 ),op represents the selection of documents when (c1 , c2 ) is used as a search pattern. Let us now suppose that we can sort the patterns in v, such that v = v(1) ⊕ v(2) , (36) where ⊕ represents the concatenation operator for vectors and such that (2) ∀i ∈ {1, . . . , |v(1) |} : ∀j ∈ {1, ..., |v(2) |} : ¬ v(1) i ∼ vj (37) This means that patterns in v(1) are independent of patterns in v(2) : there is no document in D that contains a pattern from v(1) and a pattern from v(2) . In light of our considerations about an evaluator for documents, this means that documents under the selection of patterns from v(1) are not coreferent with documents under the selection of patterns from v(2) . We can thus argue that, given a vector of patterns v, the set of documents D(c1 ,c2 ),op (38) (c1 ,c2 )∈v contains coreferent documents if all patterns in v are mutual dependent. To make our approach more robust, we shall replace the quantifier “all” by a fuzzy quantifier “most”,16 which leads to the constraint that most patterns in v must be mutual dependent. To summarize, if we are given a component with patterns v, we must split up that component into maximal subcomponents such that most patterns of each subcomponent are mutual dependent. Algorithm 1 shows how such a split can be obtained. It can be seen that Algorithm 1 has quadratic complexity in the worst case (O(|v|2 )). This is one of the reasons why k-cuts are first decomposed into components: the size of components is typically sufficiently small for acceptable efficiency. International Journal of Intelligent Systems DOI 10.1002/int 15 CONCEPT-RELATIONAL TEXT CLUSTERING Algorithm 1 split(Mv∗ ) Require: split factor sf ∈ [0, 1] 1: for j ∈ {1, ..., |v∗ |} do |v∗ | 2: x ← k=1 (Mv∗ (j, k)) 3: if |vx∗ | < sf then 4: return {v∗(1,...,j ) } ∪ {split(Mv∗(j +1,...,|v∗ |) )} 5: end if 6: end for 5.2. Dependencies between Components In the same way strong independencies within a component require a split of the component, strong dependencies between components reveal that the documents corresponding with these components, are in fact coreferent. Therefore, we generalize the definition of degree of dependence toward vectors of patterns: |v| i=1 dep(v, v′ ) = min Dvi ,∈ ∩ |v| i=1 Dvi ,∈ , |v′ | i=1 Dv′i ,∈ |v′ | i=1 Dv′i ,∈ . (39) We will reason that if a component has a strong dependency with a bigger component (i.e., a component with more patterns), we shall ignore the smaller component. The fact that we consider ignorance of a component rather than merging of two components can be explained as follows. On the one hand, a strong dependency between two components with patterns v and v′ indicates by definition that the set |v| i=1 ⎛ Dvi ,∈ ∩ ⎝ |v′ | i=1 ⎞ Dv′i ,∈ ⎠ (40) contains a large number of documents. Thus, we know a priori that the contribution of the smaller component is rather low. On the other hand, due to the quadratic complexity of Algorithm 1, the number of patterns to be analyzed with Algorithm 1 is best kept low. 5.3. Production of Clusters Based on ((D))k In this section, we provide an algorithm to compute clusters when a k-cut of ((D)) is given as a candidate for R (D) . This means that patterns in ((D))k are considered to be relevant patterns. Algorithm 2 summarizes the computation of clusters in pseudo code, taking into account the observations regarding (in)dependencies. First, a cut of the relational transformation of D is taken (line 2). This cut is then splitted in maximal connected components (line 3). The set of components is International Journal of Intelligent Systems DOI 10.1002/int 16 BRONSELAER AND DE TRÉ Algorithm 2 produce(D,k) Require: delete factor df ∈ [0, 1] 1: K ← ∅ (D ) 2: Rk ← (D )k (D ) 3: Rk ← components(Rk ) 4: while Rk = ∅ do 5: v ←arg minv∈Rk |v| 6: if ¬ ∃v′ ∈ Rk : dep v, v′ > df then 7: (v(1) , ..., v(l) ) ← split(Mv∗ ) 8: for j ∈ {l, ..., 1} do 9: cluster ← (c1 ,c2 )∈v(j ) D(c1 ,c2 ),op 10: add(K, cluster) 11: D ← D\cluster 12: end for 13: end if 14: Rk ← Rk \v 15: end while 16: return K denoted as Rk . The components are then sorted according to their size (i.e., the number of patterns). Smaller components are treated first because they represent a topic for which few information is available. These topics are thus the hardest to identify. For each component, it is checked whether a larger component exists with which the selected component has a strong dependency (line 6). When no strong dependencies are found, the component is splitted in subcomponents (line 7). For each subcomponent, a cluster is composed as the union of the selection of patterns in the subcomponent. Documents from the new cluster are deleted from D to ensure that clusters are mutual disjunct. 5.4. Optimizing k With the ability of computing clusters based on a k-cut of (D), a question left unanswered is how to choose the best k-cut. Therefore, we adopt the following reasoning. On the one hand, for a given k-cut, we compute clusters such that the precision of clusters is sufficiently high. That is, by choosing groups of patterns with a high ratio of mutual dependencies, we group documents that share a sufficient amount of relevant patterns. Relevancy of the patterns hereby stems from the fact that the selected patterns have a high multiplicity in (D). If not, they would not be part of the k-cut. On the other hand, we should also be able to perform a control of recall. Recall is a measure for the completeness of clusters, i.e., it measures to what extent all documents about one topic are grouped in one cluster. To ensure recall, we state that the best k-cut of (D) is the one for which the number of clusters best approaches the number of topics. To do this, we adopt a method for the estimation of International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 17 the number of topics.4 If we denote the set of topics as T = {t1 , ..., tm }, the method in Ref. 4 provides us with an estimation of m, denoted m. Algorithm 3 CR(D) Require: cut factor cf ∈ N 1: m ← estimate(D ) 2: stop ← false 3: K ← ∅ 4: while ¬stop do 5: koptimal ← arg mink |produce(D, k)| + |D\Dproduce(D,k) | − m 6: if koptimal > cf then 7: K ← K ∪ produce(D, koptimal ) 8: D ← D \D K 9: stop ← |K| + |D| < m 10: else 11: stop ← true 12: end if 13: end while 14: ∀d ∈ D : K ← K ∪ {d} 15: return K Algorithm 3 shows how the CR algorithm estimates the number of topics (line 1), to find an optimal k (line 5). The set K represents the set of clusters. It is possible that the optimal k-cut leads to a number of clusters that is higher than the estimated number of clusters. When this situation occurs, the whole procedure is repeated on the remaining documents (line 9). The algorithm stops when (a) the number of clusters becomes smaller that m or (b) koptimal is not high enough. The second stop condition is explained by noting that patterns should have sufficient multiplicity. If not, they cannot be considered relevant anymore. 6. EXPERIMENTAL RESULTS In this section, a comparative study between the proposed method and existing methods is conducted and reported. The data set used is a collection of 550 Dutch documents collected on news sites and manually labeled with a topic. In total, 134 different topics are considered. If we calculate ((D))1 , we find that there are 73, 681 different patterns. In total, there are 15, 310 unique concepts. It can be seen that the 1-cut is a dense structure with many relations between the concepts. This observation agrees with the fact that textual documents typically contain an overwhelming amount of information, where most of this information is non relevant with respect to the topic of the document. An interesting trend is shown in Figure 5. Here, the average number of patterns per component is shown in function of k. We can clearly see that, as k increases, the components contain few patterns on average. This is an important observation, considering the quadratic complexity of Algorithm 1. We International Journal of Intelligent Systems DOI 10.1002/int 18 BRONSELAER AND DE TRÉ Average patterns/component 200 180 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 k Figure 5. Average number of patterns per component in function of k. observe that for increasing k, the resulting k-cut becomes a sparse structure, which is a promising result, considering the fact that classical vector spaces rapidly cope with overdimensionality in the case of text clustering. The results of an algorithm are reported as follows. In each experiment, a set of documents D is assumed. After execution of an algorithm, a set of clusters is available. These clusters represent a partition of D. For each combination of topic ti and cluster Kj , the precision p(ti , Kj ) and the recall r(ti , Kj ) are calculated as follows: p(ti , Kj ) = ni,j , |Kj | (41) r(ti , Kj ) = ni,j , docs(ti ) (42) where ni,j is equal to the number of documents about topic ti in cluster Kj and docs(ti ) is equal to number of documents about topic ti in D. A balance between precision and recall is given by their harmonic mean. This number is called the f -value and is calculated as follows: f (ti , Kj ) = 2 p(ti , Kj ) r(ti , Kj ) . p(ti , Kj ) + r(ti , Kj ) International Journal of Intelligent Systems DOI 10.1002/int (43) 19 CONCEPT-RELATIONAL TEXT CLUSTERING The accuracy of a cluster algorithm can then be expressed as a linear combination of maximal f -values for the individual clusters: f = docs(ti ) |D| i max f (ti , Kj ). (44) j To better understand the composition of clusters, we also report the overall precision of a set of clusters, also called purity: p= |Kj | |D| j max p(ti , Kj ). (45) i 6.1. Basic Methods on Vector Spaces To compare the CR algorithm with existing approaches, we adopt a recent overview paper on text clustering.17 This paper mentions some baseline algorithms for text clustering and discusses the important preprocessing steps. The preprocessing steps we apply are length filtering, stopword removal, stemming, frequency filtering, and principal component analysis (PCA). The stemming is done by using a Snowball stemmer. PCA is performed by choosing components that together explain 95% of the total variance. We consider two ways of constructing vectors from textual documents: binary vectors that indicate word occurrence and TFIDF vectors.13 On the resulting vector spaces, we apply the k-means clustering algorithm20 and the hierarchical clustering algorithm.11 For hierarchical clustering, we consider both single linkage and full linkage. Note that for both k-means clustering and hierarchical clustering, the number of clusters is assumed to be known. Therefore, we study the evolution of both the f -value and the purity in terms of a variating number of clusters. Figure 6 shows the results for k-means clustering.The accuracy of the k-means algorithm is very low for the given data set of documents. Note that we have also purity f-value purity 1 0,9 0,9 0,8 0,8 0,7 0,7 Accuracy Accuracy 1 0,6 0,5 0,4 f-value 0,6 0,5 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0 0 50 100 150 200 250 300 350 400 50 Number of clusters 100 150 200 250 300 350 400 Number of clusters Figure 6. Purity and f -value for k-means clustering with binary vectors (left) and TFIDF vectors (right). International Journal of Intelligent Systems DOI 10.1002/int 20 BRONSELAER AND DE TRÉ purity purity 1,0 0,9 0,9 0,8 0,8 0,7 0,7 Accuracy Accuracy 1,0 f-value 0,6 0,5 0,4 f-value 0,6 0,5 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0,0 0,0 50 100 150 200 250 300 350 50 400 100 150 200 250 300 350 400 Number of clusters Number of clusters Figure 7. Purity and f -value for single linkage with binary vectors (left) and TFIDF vectors (right). purity f-value purity 1,0 0,9 0,9 0,8 0,8 0,7 0,7 Accuracy Accuracy 1,0 0,6 0,5 0,4 0,6 0,5 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0,0 f-value 0,0 50 100 150 200 250 300 350 400 50 Number of clusters 100 150 200 250 300 350 400 Number of clusters Figure 8. Purity and f -value for full linkage with binary vectors (left) and TFIDF vectors (right). experimented with kernel k-means clustering, using the cosine kernel function. The results of this k-means variant were similar to the ones shown in Figure 6 and are, therefore, omitted here. A better approach is the hierarchical clustering algorithm, especially under the condition of full linkage. The results are shown in Figures 7 and 8. In general, we conclude that TFIDF vectors give a slightly better result than binary vectors, which is not surprising. For each tested algorithm, we shall keep the maximal f -value that is obtained by modifying the number of clusters and we shall compare this maximal f -value with the novel CR algorithm. 6.2. Latent Dirichlet Allocation Next to the basic algorithms that are reported in Ref. 17, we have tested an advanced and quite recent statistical method called latent Dirichlet allocation (LDA).5 LDA initiates from the idea that the topic of a document is in general not International Journal of Intelligent Systems DOI 10.1002/int 21 CONCEPT-RELATIONAL TEXT CLUSTERING clearly measurable. It therefore models the topic of a document as a probability distribution over the topic space T . Each topic in T is, in turn, modeled as a probability distribution over a word space, denoted as W . We explicitly use a different notation to indicate that concepts in C are the result of SIE parsing. LDA generates the word space W in a more classical way. The adopted probability distribution is the Dirichlet distribution. LDA is closely related to latent semantic indexing (LSI):19 the only difference lies in the prior distribution that is assumed. We remark that LDA does not adopt a vector space. Instead, it uses the bag-of-words model. For this reason, PCA is not performed as a preprocessing step. All other preprocessing steps that are used in the case of basic methods, are applied in the case of LDA. Again, the number of clusters is assumed to be known, so we report the evolution of purity and f -value for a variating number of clusters. The results are shown in Figure 9. Comparison of the results of LDA with the results of the basic methods reveals that LDA outperforms these methods. A remarkable difference between LDA and the best performing basic method (i.e., the hierarchical algorithm adopting the full linkage rule and TFIDF vectors) is the stability with respect to a variation in the number of clusters. Indeed, the right panel of Figure 8 shows that there is an optimal number of clusters for hierarchical clustering, whereas this is not the case with LDA (Figure 9). 6.3. The CR Algorithm In this section, we compare the novel CR algorithm with the tested algorithms from literature. As can be seen in Algorithm 2 (line 9), the selection operator purity 1,0 f-value 0,9 0,8 Accuracy 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0,0 50 100 150 200 250 300 350 Number of clusters Figure 9. Purity and f -value for LDA. International Journal of Intelligent Systems DOI 10.1002/int 400 22 BRONSELAER AND DE TRÉ Table I. Comparison of the CR algorithm with existing methods. Method Purity f -Value Number of clusters CR(1) CR(2) CR(3) k-Means Single linkage Full linkage LDA 0.9508 0.9326 0.9399 0.6904 0.8925 0.8900 0.8452 0.6240 0.8036 0.8288 0.5731 0.4714 0.6927 0.8107 348 219 205 100 390 240 92 D(c1 ,c2 ),op is a general one, where op ∈ {∈, ∈ , ∈ }. We, therefore, test three versions of our algorithm. We denote CR(1) as the CR algorithm that uses a relational selection (i.e., D(c1 ,c2 ),∈ ), CR(2) as the CR algorithm that uses a conceptual selection (i.e., D(c1 ,c2 ), ∈ ) and CR(3) as the CR algorithm that uses a possibilistic selection (i.e., D(c1 ,c2 ), ∈ ). For CR(3) , we require an evaluator to compare concepts. Such an evaluator is introduced in Refs.2 and.3 We refer to those works for more information about the used evaluators for concepts. We report that the used algorithm for estimation of the number of clusters4 provides us an estimation of 137. To make a comparison with existing methods, we report the maximal f -value that is obtained by variating the number of clusters for existing methods. These maximal f -values, together with the results of the CR algorithm, are summarized in Table 1. For the basic methods, we only report the results obtained with TFIDF vectors, as these vectors have shown to provide better results than binary vectors. Table 1 also reports the actual number of clusters that is obtained. Let us now discuss the most important observations that are revealed by Table 1. First, it can be seen that the CR algorithm results in a remarkable high purity. This means that, regardless of the selection operator used, the clusters produced by the CR algorithm mostly contain documents that are coreferent. The quality of the individual clusters is thus very high. It can be seen that the CR algorithm outperforms all the tested methods with respect to purity. This result shows that the analysis of dependencies indeed allows for production of clusters with high precision. Second, when studying the impact of the selection operator, we observe that the purity is slightly higher for CR(1) . However, the f -value is significantly lower for CR(1) , which is explained by the fact that CR(2) and CR(3) yield a significant higher recall. This can be seen by consulting the obtained number of clusters. The observed trends in purity, recall and f -value are not surprising, considering Theorem 2. In general, we observe that CR(1) is too strict for the selection of documents, which is why CR(2) or CR(3) should be used. Third, we can see that there is a significant improvement in the f -value with respect to the basic methods that adopt a vector space. Finally, although there is only a slight difference between LDA and CR with respect to f -value, there are clearly some differences between both approaches. We can, for example, see that the purity of CR is much higher than the purity of LDA. Considering the small difference in f -value, this means that the recall of LDA must be higher than the recall of CR. This is in fact confirmed by the difference in the International Journal of Intelligent Systems DOI 10.1002/int CONCEPT-RELATIONAL TEXT CLUSTERING 23 Figure 10. High purity (left) versus high recall (right) for equal f -value. number of clusters (219 and 205 for CR(2) and CR(3) versus 92 for LDA). The authors consider the emphasis on purity to be a benefit of the CR algorithm. For example, when text clustering is used in a context of text summarization, clusters with high purity will result in summaries of high quality. A lower recall can imply that several summaries deal with the same topic. However, when the purity of clusters decreases, the quality of summaries will decrease because of the fact that several topics are represented in the same cluster. Figure 10 shows two sets of clusters, both involving 40 samples and 3 topics.For both sets of clusters, the f -value is the same. The left panel shows a set of clusters for which the purity is high. All clusters contain documents that mostly refer to the same topic. The right panel shows a set of clusters for which the recall is high. All documents of one topic are mostly grouped into one cluster. However, the main topic of a cluster is not easily identified. In a context of text summarization, such clusters will result in summaries in which several topics are discussed. 7. CONCLUSION In this paper, we have introduced a new model for textual documents, called the relational document model. This model adopts the theory of multisets and multirelations to represent a document. A textual document is transformed into this relational model by using the SIE, a patented technology of the Belgian company i.Know. The SIE automatically decomposes a text into concepts and relations. Each concept and each relation is a substring of the original text and is extracted from the text without prior knowledge such as a taxonomy or an ontology. Next, we use the new text model for construction of the CR algorithm, which is an algorithm that clusters textual documents according to the topic that is referred to by a document. We hereby rely on the property that multiplicity (i.e., frequency) of (couples of) concepts is a good measure for the relevance of (couples of) concepts. Given a set International Journal of Intelligent Systems DOI 10.1002/int 24 BRONSELAER AND DE TRÉ of documents, the CR algorithm tries to construct a relation R (D) that contains the relevant couples of concepts with respect to the topics that are described. On the basis of such a relation, clusters are produced by use of a simple selection operator. It is shown that the CR algorithm tries to optimize both precision (purity) and recall. However, experimental results show that in practice, priority is given to precision. These experiments show that the CR algorithm offers a major improvement with respect to some basic cluster algorithms. In comparison with advanced methods such as LDA, there is a major improvement with respect to purity. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Bronselaer A, Hallez A, De Tré G. A possibilistic view on set and multiset comparison. Control Cybern 2009;38(2):341–366. Bronselaer A, De Tré G. A possibilistic approach on string comparison. IEEE Trans Fuzzy Syst 2009;17(1):208–223. Bronselaer A, De Tré G. Properties of possibilistic string comparison. IEEE Trans Fuzzy Syst 2010;18(2):312–325. Antoon Bronselaer, Saskia Debergh, Dirk Van Hyfte, Guy De Tré. Estimation of topic cardinality in document collections. In: Towards natural language based data/text mining and summarization via soft approaches; Philadelphia, PA: SIAM; 2010. pp 31–39. David Blei, Andrew Ng, and Michael Jordan. Latent Dirichlet allocation. J Mach Learn Res, 3:993–1022, 2003. Didier Dubois and Henri Prade. Possibility theory. Plenum Press; 1988. Ellen Hisdal. Conditional possibilities independence and noninteraction. Fuzzy Sets Syst, 1:283–297, 1978. George Shackle. Decision, order and time in human affairs. Cambridge, UK: Cambridge University Press; 1961. Gerald Salton, A Wong, and C Yang. A vector space model for automatic indexing. Commun ACM, 18(11):613–620, 1975. Gert De Cooman. Towards a possibilistic logic. In: Ruan D, editor. Fuzzy set theory and advanced mathematical applications. Boston, MA: Kluwer Academic; 1995. pp 89–133. Joe Ward. Hierarchical grouping to optimize an objective function. J Am Stat Assoc, 58(301):236–244, 1963. John Hopcroft and Robert Tarjan. Efficient algorithms for graph manipulation. Commun ACM, 16:372–378, 1973. Karen Jones. A statistical interpretation of term specificity and its application in retrieval. J Doc, 28(1):11–21, 1972. Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, and Julia Hirschberg. Do summaries help? A task based evaluation of multi document summarization. In: Proc 28th Annu Int ACM SIGIR Conf Res Development in Information Retrieval; New York ACM; 2005. pp 210–217. Lotfi Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst, 1:3–28, 1978. Lotfi Zadeh. A computational approach to fuzzy quantifiers in natural languages. Comput Math Appl, 9:149–184, 1983. Nicholas Andrews and Edward Fox. Recent developments in document clustering. Technical Report TR-07-35, Department of Computer Science, Virginia Tech, Blacksburg, VA; 2007. Ronald Yager. On the theory of bags. Int J Gen Syst, 13(1):23–27, 1986. Scott Deerwester. Improving information retrieval with latent semantic indexing. In: Proc 51st Ann Meeting Am Soc Information Science; Maryland; 1988. pp 36–40, 1988. Stuart Lloyd. Least squares quantization in PCM. IEEE Trans Inf Theory, 28(2):129–137, 1982. International Journal of Intelligent Systems DOI 10.1002/int

Log In

Concept-relational text clustering

Related papers

Related papers

Related topics