Text Mining using Fuzzy Association Rules

Martín-Bautista, M. J.; Sánchez, D.; Serrano, J. M.; Vila, M. A.

Text Mining using Fuzzy Association Rules

María José Martín Bautista

2004, Studies in Fuzziness and Soft Computing

visibility

…

description

17 pages

link

1 file

Abstract

In this paper, fuzzy association rules are used in a text framework. Text transactions are defined based on the concept of fuzzy association rules considering each attribute as a term of a collection. The purpose of the use of text mining technologies presented in this paper is to assist users to find relevant information. The system helps the user to formulate queries by including related terms to the query using fuzzy association rules. The list of possible candidate terms extracted from the rules can be added automatically to the original query or can be shown to the user who selects the most relevant for her/his preferences in a semi-automatic process.

Text Mining using Fuzzy Association Rules M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila Dept. of Computer Science and Artificial Intelligence. University of Granada. C/ Periodista Daniel Saucedo Aranda s/n, 18071, Granada, Spain. [email protected] Summary. In this paper, fuzzy association rules are used in a text framework. Text transactions are defined based on the concept of fuzzy association rules considering each attribute as a term of a collection. The purpose of the use of text mining technologies presented in this paper is to assist users to find relevant information. The system helps the user to formulate queries by including related terms to the query using fuzzy association rules. The list of possible candidate terms extracted from the rules can be added automatically to the original query or can be shown to the user who selects the most relevant for her/his preferences in a semi-automatic process. 1 Introduction The data in the Internet is not organized in a consistent way due to a lack of an authority that supervises the adding of data to the web. Even inside each web site, there is a lack of structure in the documents. Although the use of hypertext would help us to give some homogeneous structure to the documents in the web, and therefore, to use data mining techniques for structure data, as it happens in relational databases, the reality is that nobody follows a unique format to write documents for the web. This represents a disadvantage when techniques such as data mining are applied. This leads us to use techniques specifically for text, as if we were not dealing with web documents, but with text in general, since all of them have an unstructured form. This lack of homogeneity in the web makes the search process of information in the web by querying not so successful as navigators expect. This fact is due to two basic reasons: first, because the user is not able to represent her/his needs in query terms and second, because the answer set of documents is so huge that the user feels overwhelmed. In this work, we address the first problem of query specification. Data mining techniques has been broadly applied to text, generating what is called Text Mining. Sometimes, the data mining applications requires the 2 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila user to know how to manage the tool. In this paper, the rules extracted from texts are not shown to the user specifically. The generated rules are applied to help user to refine the query but the user only see, considering a process non automatic completely, a list of candidate terms to add to the query. When a user try to express her/his needs in a query, the terms that finally appear in the query are usually not very specific due to the lack of background knowledge of the user about the topic or just because in the moment of the query, the terms do not come to the user’s mind. To help the user with the query construction, terms related to the words of a first query may be added to the query. From a first set of documents retrieved, data mining techniques are applied in order to find association rules among the terms in the set. The most accurate rules that include the original query words in the antecedent / consequent of the rule, are used to modify the query by automatically adding these terms to the query or, by showing to the user the related terms in those rules, so the modification of the query depends on the user’s decision. A generalization or specification of the query will occur when the terms used to reformulate the query appear in the consequent / antecedent of the rule, respectively. This suggestion of terms helps the user to reduce the set of documents, leading the search through the desired direction. This paper is organized as follows: in section 2, a summary of literature with the same purpose of this work is included. From section 3 to section 6, general theory about data mining and new proposals in the fuzzy framework are presented. Concretely, in section 3 and 4, the concepts of association rules, fuzzy association rules and fuzzy transactions are presented. In section 5, new measures for importance and accuracy of association rules are proposed. An algorithm to generate fuzzy association rules is presented in section 6. An application of this theory to text framework is proposed in section 7 and 8. The definition of text transactions is given in section 7, while the extracted text association rules are applied to query reformulation in an Information Retrieval framework in section 8. Finally, concluding remarks and future trends are given in section 9. 2 Related Work One of the possible applications of Text Mining is the problem of query refinement, which has been treated from several frameworks. On the one hand, in the field of Information Retrieval, the problem has been defined as query expansion, and we can find several references with solutions to this problem. A good review in the topic can be found in [20]. On the other hand, techniques such as Data Mining, that have been applied successfully in the last decade in the field of Databases, have been also applied to solve some classical Information Retrieval problems such as document classification [33] and query optimization [46]. In this section, prior work in both frameworks, Information Text Mining using Fuzzy Association Rules 3 Retrieval and Data Mining is presented, although the number of approaches presented in the first one, is much more extended than in the second one. 2.1 Previous Research in the Data Mining and Knowledge Discovery Framework In general terms, the application of Data Mining and Knowledge Discovery techniques to text has been called Text Mining and Knowledge Discovery in Texts, respectively. The main difference to apply these techniques in a text framework is the special characteristics of text as unstructured data, totally different from databases, where mining techniques are usually applied and structured data is managed. Some general approaches about Text Mining and Knowledge Discovery in Texts can be found in [17], [21], [28],[31] In this work, association rules applying techniques form data mining will be discovered as a process to select the terms to be added to the original query. Some other approaches can be found in this direction. In [46] a vocabulary generated by the association rules is used to improve the query. In [22] a system for Finding Associations in Collections of Text (FACT) is presented. The system takes background knowledge to show the user a simple graphical interface providing a query language with well-defined semantics for the discovery actions based on term taxonomy at different granularity levels. A different application of association rules but in the Information Retrieval framework can be found in [33] where the extracted rules are employed for document classification. 2.2 Previous Research in the Information Retrieval Framework Several classifications can be made in this field according to the documents considered to expand the query, the selection of the terms to include in the query, and the way to include them. In [50] the authors make a study of expansion techniques based on the set of documents considered to analyze for the query expansion. If these documents are the corpus as a whole, from which all the queries are realized, then the technique is called global analysis. However, if the expansion of the query is performed based on the documents retrieved from the first query, the technique is denominated local analysis, and the set of documents is called local set. This local technique can also be classified into two types. On the one hand, local feedback adds common words from the top-ranked documents of the local set. These words are identified sometimes by clustering the document collection [3]. In this group we can include the relevance feedback process, since the user have to evaluate the top ranked documents from which the terms to be added to the query are selected. On the other hand, local context analysis [50], which combines global analysis and context local feedback to add words based on relationships of the top-ranked documents. The co-occurrences of terms are calculated based on passages (text windows of fixed size), as in global analysis, instead of complete 4 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila documents. The authors show that, in general, local analysis performs better than global one. In our approach, both a global and a local technique are considered. On the one hand, association rules will be extracted from the corpus and applied to expand the query, and on the other hand, only the top ranked documents will be considered to carry out the same process. Regarding the selection of the terms, some approaches use several techniques to identify terms that should be added to the original query. The first group is based on their association relation by co-occurrence to query terms [47]. Instead of simply terms, in [50] find co-occurrences of concepts given by noun groups with the query terms. Some other approaches based on concept space are [12]. The statistical information can be extracted from a clustering process and ranking of documents from the local set, as it is shown in [13] or by similarity of the top-ranked documents [36]. All these approaches where a co-occurrence calculus is performed has been said to be suitable for construct specific knowledge base domains, since the terms are related, but it can not be distinguished how [8]. The second group searches terms based on their similarity to the query terms, constructing a similarity term thesaurus [41]. Other approaches in this same group, use techniques to find out the most discriminatory terms, which are the candidates to be added to the query. These two characteristics can be combined by first calculating the nearest neighbors and second by measuring the discriminatory abilities of the terms [38]. The last group is formed by approaches based on lexical variants of query terms extracted from a lexical knowledge base such as Wordnet [35]. Some approaches in this group are [49], and [8] where a semantic network with term hierarchies is constructed. The authors reveal the adequacy of this approach for general knowledge base, which can be identified in general terms with global analysis, since the set of documents from which the hierarchies are constructed is the corpus, and not the local set of a first query. Previous approaches with the idea of hierarchical thesaurus can be also found in the literature, where an expert system of rules interprets the user’s queries and controls the search process [25]. In our approach, since we are performing a local analysis, fuzzy association rules are used as a technique to find relations among the terms. The aim of the use of this technique is detail and give more information by means of inclusion relations about the connection of the terms, avoiding the inherent statistical nature of systems using co-occurrences as relationships among terms, which performance is only good where the terms selected to expand the query comes from relevant documents of the local set [27]. Previous good results of the use of fuzzy association rules in comparison with crisp association rules and pure statistical methods have been presented in the relational database framework [4], [16], [18]. As for the way to include the terms in the query, we can distinguish between automatic and semi-automatic query expansion [41]. In the first group, the selected terms can substitute or be added to the original query without Text Mining using Fuzzy Association Rules 5 the intervention of the user [10], [25], [47]. In the second group, a list of candidate terms is shown to the user, which makes the selection [48]. Generally, automatic query expansion is used in local analysis and semi-automatic query expansion is more adequate for global analysis, since the user has to decide from a broad set of terms from the corpus which are more related to her/his needs. 3 Association Rules The obtaining and mining of association rules is one of the main research problems in data mining framework [1]. Given a database of transactions, where each transaction is an itemset, the obtaining of association rules is a process guided by the constrains of support and confidence specified by the user. Support is the percentage of transactions containing an itemset, calculated in a statistical manner, while confidence measures the strength of the rule. Formally, let T be a set of transactions containing items of a set of items I. Let us consider two itemsets I1 , I2 ⊆ I, where I1 ∩ I2 = ∅. A rule I1 ⇒ I2 is an implication rule meaning that the apparition of itemset I 1 implies the apparition of itemset I 2 in the set of transactions T. I 1 and I 2 are called antecedent and consequent of the rule, respectively. Given a support of an itemset noted by supp(I k ), and the rule I1 ⇒ I2 , the support and the confidence of the rule noted by Supp (I1 ⇒ I2 ) and Conf (I1 ⇒ I2 ), respectively, are calculated as follows: Supp (I1 ⇒ I2 ) = supp (I1 ∪ I2 ) (1) supp (I1 ∪ I2 ) supp (I1 ) (2) Conf (I1 ⇒ I2 ) = The constrains of minimum support and minimum confidence are established by the user with two threshold values: minsupp for the support and minconf for the confidence. A strong rule is an association rule whose support and confidence are greater that thresholds minsupp and minconf, respectively. Once the user has determined these values, the process of obtaining association rules can be decomposed in two different steps: Step 1.- Find all the itemsets that have a support above threshold minsupp. These itemsets are called frequent itemsets. Step 2.- Generate the rules, discarding those rules below threshold minconf. The rules obtained with this process are called boolean association rules in the sense that they are generated from a set of boolean transactions where the values of the tuples are 1 or 0 meaning that the attribute is present in the transaction or not, respectively. 6 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila The application of these processes is becoming quite valuable to extract knowledge in business world. This is the reason why the examples given in the literature to explain generation and mining processes of association rules are based, generally, on sale examples of customers shopping. One of the most famous examples of this kind is the market basket example introduced in [1], where the basket of customers is analyzed with the purpose of know the relation among the products that everybody buy usually. For instance, a rule with the form bread⇒milk means that everybody that buy bread also buy milk, that is, the products bread an milk usually appears together in the market basket of customers. We have to take into account, however, that this rule obtaining has an inherent statistical nature, and is the role of an expert the interpretation of such rules in order to extract the knowledge that reflects human behavior. This fact implies the generation of easy rules understandable for an expert of the field described by the rules, but probably with no background knowledge of the data mining concepts and techniques. The consideration of rules coming from real world implies, most of the times, the handling of uncertainty and quantitative association rules, that is, rules with quantitative attributes such as, for example, the age or the weight of a person. Since the origin of these rules is still considered as a set of boolean transactions, a partition into intervals of the quantitative attributes is needed in order to transform the quantitative problem in a boolean one. The discover of suitable intervals with enough support is one of the problems to solve in the field proposed and addressed in several works [14], [23], [39]. In the first work, an algorithm to deal with non binary attributes, considering all the possible values that can take the quantitative attributes to find the rules. In the last two works, however, the authors strengthen the suitability of the theory of fuzzy sets to model quantitative data and, therefore, deal with the problem of quantitative rules. The rules generated using this theory are called fuzzy association rules, and their principal bases as well as the concept of fuzzy transactions are presented in next section. 4 Fuzzy Transactions and Fuzzy Association Rules Fuzzy association rules are defined as those rules that associate items of the form (Attribute, Label), where the label has an internal representation as fuzzy set over the domain of the attribute [18]. The obtaining of these rules comes from the consideration of fuzzy transactions. In the following, we present the main and features related to fuzzy transactions and fuzzy association rules. The complete model and applications of these concepts can be found in [14]. 4.1 Fuzzy Transactions Given a finite set of items I, we define a fuzzy transaction as any nonempty fuzzy subset τ̃ ⊆ I. For every i ∈ I, the membership degree of i in a fuzzy Text Mining using Fuzzy Association Rules 7 transaction τ̃ is noted by τ̃ (i). Therefore, given an itemset Io ⊆ I, we note τ̃ (I0 ) the membership degree of I0 to a fuzzy transaction τ̃ . We can deduce from this definition that boolean transactions are a special case of fuzzy transactions. We call FT-set the set of fuzzy transactions, remarking that it is a crisp set. A set of fuzzy transactions FT-set is represented as a table where columns and rows are labeled with identifiers of items and transactions, respectively. Each cell of a pair (transaction, itemset) of the form (I0 , τ̃j ) contains the membership degree of I 0 in τ̃j , noted τ̃j (I0 ) and defined as τ̃ (I0 ) = min τ̃ (i) i∈I0 (3) The representation of an item I 0 in a FT-set T based in I is represented by a fuzzy set Γ̃I0 ⊆ T, defined as X Γ̃I0 = τ̃ (I0 )/τ̃ (4) τ̃ ∈T 4.2 Fuzzy Association Rules A fuzzy association rule is a link of the form A ⇒ B such that A, B ⊂ I and A ∩ B = ∅, where A is the antecedent and B is the consequent of the rule, being both of them fuzzy itemsets. An ordinary association rule is a fuzzy association rule. The meaning of a fuzzy association rule is, therefore, analogous to the one of an ordinary association rule, but the set of transactions where the rule holds, which is a FT-set. If we call Γ̃A and Γ̃B the degrees of attributes A and B in every transaction τ̃ ∈ T, we can assert that the rule A ⇒ B holds with totally accuracy in T when Γ̃A ⊆ Γ̃B . 5 Importance and Accuracy Measures for Fuzzy Association Rules The imprecision latent in fuzzy transactions makes us consider a generalization of classical measures of support and confidence by using approximate reasoning tools. One of these tools is the evaluation of quantified sentences presented in [51]. A quantified sentence is and expression of the form ”Q of F are G”, where F and G are two fuzzy subsets on a finite set X, and Q is a relative fuzzy quantifier. We focus on quantifiers representing fuzzy percentages with fuzzy values in the interval [0,1] such as ”most”, ”almost all” or ”many”. These quantifiers are called relative quantifiers. Let us consider Q M a quantifier defined as QM (x) = x, ∀x ∈ [0, 1]. We define the support of an itemset I 0 in an FT-set T as the evaluation of the quantified sentence, 8 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila QM of T are Γ̃I0 (5) while the support of a rule A ⇒ B in T is given by the evaluation of QM of T are Γ̃A∪B = QM of T are Γ̃A ∩ Γ̃B (6) and its confidence is the evaluation of QM of Γ̃A are Γ̃B (7) We evaluate the sentences by means of method GD presented in [19]. To evaluate the sentence ”Q of F are G”, a compatibility degree between the relative cardinality of G with respect to F and the quantifier is represented by GDQ (G/F ) and defined as Ã¯ ¯! ¯(G ∩ F ) ¯ X αi GDQ (G/F ) = (αi − αi+1 ) ·Q (8) |Fαi | αi ∈∆(G/F ) where ∆ (G/F ) = Λ (G ∩ F ) ∪ Λ (F ), Λ (F ) being the level set of F, and ∆ (G/F ) = {α1 , . . . , αp } with αi > αi+1 for every i ∈ {1, . . . p}. The set F is assumed to be normalized. If not, F is normalized and the normalization factor is applied to G ∩ F . We must point out, moreover, that when we are dealing with crisp data in a T-set T , the evaluation of sentences are the ordinary measures of support and confidence of crisp association rules. Therefore, the evaluation of sentence ”Q of F are G” is µ ¶ |F ∩ G| Q (9) |F | when F and G are crisp. The GD method verifies this property. For more details, see [19]. We can interpret the ordinary measures of confidence and support as the degree to which the confidence and support of an association rule is Q M . Other properties of this quantifier can be seen in [14]. This generalization of the ordinary measures allow us, using Q M , provide an accomplishment degree, basically. Hence, for fuzzy association rules we can assert QM τ ∈ T, A ⇒ B (10) 5.1 Certainty as a New Measure for Rule Accuracy We propose the use of certainty factors to measure the accuracy of association rules. A previous study can be found in [15]. Certainty factors were developed as a model for the representation of uncertainty and reasoning in rule-based systems [45], although they have been used in knowledge discovery too [24]. Text Mining using Fuzzy Association Rules 9 We define certainty factor (CF ) of a fuzzy association rule A ⇒ B based on the value of the confidence of the rule. If Conf (A ⇒ B) > supp (B) the value of the factor is given by expression (11); otherwise, is given by expression (12), considering that if supp(B)=1, then CF (A ⇒ B) = 1 and if supp(B)=0, then CF (A ⇒ B) = −1 CF (A ⇒ B) = Conf (A ⇒ B) − supp (B) 1 − supp (B) (11) CF (A ⇒ B) = Conf (A ⇒ B) − supp (B) supp (B) (12) We demonstrated in [7] that certainty factors verify the three properties by [29]. From now on, we shall use certainty factors to measure the accuracy of a fuzzy association rule. We consider a fuzzy association rule as strong when its support and certainty factor are greater than thresholds minsupp and minCF, respectively. 6 Generation of Fuzzy Association Rules Several approaches can be found in the literature where efficient algorithms for association rule generation like Apriori and AprioriTid [2], OCD [34], SETM [30], DHP [37], DIC [9], FP-Growth [26] and TBAR [6], have been presented. Most of them include and describe the process of generating fuzzy association rules with two basic steps, as we mentioned in Sect. 3: the generation of frequent itemsets and the obtaining of the rules, with their associated grades of support and confidence. As we are considering fuzzy association rules, in Algorithm 1, we show a process to find the frequent itemsets. For this purpose, the transactions are analyzed one by one and the itemsets whose support is greater than threshold minsupp are selected. The items are processed ordered by size. First 1-itemsets, next 2-itemsets and so on. The variable l stores the actual size. The set Ll stores the l -itemsets that are being analyzed and, at the end, it stores the frequent l -itemsets. In order to deal with fuzzy transactions, we need to store the difference between the cardinality of every α-cut of Γ̃I0 and the cardinality of the corresponding strong α-cut, α ∈ [0, 1], for all considered itemsets I 0 . Specifically ¯³ ´ ¯ ¯¯³ ´ ¯¯ ¯ ¯ ¯ ¯ Γ̃I0 ¯ − ¯¯ Γ̃I0 ¯ α α+ ¯ ³ ´ n o ³ ´ ¯ where Γ̃I0 = τ̃ ∈ T ¯ Γ̃I0 (τ̃ ) ≥ α and Γ̃I0 α α+ ¯ n o ¯ = τ̃ ∈ T ¯ Γ̃I0 (τ̃ ) > α . We use a used a fixed number of k equidistant α-cuts, (specifically k=100, although we a lesser value would be sufficient). By this information, we obtain the fuzzy cardinality of the representation of the items, which is stored in an 10 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila array ³ VI0 . This ´ array can be easily obtained from an FT-set by adding 1 to VI0 Γ̃I0 (τ̃ ) for every itemset I 0 each time a transaction τ̃ is considered. The function R(x,k) maps the real value x to the nearest value in the set of k equidistant levels we are using. The procedure CreateLevel(i, L) generates a set of i -itemsets such that every proper subset with i -1 items is frequent (i.e. is in Li−1 ). Since every proper subset of a frequent itemset is also a frequent itemset, with this procedure we avoid analyzing itemsets that do not verify this property, saving space and time. Algorithm 1 Basic algorithm to find frequent itemsets in a FT-set T Input: a set I of items and an a FT-set T based on I. Output: a set of frequent itemsets F. 1. {Initialization} a) Create an array V{i} of size k+1 for every i ∈ I b) L1 ← { {i} | i ∈ I } c) F=0 d) l ← 1 2. Repeat until l > |I| or Ll = 0 a) For every τ̃ ∈ T i. For every ¡ I∗¡ ∈ Ll ¢¢ ¡ ¡ ¢¢ A. VI∗ R Γ̃I∗ (τ̃ ) , k ← VI∗ R Γ̃I∗ (τ̃ ) , k + 1 b) For every I∗ ∈ Ll ¡ ± ¢ i. Calculate GDQ Γ̃I∗ T ¡ ± ¢ ii. If GDQ Γ̃I∗ T < minsupp × |T | A. Ll ← Ll \ {I∗ } B. Free the memory used by VI∗ c) {Variables updating} i. F = F ∪ Ll ii. Ll+1 ← CreateLevel (l + 1, Ll ) iii. l ← l + 1 3. Return(F ) The complexity of this algorithm is an exponential function of the number of items. The hidden constant is increased in a factor that depends on k as this value affects the size of the arrays V. For more details of the algorithm, see [14] Once we have obtained the frequent itemsets with the former algorithm, we obtain the confidence by calculating GDQ (B/A) from V A and V A∪B . From confidence and support of the consequent, both available, we obtain the certainty factor of the rules. Finally, we can identifier the strong rules by analyzing the values of support and certainty for the rules. Text Mining using Fuzzy Association Rules 11 7 Text Mining for Information Access The main problem when the general techniques of data mining are applied to text, is to deal with unstructured data, in comparison to structured data coming from relational databases. Therefore, with the purpose to perform a knowledge discovery process, we need to obtain some kind of structure in the texts. Different representations of text for association rules extraction have been considered: bag of words, indexing keywords, term taxonomy and multiterm text phrases [17]. In our case, we use automatic indexing techniques coming from Information Retrieval [44]. We represent each document by a set of terms with a weight meaning the presence of the term in the document. Some weighting schemes for this purpose can be found in [43]. One of the more successful and more used representation schemes is the tf-idf scheme, which takes into account the term frequency and the inverse document frequency, that is, if a term occurs frequently in a document but infrequently in the collection, a high weight will be assigned to that term in the document. This is the scheme we consider in this work. The algorithm to get the representation by terms and weights of a document d i can be detailed by the known following steps in Algorithm 2. Algorithm 2 Basic algorithm to obtain the representation of documents in a collection Input: a set of documents D = {d1 , . . . dn }. Output: a representation for all documents in D. 1. 2. 3. 4. 5. Let D = {d1 , . . . dn } be a collection of documents Extract an initial set of terms S from each document di ∈ D Remove stop words Apply stemming (via Porter’s algorithm [40]) The representation of d i obtained is a set of keywords {t1 , . . . , tm } ∈ S with their associated weights {w1 , . . . , wm } We must point out that, as it has been commented and shown in [21], [42], standard Text Mining usually deal with categorized documents, in the sense of documents which representation is a set of keywords, that is, terms that really describe the content of the document. This means that usually a full text is not considered and its description is not formed by all the words in the document, even without stop words, but also by keywords. The authors justify the use of keywords because of the appearing of useless rules. Some additional commentaries about this problem regarding the poor discriminatory power of frequent terms can be found in [38], where the authors comment the fact that the expanded query may result worst performance than the original one due to the poor discriminatory ability of the added terms. 12 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila However, in document collections where the categorization is not always available, full text is necessary to be considered as starting point. Additionally, special pre-processing tasks of term extraction and selection can be applied to get keywords in these collections. We are not referring here to statistical counts of term occurrences and assigning of weighting schemes such as the tfidf one, but to more elaborated methods that imply additional time process, such as term taxonomy construction, thesauri or controlled vocabulary. Nevertheless, in dynamic environments or systems where the response-time is important, the application of this pre-processing stage may not be suitable. This is the case of the problem we deal with in this work, the query refinement in Internet, where an automatic process would be necessary. Two time constraints have to be into account: first, the fact that not all web documents have identified keywords when is retrieved, or if they have, we do not have the guarantee that the keywords are appropriate in all the cases. Second, in the case of query refinement, information rule must be shown to the user online, that is, while she/he is query the system. Therefore, instead of improve document representation in this situation, we can improve the rule obtaining process. The use of alternative measures of importance and accuracy such as the ones presented in Sect. 5 is considered in this work in order to avoid the problem of non appropriate rule generation. 7.1 Text Transactions From a collection of documents D = {d1 , . . . , dn } we can obtain a set of terms I = {t1 , . . . , tm } which is the union of the keywords for all the documents in the collection. The weights associated to these terms are represented by W = {w1 , . . . , wm }. Therefore, for each document d i , we consider an extended representation where a weight of 0 will be assigned to every term appearing in some of the documents of the collection but not in d i . Considering these elements, we can define a text transaction τi ∈ T as the extended representation of document d i . Without loosing generalization, we can write T = {d1 , . . . , dn }. However, as we are dealing with fuzzy association rules, we will consider a fuzzy representation of the presence of the terms in documents, by using the normalized tf-idf scheme [32]. Analogously to the former case, we can define a set of fuzzy text transactions F T = {d1 , . . . , dn }, where each document d i corresponds to a fuzzy transaction τ̃i , and where the weights W = {w1 , . . . , wm } of the keyword set I = {t1 , . . . , tm } are fuzzy values. 8 Query Reformulation Procedure The purpose of this work is to provide a system with a query reformulation ability in order to improve the retrieval process. We represent the query a Q = {q1 , . . . , qm } with associated weights P = {p1 , . . . , pm }. To obtain a Text Mining using Fuzzy Association Rules 13 relevance value for each document, the query representation is matched to each document representation, obtained as explained in Algorithm 2. If a document term does not appear in the query, its value will be assumed as 0. The considered operators and measures are the one from the generalized Boolean model with fuzzy logic [11]. The user’s initial query generates a set of ranked documents. If the topranked documents do not satisfy user’s needs, the query improvement process starts. From the retrieved set of documents, association relations are found. As we explain in Sect.2, two different approaches can be considered at this point: an automatic expansion of the query or a semi-automatic expansion, based on the intervention of the user in the selection process of the terms to be added to the query. The complete process in both cases is detailed in the following: Case 1: Automatic query reformulation process 1. The user queries the system 2. A first set of documents is retrieved 3. From this set, the representation of documents is extracted following Algorithm 2 and fuzzy association rules are generated following Algorithm 1 and the extraction rule procedure. 4. The terms co-occurring in the rules with the query terms are added to the query. 5. With the expanded query, the system is queried again. Case 2: Semi-automatic query reformulation process 1. The user queries the system 2. A first set of documents is retrieved 3. From this set, the representation of documents is extracted following Algorithm 2 and fuzzy association rules are generated following Algorithm 1 and the extraction rule procedure 4. The terms co-occurring in the rules with the query terms are shown to the user 5. The user selects those terms more related to her/his needs 6. The selected terms are added to the query, which is used to again to query the system We must point out that, in both cases, the obtained association rules conform a knowledge base specific for the domain of the first query. Where several queries are performed, a broader knowledge base may be constructed, so original queries will be enriched with more terms as the time passes. However, the obtaining of a huge knowledge-based from iterated query expansions even in different domains probably can not be used for any query in a successful way, since additional semantic relation information should be also take into account 14 M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila in order to get a general knowledge-base. As a future proposal, we can think about combine both domain-specific knowledge base and general knowledge base, looking at the terms appearing in association rules together with query terms appear, and searching in a general knowledge-base additional terms, WordNet [35], for instance, with a semantic relation with all the terms in the rule. Some further discussion about this point can be found in [8] 8.1 Generalization and Specialization of a Query Once the first query is constructed, and the association rules are extracted, we make a selection of rules where the terms of the original query appear. However, the terms of the query can appear in the antecedent or in the consequent of the rule. If a query term appears in the antecedent of a rule, and we consider the terms appearing in the consequent of the rule to expand the query, a generalization of the query will be carried out. Therefore, a generalization of a query gives us a query on the same topic as the original one, but looking for more general information. However, if query term appears in the consequent of the rule, and we reformulate the query by adding the terms appearing in the antecedent of the rule, then a specialization of the query will be performed, and the precision of the system should increase. The specialization of a query looks for more specific information than the original query but in the same topic. In order to obtain as much documents as possible, terms appearing in both sides of the rules can also be considered. 9 Conclusion and Future Work In this work, an application of traditional data mining techniques in a text framework is proposed. Classical transactions in data mining are first extended to the fuzzy transactions, proposing new measures to measure the accuracy of a rule. Text transactions are defined based on fuzzy transactions, considering that each transaction correspond to a document representation. The set of transactions represents, therefore, a document collection from which the fuzzy association rules are extracted. One of the applications of this process is to solve the problem of refinement of a query, very well known in the field of Information Retrieval. A list of terms extracted from the fuzzy association rules related to the terms in the query can be automatically added to the original query to optimize the search. This process can also be done with the user intervention, selecting the terms more related to her/his preferences. As future work, we will implement the application of the model to this query reformulation procedure and compare the results with other approaches to query refinement coming from Information Retrieval. Text Mining using Fuzzy Association Rules 15 References 1. Agrawal R, Imielinski T, Swami A (1993) Mining Association Rules between Set of Items in Large Databases. Proc. of the 1993 ACM SIGMOD Conference, pp 207-216 2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. Proc. Of the 20th VLDB Conference, pp 478-499 3. Attar R, Fraenkel AS (1977) Local Feedback in Full-Text Retrieval Systems. Journal of the Association for Computing Machinery 24(3):397-417 4. Au WH, Chan KCC (1998) An effective algorithm for discovering fuzzy rules in relational databases. Proc. Of IEEE International Conference on Fuzzy Systems, vol II, pp 1314-1319 5. Baeza-Yates R, Ribeiro-Nieto B (1999) Modern Information Retrieval, AddisonWesley, USA 6. Berzal F, Cubero JC, Marı́n N, Serrano JM (2001) TBAR: An efficient method for association rule mining in relational databases. Data and Knowledge Engineering 37(1):47-84 7. Berzal F, Blanco I, Sánchez, Vila MA (2002) Measuring the Accuracy and Importance of Association Rules: A New Framework. Intelligent Data Analysis 6:221-235 8. Bodner RC, Song F (1996) Knowledge-Based Approaches to Query Expansion in Information Retrieval. In: McGalla G (ed) Advances in Artificial Intelligence pp 146-158. Springer, New York 9. Brin S, Motwani JD, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. SIGMOD Record 26(2):255-264 10. Buckley C, Salton G, Allan J, Singhal A (1993) Automatic Query Expansion using SMART: TREC 3”. Proc. of the 3 rd Text Retrieval Conference. NIST Special Publication 500-225, pp 69-80 11. Buell DA, Kraft DH (1981) Performance Measurement in a Fuzzy Retrieval Environment. Proceedings of the Fourth International Conference on Information Storage and Retrieval, ACM/SIGIR Forum 16(1): 56-62, Oakland, CA 12. Chen H, Ng T, Martinez J, Schatz BR (1997) A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System. Journal of the American Society for Information Science 48(1):17-31 13. Croft WB, Thompson RH (1987) I3 R: A New Approach to the Design of Document Retrieval Systems. Journal of the American Society for Information Science 38(6):389-404 14. Delgado M, Marı́n N, Sánchez D, Vila MA (2001). Fuzzy Association Rules: General Model and Applications. IEEE Transactions of Fuzzy Systems (accepted) 15. Delgado M, Martı́n-Bautista MJ, Sánchez D, Vila MA (2000). Mining strong approximate dependences from relational databases. Proc. Of IPMU 2000 2:11231130. Madrid, Spain 16. Delgado M, Martı́n-Bautista MJ, Sánchez D, Vila MA (2001) Mining association rules with improved semantics in medical databases. Artificial Intelligence in Medicine 21:241-245 17. Delgado M, Martı́n-Bautista MJ, Sánchez D, Vila MA (2002) Mining Text Data: Special Features and Patterns. Proc. of EPS Exploratory Workshop on 16 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. M.J. Martı́n-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila Pattern Detection and Discovery in Data Mining, pp 140-153. Imperial College Londres, UK Delgado M, Sánchez D, Vila MA (2000) Acquisition of fuzzy association rules from medical data. In Barro S, Marı́n R (eds) Fuzzy Logic in Medicine. PhysicaVerlag Delgado M, Sánchez D, Vila MA (2000) Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning 23:23-66 Efthimiadis E (1996) Query Expansion. Annual Review of Information Systems and Technology 31:121-187 Feldman R, Fresko M, Kinar Y, Lindell Y, Liphstat O, Rajman M, Schler Y, Zamir O (1998) Text Mining at the Term Level. Proc. of the 2nd European Symposium of Principles of Data Mining and Knowledge Discovery, pp 65-73 Feldman R, Hirsh H (1996) Mining associations in text in the presence of Background Knowledge. Proc. of the Second International Conference on Knowledge Discovery from Databases Fu AW, Wong MH, Sze SC, Wong WC, Wong WL, Yu WK (1998) Finding Fuzzy Sets for the Mining of Fuzzy Association Rules for Numerical Attributes. Proc. of Int. Symp. on Intelligent Data Engineering and Learning (IDEAL’98), pp 263-268, Hong Kong Fu LM, Shortliffe EH (2000) The application of certainty factors to neural computing for rule discovery. IEEE Transactions on Neural Networks 11(3):647657 Gauch S, Smith JB (1993) An Expert System for Automatic Query Reformulation. Journal of the American Society for Information Science 44(3):124-136 Han J, Pei J, Yin Y (2000)Mining frequent patterns without candidate generation. Proc. ACM SIGMOD Int. Conf. On Management of Data, pp 1-12. Dallas, TX, USA Harman D (1988) Towards interactive query expansion. Proc. of the Eleventh Annual International ACMSIGIR Conference on Research and Development in Information Retrieval pp 321-331. ACM Press Hearst M (1999) Untangling Text Data Mining. Proc. of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99). University of Maryland Hearst M (2000) Next Generation Web Search: Setting our Sites. IEEE Data Engineering Bulletin, Special issue on Next Generation Web Search, Gravano L (ed) Houtsma M, Swami A (1995) Set-oriented mining for association rules in relational databases. Proc. Of the 11th International Conference on Data Engineering pp 25-33. Kodratoff Y (1999) Knowledge Discovery in Texts: A Definition, and Applications. In: Ras ZW, Skowron A (eds) Foundation of Intelligent Systems, Lectures Notes on Artificial Intelligence 1609. Springer Verlag Kraft D, Petry FE, Buckles BP, Sadasivan T (1997) Genetic Algorithms for Query Optimization in Information Retrieval: Relevance Feedback. In: Sanchez E, Shibata T, Zadeh LA, (eds) Genetic Algorithms and Fuzzy Logic Systems, Advances in Fuzziness: Applications and Theory 7:157-173, World Scientific Lin SH, Shih CS, Chen MC, Ho JM, Ko MT, Huang YM (1998) Extracting Classification Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach. Proc. of ACM/SIGIR’98 pp 241-249. Melbourne, Australia View publication stats Text Mining using Fuzzy Association Rules 17 34. Mannila H, Toivonen H, Verkamo I (1994) Efficient algorithms for discovering association rules. Proc. Of AAAI Workshop on Knowledge Discovery in Databases pp 181-192 35. Miller G (1990) WordNet: An on-line lexical database. International Journal of Lexicography 3(4) 36. Mitra M, Singhal A, Buckley C (1998) Improving Automatic Query Expansion. Proc. Of ACM SIGIR pp 206-214. Melbourne, Australia 37. Park JS, Chen MS, Yu PS (1995) An effective hash based algorithm for mining association rules. SIGMOD Record 24(2):175-186 38. Peat HJ, Willet P (1991) The limitations of term co-occurrence Data for Query Expansion in Document Retrieval Systems. Journal of the American Society for Information Science 42(5):378-383 39. Piatetsky-Shapiro G (1991) Discovery, Analysis, and Presentation of Strong Rules. In: Piatetsky-Shapiro G, Frawley WJ (eds) Knowledge Discovery in Databases, AAAI/MIT Press 40. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130-137 41. Qui Y, Frei HP (1993) Concept Based Query Expansion. Proc. Of the Sixteenth Annual International ACM-SIGIR’93 Conference on Research and Development in Information Retrieval pp 160-169 42. Rajman M, Besançon R (1997) Text Mining: Natural Language Techniques and Text Mining Applications. Proc. of the 3rd International Conference on Database Semantics (DS-7)Chapam & Hall IFIP Proceedings serie 43. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5):513-523 44. Salton G, McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill 45. Shortliffe E, Buchanan B (1975) A model of inexact reasoning in medicine. Mathematical Biosciences 23:351-379 46. Srinivasan P, Ruiz ME, Kraft DH, Chen J (2001) Vocabulary mining for information retrieval: rough sets and fuzzy sets. Information Processing and Management 37:15-38 47. Van Rijsbergen CJ, Harper DJ, Porter MF (1981) The selection of good search terms. Information Processing and Management 17:77-91 48. Vélez B, Weiss R, Sheldon MA, Gifford DK (1997) Fast and Effective Query Refinement. Proc. Of the 20th ACM Conference on Research and Development in Information Retrieval (SIGIR’97). Philadelphia, Pennsylvania 49. Voorhees EM (1994)Query expansion using Lexical-Semantic Relations. ACM SIGIR pp 61-70 50. Xu J, Croft WB (1996) Query Expansion Using Local and Global Document Analysis. Proc. of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pp 4-11 51. Zadeh LA (1983) A computational approach to fuzzy quantifiers in natural languages. Computing and Mathematics with Applications 9(1):149-184

Log In

Text Mining using Fuzzy Association Rules

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics