Academia.eduAcademia.edu

2 Rough Sets In Data Analysis : Foundations and Applications

2008

Rough sets is a paradigm introduced in order to deal with uncertainty due to ambiguity of classification caused by incompleteness of knowledge. The idea proposed by Z. Pawlak in 1982 goes back to classical idea of representing uncertain and/or inexact notions due to the founder of modern logic, Gottlob Frege: uncertain notions should possess around them a region of uncertainty consisting of objects that can be qualified with certainty neither into the notion nor to its complement. The central tool in realizing this idea in rough sets is the relation of uncertainty based on the classical notion of indiscernibility due to Gottfried W. Leibniz: objects are indiscernible when no operator applied to each of them yields distinct values. In applications, knowledge comes in the form of data; those data in rough sets are organized into an information system: a pair of the form (U,A) where U is a set of objects and A is a set of attributes, each of them a mapping a : U → Va, the value set of ...

2 Rough Sets In Data Analysis: Foundations and Applications Lech Polkowski1,2 and Piotr Artiemjew2 1 2 Polish-Japanese Institute of Information Technology, Koszykowa 86, 02008 Warszawa, Poland [email protected] Department of Mathematics and Computer Science, University of Warmia and Mazury, Żolnierska 14, Olsztyn, Poland [email protected] Summary. Rough sets is a paradigm introduced in order to deal with uncertainty due to ambiguity of classification caused by incompleteness of knowledge. The idea proposed by Z. Pawlak in 1982 goes back to classical idea of representing uncertain and/or inexact notions due to the founder of modern logic, Gottlob Frege: uncertain notions should possess around them a region of uncertainty consisting of objects that can be qualified with certainty neither into the notion nor to its complement. The central tool in realizing this idea in rough sets is the relation of uncertainty based on the classical notion of indiscernibility due to Gottfried W. Leibniz: objects are indiscernible when no operator applied to each of them yields distinct values. In applications, knowledge comes in the form of data; those data in rough sets are organized into an information system: a pair of the form (U, A) where U is a set of objects and A is a set of attributes, each of them a mapping a : U → Va , the value set of a. Each attribute a does produce the a-indiscernibility relation IN D(a) = {(u, v) : a(u) = a(v)}. T Each set of attributes B does induce the B-indiscernibility relation IN D(B) = IN D(a) : a ∈ B. Objects u, v that are in the relation IN D(B) are B-indiscernible. Classes [u]B of the relation IN D(B) form B–elementary granules of knowledge. Rough sets allow for establishing dependencies among groups of attributes: a group B depends functionally on group C when IN D(C) ⊆ IN D(B): in that case values of attributes in B are functions of values of attributes in C. An important case is when data are organized into a decision system: a triple (U, A, d) where d is a new attribute called the decision. The decision gives a classification of object due to an expert, an external oracle; establishing dependencies between groups B of attributes in A and the decision is one of tasks of rough set theory. The language for expressing dependencies is the descriptor logic. A descriptor is a formula (a = v) where v ∈ Va , interpreted in the set U as [a = v] = {u : a(u) = v}. Descriptor formulas are obtained from descriptors by means of connectives ∨, ∧, ¬, ⇒ L. Polkowski and P. Artiemjew: Rough Sets In Data Analysis: Foundations and Applications, Studies in Computational Intelligence (SCI) 122, 33–54 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com 34 L. Polkowski, P. Artiemjew of propositional calculus; their semantics is: [α ∨ β] = [α] ∪ [β], [α ∧ β] = [α] ∩ [β], [¬α] = U \ [α], [α ⇒ β] = [¬α] ∪ [β]. In the language of descriptors, dependency between a group B of attributes V and the decision is expressed as a decision rule: a∈B (a = va ) ⇒ (d = v); a set of decision rules is a decision algorithm. There exist a number of algorithms for inducing decision rules. Indiscernibility relations proved to be too rigid for classification, and the search has been in rough sets for more flexible similarity relations. Among them one class that is formally rooted in logic is the class of rough inclusions. They allow for forming granules of knowledge more robust than traditional ones. Algorithms based on them allow for a substantial knowledge reduction yet with good classification quality. The problem that is often met in real data is the problem of missing values. Algorithms based on granulation of knowledge allow for solving this problem with a good quality of classification. In this Chapter, we discuss: • Basics of rough sets; • Language of descriptors and decision rules; • Algorithms for rule induction; • Examples of classification on real data; • Granulation of knowledge; • Algorithms for rule induction based on granulation of knowledge; • Examples of classification of real data; • The problem of missing values in data. 2.1 Basics of rough sets Introduced by Pawlak in [20], rough set theory is based on ideas that – although independently fused into a theory of knowledge – borrow some thoughts from Gottlob Frege, Gottfried Wilhelm Leibniz, Jan Lukasiewicz, Stanislaw Leśniewski, to mention a few names of importance. Rough set approach rests on the assumption that knowledge is classification of entities into concepts (notions). To perform the classification task, entities should be described in a formalized symbolic language. In case of the rough set theory, this language is the language of attributes and values. The formal framework for allowing this description is an information system, see Pawlak [21]. 2.1.1 Information systems: formal rendering of knowledge An information system is a pair (U, A), in which U is a set of objects and A is a set of attributes. Each attribute a ∈ A is a mapping a : U → Va from the universe U into the value set Va of a. A variant of this notion is a basic in data mining notion of a decision system: it is a pair (U, A ∪ {d}), where d ∈ / A is the decision. In applications, decision d is the attribute whose value is set by an expert whereas attributes in A, called in this case conditional attributes, are selected and valued by the system user. Description of entities is done in the attribute–value language. 2 Rough Sets In Data Analysis . . . 35 2.1.2 Attribute–value language. Indiscernibility Attribute–value language is built from elementary formulas called descriptors; a descriptor is a formula of the form (a = v), where v ∈ Va . From descriptors, complex formulas are formed by means of connectives ∨, ∧, ¬, ⇒ of propositional calculus: if α, β are formulas then α ∨ β, α ∧ β, ¬α, α ⇒ β are formulas. These formulas and no other constitute the syntax of the descriptor logic. Semantics of descriptor logic formulas is defined recursively: for a descriptor (a = v), its meaning [a = v] is defined as the set {u ∈ U : a(u) = v}. For complex formulas, one adopts the recursive procedure, given by the following identities: • • • • [α ∨ β] = [α] ∪ [β]. [α ∧ β] = [α] ∩ [β]. [¬α] = U \ [α]. [α ⇒ β] = [¬α] ∪ [β]. Descriptor logic allows for coding of objects in the set U as sets of descriptors: for an object u ∈ U , the information set InfA (u) is defined as the set {(a = a(u)) : a ∈ A}. It may happen that two objects, u and v, have the same information set: InfA (u) = InfA (v); in this case, one says that u and v are A–indiscernible. This notion maybe relativized to any set B ⊆ A of attributes: the B–indiscernibility relation is defined as IN D(B) = {(u, v) : InfB (u) = InfB (v)}, where InfB (u) = {(a = a(u)) : a ∈ B} is the information set of u restricted to the set B of attributes. A more general notion of a template was proposed and studied in [18]: a template is a formula of the form (a ∈ Wa ), where Wa ⊆ Va is a set of values of the attribute a; the meaning [a ∈ Wa ] of the template (a ∈ Wa ) is the set {u ∈ U : a(u) ∈ Wa }. Templates can also (like descriptors) be combined by means of propositional connectives with semantics defined as with descriptors. The indiscernibility relations are very important in rough sets: one easily may the formula in descriptor logic: V observe that for u ∈ U , and B φB u : a∈B (a = a(u)), the meaning [φu ] is equal to the equivalence class [u]B = {v ∈ U : (u, v) ∈ IN D(B) of the equivalence relation IN D(B). The moral is: classes [u]B are definable, i.e., they have descriptions in the descriptor logic; also unions of those classes are definable: for a union S W B X = j∈J [uj ]Bj of such classes, the formula j∈J φujj has the meaning equal to X. Concepts X ⊆ U that are definable are also called exact; other concepts are called rough. The fundamental difference between the two kinds of concepts is that only exact concepts are “seen” in data; rough concepts are “blurred” and they can be described by means of exact concepts only; to this aim, rough sets offer the notion of an approximation. 36 L. Polkowski, P. Artiemjew 2.1.3 Approximations Due to Fregean idea [6], an inexact concept should possess a boundary into which objects that can be classified with certainty neither to the concept nor to its complement fall. This boundary to a concept is constructed from indiscernibility relations induced by attributes (features) of objects. To express the B–boundary of a concept X induced by the set B of attributes, S approximations over B are introduced, i.e., BX = {[u]B : [u]B ⊆ X} (the B–lower approximation) BX = S {[u]B : [u]B ∩ X 6= ∅} (the B–upper approximation). The difference BdB X = BX \ BX is the B–boundary of X; when non–empty it does witness that X is rough. For a rough concept X, one has the double strict inclusion: BX ⊂ X ⊂ BX as the description of X in terms of two nearest to it exact concepts. 2.1.4 Knowledge reduction. Reducts Knowledge represented in an information system (U, A) can be reduced: a reduct B of the set A of attributes is a minimal subset of A with the property that IN D(B) = IN Dd(A). Thus, reducts are minimal with respect to inclusion sets of attributes which preserve classification, i.e., knowledge. Finding all reducts is computationally hard: the problem of finding a minimal length reduct is NP–hard, see [35]. An algorithm for finding reducts based on Boolean Reasoning technique was proposed in [35]; the method of Boolean Reasoning consists in solving a problem by constructing a Boolean function whose prime implicants would give solutions to the problem [3]. The Skowron–Rauszer algorithm for reduct induction: a case of Boolean Reasoning In the context of an information system (U, A), the method of Boolean Reasoning for reduct finding proposed by Skowron and Rauszer [35], given input (U, A) with U = {u1 , ..., un }, starts with the discernibility matrix, MU,A = [ci,j = {a ∈ A : a(ui ) 6= a(uj )}]1≥i,j≤n , and builds V the Boolean W function in the CNF form, fU,A = ci,j 6=∅,i<j a∈ci,j a, where a is the Boolean variable assigned to the attribute a ∈ A. W V ∗ The function fU,A is converted to its DNF form: fU,A : j∈J k∈Kj aj,k . Then: sets of the V form Rj = {aj,k : k ∈ Kj } for j ∈ J, corresponding to prime implicants k∈Kj aj,k are all reducts of A. 2 Rough Sets In Data Analysis . . . 37 On the soundness of the algorithm We give here a proof of the soundness of the algorithm in order to acquaint the reader with this method which is also exploited in a few variants described below; the reader will be able to supply own proofs in those cases on the lines shown here. We consider a set B of attributes and the valuation valB on the Boolean variable set {a : a ∈ A}: valB (a) = 1 in case a ∈ B and 0, otherwise. Assume that the Boolean function fU,A is W satisfied under this valuation: valB (fU,A ) = 1. This means that valB ( a∈ci,j a) = 1 for each ci,j 6= ∅. An equivalent formula to this statement is: ∀i, j.ci,j 6= ∅ ⇒ ∃a ∈ ci,j .a ∈ B. Applying tautology p ⇒ q ⇔ ¬q ⇒ ¬p to the last implication, we obtain: ∀a ∈ B.a ∈ / ci,j ⇒ ∀a ∈ A.a ∈ / ci,j for each pair i, j. By definition of the set ci,j , the last implication reads: IN D(B) ⊆ IN D(A). This means IN D(B) = IN D(A) as IN D(A) ⊆ IN D(B) always because B ⊆ A. V ∗ Now, we have valB (fU,A ) = 1 as well; this means that valB ( k∈Kj aj,k ) = 1 for some jo ∈ J. In turn, by definition of valB , this implies that B ⊆ {ajo ,k : k ∈ Kjo }. ∗ A conclusion from the comparison of values of valB on fU,A and fU,A is that : IN D(B) = IN D(A) if and only if B ⊆ {aj,k : k ∈ Kj } for the j − th prime implicant of fU,A . Thus, any minimal with respect to inclusion set B of attributes such that IN D(B) = IN D(A) coincides with a set of attributes {aj,k : k ∈ Kj } corresponding to a prime implicant of the function fU,A . Choosing a reduct R, and forming the reduced information system (U, R) one is assured that no information encoded in (U, A) has been lost. 2.1.5 Decision systems. Decision rules: an introduction A decision system (U, A ∪ {d}) encodes information about the external classification d (by an oracle, expert etc.). Methods based on rough sets aim at finding a description of the concept d in terms of conditional attributes in A in the language of descriptors. This description is fundamental for expert systems, knowledge based systems and applications in Data Mining and Knowledge Discovery. Formal expressions for relating knowledge in conditional part (U, A) to knowledge of an expert in (U, d) are decision rules; in descriptor logic they are of the form φB U ⇒ (d = w), where w ∈ Vd , the value set of the decision. Semantics of decision rules is given by general rules set in sect. 2.1.2: the B rule φB U ⇒ (d = w) is certain or true in case [φu ] ⊆ [d = w], i.e., in case when B each object v that satisfies φu , i.e., (u, v) ∈ IN D(B), satisfies also d(v) = w; otherwise the rule is said to be partial. The simpler case is when the decision system is deterministic, i.e., IN D(A) ⊆ IN D(d). In this case the relation between A and d is functional, given by the unique assignment VfA,d : InfA (u) → Infd (u), or, in the decision rule form as the set of rules: a∈A (a = a(u)) ⇒ (d = d(u)). Each of these rules is clearly certain. 38 L. Polkowski, P. Artiemjew In place of A any reduct R of A can be substituted leading to shorter certain rules. In the contrary case, some classes [u]A are split into more than one decision class [v]d leading to ambiguity in classification. In order to resolve the ambiguity, the notion of a δ–reduct was proposed in [35]; it is called a relative reduct in [2]. To define δ– reducts, first the generalized decision δB is defined for any B ⊆ A: for u ∈ U , δB (u) = {v ∈ Vd : d(u′ ) = v ∧ (u, u′ ) ∈ IN D(B) for some u′ ∈ U }. A subset B of A is a δ–reduct to d when it is a minimal subset od A with respect to the property that δB = δA . δ–reducts can be obtained from the modified Skowron and Rauszer algorithm [35]: it suffices to modify the entries ci,j to the discernibility matrix, by letting cdi,j = {a ∈ A ∪ {d} : a(ui ) 6= a(uj )} and then setting c′i,j = cdi,j \ {d} in case d(ui ) 6= d(uj ) and c′i,j = ∅ in case d(ui ) = d(uj ). The algorithm described δ above input with entries c′i,j forming the matrix MU,A outputs all δ–reducts δ to d encoded as prime implicants of the associated Boolean function fU,A . For any δ–reduct R, rules of the form φR ⇒ δ = δ (u) are certain. R u An example of reduct finding and decision rule induction We conclude the first step into rough sets with a simple example of a decision system, its reducts and decision rules. Table 2.1 shows a simple decision system. Table 2.1. Decision system Simple obj. u1 u2 u3 u4 u5 u6 a1 1 0 1 1 0 1 a2 0 1 1 0 0 1 a3 0 0 0 0 0 1 a4 1 0 0 1 1 1 d 0 1 1 1 1 0 Reducts of the information system (U, A = {a1 , a2 , a3 , a4 }) can be found from the discernibility matrix MU,A in Table 2.2; by symmetry, cells ci,j = cj,i with i > j are not filled. Each attribute ai is encoded by the Boolean variable i. After reduction by means of absorption rules of sentential calculus: (p ∨ ∗ q) ∧ p ⇔ p, (p ∧ q) ∨ p ⇔ p, the DNF form fU,A is 1 ∧ 2 ∧ 3 ∨ 1 ∧ 2 ∧ 4 ∨ 1 ∧ 3 ∧ 4. Reducts of A in the information system (U, A) are : {a1 , a2 , a3 }, {a1 , a2 , a4 }, {a1 , a3 , a4 }. δ–reducts of the decision d in the decision system Simple, can be found δ from the modified discernibility matrix MU,A in Table 2.3. 2 Rough Sets In Data Analysis . . . 39 Table 2.2. Discernibility matrix MU,A for reducts in (U, A) obj. u1 u2 u3 u4 u5 u6 u1 u2 u3 u4 ∅ {1, 2, 4} {2, 4} ∅ − ∅ {1} {1, 2, 3} − − ∅ {2, 4} − − − ∅ − − − − − − − − u5 {1} {2, 4} {2, 4} {1} ∅ − u6 {2, 3} {1, 3, 4} {3, 4} {2, 3} {1, 2, 3} ∅ δ Table 2.3. Discernibility matrix MU,A for δ–reducts in (U, A, d) obj. u1 u2 u3 u4 u5 u6 u1 u2 u3 ∅ {1, 2, 4} {2, 4} − ∅ ∅ − − ∅ − − − − − − − − − u4 ∅ ∅ ∅ ∅ − − u5 {1} ∅ ∅ ∅ ∅ − u6 ∅ {1, 3, 4} {3, 4} {2, 3} {1, 2, 3} ∅ δ From the Boolean function fU,A we read off δ–reducts R1 = {a1 , a2 , a3 }, R2 = {a1 , a2 , a4 }, R3 = {a1 , a3 , a4 }. Taking R1 as the reduct for inducing decision rules, we read the following certain rules: r1 : (a1 = 0) ∧ (a2 = 1) ∧ (a3 = 0) ⇒ (d = 1); r2 : (a1 = 1) ∧ (a2 = 1) ∧ (a3 = 0) ⇒ (d = 1); r3 : (a1 = 0) ∧ (a2 = 0) ∧ (a3 = 0) ⇒ (d = 1); r4 : (a1 = 1) ∧ (a2 = 1) ∧ (a3 = 1) ⇒ (d = 0); and two possible rules r5 : (a1 = 1) ∧ (a2 = 0) ∧ (a3 = 0) ⇒ (d = 0); r6 : (a1 = 1) ∧ (a2 = 0) ∧ (a3 = 0) ⇒ (d = 1), each with certainty factor =.5 as there are two objects with d=0. 2.1.6 Decision rules: advanced topics In order to precisely discriminate between certain and possible rules, the notion of a positive region along with the notion of a relative reduct was proposed and studied in [35]. S Positive region posB (d) is the set {u ∈ U : [u]B ⊆ [u]d }= v∈Vd B[(d = v)]; posB (d) is the greatest subset X of U such that (X, B ∪ {d}) is deterministic; it generates certain rules. Objects in U \ posB (d) are subjected to ambiguity: given such u, and the collection v1 , .., vk of decision d values on the class V (a = a(u)) ⇒ [u] , the decision rule describing u can be formulated as, a∈B V WB (a = a(u)) ⇒ (d = v ) (d = v ); each of the rules i is possible i a∈B i=1,...,k but not certain as only for a fraction of objects in the class [u]B the decision takes the value vi on. 40 L. Polkowski, P. Artiemjew Relative reducts are minimal sets B of attributes with the property that posB (d) = posA (d); they can also be found by means of discernibility matrix ∗ MU,A [90]: c∗i,j = cdi,j \ {d} in case either d(ui ) 6= d(uj ) and ui , uj ∈ posA (d) or pos(ui ) 6= pos(uj ) where pos is the characteristic function of posA (d); otherwise, c∗i,j = ∅. For a relative reduct B, certain rules are induced from the deterministic system (posB (d), A ∪ {d}), possible rules are induced from the non– deterministic system (U \ posB (d), A ∪ {d}). In the last case, one can find δ–reducts to d in this system and turn the system intoVa deterministic one (U \ posB (d), A, δ) inducing certain rules of the form a∈B (a = a(u)) ⇒ W v∈δ(u) (d = v). A method for obtaining decision rules with minimal number of descriptors [22], [34], consists in reducing a given rule r : φ/B, u ⇒ (d = v) by finding a set Rr ⊆ B consisting of irreducible attributes in B only, in the sense that removing any a ∈ Rr causes the inequality [φ/Rr , u ⇒ (d = v)] 6= [φ/Rr \ {a}, u ⇒ (d = v)] to hold. In case B = A, reduced rules φ/Rr , u ⇒ (d = v) are called optimal basic rules (with minimal number of descriptors). The method for finding of all irreducible subsets of the set A [34], consists in considering another modification of discernibility matrix: for each object uk ∈ U , the δ entry c′i,j into the matrix MU,A for δ–reducts is modified into cki,j = c′i,j in k case d(ui ) 6= d(uj ) and i = k ∨ j = k, otherwise cki,j = ∅. Matrices MU,A k and associated Boolean functions fU,A for all uk ∈ U allow for finding all irreducible subsets of the set A and in consequence all basic optimal rules (with minimal number of descriptors). Decision rules are judged by their quality on the basis of the training set and by quality in classifying new unseen as yet objects, i.e., by their performance on the test set. Quality evaluation is done on the basis of some measures: for a rule r : φ ⇒ (d = v), and an object u ∈ U , one says that u matches r in case u ∈ [φ]. match(r) is the number of objects matching r. Support supp(r) supp(r) of r is the number of objects in [φ] ∩ [(d = v)]; the fraction cons(r)= match(r) is the consistency degree of r: cons(r) = 1 means that the rule is certain. Strength, strength(r), of the rule r is defined, as the number of objects correctly classified by the rule in the training phase [15], [1], [8]; relative supp(r) . Specificity strength is defined as the fraction rel − strength(r)= |[(d=v)]| of the rule r, spec(r), is the number of descriptors in the premise φ of the rule r. In the testing phase, rules vie among themselves for object classification when they point to distinct decision classes; in such case, negotiations among rules or their sets are necessary. In these negotiations, rules with better characteristics are privileged. For a given decision class c : d = v, and an object u in the test set, the set Rule(c, u) of all rules matched by u and pointing to the decision v, P is characterized globally by Support(Rule(c, u))= r∈Rule(c,u) strength(r) · 2 Rough Sets In Data Analysis . . . 41 spec(r). The class c for which Support(Rule(c, u)) is the largest wins the competition and the object u is classified into the class c : d = v. It may happen that no rule in the available set of rules is matched by the test object u and partial matching is necessary, i.e., for a rule r, the matching factor match−f act(r, u) is defined as the fraction of descriptors in the premise φ of r matched by u to the number spec(r) of descriptors P in φ. The rule for which the partial support P art − Support(Rule(c, u))= r∈Rule(c,u) match − f act(r, u) · strength(r) · spec(r) is the largest wins the competition and it does assign the value of decision to u. 2.2 Discretization of continuous valued attributes The important problem of treating continuous values of attributes has been resolved in rough sets with the help of discretization of attributes technique, common to many paradigms like decision trees, etc.; for a decision system (U, A, d), a cut is a pair (a, c), where a ∈ A, c in reals. The cut (a, c) induces the binary attribute ba,c (u) = 1 if a(u) ≥ c and it is 0, otherwise. Given a finite sequence pa =ca0 < ca1 < .... < cam of reals, the set Va of values of a is split into disjoint intervals: (←, ca0 ), [ca0 , ca1 ), ...., [cam , →); the new attribute Da (u) = i when bcai+1 = 0, bcai = 1, is a discrete counterpart to the continuous attribute a. Given a collection P = {pa : a ∈ A} (a cut system), the set D = {Da : a ∈ A} of attributes transforms the system (U, A, d) into the discrete system (U, DP , d) called the P –segmentation of the original system. The set P is consistent in case generalized decision in both systems is identical, i.e., δA = δDP ; a consistent P is irreducible if P ′ is not consistent for any proper subset P ′ ⊂ P ; P is optimal if its cardinality is minimal among all consistent cut systems, see [16], [17]. 2.3 Classification Classification methods can be divided according to the adopted methodology, into classifiers based on reducts and decision rules, classifiers based on templates and similarity, classifiers based on descriptor search, classifiers based on granular descriptors, hybrid classifiers. For a decision system (U, A, d), classifiers are sets of decision rules. Induction of rules was a subject of research in rough set theory since its beginning. In most general terms, building a classifier consists in searching in the pool of descriptors for their conjuncts that describe decision classes sufficiently well. As distinguished in [37], there are three main kinds of classifiers searched for: minimal, i.e., consisting of the minimum possible number of rules describing decision classes in the universe, exhaustive, i.e., consisting of all possible rules, satisfactory, i.e., containing rules tailored to a specific use. Classifiers 42 L. Polkowski, P. Artiemjew are evaluated globally with respect to their ability to properly classify objects, usually by error which is the ratio of the number of correctly classified objects to the number of test objects, total accuracy being the ratio of the number of correctly classified cases to the number of recognized cases, and total coverage, i.e, the ratio of the number of recognized test cases to the number of test cases. Minimum size algorithms include LEM2 algorithm due to Grzymala–Busse [9] and covering algorithm in the RSES package [33]; exhaustive algorithms include, e.g., LERS system due to Grzymala–Busse [7], systems based on discernibility matrices and Boolean reasoning [34], see also [1], [2], implemented in the RSES package [33]. Minimal consistent sets of rules were introduced in Skowron and Rauszer [35]. Further developments include dynamic rules, approximate rules, and relevant rules as described in [1], [2], as well as local rules (op. cit.) effective in implementations of algorithms based on minimal consistent sets of rules. Rough set based classification algorithms, especially those implemented in the RSES system [33], were discussed extensively in [2]. In [1], a number of techniques were verified in experiments with real data, based on various strategies: discretization of attributes (codes: N-no discretization, S-standard discretization, D-cut selection by dynamic reducts, G-cut selection by generalized dynamic reducts); dynamic selection of attributes (codes: N-no selection, D-selection by dynamic reducts, G-selection based on generalized dynamic reducts); decision rule choice (codes: A-optimal decision rules, G-decision rules on basis of approximate reducts computed by Johnson’s algorithm, simulated annealing and Boltzmann machines etc., N-without computing of decision rules); approximation of decision rules (codes: N-consistent decision rules, P-approximate rules obtained by descriptor dropping); negotiations among rules (codes: S-based on strength, M-based on maximal strength, R-based on global strength, D-based on stability). Any choice of a strategy in particular areas yields a compound strategy denoted with the alias being concatenation of symbols of strategies chosen in consecutive areas, e.g., NNAND etc. We record here in Table 2.4 an excerpt from the comparison (Table 8, 9, 10 in [1]) of best of these strategies with results based on other paradigms in classification for two sets of data: Diabetes and Australian credit from UCI Repository [40]. An adaptive method of classifier construction was proposed in [43]; reducts are determined by means of a genetic algorithm, see [2], and in turn reducts induce subtables of data regarded as classifying agents; choice of optimal ensembles of agents is done by a genetic algorithm. 2 Rough Sets In Data Analysis . . . 43 Table 2.4. A comparison of errors in classification by rough set and other paradigms paradigm system/method Diabetes Austr.credit Stat.M ethods Logdisc 0.223 0.141 Stat.M ethods SM ART 0.232 0.158 N eural N ets Backpropagation2 0.248 0.154 N eural N etworks RBF 0.243 0.145 Decision T rees CART 0.255 0.145 Decision T rees C4.5 0.270 0.155 Decision T rees IT rule 0.245 0.137 Decision Rules CN 2 0.289 0.204 Rough Sets N N AN R 0.335 0.140 Rough Sets DN AN R 0.280 0.165 Rough Sets best result 0.255(DN AP M ) 0.130(SN AP M ) 2.4 Approaches to classification in data based on similarity Algorithms mentioned in sect. 2.3 were based on indiscernibility relations which are equivalence relations. A softer approach is based on similarity relations, i.e., relations that are reflexive and possibly symmetric but need not be transitive. Classes of these relations provide coverings of the universe U instead of its partitions. 2.4.1 Template approach Classifiers of this type were constructed by means of templates matching a given object or closest to it with respect to a certain distance function, or on coverings of the universe of objects by tolerance classes and assigning the decision value on basis of some of them [18]; we include in Table 2.5 excerpts from classification results in [18]. Table 2.5. Accuracy of classification by template and similarity methods paradigm system/method Diabetes Austr.credit Rough Sets Simple.templ./Hamming 0.6156 0.8217 Rough Sets Gen.templ./Hamming 0.742 0.855 Rough Sets Simple.templ./Euclidean 0.6312 0.8753 Rough Sets Gen.templ./Euclidean 0.7006 0.8753 Rough Sets M atch.tolerance 0.757 0.8747 Rough Sets Clos.tolerance 0.743 0.8246 A combination of rough set methods with the k–nearest neighbor idea is a further refinement of the classification based on similarity or analogy in [42]. In this approach, training set objects are endowed with a metric, and the test 44 L. Polkowski, P. Artiemjew objects are classified by voting by k nearest training objects for some k that is subject to optimization. 2.4.2 Similarity measures based on rough inclusions Rough inclusions offer a systematic way for introducing similarity into object sets. A rough inclusion µ(u, v, r) (read: u is a part of v to the degree of at least r) introduces a similarity that is not symmetric. Rough inclusions in an information system (U, A) can be induced in some distinct ways as in [25], [27]. We describe here just one method based on using Archimedean t–norms, i.e., t–norms t(x, y) that are continuous and have no idempotents, i.e., values x with t(x, x) = x except 0, 1 offer one way; it is well–known, see, e.g., [23], that up to isomorphism, there are two Archimedean t–norms: the Lukasiewicz t–norm L(x, y) = max{0, x + y − 1} and the product (Menger) t–norm P (x, y) = x · y. Archimedean t–norms admit a functional characterization, see, e.g, [23]: t(x, y) = g(f (x) + f (y)), where the function f : [0, 1] → R is continuous decreasing with f (1) = 0, and g : R → [0, 1] is the pseudo–inverse to f , i.e., f ◦ g = id. The t– )≥r induced rough inclusion µt is defined [24] as µt (u, v, r) ⇔ g( |DIS(u,v)| |A| where DIS(u, v) = {a ∈ A : a(u) 6= a(v)}. With the Lukasiewicz t–norm, f (x) = 1 − x = g(x) and IN D(u, v) = U × U \ DIS(u, v), the formula be≥ r; thus in case of Lukasiewicz logic, µL comes: µL (u, v, r) ⇔ |IN D(u,v)| |A| becomes the similarity measure based on the Hamming distance between information vectors of objects reduced modulo |A|; from probabilistic point of view, it is based on the relative frequency of descriptors in information sets of u, v. This formula permeates data mining algorithms and methods, see [10]. 2.5 Granulation of knowledge The issue of granulation of knowledge as a problem on its own, has been posed by L.A. Zadeh [44]. Granulation can be regarded as a form of clustering, i.e., grouping objects into aggregates characterized by closeness of certain parameter values among objects in the aggregate and greater differences in those values from aggregate to aggregate. The issue of granulation has been a subject of intensive studies within rough set community in, e.g., [14], [29], [31]. Rough set context offers a natural venue for granulation, and indiscernibility classes were recognized as elementary granules whereas their unions serve as granules of knowledge. For an information system (U, A), and a rough inclusion µ on U , granulation with respect to similarity induced by µ is formally performed by exploiting the class operator Cls of mereology [13]. The class operator is applied to 2 Rough Sets In Data Analysis . . . 45 any non–vacuous property F of objects (i.e. a distributive entity) in the universe U and produces the object ClsF (i.e., the collective entity) representing wholeness of F . The formal definition of Cls is: assuming a part relation in U and the associated ingredient relation ing, ClsF does satisfy conditions, 1. if u ∈ F then u is ingredient of ClsF . 2. if v is an ingredient of ClsF then some ingredient w of v is an ingredient as well of a T that is in F ; in plain words, each ingredient of ClsF has an ingredient in common with an object in F . An example of part relation is the proper subset ⊂ relation on a family of sets; then the subset relation S ⊆ is the ingredient relation, and the class of a family F of sets is its union F . The merit of class operator is in the fact that it always projects hierarchies onto the collective entity plane containing objects. For an object u and a real number r ∈ [0, 1], we define the granule gµ (u, r) about u of the radius r, relative to µ, as the class ClsF (u, r), where the property F (u, r) is satisfied with an object v if and only if µ(v, u, r) holds. It was shown [24] that in case of a transitive µ, v is an ingredient of the granule gµ (u, r) if and only if µ(v, u, r). This fact allows for writing down the granule gµ (u, r) as a distributive entity (a set, a list) of objects v satisfying µ(v, u, r). Granules of the form gµ (u, r) have regular properties of a neighborhood system [25]. Granules generated from a rough inclusion µ can be used in defining a compressed form of the decision system: a granular decision system [25]; for a granulation radius r, and a rough inclusion µ, we form the collection G G Ur,µ = {gµ (u, r)}. We apply a strategy G to choose a covering Covr,µ of the G universe U by granules from Ur,µ . We apply a strategy S in order to assign G the value a∗ (g) of each attribute a ∈ A to each granule g ∈ Covr,µ : a∗ (g) = S({a(u) : u ∈ g}). The granular counterpart to the decision system (U, A, d) G is a tuple (Ur,µ , G, S, {a∗ : a ∈ A}, d∗ ). The heuristic principle that H: objects, similar with respect to conditional attributes in the set A, should also reveal similar (i.e., close) decision values, and therefore, granular counterparts to decision systems should lead to classifiers satisfactorily close in quality to those induced from original decision systems that is at the heart of all classification paradigms, can be also formulated in this context [25]. Experimental results bear out the hypothesis [28]. The granulated data set offers a compression of the size of the training set and a fortiori, a compression in size of the rule set. Table 2.6 shows this on the example of Pima Indians Diabetes data set [40]. Exhaustive algorithm of RSES [33] has been applied as the rule inducting algorithm. Granular covering has been chosen randomly, majority voting has been chosen as the strategy S. Results have been validated by means of 10–fold cross validation, see, e.g., [5]. The radii of granulation have been determined by the chosen rough inclusion µL : according to its definition in sect.2.4.2, an object v is in the granule gr (u) in case at least r fraction of attributes agree on u and v; thus, values of r are 46 L. Polkowski, P. Artiemjew 1 less or equal to 1. The radius “nil” denotes multiplicities of the fraction |A| the results of non–granulated data analysis. Table 2.6. 10-fold CV; Pima; exhaustive algorithm. r=radius, macc=mean accuracy, mcov=mean coverage, mrules=mean rule number, mtrn=mean size of training set r nil 0.125 0.250 0.375 0.500 0.625 0.750 0.875 macc 0.6864 0.0618 0.6627 0.6536 0.6645 0.6877 0.6864 0.6864 mcov 0.9987 0.0895 0.9948 0.9987 1.0 0.9987 0.9987 0.9987 mrules 7629 5.9 450.1 3593.6 6517.6 7583.6 7629.2 7629.2 mtrn 692 22.5 120.6 358.7 579.4 683.1 692 692 For the exhaustive algorithm, the accuracy in granular case exceeds or equals that in non–granular case from the radius of .625 with slightly smaller sizes of training as well as rule sets and it reaches 95.2 percent of accuracy in non–granular case, from the radius of .25 with reductions in size of the training set of 82.6 percent and in the rule set size of 94 percent. The difference in coverage is less than .4 percent from r = .25 on, where reduction in training set size is 82.6 percent, and coverage in both cases is the same from the radius of .375 on with reductions in size of both training and rule set of 48, resp., 53 percent. The fact of substantial reduction in size of the training set as well in size of the rule set coupled with the fact of a slight only decrease in classification accuracy testifies to validity of the idea of granulated data sets; this can be of importance in case of large biological or medical data sets which after granulation would become much smaller and easier to analyze. 2.5.1 Concept–dependent granulation A variant of granulation idea is the concept-dependent granulation [28] in which granules are computed relative to decision classes, i.e., the restricted granule gµd (u, r) is equal to the intersection gµ (u, r)∩[d = d(u)] of the granule gµ (u, r) with the decision class [d = d(u)] of u. At the cost of an increased number of granules, the accuracy of classification is increased. In Table 2.7, we show the best results of classification obtained by means of various rough set methods on Australian credit data set [40]. The best result is obtained with concept–dependent granulation. 2 Rough Sets In Data Analysis . . . 47 Table 2.7. Best results for Australian credit by some rough set based algorithms; in case ∗, reduction in object size is 49.9 percent, reduction in rule number is 54.6 percent; in case ∗∗, resp., 19.7, 18.2; in case ∗ ∗ ∗, resp., 3.6, 1.9 source method accuracy coverage [1] SN AP M (0.9) error = 0.130 − [18] simple.templates 0.929 0.623 [18] general.templates 0.886 0.905 [18] closest.simple.templates 0.821 1.0 [18] closest.gen.templates 0.855 1.0 [18] tolerance.simple.templ. 0.842 1.0 [18] tolerance.gen.templ. 0.875 1.0 [43] adaptive.classif ier 0.863 − [28] granular∗ .r = 0.642 0.8990 1.0 [28] granular∗∗ .r = 0.714 0.964 1.0 [28] granular∗∗∗ .concept.r = 0.785 0.9970 0.9995 2.6 Missing values Incompleteness of data sets is an important problem in data especially biological and medical in which case often some attribute values have not been recorded due to difficulty or impossibility of obtaining them. An information/decision system is incomplete in case some values of conditional attributes from A are not known; some authors, e.g., Grzymala–Busse [8], [9], make distinction between values that are lost (denoted ?), i.e., they were not recorded or were destroyed in spite of their importance for classification, and values that are missing (denoted ∗) as those values that are not essential for classification. Here, we regard all lacking values as missing without making any distinction among them denoting all of them with ∗. Analysis of systems with missing values requires a decision on how to treat such values; Grzymala–Busse in his work [8], analyzes nine such methods known in the literature, among them, 1. most common attribute value, 2. concept–restricted most common attribute value, (...), 4. assigning all possible values to the missing location, (...), 9. treating the unknown value as a new valid value. Results of tests presented in [8] indicate that methods 4,9 perform very well among all nine methods. For this reason we adopt these methods in this work for the treatment of missing values and they are combined in our work with a modified method 1: the missing value is defined as the most frequent value in the granule closest to the object with the missing value with respect to a chosen rough inclusion. Analysis of decision systems with missing data in existing rough set literature relies on an appropriate treatment of indiscernibility: one has to reflect in this relation the fact that some values acquire a distinct character and must be treated separately; in case of missing or lost values, the relation of indiscernibility is usually replaced with a new relation called a characteristic relation. Examples of such characteristic functions are given in, e.g., Grzymala–Busse [9]: the function ρ is introduced, with ρ(u, a) = v meaning 48 L. Polkowski, P. Artiemjew that the attribute a takes on u the value v. Semantics of descriptors is changed, viz., the meaning [(a = v)] has as elements all u such that ρ(u, a) = v, in case ρ(u, a) =? the entity u is not included into [(a = v)], and in case ρ(u, a) = ∗, the entity u is included into [(a = v)] for all values v 6= ∗, ?. Then the characteristic relation is R(B) = {(u, v) : ∀.a ∈ B.ρ(u, a) =? ⇒ (ρ(u, a) = ρ(v, a) ∨ ρ(u, a) = ∗ ∨ ρ(v, a) = ∗)}, where B ⊆ A. Classes of the relation R(B) are then used in defining approximations to decision classes from which certain and possible rules are induced, see [9]. Specializations of the characteristic relation R(B) were defined in [38] (in case of only lost values) and in [11] (in case of only “don’t care” missing values). An analysis of the problem of missing values along with algorithms IApriori Certain and IAprioriPossible for certain and possible rule generation was given in [12]. We will use the symbol ∗ commonly used for denoting the missing value; we will use two methods 4, 9 for treating ∗, i.e, either ∗ is a “don’t care” symbol meaning that any value of the respective attribute can be substituted for ∗,thus ∗ = v for each value v of the attribute, or ∗ is a new value on its own, i.e., if ∗ = v then v can be only ∗. Our procedure for treating missing values is based on the granular strucG ture (Ur,µ , G, S, {a∗ : a ∈ A}); the strategy S is the majority voting, i.e., for each attribute a, the value a∗ (g) is the most frequent of values in {a(u) : u ∈ g}, with ties broken randomly. The strategy G consists in random selection of granules for a covering. For an object u with the value of ∗ at an attribute a, and a granule G g = g(v, r) ∈ Ur,µ , the question whether u is included in g is resolved according to the adopted strategy of treating ∗: in case ∗ = don’t care, the value of ∗ is regarded as identical with any value of a hence |IN D(u, v)| is automatically increased by 1, which increases the granule; in case ∗ = ∗, the granule size is decreased. Assuming that ∗ is sparse in data, majority voting on g would produce values of a∗ distinct from ∗ in most cases; nevertheless the value of ∗ may appear in new objects g ∗ , and then in the process of classification, such value is repaired by means of the granule closest to g ∗ with respect to the rough inclusion µL , in accordance with the chosen method for treating ∗. In plain words, objects with missing values are in a sense absorbed by close to them granules and missing values are replaced with most frequent values in objects collected in the granule; in this way the method 4 or 9 in [8] is combined with the idea of the most frequent value 1, in a novel way. We have thus four possible strategies: • Strategy A: in building granules ∗=don’t care, in repairing values of ∗, ∗=don’t care; • Strategy B: in building granules ∗=don’t care, in repairing values of ∗, ∗ = ∗; • Strategy C: in building granules ∗ = ∗, in repairing values of ∗, ∗=don’t care; • Strategy D: in building granules ∗ = ∗, in repairing values of ∗, ∗ = ∗. 2 Rough Sets In Data Analysis . . . 49 2.7 Case of real data with missing values We include results of tests with Breast cancer data set [40] that contains missing values. We show in Tables 2.8, 2.9, 2.10, 2.11, results for intermediate values of radii of granulation for strategies A,B,C,D and exhaustive algorithm of RSES [33]. For comparison, results on error in classification by the endowed system LERS from [8] for approaches similar to our strategies A and D (methods 4 and 9, resp., in Tables 2 and 3 in [8]) in which ∗ is either always ∗ (method 9) or ∗ is always don’t care (method 4) are recalled in Tables 2.8 and 2.11. We have applied here the 1-train–and–9 test, i.e., the data set is split randomly into 10 equal parts and training set is one part whereas the rules are tested on each of remaining 9 parts separately and results are averaged. Table 2.8. Breast cancer data set with missing values. Strategy A: r=granule radius, mtrn=mean granular training sample size, macc=mean accuracy, mcov=mean covering, gb=LERS method 4, [8] r mtrn macc mcov gb 0.555556 9 0.7640 1.0 0.7148 0.666667 14 0.7637 1.0 0.777778 17 0.7129 1.0 0.888889 25 0.7484 1.0 Table 2.9. Breast cancer data set with missing values. Strategy B: r=granule radius, mtrn=mean granular training sample size, macc=mean accuracy, mcov=mean covering r mtrn macc mcov 0.555556 7 0.0 0.0 0.666667 13 0.7290 1.0 0.777778 16 0.7366 1.0 0.888889 25 0.7520 1.0 Table 2.10. Breast cancer data set with missing values. Strategy C: r=granule radius, mtrn=mean granular training sample size, macc=mean accuracy, mcov=mean covering r mtrn macc mcov 0.555556 8 0.7132 1.0 0.666667 14 0.6247 1.0 0.777778 17 0.7328 1.0 0.888889 25 0.7484 1.0 50 L. Polkowski, P. Artiemjew Table 2.11. Breast cancer data set with missing values. Strategy D: r=granule radius, mtrn=mean granular training sample size, macc=mean accuracy, mcov=mean covering, gb=LERS method 9 [8] r mtrn macc mcov gb 0.555556 9 0.7057 1.0 0.6748 0.666667 16 0.7640 1.0 0.777778 17 0.6824 1.0 0.888889 25 0.7520 1.0 A look at Tables 2.8–2.11 shows that granulated approach gives with Breast cancer data better results than obtained earlier with the LERS method. This strategy deserves therefore attention. 2.8 Applications of rough sets A number of software systems for inducing classifiers were proposed based on rough set methodology, among them LERS by Grzymala–Busse ; TRANCE due to Kowalczyk; RoughFamily by Slowiński and Stefanowski; TAS by Suraj; PRIMEROSE due to Tsumoto; KDD-R by Ziarko; RSES by Skowron et al; ROSETTA due to Komorowski, Skowron et al; RSDM by Fernandez–Baizan et al; GROBIAN due to Duentsch and Gediga RoughFuzzyLab by Swiniarski. All these systems are presented in [30]. Rough set techniques were applied in many areas of data exploration, among them in exemplary areas: Processing of audio signals: [4]. Pattern recognition: [36]. Signal classification: [41]. Image processing: [39]. Rough neural computation modeling: [26]. Self organizing maps: [19]. Learning cognitive concepts: [32]. 2 Rough Sets In Data Analysis . . . 51 2.9 Concluding remarks Basic ideas, methods and results obtained within the paradigm of rough sets by efforts of many researchers, both in theoretical and application oriented aspects, have been recorded in this Chapter. Further reading, in addition to works listed in References, may be directed to the following monographs or collections of papers: A. Polkowski L, Skowron, A (eds.) (1998) Rough Sets in Knowledge Discovery, Vols. 1 and 2, Physica Verlag, Heidelberg B. Inuiguchi M, Hirano S, Tsumoto S (eds.) (2003) Rough Set Theory and Granular Computing, Springer, Berlin C. Transactions on Rough Sets I. Lecture Notes in Computer Science (2004) 3100, Springer, Berlin D. Transactions on Rough Sets II. Lecture Notes in Computer Science (2004) 3135, Springer Verlag, Berlin E. Transactions on Rough Sets III. Lecture Notes in Computer Science (2005) 3400, Springer, Berlin F. Transactions on Rough Sets IV. Lecture Notes in Computer Science (2005) 3700, Springer Verlag, Berlin G. Transactions on Rough Sets V. Lecture Notes in Computer Science (2006) 4100, Springer, Berlin H. Transactions on Rough Sets VI. Lecture Notes in Computer Science (2006) 4374, Springer, Berlin References 1. Bazan JG (1998) A comparison of dynamic and non–dynamic rough set methods for extracting laws from decision tables. In: Polkowski L, Skowron A (eds.), Rough Sets in Knowledge Discovery 1. Physica, Heidelberg 321–365 2. Bazan JG, Synak P, Wróblewski J, Nguyen SH, Nguyen HS (2000) Rough set algorithms in classification problems. In: Polkowski L, Tsumoto S, Lin TY (eds.) Rough Set Methods and Applications. New Developments in Knowledge Discovery in Information Systems, Physica , Heidelberg 49–88 3. Brown MF (2003) Boolean Reasoning: The Logic of Boolean Equations, 2nd ed., Dover, New York 4. Czyżewski A, et al. (2004) Musical phrase representation and recognition by means of neural networks and rough sets, Transactions on Rough Sets I. Lecture Notes in Computer Science 3100, Springer, Berlin 254–278 52 L. Polkowski, P. Artiemjew 5. Duda RO, Hart PE, Stork DG (2001) Pattern Classification, John Wiley and Sons, New York 6. Frege G (1903) Grundlagen der Arithmetik II, Jena 7. Grzymala–Busse JW (1992) LERS – a system for learning from examples based on rough sets. In: Slowiński R (ed.) Intelligent Decision Support: Handbook of Advances and Applications of the Rough Sets Theory. Kluwer, Dordrecht 3–18 8. Grzymala–Busse JW, Ming H (2000) A comparison of several approaches to missing attribute values in data mining, Lecture Notes in AI 2005, Springer, Berlin, 378–385 9. Grzymala–Busse JW (2004) Data with missing attribute values: Generalization of indiscernibility relation and rule induction, Transactions on Rough Sets I. Lecture Notes in Computer Science 3100, Springer, Berlin 78–95 10. Klösgen W, Żytkow J (eds.) (2002) Handbook of Data Mining and Knowledge Discovery, Oxford University Press, Oxford 11. Kryszkiewicz M (1999) Rules in incomplete information systems, Information Sciences 113:271–292 12. Kryszkiewicz M, Rybiński H (2000) Data mining in incomplete information systems from rough set perspective. In: Polkowski L, Tsumoto S, Lin TY (eds.) Rough Set Methods and Applications, Physica Verlag, Heidelberg 568–580 13. Leśniewski S (1916) Podstawy Ogólnej Teoryi Mnogosci (On the Foundations of Set Theory), in Polish. See English translation (1982) Topoi 2:7–52 14. Lin TY (2005) Granular computing: Examples, intuitions, and modeling. In: Proceedings of IEEE 2005 Conference on Granular Computing GrC05, Beijing, China. IEEE Press 40–44 15. Michalski RS, et al (1986) The multi–purpose incremental learning system AQ15 and its testing to three medical domains. In: Proceedings of AAAI-86, Morgan Kaufmann, San Mateo CA 1041–1045 16. Nguyen HS (1997) Discretization of Real Valued Attributes: Boolean Reasoning Approach, PhD Dissertation, Warsaw University, Department of Mathematics, Computer Science and Mechanics 17. Nguyen HS, Skowron A (1995) Quantization of real valued attributes: Rough set and Boolean reasoning approach, In: Proceedings 2nd Annual Joint Conference on Information Sciences, Wrightsville Beach NC 34–37 18. Nguyen SH (2000) Regularity analysis and its applications in Data Mining, In: Polkowski L, Tsumoto S, Lin TY (eds.), Physica Verlag, Heidelberg 289–378 19. Pal S K, Dasgupta B, Mitra P (2004) Rough–SOM with fuzzy discretization. In: Pal SK, Polkowski L, Skowron A (eds.), Rough – Neural Computing. Techniques for Computing with Words. Springer, Berlin 351–372 20. Pawlak Z (1982) Rough sets, Int. J. Computer and Information Sci. 11:341–356 21. Pawlak Z (1991) Rough sets: Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht 22. Pawlak Z, Skowron A (1993) A rough set approach for decision rules generation. In: Proceedings of IJCAI’93 Workshop W12. The Management of Uncertainty in AI; also ICS Research Report 23/93, Warsaw University of Technology, Institute of Computer Science 23. Polkowski L (2002) Rough Sets. Mathematical Foundations, Physica Verlag, Heidelberg 24. Polkowski L (2004) Toward rough set foundations. Mereological approach. In: Proceedings RSCTC04, Uppsala, Sweden, Lecture Notes in AI 3066, Springer, Berlin 8–25 2 Rough Sets In Data Analysis . . . 53 25. Polkowski L (2005) Formal granular calculi based on rough inclusions. In: Proceedings of IEEE 2005 Conference on Granular Computing GrC05, Beijing, China, IEEE Press 57–62 26. Polkowski L (2005) Rough–fuzzy–neurocomputing based on rough mereological calculus of granules, International Journal of Hybrid Intelligent Systems 2:91– 108 27. Polkowski L (2006) A model of granular computing with applications. In: Proceedings of IEEE 2006 Conference on Granular Computing GrC06, Atlanta, USA. IEEE Press 9–16 28. Polkowski L, Artiemjew P (2007) On granular rough computing: Factoring classifiers through granular structures. In: Proceedings RSEISP’07, Warsaw, Lecture Notes in AI 4585, Springer, Berlin, 280–289 29. Polkowski L, Skowron A (1997) Rough mereology: a new paradigm for approximate reasoning, International Journal of Approximate Reasoning 15:333–365 30. Polkowski L, Skowron A (eds.) (1998) Rough Sets in Knowledge Discovery 2. Physica Verlag, Heidelberg 31. Polkowski L, Skowron A (1999) Towards an adaptive calculus of granules. In: Zadeh L A, Kacprzyk J (eds.) Computing with Words in Information/Intelligent Systems 1. Physica Verlag, Heidelberg 201–228 32. Semeniuk–Polkowska M (2007) On conjugate information systems: A proposition on how to learn concepts in humane sciences by means of rough set theory, Transactions on Rough Sets VI. Lecture Notes in Computer Science 4374:298– 307, Springer, Berlin 33. Skowron A et al (1994) RSES: A system for data analysis. Available: http:\\logic.mimuw.edu.pl/~rses/ 34. Skowron A (1993) Boolean reasoning for decision rules generation. In: Komorowski J, Ras Z (eds.), Proceedings of ISMIS’93. Lecture Notes in AI 689:295–305. Springer, Berlin 35. Skowron A, Rauszer C (1992) The discernibility matrices and functions in decision systems. In: Slowiński R (ed) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory. Kluwer, Dordrecht 311–362 36. Skowron A, Swiniarski RW (2004) Information granulation and pattern recognition. In: Pal S K, Polkowski L, Skowron A (eds.), Rough – Neural Computing. Techniques for Computing with Words. Springer, Berlin 599–636 37. Stefanowski J (2006) On combined classifiers, rule induction and rough sets, Transactions on Rough Sets VI. Lecture Notes in Computer Science 4374:329– 350. Springer, Berlin 38. Stefanowski J, Tsoukias A (2001) Incomplete information tables and rough classification, Computational Intelligence 17:545–566 39. Swiniarski RW, Skowron A (2004) Independent component analysis, principal component analysis and rough sets in face recognition, Transactions on Rough Sets I. Lecture Notes in Computer Science 3100:392–404. Springer, Berlin 40. UCI Repository: http://www.ics.uci.edu./~mlearn/databases/ 41. Wojdyllo P (2004) WaRS: A method for signal classification. In: Pal S K, Polkowski L, Skowron A (eds.), Rough – Neural Computing. Techniques for Computing with Words. Springer, Berlin 649–688 42. Wojna A (2005) Analogy–based reasoning in classifier construction, Transactions on Rough Sets IV. Lecture Notes in Computer Science 3700:277–374. Springer, Berlin 54 L. Polkowski, P. Artiemjew 43. Wróblewski J (2004) Adaptive aspects of combining approximation spaces. In: Pal S K, Polkowski L, Skowron A (eds.), Rough – Neural Computing. Techniques for Computing with Words. Springer, Berlin 139–156 44. Zadeh LA (1979) Fuzzy sets and information granularity. In: Gupta M, Ragade R, Yaeger RR (eds.) Advances in Fuzzy Set Theory and Applications. North– Holland, Amsterdam 3–18