Minimal Decision Rules Based On The Apriori Algorithm: Int. J. Appl. Math. Comput. Sci., 2001, Vol.11, No.3, 691-704
Minimal Decision Rules Based On The Apriori Algorithm: Int. J. Appl. Math. Comput. Sci., 2001, Vol.11, No.3, 691-704
Minimal Decision Rules Based On The Apriori Algorithm: Int. J. Appl. Math. Comput. Sci., 2001, Vol.11, No.3, 691-704
3, 691–704
Based on rough set theory many algorithms for rules extraction from data have
been proposed. Decision rules can be obtained directly from a database. Some
condition values may be unnecessary in a decision rule produced directly from
the database. Such values can then be eliminated to create a more comprehensi-
ble (minimal) rule. Most of the algorithms that have been proposed to calculate
minimal rules are based on rough set theory or machine learning. In our ap-
proach, in a post-processing stage, we apply the Apriori algorithm to reduce
the decision rules obtained through rough sets. The set of dependencies thus
obtained will help us discover irrelevant attribute values.
1. Introduction
One way to construct a simpler model computed from data, easier to understand
and with more predictive power, is to create a set of simplified rules (Shan, 1995).
A simplified rule (also referred to as a minimal rule or a kernel rule) is the one in
which the number of conditions in its antecedent is minimal. Thus, when dealing
with decision rules, some condition values can be unnecessary and can be dropped
to generate a simplified rule preserving essential information. In (Pawlak, 1991) an
approach to simplify decision tables is presented. Such an approach consists of three
steps:
computation of reducts of condition attributes,
elimination of duplicate rows,
elimination of superfluous values of attributes.
† This work was partially supported by the Universidad Politècnica de Madrid under projects
“Cooperación UPM-Universidad del Valle” and “Implementación de un Data Warehouse capaz
de ser integrado con un sistema de Data Mining genérico”.
∗ DLSIIS Facultad de Informática, U.P.M., Madrid, Spain,
This approach to the problem is not very useful from the practical point of view,
because both the computation of reducts and the superfluous equivalence classes are
NP-hard.
Many algorithms and methods have been proposed and developed to generate
minimal decision rules. From the point of view of algorithms, some of them are based
on inductive learning (Grzymala-Busse, 1993; Michalski et al., 1986; Quinlan, 1986),
and some others are based on rough set theory (Bazan, 1996; 1998; Shan, 1995; Shan
and Ziarko, 1994; 1995; Skowron, 1995; Stefanowski, 1998). Some of the procedures
associated with rough set theory are complex to compute. Thus, rough set theory
has been applied to solve data mining problems integrated with other knowledge
extraction methods.
Shan (1995) proposes and develops a systematic method for computing all mini-
mal rules, called maximally general rules, based on decision matrices. Based on rough
set and Boolean reasoning, Bazan (1996) proposes a method to generate decision rules
using the concept of dynamic reducts. Dynamic reducts are stable reducts of a given
decision table that appear frequently in random samples of a decision table. Skowron
(1995) proposes a method that, when applied over consistent decisions tables, makes
it possible to obtain minimal decision rules. Based on the relative discernibility matrix
notion, the method enables us to calculate the set of all relative reducts of a decision
table. An incremental learning algorithm for computing the set of all minimal decision
rules based on the decision matrix method is proposed in (Shan and Ziarko, 1995).
On the other hand, some efforts have been made to integrate different data mining
tasks for obtaining better rules. In (Choobineh et al, 1997), an approach integrating
fuzzy logic and modified rough sets is proposed in order to classify new objects into
one of the different decisions categories when this classification cannot be perfectly
made. Świniarski (1998a; 1998b) proposes integration of rough sets with statistical
methods and neural networks. In the former proposal, Principal Component Analysis
(PCA) and Bayesian inference are integrated with rough sets for data mining applied
to breast cancer detection. PCA is applied for feature extraction and reduction. Rough
set theory is used to find reduct sets from patterns obtained as a result of PCA. In
the latter, rough sets and neural networks are combined for data mining and pattern
recognition and then implemented as a rough-neural system. Rough sets are used to
reduce the size of a knowledge base, while neural networks are used for classification.
Furthermore, in (Kosters et al, 1999) a method to extract clusters based on the Apriori
algorithm is proposed. The method considers the highest possible order association
rules. One of these rules with the highest confidence is selected, and all the customers
that bought items in the antecedent rule constitute a first cluster. The customers
in this cluster are removed from the original data set and the process is iterated,
leading to a hierarchical clustering. The method for stopping uses either a maximum
number of cluster parameters or the minimum support threshold for the generated
rules. Another combined algorithm based on both the Apriori algorithm and rough
set theory can be found in (Lin, 1996).
Kryszkiewicz (1998a; 1998b; Kryszkiewicz and Rybinski, 1998) proposed a new
algorithm which mines very large databases based on association rules generation.
This algorithm is a modification of the Apriori algorithm (Agrawal et al., 1993) espe-
Minimal decision rules based on the Apriori algorithm 693
2. Preliminaries
2.1. Rough Set Theory
The original rough set model was proposed by Pawlak (1991). This model is concerned
with the analysis of deterministic data dependencies. According to Ziarko (1993),
rough set theory is the discovery representation and analysis of data regularities.
In this model, the objects are classified into indiscernibility classes based on pairs
(attribute, values).
Let OB be a non-empty set called the universe, and let IND be an equivalence
relation over the universe OB , called the indiscernibility relation, which represents a
classification of the universe into classes of objects which are indiscernible or identical
in terms of the knowledge provided by given attributes. The main notion in rough set
theory is that of the approximation space which is formally defined as
A = (OB , IND). (1)
The equivalence classes of the relation are also called the elementary sets. Any fi-
nite union of elementary sets is called a definable set. Let us take X ⊆ OB
which represents a concept. It is not always the case that X can be defined ex-
actly as the union of some elementary sets. That is why two new sets are de-
fined: Apr(X) = {o ∈ OB /[o] ⊆ X} will be called the lower approximation, and
Apr(X) = {o ∈ OB /[o] ∩ X 6= ∅} will be called the upper approximation. Any set
defined in terms of its lower and upper approximations is called a rough set.
694 M.C. Fernández et al.
OB is a set of objects,
AT is a set of attributes,
f : OB × AT → V .
compute rules for a given minconf based on the itemsets obtained before.
Minimal decision rules based on the Apriori algorithm 695
The Apriori algorithm for finding all large itemsets makes multiple passes over the
database. In the first pass, the algorithm counts item occurrences to determine large
one-item sets. The other passes consist of two steps. First, the large itemsets L k−1
found in the (k − 1)-th pass are used to generate the candidate itemsets Ck ; next,
all those itemsets which have some k − 1 subset that is not in Lk−1 are deleted,
yielding Ck . Once the large itemsets are obtained, rules of the form a → (l − a) are
computed, where a ⊂ l and l is a large itemset.
➀ The cardinality of the ruleset represents the number of rules the ruleset has.
When dealing with real world problems, many algorithms tend to generate a
large number of very specialized rules. Each of these rules describes a very
reduced set of cases used for training (a very low support).
➁ Some of the rules in the ruleset have redundant conditions in their antecedents
allowing for the removal of one or more conditions from the antecedent, while
keeping the meaning and classification power of the rule.
Rule matching: Let OB be the set of objects and R be a ruleset. We define the
rule matching functor as an incomplete function: Match : OB × R −→ Vd ,
(
Right if ∀(A = v) ∈ Left : f (A, o) = v,
Match(o, Left → Right) = (2)
undefined otherwise.
696 M.C. Fernández et al.
Ruleset equivalence: Let OB be the set of objects. Then two rulesets, R and R 0 ,
are defined as equivalent according to D if
R ∼D R0 ⇐⇒ Cov (R, D, OB) = Cov (R0 , D, OB ). (12)
Rule equivalence: Let OB be the set of objects. Then two rules, r and r 0 , are
defined as equivalent if
(
Match(o, r) = Match(o, r 0 ), or
0
r ∼ r ⇐⇒ ∀o ∈ OB (13)
both Match(o, r) and Match(o, r 0 ) are undefined.
Minimal decision rules based on the Apriori algorithm 697
➀ If R is the original ruleset, the first objective is to find a new ruleset R 0 with
the restriction R ∼D R0 . The new ruleset should have fewer rules than the
previous one (|R0 | < |R|). This can be done in two different ways, namely by
deleting redundant rules, or by combining a subset of the ruleset obtaining a
reduced subset:
• Redundant rule deletion: r ∈ R is a redundant rule if
(
Match(o, r) = Match(o, r 0 ), or both
∀o ∈ OB ∃r ∈ R : r 6= r
0 0
(14)
Match(o, r) and Match(o, r 0 ) are undefined
With this definition, it is easy to prove that R ∼D R − {r}. Indeed, suppose
that R and R − {r} are not equivalent. Then
Cov (R, D, OB) 6= Cov (R − {r}, D, OB), (15)
This definition can be used to prove that R ∼D ((R − R0 ) ∪ R00 ). Indeed, the
equivalence relationship between rulesets is calculated based on the coverage
sets:
R ∼D ((R −R0 )∪R00 ) ⇐⇒ Cov (R, D, OB) = Cov ((R −R0 )∪R00 ), D, OB ). (26)
Using the distributive property of the coverage over the ruleset union and
difference, we see that R ∼D ((R − R0 ) ∪ R00 ) iff
Cov (R, D, OB ) = Cov (R, D, OB) − Cov (R0 , D, OB) ∪ Cov (R00 , D, OB ). (27)
Cov (R, D, OB ) = Cov (R, D, OB) − Cov (R0 , D, OB) ∪ Cov (R0 , D, OB ), (28)
Cov (R, D, OB) = Cov (R, D, OB). (29)
➁ The second objective deals with the reduction of the rule complexity. For each
rule in R ((Left → Right) ∈ R), the left-hand side has one or more predicate
conditions that are able to describe the objects covered by the rule. Therefore
a value is predicted for all the objects described by these conditions.
One of these conditions ((Ai = vi ) ∈ Left) is called superfluous if
Input: The algorithm takes a ruleset Ro extracted by applying the positive region
procedures of VPRSM to a decision table.2 OB are the objects successfully classified
by Ro using the attribute D ∈ AT as a decision attribute.
Output: A new ruleset Rf that is able to classify the same objects OB.
Algorithm:
➍ Once all the rules in AR have been processed, the remaining rules in Ro are
moved to Rf . These rules are called the reduced rules.
❶ All the attributes must be discrete (nominal), because we do not deal with
continuous values.
❷ The dataset should have more than 8 attributes and 100 objects. When the
positive region is calculated for datasets with either few attributes or few ob-
jects, the number of rules and their complexity are less significant than in large
datasets. Because our algorithm deals with complex datasets, we reduce the
cases studied to those of these sizes.
Mean support (MS): The mean number of objects successfully classified by the
rules.
Mean rule complexity (MRC): The mean number of conditions on the left-hand
side of the rules.
The results obtained are presented in Table 1. The columns are grouped in two blocks:
The first three columns are the values measured when the positive region algo-
rithm is executed. As can be seen, the mean support (M S) is very low, and in
most cases only one sample is described by each rule. The positive region rules
are also very complex since all the condition attributes appear in each rule.
The second block of columns belong to the results extracted by the post-
processing algorithm, using as the input the rules from the positive region cal-
culation. These results are divided into three groups:
– the rules extracted directly by the Apriori algorithm (association rules with
the decision attribute on the right-hand sides of the rules),
– the original decision rules after the post-processing algorithm; some rules
were deleted because they are described by direct a-priori rules, and the
others are filtered, removing redundant conditions and getting rules with
less complexity,
– the measures of both set of rules (direct a-priori rules and post-processed
rules).
Minimal decision rules based on the Apriori algorithm 701
As is shown, the ruleset complexity was reduced in terms of both the number of rules
and the number of conditions per rule. Some of the rulesets were reduced to 3.68%
(from 8107 rules to 298), and the number of conditional predicates to 29.09% (from
22.00 to 6.40).
There is also another main difference betwen the original model and the final
ruleset. The rules extracted by any rough set algorithm are independent of one another
(there is no object described by more than one rule). The rules extracted by our
algorithm, and especially the rules derived directly by the Apriori algorithm, may
overlap one another, and therefore there would exist objects described by more than
one rule. On the one hand, this characteristic alters the mean support measure (M S)
in terms of the numbers of objects classified successfully by direct a-priori rules; on
the other hand, this M S is correct for evaluating the strength of the rules in terms
of how general they are.
The accuracy measures, like confusion matrices, are not included because our
post-processing algorithm does not change this feature of the ruleset. If the association
rules are extracted by the Apriori with 100% accuracy, the deletion of rules and
redundant conditions does not modify the prediction power of the model.
The Size parameter in Table 1 defines the maximum itemset size when execut-
ing the Apriori algorithm. The selected value depends on the characteristics of each
dataset. There is an important trend when different values are used: as the size of
the itemset increases, the number of direct a-priori rules goes up too and, of course,
the number of post-processed rules decreases. An example of how the number of rules
changes depending on the values if this parameter is presented in Fig. 1.
A very important element in the performance of the algorithm is the number of
the association rules extracted by the Apriori algorithm. This set of rules is scanned
by our algorithms in order to classify them according to whether the decision attribute
belongs to either the left- or the right-side of the rule or it is not present. The size of
this set of rules increases when the itemset size increases (see Fig. 2). too
5. Discussion
In this paper, a new approach in which the Apriori algorithm is used to obtain minimal
rules has been presented. The knowledge discovery by means of rough set results
in a set of rules all having the features of high confidence but very low supports.
On the other hand, all condition attributes will appear in each rule. In order to
reduce the set of condition attributes, reduct calculation and other methods have
been traditionally used. Our approach is an alternative to reducing both the number
of condition attributes and the number of rules. Besides, the average support of rules
will increase with the application of our method. In this paper, we have presented
some experimental results using datasets with more than 100 records described by
more than 8 attributes (large datasets). The results highlight the fact that in all cases
the approach enhances the output of the classification algorithm.
702 M.C. Fernández et al.
160
Apriori direct rules
Post-processed rules
Both rules
140
120
100
Number of rules
80
60
40
20
0
2 3 4 5 6 7
Itemset size
1000
Association rules
900
800
700
600
Number of rules
500
400
300
200
100
0
2 3 4 5 6 7
Itemset size
The main drawback of our approach is the complexity of the resulting algorithm.
There are two main nested loops in it: one scans the association rules extracted by
the Apriori algorithm and the other the original rules. Nevertheless, let us observe
that dealing with huge datasets the original ruleset is neither significant nor compre-
hensible. So some approach is needed to improve the quality of rules.
As the algorithm effectiveness has been proven, we are currently working on the
improvement of its efficiency. We are also working on the application of this approach
to models obtained by other classification methods.
References
Agrawal R., Imielinski T., Swami A. (1993): Mining association rules between sets of items in
large databases. — Proc. ACM SIGMOD Int. Conf. Management of Data, Washington,
pp.207–216.
Bazan J.G. (1996): Dynamic reducts and statistical inference. — Proc. 5-th Int. Conf.
Information Processing and Management of Uncertainty in Knowledge-Based Systems
IPMU’96, Granada, Spain, pp.1147–1151.
Bazan J.G. (1998): A comparison of dynamic and non-dynamic rough set methods for ex-
tracting laws from decision tables, In: Rough Sets in Knowledge Discovery (Polkowski
L. and Skowron A., Eds.). — Heidelberg: Physica-Verlag, pp.321–365.
Blake C.L. and Merz C.J. (2001): UCI Repository of machine learning databases. —
Irvine, CA: University of California, Department of Information and Computer Sci-
ence, http://www.ics.uci.edu/∼mlearn/MLRepository.html.
Grzymala-Busse J.W. (1993): LERS: A system for learning from examples based on rough
sets, In: Intelligent Decision Support: Handbook of Applicactions and Advances of
Rough Set Theory (Slowinski R., Ed.). — Banff, Alberta: Kluwer Netherlands, pp.3–18.
Choobineh F., Paule M., Silkker W. and Hashemei R. (1997): On integration of modified
rough set and fuzzy logic as classifiers. — Proc. Joint Conf. Information Sciences, North
Carolina, pp.255–258.
Kosters W., Marchiori E. and Oerlemans A. (1999): Mining cluster with association rules.
— Lecture Notes in Computer Science 1642, Springer, pp.39–50.
Kryszkiewicz M. (1998a): String rules in large databases. — Proc. 7-th Int. Conf. Information
Processing and Management of Uncertainty in Knowledge-Based Systems IPMU98,
Paris, Vol.2, pp.1520–1527.
Kryszkiewicz M. (1998b): Fast discovery of representative association rules. — Proc. 1-st
Int. Conf. Rough Sets and Current Trends in Computing, RSCTC’98, Warsaw, Poland,
pp.214–221.
Kryszkiewicz M. and Rybinski H. (1998): Knowledge discovery from large databases using
rough sets . — Proc. 6-th Europ. Congr. Intelligent Techniques and Soft Computing
EUFIT’98, Aachen, Germany, Vol.1, pp.85–89.
Lin T.Y. (1996): Rough set theory in very large databases. — Proc. Computational Enfi-
neering in Systems Applications, CESA’96, Lille, France, Vol.2, pp.936–941.
704 M.C. Fernández et al.
Michalski R., Carbonell J. and Mitchell T.M. (1986): Machine Learning: An Artificial In-
telligence Approach, Vol. 1. — Palo Alto CA: Tioga Publishing.
Pawlak Z. (1991): Rough Sets—Theoretical Aspects of Reasoning about Data. — Dordrecht:
Kluwer.
Quinlan J.R. (1986): Induction of decision trees. — Mach. Learn., Vol.1, pp.81–106.
Shan N. (1995): Rule discovery from data using decision matrices. — M.Sc. Thesis, University
of Regina.
Shan N. and Ziarko W. (1994): An incremental learning algorithm for constructing decision
rules, In: Rough Sets, Fuzzy Sets and Knowledge Discovery (W. Ziarko, Ed.). — Berlin:
Springer, pp.326–334.
Shan N. and Ziarko W. (1995): Data-base acquisition and incremental modification of clas-
sification rules. — Comput. Intell., Vol.11, No.2, pp.357–369.
Skowron A. (1995): Extracting laws from decision tables: A rough set approach. — Comput.
Intell., Vol.11, No.2, pp.371–387.
Stefanowski J. (1998): On rough set based approaches to induction of decision rules, In: Rough
Sets in Knowledge Discovery (Polkowski L. and Skowron A., (Eds.). — Heidelberg:
Physica-Verlag, pp.500–529.
Świniarski R. (1998a): Rough sets and Bayesian methods applied to cancer detection. — Proc.
Rough Sets and Current Trends in Computing, RSCTC’98, Lecture Notes in Artificial
Intelligence 1424, Berlin, pp.609–616.
Świniarski R. (1998b): Rough sets and neural networks application to handwritten character
recognition by complex Zernike moments. — Proc. Rough Sets and Current Trends in
Computing, RSCTC’98, Lecture Notes in Artificial Intelligence 1424, Berlin, pp.616–
624.
Ziarko W. (1993): The discovery, analysis, and representation of data dependencies in
databases . — Proc. Knowledge Discovery in Databases, KDD-93, Washington, pp.195–
209.