Boosting Fuzzy Rules in Classification Problems Under Single-Winner Inference
Boosting Fuzzy Rules in Classification Problems Under Single-Winner Inference
Boosting Fuzzy Rules in Classification Problems Under Single-Winner Inference
Abstract
In previous studies, we have shown that an Adaboost-based fitness can be suc-
cessfully combined with a Genetic Algorithm to iteratively learn fuzzy rules from
examples in classification problems. Unfortunately, some restrictive constrains in
the implementation of the logical connectives and the inference method were as-
sumed. Alas, the knowledge bases Adaboost produce are only compatible with an
inference based on the maximum sum of votes scheme, and they can only use the
t-norm product to model the ’and’ operator.
This design is not optimal in face of linguistic interpretability. Using the sum
to aggregate votes allows many rules to be combined, when the class of an example
is being decided. Since it can be difficult to isolate the contribution of individual
rules to the knowledge base, fuzzy rules produced by Adaboost may be difficult to
understand linguistically.
Under this point of view, single winner inference would be a better choice,
but it implies dropping some non-trivial hypotheses. In this work we introduce our
first results in the search of a boosting-based genetic method able to learn weighted
fuzzy rules that are compatible with this last inference method.
1 Introduction
The first application of a boosting algorithm to learn fuzzy classifiers is given in [15]. In
this work, it was proposed to combine a search algorithm with a fitness function taken
from Real Adaboost to incrementally learn descriptive fuzzy rules from examples in
classification problems. There are subsequent works in which approximate rules [12]
are also learned. A comprehensive description of the use of boosting in fuzzy classifiers
is given in [4].
In [21][23], following the work of [7], Adaboost is regarded as an forward step-
wise estimation of the statistical parameters defining a logit transform of a Generalized
Additive Model, and this property is used to extend this last estimation to learn fuzzy
models in regression problems. A similar statistical interpretation has been used later
to improve the fuzzy Adaboost algorithm, again in classification problems. Adaboost
was considered as the application of the same forward stepwise procedure, so called
“Matching Pursuit” in signal theory related works [17][25], and an instance of the
matching pursuit algorithm was successfully used to extend the LogitBoost algorithm
1
to learn descriptive fuzzy rules in classification problems [19], solving some difficulties
the AdaBoost algorithm poses in multiclass problems.
Besides all these methods are fast, and produce accurate classifiers and models, they
all share a common problem: their output has a low degree of interpretability. This is
rooted in their own definition. Fuzzy systems can only be compared to Generalized
Additive models when the sum of votes scheme [16] is adopted. But the use of the
sum to aggregate rules allows the existence of rules that have not linguistic meaning by
themselves, but when combined with other, overlapping ones. In other words, one can
not isolate the contribution of a single rule to the fuzzy classifier; they can be thought
of as weights in a neural network.
A better inference method, in terms of linguistic interpretability, is the “single win-
ner” [14]. This last mechanism is compatible with the idea of a fuzzy rule being an
imprecise assert, which states that all patterns in a given fuzzy region belong to the
same class. But the single winner inference does not combines the votes of the rules
with the arithmetic sum, but the maximum operator. Apparently, this prevents us from
using the analogy between fuzzy classifiers and additive models on which fuzzy Ad-
aboost depends. On the contrary, we will show later in this paper that this problem can
be reformulated so that a matching pursuit, with a prefitting stage, can be used to solve
it.
1.1 Summary
The structure of this paper is as follows: in the next section, fuzzy classifiers are intro-
duced and it is explained how boosting can be applied to induce them from data. Then,
it is explained how single winner inference can be expressed in terms of additive mod-
els, and a new algorithm is proposed. The paper finishes with an empirical evaluation
of the new algorithm and some preliminary numerical results.
2
classes. In other words, for every antecedent Aj we may have up to p numbers between
0 and 1 that represent our degree of knowledge about the assert “All elements in the
fuzzy set Aj belong to class number k”. Values near to 1 mean “high confidence,” and
values near 0 mean “absence of knowledge about the assert.”
In practical cases, we work with antecedents Aj that can be decomposed in a Carte-
sian product of fuzzy sets defined over each feature, Aj = Aj1 × Aj2 × . . . × Ajn , thus
the rules are
if x1 is Aj1 and . . . and xn is Ajn
then truth(c1 ) = sj1 and · · · and truth(cp ) = sjp .
We can restrict the definition further by defining n linguistic variables (one linguistic
variable for every feature) and requiring that all terms sets Ajf in the antecedents are
associated with one linguistic term in its corresponding linguistic variable. In this case,
we obtain a fuzzy rule based descriptive classifier. If we do not apply the latter restric-
tion, we obtain an approximate classifier. This work deals with descriptive classifiers.
where “∧” and “∨” can be implemented by different operators. “∧” is always a t-norm,
usually the minimum or the product. In this work, we have chosen to use the product.
Selecting an implementation of the “∨” operator is not immediate. Fuzzy Adaboost
relies on the use of the “maximum voting scheme” [14], because of reasons explained in
[4]. It was mentioned in the introduction that this scheme may be criticized, because of
interpretability reasons. We are interested in defining “∨” to be the maximum operator
[16]. Observe that, in this case, all terms sjk in rule number j but the maximum one
can be removed without affecting the output of the classifier, which will be formed by
rules as follows:
if x1 is Aj1 and . . . and xn is Ajn
then truth(cq (j)) = sjq(j)
where q(j) = arg maxk sjk . There is an alternate linguistic expression for this rule:
3
that allow us to transform the classification problem into a regression one, that of esti-
mating the conditional expectations E(yk |x) = P (class(x) = ck ), which is solved as
shown in the next subsection.
E(y|x) = β0 + β1 x1 + . . . + βn xn (3)
rj = βj r(x; γj ) (6)
Generalized additive models embody many machine learning algorithms. For example,
in radial basis neural networks the functions r(x, γj ) = exp{||x − γj ||2 } are the “basis
functions”; γj are their centers and βj are the weights that connect the input layer with
the output. In support vector machines, r(x, γj ) is a kernel, and γj are the support
vectors. In our case, we will propose a model where r(x, γj ) is an expression that
contains the membership Aj of the antecedent of the j-th fuzzy rule, γj identifies the
linguistic terms that participate in the rule and βj is the weight of the rule (that we have
called sjq(j) before.) The precise expression will be made clear later.
4
2.2.2 Backfitting and the Logitboost Algorithm
Extended additive models use to be estimated by maximum likelihood. The numerical
techniques involved can be very different, but they all share a common objective: given
a cost function d, that measures the differences between the conditional expectation and
its approximation, the learning algorithm consists in finding N pairs of values {βj , γj }
minimizing each
X
E d y, βα r(x; γα ) + βr(x; γ) (8)
α=1...N
j6=α
with respect to β, γ [7]. We are interested in a greedy learning algorithm, that finds
{β1 , γ1 } first, then {β2 , γ2 } and so on.
Algorithms that learn a weighted sum of basis functions, by sequentially appending
functions to an initially empty basis, to approximate a target function in the least-
squares sense, are contained in the family of the matching pursuit algorithms [17].
These algorithms have been compared to those used to learn support vector machines
[27] and radial basis neural networks in machine learning problems [25], and also to
Genetic Iterative Learning of fuzzy rules [4].
One of the most interesting properties of matching pursuit algorithms is that they
are good in keeping the sparsity of the solution; this explains the good generalization
properties of the methods listed before. We will also see in the following sections that
the same property guarantees a short number of rules in the fuzzy case that will be
described later.
As we mentioned in the preceding section, the objective of a binary classification
problem is to approximate the value E(y|x) = p(c = 1|x), which we will abbrevi-
ate by p(x). The response variable in a classification problem follows the binomial
distribution, whose corresponding extended additive model is [10]
p(class(x) = 1)
log = g(E(y|x)) = r0 + β1 r(x, γ1 ) + . . . (9)
p(class(x) = 0)
and the output of the model, reversing the logistic transform,
eg(E(y|x))
p(x) = (10)
1 + eg(E(y|x))
If the greedy version of generalized backfitting (the “matching pursuit” algorithm,)
also mentioned in the preceding subsection, is applied to this model, it is obtained the
Logitboost algorithm [7]. At each iteration, an adjusted dependent variable is fitted by
weighted least squares. The adjusted variable that arises from the binomial distribution
is
1y=1 − p(x)
z = g(E(y|x)0 ) + (11)
p(x)(1 − p(x))
This response depends on the values p(x), defined by means of eq. 10. g(E(y|x)0 ) is
the output of the additive model in the previous iteration, as defined in eq. 9. When the
method is particularized to learn fuzzy rules, g(E(y|x)0 ) is the output of the fuzzy rule
base before the new rule is added, and the values sjk and the membership function Aj
are chosen to minimize
n 2
j
X j j d(xi ) − p(xi )
fitness(A ) = p(xi )(1 − p(xi )) sk · A (xi ) − (12)
i
p(xi )(1 − p(xi ))
5
fi0k = 0
For step number j = 1, . . . , N
For class number k = 1, . . . , p
for i = 1, . . . , n do pijk = efij−1k /(1 + efij−1k )
for i = 1, . . . , n do wijk = pijk (1 − pijk ).
Find with a Genetic Algorithm the antecent Aj that minimices
n 2
X yik − pijk
fitness(Aj ) = wijk sj · Aj (xi ) −
wijk
i
and are searched by means of a genetic algorithm. An outline of the adaptation of this
method to learn fuzzy rules, as described in [19], is shown in Figure 1.
3 Proposed algorithm
3.1 Definition of the weak learners
Let us recall eq. 1. We mentioned that an instance x is assigned to the class
_
arg maxk Aj (x) ∧ sjk (13)
j
where “∧” and “∨” could be implemented by different operators. In fuzzy boosting,
this last expression became
X
arg maxk Aj (x) · sjk (14)
j
and that allowed to use the fuzzy memberships Aj in antecedents as weak learners, and
obtain the weights of the rules sjk by means of a boosting algorithm, as shown in Figure
1.
To use single-winner inference, we want to use the max operator instead of the
sum. We have to obtain the weights sjk of the expression that follows:
6
which is not a sum of terms and therefore not an additive model thus boosting makes
no sense here. But, we can define a function
j j
I(x, j) = 1 if j = arg max A (x) · sk (16)
0 elsewhere
(in words, I(x, j) = 1 if the rule number j is the winner when classifying the input x,
and 0 if not) and rewrite eq. 15 as
X
arg maxk I(x, j) · Aj (x) · sjk . (17)
j
The rewritten expression is related to eq. 7 as follows. Observe that the products
can be regarded as weak learners, and their weights sjk determined by backfitting.
Therefore, we have transformed the initial problem into another, that can be solved
after estimating the function I. In the next subsection, we propose an iterative algo-
rithm to complete the task.
We know the values of I(xi , j) for all the rules in the initial base, at the points in
the training set: I(xi , j) = 1 if the rule number j was the winner in the example i, 0
else. Now we want to improve the base by adding the N -th rule. It is clear that some
of these values of I will change from 1 to 0 after the new rule is added (otherwise, the
new rule would not win at any example in the training set and it would be useless to
add it to the base.) But I participates in the definition of the weak learners, therefore all
values sjk must be recalculated every time a rule is added. Clearly, a prefitting version
of the matching pursuit algorithm [25] is mandatory for this problem: the consequents
of all the rules in the initial base are affected after a new rule is added to it.
If all the values I(xi , j) and Aj (xi ) were known, the least squares election of sjk
would be an standard problem of linear regression, that can be solved by means of a
pseudoinverse. Let us suppose that the initial rule base is empty, and define fi0k = 0,
pi0k = 1/2 and dik = 1 if class(xi ) = k, and 0 else. The adjusted dependent variable
is
dik − 1/2
zik = 0 + (20)
1/2(1 − 1/2)
and, given a set of N membership functions Aj , the N · p values sjk can be found by
minimizing
2
X
fitness(A1 , . . . , AN ) = zik − I(xi , j) · Aj (xi ) · sjk . (21)
j
7
procedure AddFuzzyRule
Input: A rule base of size N and the antecedent of the fuzzy rule AN +1
Output: A rule base of size N + 1 and a numerical value of fitness
Skj = sjk k = 1, . . . , p, j = 1, . . . , N
Initialize Sk,N +1 at random
Repeat
I0 = I
For j = 1,n. . . , N + 1, i = 1, . . . , m do
1 rule j wins in example xi
Iij =
0 else
End For
j
Fji = A (xi ) · Iij j = 1, . . . , N + 1, i = 1, . . . , m
Zki = 4(yk (xi ) − 0.5) k = 1, . . . , p, i = 1, . . . , m
S0 = S
S = Z · F t · (F · F t )−1
S = αS + (1 − α)S 0
Until ||S − S 0 || <
Output S and fitness=||Z − S · F ||
In matrix form, let S = [σkj ], Z = [θki ] and F = [φji ], where σkj = sjk , θki = zik
and φji = fi1k . Then, the least squares solution of S is
S = ZF t (F · F t )−1 . (22)
We can use a search algorithm (a genetic algorithm, in this case) to find the set of values
of Aj and I that minimize eq. 21, but the simultaneous search of Aj and I would not
be efficient. We propose to use instead the recurrent algorithm shown in Figure 2 to do
the task. In the next section it will be explained where this function is called from, and
an outline of the complete procedure given.
Notice that the values of sjk are randomly generated first to obtain an initial guess
of I; then sjk are calculated by means of eq. 22. I is estimated again; if old and
new values are the same, the procedure stops. Otherwise, the process is repeated. In
practical simulations we have observed that an smoothing term α, as shown in the
figure, improves the speed of convergence.
8
For step number j = 1, . . . , N
Call AddOneRule from a GA and select
the rule base of size j with the minimum fitness value.
End For
For j = 1, . . . , N
Make all sjk = 0 but the maximum one, sjq(j)
Emit the Rule “if x is Aj then class = q(j) with confidence [sjq(j) ]”
End For
let {L OW, M ED , H IGH} be the linguistic labels of all features in a problem involving
three input variables and two classes. The antecedent
x1 is High and x2 is Med and x3 is Low
is codified with the chain 001 010 100. We could use this encoding to represent rules
for which not all variables appear in the antecedent and ’OR’ combinations of terms in
the antecedent. For example, the antecedent of the rule
9
• Gauss-5: synthetic 5 class problem proposed in [4], comprising 50, 100, 150,
200 and 250 samples from 5 bi-dimensional Gaussian distribution with centers
in (0, 0), (−1, −1), (−1, 1), (1, −1), (1, 1), and unity covariances matrix.
5 Concluding Remarks
The advantages of boosting methods when learning fuzzy classifiers are two: as far as
we know, the size of the rule base is much smaller than the obtained with any other
genetic fuzzy classifier, and the learning is very fast (between seconds and minutes for
the problems used in this paper.) But there are also drawbacks: the inference is not
standard, and the quality of the rule base is low, because the interaction between rules
is very high.
10
0.45
0.50
0.25
0.45
0.40
0.20
0.40
0.35
0.15
0.35
0.30
0.10
0.30
0.25
0.05
0.25
0.20
0.20
LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM
1.0
0.7
0.9
0.6
0.8
0.5
0.7
0.4
0.6
0.3
0.5
0.2
0.4
0.1
0.3
0.0
LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM
Figure 4: Boxplots with a comparison between black-boxes (linear and quadratic dis-
criminant analysis, 3 layer perceptron, k-nearest neighbors, kernel estimation of densi-
ties) and fuzzy rule based classifiers (Wang and Mendel’s, Ishibuchi, Pal and Mandal,
Genetic Iterative Learning, Random Set based, Fuzzy Adaboost and Max-Fuzzy Boost-
ing.) The datasets are Pima, Cancer, Gauss, Glass and Gauss5.
LIN QUA NEU KNN KER WM ISH PM GIL KRE ABD AMM
pima 0.227 0.251 0.234 0.270 0.313 0.287 0.301 0.464 0.269 0.308 0.255 0.257
cancer 0.044 0.051 0.035 0.048 0.099 0.039 0.096 0.145 0.099 0.221 0.038 0.039
gauss 0.239 0.190 0.194 0.216 0.191 0.329 0.322 0.306 0.205 0.215 0.206 0.200
glass 0.404 - 0.389 0.354 0.621 0.453 0.503 0.647 0.363 0.606 0.522 0.388
gauss5 0.318 0.317 0.321 0.343 0.332 0.410 0.345 0.974 0.338 0.321 0.344 0.337
11
The high interaction between rules in Adaboost and Logitboost is a consequence
of the “sum of votes” inference scheme. The preferred inference method, in terms of
linguistic interpretability, is the “single winner” one. This last mechanism is compati-
ble with the idea of a fuzzy rule being an imprecise assert, which states that all patterns
in a given fuzzy region belong to the same class. But the single winner inference does
not combines the votes of the rules with the arithmetic sum, but the maximum oper-
ator, and this difficulties the application of boosting algorithms. We have solved the
problem by the introduction of an intermediate function in the definition of the weak
learner, and a recurrent algorithm to estimate it. The final algorithm produces fuzzy
rule bases of high descriptive quality, while preserving a good accuracy. A drawback
of this new procedure is related to its computational complexity, which is higher than
that of fuzzy Adaboost and Logitboost.
Acknowledgments
This work was funded by Spanish M. of Science and Technology and by FEDER funds,
under the grant TIC-04036-C05-05.
References
[1] E. Alpaydin (1999) ”Combined 5x2cv F Test for Comparing Supervised Classifi-
cation Learning Algorithms,” Neural Computation, 11(8), 1885-1982.
[2] Cordón, O., Del Jesus, M. J., Herrera, F. “A proposal on reasoning methods in
fuzzy rule-based classification systems”. International Journal of Approximate
Reasoning 20(1), pp. 21-45, 1999.
[3] Cordón, O., Herrera, F. (2000) “A proposal for improving the accuracy of linguis-
tic modeling”. IEEE Transactions on Fuzzy Systems, 8, 3, pp. 335-344.
[4] Del Jesus, M. J., Hoffmann, F., Junco, L., Sánchez, L. Induction of Fuzzy Rule
Based Classifiers with Evolutionary Boosting Algorithms. IEEE Transactions on
Fuzzy Sets and Systems. Admitted for publication.
[5] Dietterich, G. (1998) “Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms”. Neural Computation, 10, 7, pp 1895-1924.
[6] Freund, Y., Schapire, R. “Experiments with a new boosting algorithm”. In Ma-
chine Learning, Proc. 13th International Conference, pp. 148-156
[7] Friedman, J., Hastie, T., Tibshirani, R. (1998) “Additive Logistic Regression: A
Statistical View of Boosting”. Machine Learning.
[8] Gonzalez, A., Perez, R. (1996) “Completeness and consistency conditions for
learning fuzzy rules,” Fuzzy Sets and Systems, vol 96, pp 37-51.
[9] Hand, D. J. Discrimination and Classification. Wiley. 1981
[10] Hastie, T. J., Tibshirani, R. (1986) “Generalized Additive Models”. Statistical
Science, 1, pp 297-318
[11] Haykin, S. Neural Networks. Prentice Hall, 1999.
12
[12] Hoffmann, F., Boosting a Genetic Fuzzy Classifier. in Proc. Joint 9th IFSA
World Congress and 20th NAFIPS International Conference, vol. 3, (Vancouver,
Canada), pp. 1564–1569, July 2001
[13] Ishibuchi, H., “Distributed representation of fuzzy rules and its application to
pattern classification”, Fuzzy Sets and Systems 52, pp. 21-32, 1992.
[14] Ishibuchi, H., Nakashima, T., Morisawa, T. “Voting in fuzzy rule-based systems
for pattern classification problems,” Fuzzy Sets and Systems, vol 103, no 2, pp.
223-239, 1999.
[15] Junco, L., Sanchez, L.“Using the Adaboost algorithm to induce fuzzy rules in
classification problems”, Proc. ESTYLF 2000, Sevilla, pp 297-301.
[16] Kuncheva, L. I. Fuzzy Classifier Design. Springer-Verlag, NY, 2000.
[17] Mallat, S. Zhang, Z. (1993) “Matching pursuits with time-frequency dictionar-
ies”. IEEE Trans on Signal Processing 41, pp 3397-3415.
[18] Merz, C. J., Murphy, P.M. (1998). UCI repository of machine learning databases.
Available at: http://www.ics.uci.edu/mlearn/MLRepository.html.
[19] Otero, J., Sánchez, L. Induction of descriptive fuzzy classifiers with the Logit-
boost algorithm. Submitted to Soft Computing.
[20] Pal, S. K., Mandal, D. P. “Linguistic recognition system based in approximate
reasoning”. Information Sciences 61, pp. 135-161. 1992.
[21] Sánchez, L. “A Fast Genetic Method for Inducting Linguistically Understandable
Fuzzy Models”. Proc. IFSA NAFIPS, 2001.
[22] Sánchez, L. Casillas, J., Cordón, O., del Jesus, M. J. (2002) “Some relationships
between fuzzy and random classifiers and models”. International Journal of Ap-
proximate Reasoning 29, 175-213.
[23] Sánchez, L., Otero, J. A fast genetic method for inducting descriptive fuzzy mod-
els. Fuzzy Sets and Systems. Admitted for publication.
[24] Schapire, R., Singer, Y. Improved Boosting Algorithms Using Confidence-rated
Predictions. Machine Learning 37(3): 297-336. 1999
[25] P. Vincent and Y. Bengio, (2002) “Kernel Matching Pursuit”, Machine Learn-
ing Journal, Special Issue on New Methods for Model Combination and Model
Selection.
[26] Wang, L. X., Mendel, J. (1992) “Generating fuzzy rules by learning from exam-
ples”. IEEE Trans. on Systems, Man and Cybernetics, 25, 2, pp. 353-361.
[27] Zhu, J., Hastie, T. (2001) “Kernel Logistic Regression and the Import Vector
Machine”. Proc. NIPS 2001, Vancouver, Canada.
13