Stacking MF Networks To Combine The Outputs Provided by RBF Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Stacking MF Networks to Combine the Outputs

Provided by RBF Networks

Joaquı́n Torres-Sospedra, Carlos Hernández-Espinosa,


and Mercedes Fernández-Redondo

Departamento de Ingenieria y Ciencia de los Computadores, Universitat Jaume I,


Avda. Sos Baynat s/n, C.P. 12071, Castellon, Spain
{jtorres, espinosa, redondo}@icc.uji.es

Abstract. The performance of a Radial Basis Functions network (RBF) can be


increased with the use of an ensemble of RBF networks because the RBF net-
works are successfully applied to solve classification problems and they can be
trained by gradient descent algorithms. Reviewing the bibliography we can see
that the performance of ensembles of Multilayer Feedforward (MF) networks can
be improved by the use of the two combination methods based on Stacked Gen-
eralization described in [1]. We think that we could get a better classification
system if we applied these combiners to an RBF ensemble. In this paper we sat-
isfactory apply these two new methods, Stacked and Stacked+, on ensembles of
RBF networks. Increasing the number of networks used in the combination mod-
ule is also successfully proposed in this paper. The results show that training 3
MF networks to combine an RBF ensemble is the best alternative.

1 Introduction
A Radial Basis Functions (RBF) network is a commonly applied architecture which
is used to solve classification problems. This network can also be trained by gradient
descent [2,3]. So with a fully supervised training, it can be an element of an ensem-
ble of neural networks. Previouses comparisons showed that the performance of RBF
networks was better than Multilayer Feedforward (MF) networks. Being the Simple
Ensemble the best method to train an ensemble of RBF networks.
Among the methods of combining the outputs of an ensemble of neural networks,
the two most popular are the Majority voting and the Output average [4]. Last year, two
new combination methods based on the idea of Stacked Generalization called Stacked
and Stacked+ were successfully proposed in [1]. These new methods consist of training
a single MF to combine the networks.
In this paper, we want to apply these combiners to a ensemble of Radial Basis Func-
tions trained with the Simple Ensemble. Moreover, we want to increase the number of
networks used to combine the ensemble in our experiments in order get a better com-
biner. Finally, we compare the results we have got with these two new combiners with
the results we previously got with other classical combiners.
To test Stacked and Stacked+ with RBF ensembles, we have selected nine databases
from the UCI repository. This paper is organized as follows. The concepts related to
the ensembles training and combination are briefly commented in section 2 whereas the
experimental setup, results and the discussion are in section 3.

J. Marques de Sá et al. (Eds.): ICANN 2007, Part I, LNCS 4668, pp. 450–459, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Stacking MF Networks to Combine the Outputs Provided by RBF Networks 451

2 Theory

2.1 Training a Neural Network


There are conceptual similarities in the training process of an RBF network and an MF
network since a gradient descent method can be applied to train both networks [5,6].
In our experiments we have trained the networks for few iterations. In each iteration,
the network trainable parameters have been adapted with BackPropagation over the
training set. These parameters are the weights for the case of the MF networks whereas
the weigths and the centers of the gaussian units are the trainable parameters for the
case of the RBF network. At the end of the iteration the Mean Square Error (MSE)
has been calculated over the Validation set. When the learning process has finished,
we assign the network configuration of the iteration with minimum MSE to the final
network.
For this reason the original learning set L used in the learning process is divided into
two subsets: The first set is the training set T which is used to train the networks and
the second set is the validation set V which is used to finish the training process.

Algorithm 1. Neural Network Training {T ,V }


for i = 1 to iterations do
Train the network with the patterns from the training set T
Calculate MSE over validation set V
Save epoch weights and calculated MSE
end for
Select epoch with minimum MSE
Assign best epoch configuration to the network
Save network configuration

2.2 The Multilayer Feedforward Network


The Multilayer Feedforward architecture is the most known neural network architec-
ture. This kind of networks consists of three layers of computational units. The neurons
of the first layer apply the identity function whereas the neurons of the second and third
layer apply the sigmoid function. It has been proved that MF networks with one hid-
den layer and threshold nodes can approximate any function with a specified precision
[7,8]. Figure 1 shows the diagram of the Multilayer Feedforward network.

2.3 The Radial Basis Function Network


An RBF network has two layer of neurons. The first one, in its usual form, is composed
of neurons with Gaussian transfer functions (GF). The second layer is composed of
neurons with linear transfer functions. This network can be the base classifier of an
ensemble of neural networks since gradient descent can be applied to train it [2,3]. A
basic RBF network can be seen in figure 2.
452 J. Torres-Sospedra, C. Hernández-Espinosa, and M. Fernández-Redondo

y1 y2 yq
Output
layer

Hidden
layer

Input
layer
x1 x2 xn

Fig. 1. A MF network

y1 y2 yq
Linear units

Gaussian units

x1 x2 xn

Fig. 2. An RBF network

2.4 Combining Networks with Stacked Generalization


Stacked Generalization was introduced by Wolpert [9]. Some authors have adapted the
Wolpert’s method to use with neural networks [10,11,12]. In [1] two combination meth-
ods based on Stacked Generalization were proposed, Stacked and Stacked+. In these
methods a single combination network was trained to combine the expert networks of
the ensemble. In both kinds of networks, expert and combination, the neural network
architecture choosen was the MF network.
The use of the original pattern input vector is the difference between Stacked and
Stacked+. Stacked uses the output provided by the expert networks on patterns from the
training set whereas Stacked+ uses the output provided by the experts along with the
original pattern input vector to train the combination network.
In this paper we want to combine ensembles of RBF networks with Stacked and
Stacked+. Moreover we want to increase the number of combination networks from one
Stacking MF Networks to Combine the Outputs Provided by RBF Networks 453

to three and nine as an ensemble of combiners to test if the system could be improved by
adding more combination networks. With this procedure we want to combine the expert
networks of an ensemble of RBF networks with an ensemble of MF networks. Finally,
we have applied the output average in order to combine the combination networks.
Figure 3 show the diagram of Stacked and Stacked+ we have used in our experiments.

          
sg
y (x) ysg+(x)
 

yc1(x) yc2(x) yck(x) yc1(x) yc2(x) yck(x)

CN1 CN2 CNksg CN1 CN2 CNksg+

xsg xsg xsg xsg+ xsg+ xsg+


sg
x xsg+
 

y1(x) y2(x) yk(x) y1(x) y2(x) yk(x)

EN1 EN2 ENk EN1 EN2 ENk

x x x x x x
x={x1,x2, ... ,xn} x={x1,x2, ... ,xn}

Fig. 3. Stacked and Stacked+ diagrams

3 Experimental Setup

In this section we describe the experimental setup and the datasets we have used in our
experiments. Then, we show the main results we have obtained with the combination
methods on the different datasets. Moreover, we calculate two general measurements in
order to compare the methods. Finally, we discus about the results we have got.
In our experiments we have used ensembles of 3 and 9 RBF networks previously
trained with Simple ensemble on nine different classification problems. Moreover, we
have trained 1, 3 and 9 MF networks with Stacked and Stacked+ in order to combine the
networks of the ensemble. In addition, we have generated 10 different partitions of data
at random in training, validation and test sets and repeat the whole learning process 10
times in order to get a mean performance of the ensemble an error in the performance
by standard error theory.

3.1 Databases

The datasets we have used in our experiments and their characteristics are described in
this subsection. We have applied the new combination methods, Stacked and Stacked+,
454 J. Torres-Sospedra, C. Hernández-Espinosa, and M. Fernández-Redondo

to nine classification problems from the UCI repository [13]. The datasets we have used
are the following ones:
Balance Scale Database (bala)
The aim is to determine if a balance is scaled tip to the right, tip to the left, or balanced.
This dataset contains 625 instances, 4 attributes and 3 classes.
Cylinder Bands Database (band)
Used in decision tree induction for mitigating process delays know as “cylinder bands”
in rotogravure printing. This dataset contains 512 instances, 19 attributes and 2 classes.
BUPA liver disorders (bupa)
The aim of this dataset is to try to detect liver disorders. This dataset contains 345
instances, 6 attributes and 2 classes.
Australian Credit Approval (cred)
This dataset concerns credit card applications. This dataset contains 653 instances, 15
attributes and 2 classes.
Glass Identification Database (glas)
The aim of the dataset is to determinate if the glass analysed was a type of ‘float’ glass
or not for Forensic Science. This dataset contains 2311 instances, 34 attributes and 2
classes.
Heart Disease Databases (hear) The aim of the dataset is to determinate the presence
of heart disease in the patient. This dataset contains 297 instances, 13 attributes and 2
classes.
The Monk’s Problem 1 (mok1)
Artificial problem with binary inputs. This dataset contains 432 instances, 6 attributes
and 2 classes.
The Monk’s Problem 2 (mok2)
Artificial problem with binary inputs. This dataset contains 432 instances, 6 attributes
and 2 classes.
Congressional Voting Records Database (Vote)
Classification between Republican or Democrat. All attributes are boolean. This dataset
contains 432 instances, 6 attributes and 2 classes.
Table 1 shows the training parameters (number of clusters, iterations, adaptation step
and the width of the gaussian units) of the expert networks and the performance of a
single network on each database. Moreover we have added to this table the performance
of the ensembles of 3 and 9 RBF networks previously trained with Simple Ensemble in
order to see if the new combination methods proposed increase the performance of the
classification systems.
Table 2 shows the training parameters we have used to train the combination net-
works (hidden units, adaptation step, momentum rate and number of iterations) with
the new two combiners, Stacked and with Stacked+.
Stacking MF Networks to Combine the Outputs Provided by RBF Networks 455

Table 1. Expert networks training parameters

Experts - RBF parameters Experts performance


clusters iterations step width 1 net 3 nets 9 nets
bala 60 6000 0.005 0.6 90.2 89.68 89.68
band 40 10000 0.01 1 74 73.82 73.27
bupa 40 8000 0.01 0.4 70.1 71.86 72.43
cred 30 10000 0.005 2 86 87.15 87.23
glas 110 20000 0.01 0.4 93 93.2 93
hear 20 15000 0.005 2 82 83.9 83.9
mok1 30 20000 0.005 0.8 98.5 99.63 99.63
mok2 45 60000 0.005 0.6 91.3 91.5 91.38
vote 5 5000 0.01 1.8 95.4 96.25 96.25

Table 2. Combination network parameters - Stacked and Stacked+

Stacked - MF parameters Stacked+ - MF parameters


dataset experts hidden step mom ite hidden step mom ite
3 22 0.4 0.2 1500 10 0.4 0.05 3500
bala
9 3 0.1 0.01 3000 11 0.1 0.01 6500
3 7 0.4 0.1 500 26 0.05 0.01 2500
band
9 30 0.1 0.1 500 30 0.05 0.05 500
3 3 0.4 0.1 750 4 0.2 0.1 750
bupa
9 6 0.2 0.05 750 19 0.4 0.05 750
3 21 0.4 0.2 500 27 0.05 0.01 750
cred
9 21 0.4 0.2 500 25 0.4 0.1 3500
3 4 0.1 0.001 7500 4 0.4 0.1 7500
glas
9 4 0.4 0.2 7500 4 0.4 0.2 7500
3 4 0.4 0.2 5000 2 0.2 0.05 3500
hear
9 17 0.1 0.2 1500 3 0.4 0.1 7500
3 2 0.4 0.2 7500 4 0.4 0.2 7500
mok1
9 2 0.4 0.2 7500 3 0.4 0.2 7500
3 2 0.1 0.1 1000 2 0.4 0.1 7500
mok2
9 9 0.4 0.1 7500 1 0.4 0.1 7500
3 28 0.4 0.2 500 30 0.05 0.01 750
vote
9 26 0.4 0.1 500 12 0.4 0.05 500

3.2 Results

The main results we have obtained with the application of stacking methods are pre-
sented in this subsection. Tables 3 and 4 shows the results we have obtained combining
ensembles of 3 and 9 networks with Stacked and Stacked+.
In [14] the complete results of the combination of RBF ensembles with 14 different
combination methods are published. Although we have omitted these results to keep
the length of the paper short, the general measurements related to these combination
methods appear in subsection 3.3. These methods are: Majority Vote (vote), Winner
456 J. Torres-Sospedra, C. Hernández-Espinosa, and M. Fernández-Redondo

Table 3. Results of the 3 RBF Network Ensemble

stacked stacked3 stacked9 stacked+ stacked3+ stacked9+


bala 92.88±0.72 92.96±0.70 92.96±0.70 93.44±0.70 93.36±0.70 93.44±0.68
band 74.36±1.03 74.55±0.94 74.36±0.88 74.73±1.03 75.27±0.95 75.45±0.99
bupa 72.00±1.15 71.43±1.09 71.00±1.11 72.00±1.21 72.14±1.19 72.29±1.19
cred 86.85±0.58 86.85±0.58 86.85±0.58 87.23±0.74 87.00±0.75 87.15±0.73
glas 93.80±1.25 93.20±1.04 93.40±1.08 93.00±1.09 93.00±1.09 93.00±1.09
hear 83.22±1.63 82.88±1.68 83.05±1.64 83.39±1.36 83.73±1.34 83.22±1.44
mok1 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38
mok2 91.25±1.26 91.25±1.26 91.25±1.26 91.50±1.15 91.38±1.13 91.38±1.13
vote 95.38±0.88 95.50±0.84 95.63±0.77 96.13±0.44 96.13±0.44 96.13±0.44

Table 4. Results of the 9 RBF Network Ensemble

stacked stacked3 stacked9 stacked+ stacked3+ stacked9+


bala 92.88±0.70 93.04±0.66 92.88±0.69 94.08±0.64 93.84±0.66 93.76±0.66
band 73.82±1.02 74.00±1.27 74.00±1.27 74.73±1.20 74.73±1.20 74.55±1.21
bupa 72.29±1.21 71.86±1.31 72.14±1.21 71.57±1.12 71.71±1.16 71.29±1.19
cred 86.46±0.63 86.46±0.63 86.46±0.63 86.46±0.63 86.38±0.64 86.38±0.62
glas 92.80±0.85 93.40±0.90 93.20±1.04 93.60±0.93 93.60±0.93 93.40±0.90
bala 82.88±1.78 82.88±1.78 83.05±1.60 83.22±1.39 83.22±1.37 82.88±1.42
band 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38 99.63±0.38
bupa 91.25±1.24 91.25±1.24 91.25±1.24 91.63±1.24 91.63±1.24 91.75±1.18
cred 96.13±0.68 96.13±0.68 96.13±0.68 95.88±0.46 96.00±0.49 96.00±0.49

Takes All (wta), Borda Count (borda), Bayesian Combination (bayesian), Weighted Av-
erage (w.ave), Choquet Integral (choquet), Choquet Integral with Data-Depend Den-
sities (choquet.dd), Weighted Average with Data-Depend Densities (w.ave.dd), BADD
Defuzzification Startegy (badd), Zimmermann’s Compensatory Operator (zimm), Di-
namically Averaged Networks versions 1 and 2 (dan and dan2), Nash vote (nash).

3.3 General Measurements

We have also calculated the percentage of error reduction (PER) of the results with
respect to a single network to get a general value for the comparison among all the
methods we have studied. We have used equation 1 to calculate the PER value.

Errorsinglenetwork − Errorensemble
P ER = 100 · (1)
Errorsinglenetwork

IoP = P erf ormanceensemble − P erf ormancesinglenet (2)


The PER value ranges from 0%, where there is no improvement by the use of a
particular ensemble method with respect to a single network, to 100%. There can also
be negative values, which means that the performance of the ensemble is worse.
Stacking MF Networks to Combine the Outputs Provided by RBF Networks 457

Furthermore, we have calculated the mean increase of performance (IoP ) with re-
spect to Single Network and the mean percentatge of error reduction (P ER) across
all databases for the methods proposed in this paper, Stacked and Stacked+, and for the
combination methods that appear in [14]. The P ER is calculated by equation 1 whereas
the IoP with respect a single network is calculated by equation 2. Table 5 shows the
results of the mean P ER and the mean IoP .

Table 5. General Measurements

mean P ER mean IoP


combiner 3 experts 9 experts 3 experts 9 experts
average 13.06 12.63 0.72 0.69
stacked 14.85 14.43 0.98 0.84
stacked3 13.85 15.48 0.85 0.9
stacked9 14.34 15.19 0.83 0.91
stacked+ 16.27 17.27 1.11 1.14
stacked3+ 16.43 17.28 1.18 1.13
stacked9+ 16.37 16.58 1.18 1.01
vote 13.51 13.92 0.83 0.85
wta 12.71 12.63 0.57 0.62
borda 13.42 13.83 0.82 0.84
bayesian 11.65 12.91 0.8 0.88
w.ave 14.16 13.61 0.71 0.57
choquet 12.47 11.7 0.55 0.62
choquet.dd 11.77 12.09 0.51 0.65
w.ave.dd 13.86 12.42 0.81 0.66
badd 13.06 12.63 0.72 0.69
zimm −146.23 −168.3 −10.53 −11.72
dan 11.02 6.46 0.58 0.24
dan2 12.27 7.44 0.55 0.28
nash 13.57 12.67 0.82 0.69

3.4 Discussion
The main results (tables 3-4) show that Stacked and Stacked+ get an improvement in a
wide majority of problems: bupa, cred, glas, hear, mok1 and vote.
The results show that the improvement in performance of training an ensemble of
nine combination networks Stacked9/Stacked9+ (instead of three Stacked3/Stacked3+)
is low. Taking into account the computational cost the best alternative might be an
ensemble of three combination networks Stacked3/Stacked3+.
Comparing the results of the different traditional combination methods with Stacked
and Stacked+, we can see that there is an improvement by the use of these new methods.
For example, in databases band and bala the results with the methods based on Stacked
Generalization are quite good. The largest difference between simple average and other
method is around 4.0% in the problem bala and around 1.5% in the problem band.
Comparing the general measurements, the mean PER and the mean IoP, we can
see that Stacked and Stacked+ are the best alternative to combine an ensemble of RBF
458 J. Torres-Sospedra, C. Hernández-Espinosa, and M. Fernández-Redondo

networks. Stacked+ with 3 combination networks is the best way to combine ensembles
of 3 and 9 RBF networks according to the values of the general measurements.

4 Conclusions
In this paper, we have presented experimental results by the use of Stacked and Stacked+,
two new methods based on Stacked Generalization, in order to combine the outputs of
an ensemble of RBF networks, using nine different databases.
We have trained ensembles of 3 and 9 combination networks (MF) to combine a
previously trained ensemble of expert networks (RBF). The results show that, in gen-
eral, there is a reasonable improvement by the use of Stacked and Stacked+ in a wide
majority of databases.
In addition, we have calculated the mean percentage of error reduction over all
databases. According to the values of the mean performance of error reduction, the
new combination methods, Stacked and Stacked+ are the best methods to combine en-
sembles of RBF networks.
Finally, taking into account the computational cost and the values of the general
measuremensts we can conclude that training 3 combination networks, as an ensemble
of MF networks, should be considered to be the best alternative when we combine
ensembles of RBF networks.

Acknowledgments
This research was supported by the project number P1·1B2004-03 entitled ‘Desarrollo
de métodos de diseño de conjuntos de redes neuronales’ of Universitat Jaume I - Ban-
caja in Castellón de la Plana, Spain.

References
1. Torres-Sospedra, J., Hernndez-Espinosa, C., Fernndez-Redondo, M.: Combining MF net-
works: A comparison among statistical methods and stacked generalization. In: Schwenker,
F., Marinai, S. (eds.) ANNPR 2006. LNCS (LNAI), vol. 4087, pp. 302–9743. Springer, Hei-
delberg (2006)
2. Hernndez-Espinosa, C., Fernndez-Redondo, M., Torres-Sospedra, J.: First experiments on
ensembles of radial basis functions. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS 2004.
LNCS, vol. 3077, pp. 253–262. Springer, Heidelberg (2004)
3. Torres-Sospedra, J., Hernndez-Espinosa, C., Fernndez-Redondo, M.: An experimental study
on training radial basis functions by gradient descent. In: Schwenker, F., Marinai, S. (eds.)
ANNPR 2006. LNCS (LNAI), vol. 4087, pp. 302–9743. Springer, Heidelberg (2006)
4. Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., Vapnik, V.: Boosting and other ensemble
methods. Neural Computation 6, 1289–1301 (1994)
5. Karayiannis, N.B.: Reformulated radial basis neural networks trained by gradient descent.
IEEE Transactions on Neural Networks 10, 657–671 (1999)
6. Karayiannis, N.B., Randoph-Gips, M.M.: On the construction and training of reformulated
radial basis function neural networks. IEEE Transactions on Neural Networks 14, 835–846
(2003)
Stacking MF Networks to Combine the Outputs Provided by RBF Networks 459

7. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Inc., New
York, NY, USA (1995)
8. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience,
Chichester (2004)
9. Wolpert, D.H.: Stacked generalization. Neural Networks 5, 1289–1301 (1994)
10. Ghorbani, A.A., Owrangh, K.: Stacked generalization in neural networks: Generalization
on statistically neutral problems. In: IJCNN 2001. Proceedings of the International Joint
conference on Neural Networks, Washington DC, USA, pp. 1715–1720. IEEE Computer
Society Press, Los Alamitos (2001)
11. Ting, K.M., Witten, I.H.: Stacked generalizations: When does it work? In: International Joint
Conference on Artificial Intelligence proceedings, vol. 2, pp. 866–873 (1997)
12. Ting, K.M., Witten, I.H.: Issues in stacked generalization. Journal of Artificial Intelligence
Research 10, 271–289 (1999)
13. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning
databases (1998), http://www.ics.uci.edu/∼mlearn/MLRepository.html
14. Torres-Sospedra, J., Hernandez-Espinosa, C., Fernandez-Redondo, M.: A comparison of
combination methods for ensembles of RBF networks. In: IJCNN 2005. Proceedings of In-
ternational Conference on Neural Networks, Montreal, Canada, vol. 2, pp. 1137–1141 (2005)

You might also like