Piece Wise Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A New Learning Method for Piecewise Linear

Regression

Giancarlo Ferrari-Trecate1 and Marco Muselli2


1
INRIA, Domaine de Voluceau
Rocquencourt - B.P.105, 78153 Le Chesnay Cedex, France
[email protected]
2
Istituto per i Circuiti Elettronici - CNR
via De Marini, 6 - 16149 Genova, Italy
[email protected]

Abstract. A new connectionist model for the solution of piecewise lin-


ear regression problems is introduced; it is able to reconstruct both con-
tinuous and non continuous real valued mappings starting from a finite
set of possibly noisy samples. The approximating function can assume a
different linear behavior in each region of an unknown polyhedral parti-
tion of the input domain.
The proposed learning technique combines local estimation, clustering in
weight space, multicategory classification and linear regression in order
to achieve the desired result. Through this approach piecewise affine
solutions for general nonlinear regression problems can also be found.

1 Introduction
The solution of any learning problem involves the reconstruction of an unknown
function f : X → Y from a finite set S of samples of f (training set), possibly
affected by noise. Different approaches are usually adopted when the range Y
contains a reduced number of elements, typically without a specific ordering
among them (classification problems) or when Y is an interval of the real axis
with the usual topology (regression problems).
However, applications can be found, which lie on the borderline between clas-
sification and regression; these occur when the input space X can be subdivided
into disjoint regions Xi characterized by different behaviors of the function f
to be reconstructed. One of the simplest situations of such kind is piecewise
linear regression (PLR): in this case X is a polyhedron in the n-dimensional
space IRn and {Xi }si=1 is a polyhedral partition of X, i.e. Xi ∩ Xj = ∅ for every
s
i, j = 1, . . . , s and i=1 Xi = X. The target of a PLR problem is to reconstruct
an unknown function f : X → IR having a linear behavior in each region Xi
n

f (x) = wi0 + wij xj if x ∈ Xi (1)
j=1

when only a training set S containing m samples (xk , yk ), k = 1, . . . , m, is avail-


able. The output yk gives a noisy evaluation of f (xk ), being xk ∈ X; the region

J.R. Dorronsoro (Ed.): ICANN 2002, LNCS 2415, pp. 444–449, 2002.
c Springer-Verlag Berlin Heidelberg 2002
A New Learning Method for Piecewise Linear Regression 445

Xi to which xk belongs is not known in advance. The scalars wi0 , wi1 , . . . , win ,
for i = 1, . . . , s, characterize the function f and their estimate is a target of the
PLR problem; for notational purposes they will be included in a vector wi .
Since regions Xi are polyhedral, they are defined by a set of li linear inequal-
ities, which can be written in the following form:
 
1
Ai ≥0 (2)
x

where Ai is a matrix with li rows and n + 1 columns, whose estimate is still an


output of the reconstruction process for every i = 1, . . . , s.
The target of the learning problem is consequently twofold: to generate both
the collection of regions Xi and the behavior of the unknown function f in
each of them, by using the information contained in the training set. In these
cases, classical learning algorithms for connectionist models cannot be directly
employed, since they require some knowledge about the problem, which is not
available a priori.
Several authors have treated this kind of problems [2,3,4,7], providing algo-
rithms for reaching the desired result. Unfortunately, most of them are difficult to
extend beyond two dimensions [2], whereas others consider only local approxima-
tions [3,4], thus missing the actual extension of regions Xi . In this contribution a
new connectionist model for solving PLR problems is proposed, together with a
proper learning algorithm that combines clustering, multicategory classification,
and linear regression to select a reduced subset of relevant training patterns and
to derive from them suitable values for the network weights.

2 The Proposed Learning Algorithm

Following the general idea presented in [7], a connectionist model realizing a


piecewise linear function f can be depicted as in Fig. 1. It is formed by three
layers: the hidden layer, containing a linear neuron for each of the regions Xi ,
the gate layer, whose units delimit the extension of each Xi , and the output
layer, that provides the desired output value for the pattern given as input to
the network. The task of the gate layer is to verify inequalities (2) and to decide
which of the terms zi must be used as the output y of the whole network. Thus,
the i-th unit in the gate layer has output equal to its input zi if the corresponding
constraint (2) is satisfied and equal to 0 in the opposite case. All the other units
perform a weighted sum of their inputs; the weights of the output neuron, having
no bias, are always set to 1.
As previously noted, the solution of a PLR problem requires a technique that
combines classification and regression: the first has the aim of finding matrices
Ai to be inserted in the gate layer of the neural network in Fig. 1, whereas the
latter provides the weight vectors wi for the input to hidden layer connections.
The method we propose is summarized in Fig. 2; it is composed of four steps,
each one devoted to a specific task.
446 G. Ferrari-Trecate and M. Muselli

y
5 O u tp u t L a y e r

A 1 A 2 A s G a te L a y e r
z 1 z 2 z s
w 1 0 w 2 0 w
5 w 5 w w 5 s0
H id d e n L a y e r
1 n w 2 1
2 n s1
w
w w w sn
1 1 w 1 2
2 2 s2

In p u t L a y e r
x 1 x 2 x n

Fig. 1. Connectionist model realizing a piecewise linear function.

The first one (Step 1) aims at obtaining a first estimate of the weight vectors
wi by performing local linear regressions based on small subsets of the whole
training set S. In fact, points xk that are close to each other are likely to belong
to the same region Xi . Then, for each sample (xk , yk ), with k = 1, . . . , m, we
build a local dataset Ck containing (xk , yk ) and the c−1 distinct pairs (x, y) ∈ S
that score the lowest values of the distance xk − x . It can be easily seen that
most sets Ck , called pure, contain samples belonging to the same region Xi , while
the remaining ones, named mixed, include input patterns deriving from different
Xi . These lead to wrong estimates for wi and consequently their number must
be kept minimum.
Denote with v k the weight vector produced through the local linear regression
on the samples (x1k , yk1 ), (x2k , yk2 ), . . . , (xck , ykc ) of Ck . If Φk and ψ k are defined
as  1 2 
xk xk . . . xck
Φk = , ψ k = [yk1 yk2 . . . ykc ]
1 1 ... 1
being  the transpose operator, v k is generated through the well known formula

v k = (Φk Φk )−1 Φk ψ k

A classical result in least squares theory allows also to obtain the empirical
covariance matrix Vk of the vector v k and the scatter matrix [5] Qk
c

Sk
Vk = (Φ Φk )−1 , Qk = (xik − mk )(xik − mk )
c−n+1 k i=1
  c
being Sk = ψ k I − Φk (Φk Φk )−1 Φk ψ k and mk = i=1 xik /c
A New Learning Method for Piecewise Linear Regression 447

ALGORITHM FOR PIECEWISE LINEAR REGRESSION

1. (Local regression) For every k = 1, . . . , m do


1a. Build the local dataset Ck containing the sample (xk , yk ) and the
pairs (x, y) ∈ S associated with the c − 1 nearest neighbors x to xk .
1b. Perform a linear regression to obtain the weight vector v k of a
linear unit fitting the samples in Ck .
2. (Clustering) Perform a proper clustering process in the space IRn+1 to
subdivide the set of weight vectors v k into s groups Ui .
3. (Classification) Build a new training set S  containing the m pairs
(xk , ik ), being Uik the cluster including v k . Train a multicategory
classification method to produce the matrices Ai for the regions X̂i .
4. (Regression) For every i = 1, . . . , s perform a linear regression on the
samples (x, y) ∈ S, with x ∈ X̂i , to obtain the weight vector wi for the
i-th unit in the hidden layer.

Fig. 2. Proposed learning method for piecewise linear regression.

Then, consider the feature vectors ξ k = [v k mk ] , for k = 1, . . . , m; they can


be approximately modeled as the realization of random vectors with covariance
matrix  
Vk 0
Rk =
0 Qk
If the generation of the samples in the training set is not affected by noise, most
of the v k coincide with the desired weight vectors wi . Only mixed sets Ck yield
spurious vectors v k , which can be considered as outliers. Nevertheless, even in
presence of noise, a clustering algorithm (Step 2) can be used to determine the
sets Ui of feature vectors ξ k associated with the same wi . If the number s of
regions is fixed beforehand, a proper version of the K-means algorithm [6] can
be adopted. It uses the following cost functional
s
 
J ({Ui }si=1 , {µi }si=1 ) = ξ k − µi 2R−1
k
i=1 ξk ∈Ui

where µi is the center of the cluster Ui . This choice allows to recover the influence
of poor initialization and of outliers on the clustering process.
The sets Ui generated by the clustering process induce a classification on the
input patterns xk belonging to the training set S, due to the chain of bijections
among the set of input patterns xk , the collection of local datasets Ck , and
the class of feature vectors ξ k . Now, if ξ k ∈ Ui for a given i, it is likely that
the local dataset Ck is fitted by the linear neuron with weight vector wi and
consequently xk is located into the region Xi . An estimate X̂i for each of these
regions can then be determined by solving a linear multicategory classification
problem (Step 3), whose training set S  is built by adding as output to each
448 G. Ferrari-Trecate and M. Muselli

input pattern xk the index ik of the set Uik to which the corresponding feature
vector ξ k belongs.
To avoid the presence of multiply classified points or of unclassified patterns
in the input space, multicategory techniques [1] deriving from the Support Vec-
tor Machine approach and based on linear and quadratic programming can be
employed. In this way the s matrices Ai for the gate layer are generated. Finally,
weight vectors wi for the neural network in Fig. 1 can be directly obtained by
solving s linear regression problems (Step 4) having as training sets the samples
(x, y) ∈ S with x ∈ X̂i , being X̂1 , . . . X̂s the regions found by the classification
process.

3 Simulation Results
The proposed algorithm for piecewise linear regression has been tested on a two-
dimensional benchmark problem, in order to analyze the quality of the resulting
connectionist model. The unknown function to be reconstructed is

 3 + 4x1 + 2x2 if 0.5x1 + 0.29x2 ≥ 0 and x2 ≥ 0

f (x1 , x2 ) = −5 − 6x1 + 6x2 if 0.5x1 + 0.29x2 < 0 and 0.5x1 − 0.29x2 < 0 (3)


−2 + 4x1 − 2x2 if 0.5x1 − 0.29x2 ≥ 0 and x2 < 0

with X = [−1, 1]×[−1, 1] and s = 3. A training set S containing m = 300 samples


(x1 , x2 , y) has been generated, according to the model y = f (x1 , x2 ) + ε, being ε
a normal random variable with zero mean and standard deviation σ = 0.1. The
behavior of f (x1 , x2 ) together with the elements of S are depicted in Fig. 3a.
Shaded areas in the (x1 , x2 ) plane show the polyhedral regions Xi .

10 10 10

5 5 5

0 0 0
y

−5 −5 −5

−10 −10 −10


1 1 1
0.5 1 0.5 1 0.5 1
0 0.5 0 0.5 0 0.5
0 0 0
−0.5 −0.5 −0.5
−0.5 −0.5 −0.5
−1 −1 −1 −1 −1 −1
x2 x1 x x x x
2 1 2 1

a) b) c)

Fig. 3. Two dimensional benchmark problem: a) behavior of the unknown piecewise


linear function and training set, b) simulation results obtained with the proposed al-
gorithm and c) with a conventional two layer neural network.

The method described in Fig. 2 has been applied by choosing at Step 1 the
value c = 8. The resulting connectionist model realizes the following function,
graphically represented in Fig. 3b:
A New Learning Method for Piecewise Linear Regression 449

 3.02 + 3.91x1 + 2.08x2 if 0.5x1 + 0.28x2 ≥ 0 and 0.12x1 + x2 ≥ 0.07
f (x1 , x2 ) = −5 − 5.99x1 + 6.01x2 if 0.5x1 + 0.28x2 < 0 and 0.5x1 − 0.26x2 < 0.04

−2.09 + 4.04x1 − 2.08x2 if 0.5x1 − 0.26x2 ≥ 0.04 and 0.12x1 + x2 < 0.07

As one can note, this is a good approximation to the unknown function (3);
the generalization Root Mean Square (RMS) error, estimated through a 10-fold
cross validation, amounts to 0.296. Significant differences can only be detected
at the boundaries between two adjacent regions Xi ; they are mainly due to the
effect of mixed sets Ck on the classification process.
As a comparison, a two layer neural network trained with a combination of
Levenberg-Marquardt algorithm and Bayesian regularization yields the nonlin-
ear approximating function shown in Fig. 3c. Discontinuities between adjacent
regions are modeled at the expense of reducing the precision on flat areas. It
is important to observe that neural networks are not able to reconstruct the
partition {Xi }si=1 of the input domain, thus missing relevant information for the
problem at hand.
Seven hidden nodes are needed to produce this result; their number has been
obtained by performing several trials with different values and by taking the
best performance. A 10-fold cross validation scores an estimate of 0.876 for the
generalization RMS error, significantly higher than that obtained through the
PLR procedure.

References
1. E. J. Bredensteiner and K. P. Bennett, Multicategory classification by sup-
port vector machines. Computational Optimizations and Applications, 12 (1999)
53–79.
2. V. Cherkassky and H. Lari-Najafi, Constrained topological mapping for non-
parametric regression analysis. Neural Networks, 4 (1991) 27–40.
3. C.-H. Choi and J. Y. Choi, Constructive neural networks with piecewise in-
terpolation capabilities for function approximation. IEEE Transactions on Neural
Networks, 5 (1994) 936–944.
4. J. Y. Choi and J. A. Farrell, Nonlinear adaptive control using networks of
piecewise linear approximators. IEEE Transactions on Neural Networks, 11 (2000)
390–401.
5. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. (1973)
New York: John Wiley and Sons.
6. G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari, A Clustering
Technique for the Identification of Piecewise Affine Systems. In HSSC 2001, vol
2304 of Lecture Notes in Computer Science (2001) Berlin: Springer-Verlag, 218–
231.
7. K. Nakayama, A. Hirano, and A. Kanbe, A structure trainable neural network
with embedded gating units and its learning algorithm. In Proceedings of the In-
ternational Joint Conference on Neural Networks (2000) Como, Italy, III–253–258.

You might also like