Piece Wise Linear Regression
Piece Wise Linear Regression
Piece Wise Linear Regression
Regression
1 Introduction
The solution of any learning problem involves the reconstruction of an unknown
function f : X → Y from a finite set S of samples of f (training set), possibly
affected by noise. Different approaches are usually adopted when the range Y
contains a reduced number of elements, typically without a specific ordering
among them (classification problems) or when Y is an interval of the real axis
with the usual topology (regression problems).
However, applications can be found, which lie on the borderline between clas-
sification and regression; these occur when the input space X can be subdivided
into disjoint regions Xi characterized by different behaviors of the function f
to be reconstructed. One of the simplest situations of such kind is piecewise
linear regression (PLR): in this case X is a polyhedron in the n-dimensional
space IRn and {Xi }si=1 is a polyhedral partition of X, i.e. Xi ∩ Xj = ∅ for every
s
i, j = 1, . . . , s and i=1 Xi = X. The target of a PLR problem is to reconstruct
an unknown function f : X → IR having a linear behavior in each region Xi
n
f (x) = wi0 + wij xj if x ∈ Xi (1)
j=1
J.R. Dorronsoro (Ed.): ICANN 2002, LNCS 2415, pp. 444–449, 2002.
c Springer-Verlag Berlin Heidelberg 2002
A New Learning Method for Piecewise Linear Regression 445
Xi to which xk belongs is not known in advance. The scalars wi0 , wi1 , . . . , win ,
for i = 1, . . . , s, characterize the function f and their estimate is a target of the
PLR problem; for notational purposes they will be included in a vector wi .
Since regions Xi are polyhedral, they are defined by a set of li linear inequal-
ities, which can be written in the following form:
1
Ai ≥0 (2)
x
y
5 O u tp u t L a y e r
A 1 A 2 A s G a te L a y e r
z 1 z 2 z s
w 1 0 w 2 0 w
5 w 5 w w 5 s0
H id d e n L a y e r
1 n w 2 1
2 n s1
w
w w w sn
1 1 w 1 2
2 2 s2
In p u t L a y e r
x 1 x 2 x n
The first one (Step 1) aims at obtaining a first estimate of the weight vectors
wi by performing local linear regressions based on small subsets of the whole
training set S. In fact, points xk that are close to each other are likely to belong
to the same region Xi . Then, for each sample (xk , yk ), with k = 1, . . . , m, we
build a local dataset Ck containing (xk , yk ) and the c−1 distinct pairs (x, y) ∈ S
that score the lowest values of the distance xk − x . It can be easily seen that
most sets Ck , called pure, contain samples belonging to the same region Xi , while
the remaining ones, named mixed, include input patterns deriving from different
Xi . These lead to wrong estimates for wi and consequently their number must
be kept minimum.
Denote with v k the weight vector produced through the local linear regression
on the samples (x1k , yk1 ), (x2k , yk2 ), . . . , (xck , ykc ) of Ck . If Φk and ψ k are defined
as 1 2
xk xk . . . xck
Φk = , ψ k = [yk1 yk2 . . . ykc ]
1 1 ... 1
being the transpose operator, v k is generated through the well known formula
A classical result in least squares theory allows also to obtain the empirical
covariance matrix Vk of the vector v k and the scatter matrix [5] Qk
c
Sk
Vk = (Φ Φk )−1 , Qk = (xik − mk )(xik − mk )
c−n+1 k i=1
c
being Sk = ψ k I − Φk (Φk Φk )−1 Φk ψ k and mk = i=1 xik /c
A New Learning Method for Piecewise Linear Regression 447
where µi is the center of the cluster Ui . This choice allows to recover the influence
of poor initialization and of outliers on the clustering process.
The sets Ui generated by the clustering process induce a classification on the
input patterns xk belonging to the training set S, due to the chain of bijections
among the set of input patterns xk , the collection of local datasets Ck , and
the class of feature vectors ξ k . Now, if ξ k ∈ Ui for a given i, it is likely that
the local dataset Ck is fitted by the linear neuron with weight vector wi and
consequently xk is located into the region Xi . An estimate X̂i for each of these
regions can then be determined by solving a linear multicategory classification
problem (Step 3), whose training set S is built by adding as output to each
448 G. Ferrari-Trecate and M. Muselli
input pattern xk the index ik of the set Uik to which the corresponding feature
vector ξ k belongs.
To avoid the presence of multiply classified points or of unclassified patterns
in the input space, multicategory techniques [1] deriving from the Support Vec-
tor Machine approach and based on linear and quadratic programming can be
employed. In this way the s matrices Ai for the gate layer are generated. Finally,
weight vectors wi for the neural network in Fig. 1 can be directly obtained by
solving s linear regression problems (Step 4) having as training sets the samples
(x, y) ∈ S with x ∈ X̂i , being X̂1 , . . . X̂s the regions found by the classification
process.
3 Simulation Results
The proposed algorithm for piecewise linear regression has been tested on a two-
dimensional benchmark problem, in order to analyze the quality of the resulting
connectionist model. The unknown function to be reconstructed is
3 + 4x1 + 2x2 if 0.5x1 + 0.29x2 ≥ 0 and x2 ≥ 0
f (x1 , x2 ) = −5 − 6x1 + 6x2 if 0.5x1 + 0.29x2 < 0 and 0.5x1 − 0.29x2 < 0 (3)
−2 + 4x1 − 2x2 if 0.5x1 − 0.29x2 ≥ 0 and x2 < 0
10 10 10
5 5 5
0 0 0
y
−5 −5 −5
a) b) c)
The method described in Fig. 2 has been applied by choosing at Step 1 the
value c = 8. The resulting connectionist model realizes the following function,
graphically represented in Fig. 3b:
A New Learning Method for Piecewise Linear Regression 449
3.02 + 3.91x1 + 2.08x2 if 0.5x1 + 0.28x2 ≥ 0 and 0.12x1 + x2 ≥ 0.07
f (x1 , x2 ) = −5 − 5.99x1 + 6.01x2 if 0.5x1 + 0.28x2 < 0 and 0.5x1 − 0.26x2 < 0.04
−2.09 + 4.04x1 − 2.08x2 if 0.5x1 − 0.26x2 ≥ 0.04 and 0.12x1 + x2 < 0.07
As one can note, this is a good approximation to the unknown function (3);
the generalization Root Mean Square (RMS) error, estimated through a 10-fold
cross validation, amounts to 0.296. Significant differences can only be detected
at the boundaries between two adjacent regions Xi ; they are mainly due to the
effect of mixed sets Ck on the classification process.
As a comparison, a two layer neural network trained with a combination of
Levenberg-Marquardt algorithm and Bayesian regularization yields the nonlin-
ear approximating function shown in Fig. 3c. Discontinuities between adjacent
regions are modeled at the expense of reducing the precision on flat areas. It
is important to observe that neural networks are not able to reconstruct the
partition {Xi }si=1 of the input domain, thus missing relevant information for the
problem at hand.
Seven hidden nodes are needed to produce this result; their number has been
obtained by performing several trials with different values and by taking the
best performance. A 10-fold cross validation scores an estimate of 0.876 for the
generalization RMS error, significantly higher than that obtained through the
PLR procedure.
References
1. E. J. Bredensteiner and K. P. Bennett, Multicategory classification by sup-
port vector machines. Computational Optimizations and Applications, 12 (1999)
53–79.
2. V. Cherkassky and H. Lari-Najafi, Constrained topological mapping for non-
parametric regression analysis. Neural Networks, 4 (1991) 27–40.
3. C.-H. Choi and J. Y. Choi, Constructive neural networks with piecewise in-
terpolation capabilities for function approximation. IEEE Transactions on Neural
Networks, 5 (1994) 936–944.
4. J. Y. Choi and J. A. Farrell, Nonlinear adaptive control using networks of
piecewise linear approximators. IEEE Transactions on Neural Networks, 11 (2000)
390–401.
5. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. (1973)
New York: John Wiley and Sons.
6. G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari, A Clustering
Technique for the Identification of Piecewise Affine Systems. In HSSC 2001, vol
2304 of Lecture Notes in Computer Science (2001) Berlin: Springer-Verlag, 218–
231.
7. K. Nakayama, A. Hirano, and A. Kanbe, A structure trainable neural network
with embedded gating units and its learning algorithm. In Proceedings of the In-
ternational Joint Conference on Neural Networks (2000) Como, Italy, III–253–258.