RE S E A R C H PA P E R
Q U A N T IT A T IV E F I N A N C E V O L U M E 3 (2003) 1–10
INSTITUTE O F PHYSICS PUBLISHING
quant.iop.org
Estimating GARCH models using
support vector machines*
Fernando Pérez-Cruz1, Julio A Afonso-Rodrı́guez2 and
Javier Giner3
1
Department of Signal Theory and Communications, University Carlos III,
Leganés, 28911 Madrid, Spain
2
Department of Institutional Economics, Economic Statistics and
Econometrics, University of La Laguna, 38071 Tenerife, Canary
Islands, Spain
3
Department of Financial Economy and Accounting, University of La
Laguna, 38071 Tenerife, Canary Islands, Spain
E-mail:
[email protected],
[email protected] and
[email protected]
Received 17 February 2002, in final form 20 February 2003
Published
Online at stacks.iop.org/Quant/3
Abstract
Support vector machines (SVMs) are a new nonparametric tool for regression
estimation. We will use this tool to estimate the parameters of a GARCH
model for predicting the conditional volatility of stock market returns.
GARCH models are usually estimated using maximum likelihood (ML)
procedures, assuming that the data are normally distributed. In this paper, we
will show that GARCH models can be estimated using SVMs and that such
estimates have a higher predicting ability than those obtained via common
ML methods.
1. Introduction
Q.1
Financial returns series are mainly characterized by having
a zero mean, exhibiting high kurtosis and little, if any,
correlation. The squares of these returns often present
high correlation and persistence, which makes ARCH-type
models suitable for estimating the conditional volatility of
such processes; see Engle (1982) for the seminal work,
Bollerslev et al (1994) for a survey on volatility models and
Engle and Patton (2001) for several extensions. The ARCH
parameters are usually estimated using maximum likelihood
(ML) procedures that are optimal when the data is drawn from
a Gaussian distribution.
Support vector machines (SVMs) are state-of-the-art tools
for linear and nonlinear input–output knowledge discovery
(Vapnik 1998, Schölkopf and Smola 2001). SVMs can
be employed for solving pattern recognition and regression
* Paper presented at Applications of Physics in Financial Analysis (APFA)
3, 5–7 December 2001, Museum of London, UK.
1469-7688/03/000001+10$30.00
estimation problems. SVMs have been developed in the
machine learning community and resemble, in some ways, a
neural network (NN). But SVMs are superior to most common
NNs (such as multi-layered perceptron or radial basis function
networks) due to the SVM optimization procedure giving not
only the weights of the network but also its architecture.
Furthermore, one of the most desirable properties when using a
SVM is that its optimizing functional is quadratic and linearly
restricted, meaning that it only presents a single minimum
without any local undesirable solutions.
In this paper we propose using a SVM instead of a
ML method to estimate GARCH parameters. The benefits
of the SVM in regression (also known as a support vector
regressor; SVR) lies in not assuming that there is a probability
density function (pdf) over the return series and it adjusts
the parameters relying on the empirical risk minimization
inductive principle (Vapnik 1998). The SVR defines an
insensitivity zone (detailed in section 2) that means it can deal
with any pdf. Therefore, if the variable to be estimated is not
© 2003 IOP Publishing Ltd PII: S1469-7688(03)33983-1
1
Q U A N T IT A T IV E F I N A N C E
F Pérez-Cruz et al
f(ei)
e
e
ei
Figure 1. The cost function associated with SVR errors.
sampled from a Gaussian distribution, the SVR can actually
lead to better predictions than those obtained using a least
squares (ML) approach.
The rest of the paper is outlined as follows. We will
introduce the SVM for regression estimation in section 2.
Section 3 is devoted to the GARCH model and its estimation
via the SVR and ML procedures, including a simulation
experiment. In section 4 we consider the estimation of a
GARCH (1, 1) model with six different financial data sets,
along with the usual statistics employed in time-series analysis
to assess the adequacy of the model when using both estimation
methods. We compare the ML and SVR forecasts within the
GARCH model in section 5 using the same financial series as
in section 4. The paper ends in section 6 with a discussion and
suggestion for further work.
on the empirical error and on a factor that measures the
complexity of the used regressor. The SRM defines a nested
set of possible functions that approximates the given series
and chooses the best one according to the one that provides the
minimum SRM; more details can be found in Vapnik (1998),
Schölkopf and Smola (2001). The SRM principle needs a
nonzero insensitivity zone in order to provide an upper bound
of the error less than infinity, but this ε can be made arbitrarily
small. The samples that present a prediction error greater
than ε are linearly penalized. This decision is based on a
result by Huber (1964), in which he demonstrated that the
best cost function over the worst model over any probability
density function of y given x( p(y/x)) is the linear cost
function (absolute value penalization of the error). Therefore,
if the pdf p(y/x) is unknown the best cost function is the linear
penalization over the errors. The mixture of these two results
gives rise to the SVM cost function, which warrantees that the
SRM is finite and the best solution when p(y/x) is unknown.
Most regression estimation problems use a quadratic loss
function, f (ei ) = ei2 (least squares), but for those problems in
which yi has not been drawn from a Gaussian distribution,
the least square techniques are suboptimal (Vapnik 1982),
and can lead to severely mismatched solutions for some
densities. Furthermore, the value of ε can be optimally set
if the probability density function over ei is known, as shown
in Smola et al (1998).
2.1. SVR optimization
The SVR is stated as a constrained optimization problem
2. The SVMs
The SVR needs to work with a training set to adjust its
parameters and afterwards the machine can be used to predict
any possible outcome. The prediction over the samples used
for training purposes is known as in-sample prediction. For
the samples the algorithm did not use out-sample prediction,
also known, respectively, as the training and generalization
(test) sets in the machine-learning community. The SVR
needs a labelled training data set (xi ∈ ℜd and yi ∈ ℜ,
for i = 1, . . . , n, where xi is the input vector and yi is
its corresponding label)4 , to solve the regression estimation
problem (yi = wT xi + b, where w and b define the linear
regressor), i.e. to find the values of w and b. The SVR uses
the penalty function shown in figure 1, in which the samples
with a prediction error (ei = yi − wT xi − b) lower than ε
in absolute terms are not penalized and those samples with a
prediction error greater than ε are linearly penalized.
The regression SVM is an extension of the SVM for
classification proposed in 1992 (Boser et al 1992). The SVR
cost function has been proposed to follow the structural risk
minimization (SRM) principle (Vapnik 1998). The SRM
principle is an upper bound in the prediction error. It depends
4
For compactness we will use matrix notation. The vectors will be column
vectors denoted by bold lower case, the matrices will be bold upper case and
the scalars will be italic (lower case and occasionally upper case). The dot
product between the columns vectors will be shown as matrix multiplication
(w T x), where T indicates the matrix transpose operation.
2
min
w,b,ξi ,ξi∗
1
w2
2
+C
n
(ξi + ξi∗ )
(2.1)
i=1
subject to
yi − wT φ(xi ) − b ε + ξi
(2.2)
∗
w φ(xi ) + b − yi ε + ξi
(2.3)
ξi , ξi∗ 0
(2.4)
T
where φ(·) is a nonlinear transformation to a higher
dimensional space (xi ∈ ℜd → φ(xi ) ∈ ℜ H , d H ). The
SVR defines a linear regressor in the transformed space (ℜ H ),
also known as feature space, which is nonlinear in the input
space, unless φ(xi ) = xi (linear regression). ξi and ξi∗ are
positive slack variables, introduced to deal with the samples
with prediction errors greater than ε. The parameter C is the
penalization applied over the samples with prediction error
greater than ε. This problem is usually solved introducing
constraints (2.2)–(2.4) using Lagrange multipliers, leading to
the minimization of
n
ξi + ξi∗
L P = 21 w2 + C
i=1
−
−
n
i=1
n
i=1
αi (ε + ξi − yi + wT φ(xi ) + b) −
n
µi ξi
i=1
n
αi∗ (ε + ξi∗ + yi − wT φ(xi ) − b) −
i=1
µi∗ ξi∗ (2.5)
Q U A N T IT A T IV E F I N A N C E
Estimating GARCH models with SVMs
with respect to w, b, ξi and ξi∗ and its maximization with respect
to the Lagrange multipliers, αi , αi∗ , µi and µi∗ . To solve this
problem we need to compute the Karush–Kunh–Tucker (KKT)
conditions (Fletcher 1987),that states some conditions over the
variables in (2.5) to be an admissible solution set, being (2.2)–
(2.4),
n
∂LP
=w−
(αi − αi∗ )φ(xi ) = 0
(2.6)
∂w
i=1
i=1
κ(xi , x j ) = xiT x j
κ(xi , x j ) = (xiT x j + 1)k
κ(xi , x j ) = exp(−||xi − x j ||2 /(2σ 2 ))
Some of the most widely used kernels are shown in table 1,
where k is a natural number and σ is a real positive number.
We must recall that the Mercer theorem (Schölkopf and Smola
2001) states the necessary and sufficient conditions for any
function κ(xi , x j ) to be a kernel in a Hilbert space.
(2.7)
∂LP
= C − αi − µi = 0
∂ξi
(2.8)
∂LP
= C − αi∗ − µi∗ = 0
∂ξi∗
3. Estimating GARCH models using
SVR
(2.9)
3.1. The GARCH (1, 1) model
αi , αi∗ , µi , µi∗ 0
(2.10)
αi {ε + ξi − yi + wT φ(xi ) + b} = 0
(2.11)
αi∗ {ε + ξi∗ − wT φ(xi ) − b + yi } = 0
(2.12)
µi∗ ξi∗ = 0.
(2.13)
and
The usual procedure to optimize the SVR introduces the KKT
conditions (2.6)–(2.9) into (2.5), leading to the maximization
of
n
Linear
Polynomial
RBF
n
∂LP
=
αi − αi∗ = 0
∂b
i=1
µi ξi = 0
Ld = ε
Table 1. Some of the kernels used in the SVR.
(αi +αi∗ )−
n
n
(αi −αi∗ )(α j −α ∗j )φT (xi )φ(x j )
i=1 j=1
(2.14)
subject to (2.7) and 0 αi , αi∗ C. This procedure can be
solved using quadratic programming (QP) schemes (Schölkopf
and Smola 2001). To solve (2.14) we do not need to know
the nonlinear mapping φ(·), only its reproducing kernel in
Hilbert Space (RKHS) κ(xi , x j ) = φT (xi )φ(x j ) (Schölkopf
and Smola 2001).
One of the most difficult tasks when solving the SVR is
choosing the appropriate ε, because it dramatically depends
with the given data, making very hard to choose a priori a
good value. There is an alternative formulation, known as νSVR (Schölkopf et al 2000), which yields the same solution as
the SVR, also known as ε-SVR, in which the ε is replaced by
ν. The value of ν roughly determines the fraction of support
vectors, therefore it gives the complexity of the machine and
it is bounded between 0 and 1. Furthermore, the solution is
not very sensitive to this ν and for a wide range of them the
obtained ε is close to the optimal one. Details about the ν-SVR
formulation, in addition to the proof that ν roughly represents
the fraction of support vectors and the fact that the solution is
not very sensitive to the values ν can be found in Schölkopf
et al (2000).
The SVR can also be solved by relying on an iterative
re-weighted least squares (IRWLS) procedure, that is easier to
implement and it is much faster than the usual QP schemes, as
shown in Pérez-Cruz et al (2000) for the regular SVR and in
Pérez-Cruz and Artés-Rodr ıguez (2001) for the ν-SVR, which
is what we will use throughout this paper.
The GARCH (1, 1) model provides a simple representation
of the main statistical characteristics of a return series for a
wide range of assets and, consequently, it is extensively used
to model real financial time series. It serves as a natural
benchmark for the forecast performance of heterocedastic
models based on ARCH. In the simplest set up, if yt follows a
GARCH (1, 1) model, then
yt = µ + σt εt
2
2
σt2 = ω + αyt−1
+ βσt−1
(3.1)
where εt is an uncorrelated process with zero mean and unit
variance. Although the mean of a financial return series could
not be zero it can be usually neglected (µ ≈ 0), without
significantly degrading the performance of the proposed
model. Following the definition in (3.1), the conditional
variance σt2 is a stochastic process assumed to be a constant
2
plus a weighted average of the last period’s forecast, σt−1
, and
2
the last period’s squared observation, yt−1 . The parameters
ω, α and β must satisfy ω > 0, α, β 0 to ensure that
the conditional variance is positive. The parameter ω has to
be strictly positive for the process yt not to degenerate. The
process yt is stationary iff α + β < 1.
This model is usually estimated using the (conditionally)
Gaussian log-likelihood function and maximizing it through an
iterative algorithm such as BHHH (Berndt et al 1974), because
the function to be maximized is nonlinear in its arguments. The
estimates are called ML when the Gaussian distribution is the
underlying pdf that the data has been sampled from, if this is
not the case it is called quasi-ML. Bollerslev and Wooldridge
(1992) showed the consistency of these estimates in this case,
which does not ensure that for a finite sample set it is the best
estimate.
The SVR can also be used to estimate this model. The
selection of ν by cross validation will help to adjust the
GARCH parameters to the optimal pdf described by the return
series without knowing it beforehand. If the pdf is not
Gaussian, the SVR will probably improve the results attained
using ML procedures.
The SVR needs to deal with observable variables in the
model to be estimated. In equation (3.1) both σt (as the
3
Q U A N T IT A T IV E F I N A N C E
F Pérez-Cruz et al
σ̂t2
=
1
5
4
2
yt−k
(3.2)
k=0
and we can use now the series of yt and σt to train the SVM.
The value of C has been set to 10 as a compromise value
between the regularization of w and the weight of the errors
(the solution is not very sensitive to this parameter). The value
of ν has been set using eightfold cross validation (eightfold
CV; Bishop 1995). We have divided the in-sample set in
eight disjoint sets and have used seven of them to train several
machines with different values of ν. The eighth set has been
used to compute the validation error of each machine for each
specific value of ν. This process is repeated seven times,
leaving aside in each repetition one of the sets for computing
the validation error. Finally, all the validation errors for the
same ν value are added together and the best ν value is the
one pointed out by the minimum over the validation error.
This ν value is the one used to train the machine with the
full in-sample set. Further details on how to compute hyperparameters in machine learning can be found in Bishop (1995).
The software that has been used for solving the SVR has been
placed on the http://www.tsc.uc3m.es/∼fernando web page.
3.2. A simulation experiment
As an illustration of how the SVM works with GARCH-type
models we have performed a simulation study. We have
generated time series that follow the model in (3.1) setting
µ = 0, ω = 0.1, α = 0.1 and β = 0.8 and a disturbance
term εt distributed first as Gaussian and then as a Student’s
t with six degrees of freedom (kurtosis = 6). This second
distribution tries to model the excess of kurtosis that appears
in real financial series.
We have drawn 2000 samples for each model and have
used the first 1000 samples to train the model, i.e. to obtain the
parameter estimates using ML and SVR methods, and have
used the remaining 1000 samples to assess the quality of the
obtained model. To measure the predicting ability of both
estimates, we have used R 2 statistics, as described in detail
in section 5. Basically, the larger the R 2 value is, the better
the model predicts real data. We have reported the obtained
results for the Gaussian and Student’s t error distributions,
respectively, in tables 2 and 3, with ν taking all possible values
and C = 10. The reported results are the mean values of 50
independent trials.
In table 2, we report the R 2 statistics for the in-sample
period and the out-sample period. The best result over the
out-sample period, in which the predicting ability can be
best measured (the in-sample estimate can be over-fitted), is
obtained for the ML estimate because for Gaussian data the
ML procedure gives the best estimate. But one notices that the
4
70
500 samples
1000 samples
60
2
2
2
100(RSVM
RML
)/RML(%)
Q.2
dependent variable) and xt = [yt−1 σt−1 ]T (as the regressor
vector) must be known in order to set the weights (αi )
in equation (2.14). With the time series being persistent,
we decided to measure σt as a moving average of the
contemporaneous and four lagged squared returns around each
time point in the in-sample set, that is
50
40
30
20
10
0
10
20
3
4
5
6
Kurtosis
7
8
9
Figure 2. A comparison of the average R 2 statistic for simulated
data with a different number of samples and kurtosis levels for εt .
best result for the SVM is not far from the ML estimate and the
solution is very stable for a wide range of possible ν values;
recall that the values of ν are limited to between 0 and 1.
In table 3, the best result is obtained for the SVM with ν =
0.1. Although the ML procedure gives consistent estimates,
the SVM can give better predictions thanks to its robust cost
function, which allows one to obtain good predictions for any
distribution of εt , and to its regularization term, w2 .
Finally, to assess the behaviour of both schemes with
different numbers of samples, we repeated the previous trials
with 1000 samples (the first 500 for the in-sample set and the
last 500 for the out-sample set) and with 2000 samples (1000
for the in-sample set and 1000 for the out-sample set), using
the same values for µ, ω, α and β. We considered εt to be
distributed as Gaussian and Student’s t random variables with
ten, seven, six and five degrees of freedom, which, respectively,
lead to kurtosis values of 4, 5, 6 and 9. We have set the value of
ν for the SVM to be equal to 0.1 and C = 10. We have run 50
independent trials for each distribution and number of samples
and we have reported the mean values for the out-sample R 2
statistics in figure 2.
In figure 2, one first notices that as the kurtosis gets further
from 3 the SVM is a better predictor than the ML procedure
for both sample sizes. This occurs because the ML procedure
is tuned only for Gaussian distributions whereas the SVM is
appropriate for any distribution, which means it gives better
results for nonGaussian distributions.
It is also relevant to mention that for the reduced set, the
SVM outperforms the ML procedure for every kurtosis value
2
2
− RML
RSVM
>0
2
RML
even when εt is distributed as a Gaussian random variable.
This can be explained by VC theory: when we have very few Q.3
samples sometimes a simple model is better than the actual
model, which allows the SVM with the regularization over
w to seek better solutions, even when normality assumptions
Q U A N T IT A T IV E F I N A N C E
Estimating GARCH models with SVMs
Table 2. Average R 2 statistic for simulated data: Gaussian distribution.
SVM
ν
In-sample R 2
Out-sample R 2
0.1
4.49
3.25
0.2
4.31
3.30
0.3
4.18
3.23
0.4
3.92
3.11
0.5
3.94
3.10
ML
0.6
3.28
2.49
0.7
3.28
2.56
0.8
3.80
2.91
0.9
3.50
2.71
5.32
3.51
Table 3. Average R 2 statistic for simulated data: Student’s t (6) distribution.
SVM
ν
In-sample R 2
Out-sample R 2
0.1
3.34
2.97
0.2
3.31
2.75
0.3
2.64
2.53
0.4
2.58
2.42
hold. As the number of samples in the in-sample set increases,
the SVM and the ML method will both tend to the same solution
because both are consistent, but for reduced sets, the SVM can
give a much better result.
4. Empirical modelling
To illustrate the main empirical properties often observed
in high-frequency financial time series, table 4 contains
descriptive statistics of six financial time series observed daily.
Let pt be the observed daily price at time t and yt the
corresponding daily return defined by
yt = ln pt − ln pt−1 .
The considered series are returns of four international stock
market indexes: the S&P100 index observed from January
1996 until October 2000; the FTSE100 index observed
from January 1995 until December 2000; IBEX35 of the
Madrid Stock Exchange observed from January 1990 until
December 1999; and the NIKKEI index from January 1995
until November 1999. The returns of two stock equities:
General Motors and Hewlett Packard were also analysed from
January 1996 until October 2000. In table 4, it is possible
to observe that all the series show almost zero means and
excess kurtosis (always above 3) for the normal distribution
value. We must point out that the return series show
little or no correlation and its squares show high-correlation
coefficients. The analysis of serial correlation levels using
standard Box–Ljung statistics, named Q(20), indicates that the
S&P100, FTSE100 and IBEX35 are significantly correlated.
However, this evidence disappears when the test is corrected
for conditional heteroskedasticity as proposed by West and
Cho (1995). This corrected test is denoted in table 4 by
Q*(20). The difference that can be observed among them
seems to reinforce the evidence in favour of modelling their
time-varying conditional variance using GARCH schemes.
Furthermore, the analysis of the squared observations
shows significant correlation and persistence. Thus, the
Box–Ljung statistics corresponding to squared observations
(asymptotically equivalent to the test for ARCH effects of
Engle (1982)) are greater than their critical values for all series,
so we can expect to find more serial dependence on these series
when being modelled.
0.5
2.57
2.24
ML
0.6
2.73
2.40
0.7
2.60
1.96
0.8
2.79
2.03
0.9
2.42
2.21
4.13
2.54
We have plotted in figure 3 the S&P100 and General
Motors returns series. It can readily be seen that the volatility
concentrates itself in clusters, i.e. periods of high and low
volatility can be observed in the data. We have also depicted
nonparametric estimates of the pdf of returns together with
the corresponding normal density (same mean and variance).
The density plots confirm the results reported in table 4 about
the returns being heavily tailed. Finally, correlograms of the
squared return series, yt2 , are also reported. The volatility
clustering is reflected in the significant correlations of squared
returns. The yt2 autocorrelation coefficients are larger and last
longer (persistent) than those of the yt series.
Table 5 reports the ML estimates of the parameters of the
GARCH (1, 1) model for all the considered returns series. The
original series of length N was divided into two sets: the insample (training set) with the first N /2 observations and the
out-sample (test set) with the last N /2 observations. In this
table it is possible to observe that all the series considered
have significant ARCH effects and high persistence measured
as α̂ + β̂ (except the HP series that presents α̂ + β̂ = 0.6712).
We check the model estimation using several statistics based
on the standardized observations ε̂t = yt /σ̂t , where σ̂t
is the estimated volatility from the GARCH model. The
standardized observations still have heavy tails, as can be
noted when scrutinizing table 5. However, the autocorrelations
of squared returns are no longer significant using either the
usual Box–Ljung statistic or the corrected test suggested by
Li and Mak (1994), in which the correction factors depend on
the model specification and the results of the ML estimation.
Therefore, the GARCH (1, 1)-ML model seems able to
adequately represent the dynamics of the squared returns series
considered, although it is not able to explain the excess kurtosis
present in the standardized observations.
We show in table 6 the same parameters for the SVR
estimation, but we must point out that the parameters are quite
different due to the different optimization procedure being used
for both schemes. We have also included the value of the
parameter ν of the SVR (this parameter was computed using
cross validation techniques) as previously discussed. For every
series this gave rise to a value that was far from the optimal
one if the data came from a Gaussian random variable and we
also report the value of ε associated with the optimal ν. The C
parameter was set to 10 for all cases. Therefore, the analysis
of the standardized observations in table 6 shows that the
5
Q U A N T IT A T IV E F I N A N C E
F Pérez-Cruz et al
Table 4. Descriptive statistics of the daily returns (N , sample size; r (1), autocorrelation of order 1 of the original observations yt ; r 2(k),
autocorrelation of order k of the squared observations yt2 ; Q(20), Box–Ljung statistic for yt (31.4 is the 5% critical value); Q*(20), modified
Box–Ljung statistic for yt suggested by West and Cho (1995) (31.4 is the 5% critical value); Q2(20), Box–Ljung statistic for yt2 (31.4 is the
5% critical value)).
N
Mean
S.D.
Skewness
Kurtosis
r (1)
Q(20)
Q*(20)
S&P100
FTSE100
IBEX35
NIKKEI
1220
7.51E−04
0.0120
−0.357
6.51
−0.018
32.6
23.9
1480
4.78E−04
0.0103
−0.157
4.20
0.057
44.9
28.6
2009
7.46E−04
0.0125
−0.360
6.84
0.118
73.3
29.5
1070
−5.11E−05
0.0154
0.0836
5.63
−0.042
22.3
12.7
GM
HP
1220
3.10E−04
0.0201
0.0543
4.42
−0.061
25.0
19.5
1220
8.05E−04
0.0287
−0.00372
6.21
−0.040
26.4
22.3
4.04E−04
7.49E−04
0.112
0.070
0.064
0.049
98.9
8.25E−04
1.88E−03
0.031
0.032
−0.015
0.065
49.2
Squared observations yt2
Mean
S.D.
r 2(1)
r 2(2)
r 2(5)
r 2(10)
Q2(20)
1.45E−04
3.38E−04
0.247
0.142
0.131
0.042
175.9
1.06E−04
1.89E−04
0.130
0.193
0.181
0.187
770.3
1.57E−04
3.75E−04
0.211
0.197
0.179
0.227
1414.4
2.38E−04
5.13E−04
0.119
0.123
0.107
0.016
161.4
Table 5. Estimated GARCH (1, 1) models by ML estimation and diagnostics for the in-sample data set (r (1), autocorrelation of order 1 of
the standardized observations ε̂t ; r 2(k), autocorrelation of order k of the squared standardized observations ε̂t2 ; Q(20), Box–Ljung statistic
for ε̂t (31.4 is the 5% critical value); Q2(20), Box–Ljung statistic for ε̂t2 (31.4 is the 5% critical value); Q2*(20), modified Box–Pierce
statistic for ε̂t2 suggested by Li and Mak (1994) when the GARCH model is estimated by ML (31.4 is the 5% critical value)).
ω
α
β
α+β
S&P100
FTSE100
IBEX35
NIKKEI
GM
HP
6.34E−06
0.111
0.828
0.939
9.90E−08
0.0220
0.978
1.000*
6.34E−06
0.111
0.828
0.939
1.81E−06
0.0318
0.958
0.990
4.07E−05
0.0730
0.768
0.841
2.10E−04
0.132
0.539
0.671
Standardized observations ε̂t = yt /σ̂t
Mean
S.D.
Skewness
Kurtosis
r (1)
Q(20)
r 2(1)
r 2(2)
r 2(5)
r 2(10)
Q2(20)
Q2*(20)
a
0.0855
0.997
−0.677
5.55
0.0671
28.7
0.0258
0.0093
0.0073
−0.0131
8.9
10.2
0.0949
0.995
−0.251
4.06
0.0417
19.6
−0.0441
0.0766
−0.0369
−0.0018
16.8
18.2
0.0327
1.00
0.058
4.29
0.1470
41.3
0.0106
−0.0275
−0.0018
0.0426
18.9
20.4
0.0411
0.999
0.258
3.57
−9.81E−04
31.4
0.0305
−0.0193
0.0167
−0.0414
16.0
15.3
0.0234
1.000
−0.257
7.27
−0.0449
11.7
−0.0163
−0.0107
−0.0153
−0.0181
22.9
19.2
1 − (α + β) = +2.0e − 006.
remaining errors exhibit relatively high correlation for S&P100
and IBEX35; the squared errors of FTSE100 and IBEX35 also
show a high level of correlation when using the standard Box–
Ljung test statistic with its asymptotic distribution being based
on the normality assumption. We must also point out that the
SVM does not assume normality over the return series nor
are its errors normally distributed and so the results obtained
should not be considered as an erroneous estimate by the
SVR, but as an expected consequence of not assuming a given
distribution function. The results are almost identical when
using the corrected Box–Ljung statistic, Q2*(20), that was
also based on the Gaussianity of the conditional distribution,
leading to the conclusion that this assumption is far from being
valid in our case.
6
−0.0015
1.011
0.014
5.94
−0.0132
18.1
0.0214
0.0248
0.0415
−0.0072
6.6
6.9
Thus, we encounter a problem when evaluating the model
adequacy of the GARCH (1, 1) specification based on the SVR
estimates because all the test statistics for model checking
are based (even asymptotically) on the Gaussian assumption.
Finally, in the next section we will evaluate both methods for
estimating the GARCH (1, 1) model in terms of its forecasting
ability to replicate the observed series.
5. A comparison of the forecasting ability
of the GARCH models
The ability of the GARCH models to provide good estimates
of equity and index return volatility is well documented. Many
Q U A N T IT A T IV E F I N A N C E
Estimating GARCH models with SVMs
S&P100
General Motors
0.10
0.10
0.05
Daily returns
Daily returns
0.05
0
0
0.05
0.05
0.10
0.10
0.15
400
0
800
Normal and estimated densities
(blue and red line)
Normal and estimated densities
(blue and red line)
35
30
25
20
15
10
5
800
1200
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0
0
0.08 0.06 0.04 0.02
0
0.04
0.02
0.06
0.12
0.08
0.04
0
0.04
0.08
0.5
ACF of squared observations
0.5
ACF of squared observations
400
0
1200
0.4
0.3
0.2
0.1
0
0.4
0.3
0.2
0.1
0
0
5
10
15
20
0
5
10
15
20
Figure 3. Daily returns and densities of yt and ACF of yt2 .
studies show that the parameters of a variety of different
GARCH models are highly significant in the sample (see,
for example, Bollerslev 1986, 1987, Nelson 1991, Andersen
and Bollerslev 1998). There is, however, less evidence
that GARCH models provide good forecasts of equity return
volatility. Some studies (Franses and Van Dijk 1995, Figlewski
1997) examine the out-sample predictive ability of GARCH
models using the daily squared observations as a measure
of the realized volatility. All find that a regression of
realized volatility on forecast volatility produces a low R 2
statistic, defined below (often less than 10%) and hence the
predictive power of the forecasts may be questionable. Recent
works are introducing important improvements in the volatility
forecasting and testing procedures (Blair et al 2001), as we
will see in section 5.2, but SVR estimation can be used
independently of the initial approach used.
5.1. The daily squared observations as a measure of
realized volatility
To forecast the squared daily returns when the underlying
process of volatility is GARCH (1, 1) we could employ
equation (3.1) so one-period-ahead forecast is simply given
by
2
ŷt2 = ω̂ + (α̂ + β̂)yt−1
.
7
Q U A N T IT A T IV E F I N A N C E
F Pérez-Cruz et al
Table 6. Estimated GARCH (1, 1) models by SVR and diagnostics for the in-sample data set (r (1), autocorrelation of order 1 of the
standardized observations ε̂t ; r 2(k), autocorrelation of order k of the squared standardized observations ε̂t2 ; Q(20), Box–Ljung statistic for ε̂t
(31.4 is the 5% critical value); Q2(20), Box–Ljung statistic for ε̂t2 (31.4 is the 5% critical value); Q2*(20), modified Box-Pierce statistic for
ε̂t2 suggested by Li and Mak (1994) when the GARCH model is estimated using the ML procedure (31.4 is the 5% critical value)).
ω
α
β
α+β
ν
ε
S&P100
FTSE100
IBEX35
NIKKEI
GM
HP
1.36E−05
0.102
0.785
0.886
0.150
4.61E−05
7.43E−06
0.0481
0.810
0.858
0.325
1.35E−05
2.11E−05
0.0730
0.774
0.847
0.225
3.95E−05
4.03E − 05
0.0478
0.841
0.888
0.100
1.19E−04
3.61E−05
0.0459
0.804
0.849
0.325
6.17E−05
2.47E−05
0.0171
0.924
0.941
0.950
2.21E−06
Standardized observations ε̂t = yt /σ̂t
Mean
S.D.
Skewness
Kurtosis
r (1)
Q(20)
r 2(1)
r 2(2)
r 2(5)
r 2(10)
Q2(20)
Q2*(20)
Q.4
0.0837
0.939
−0.668
5.88
0.0604
31.7
0.0449
0.0193
0.0226
−0.0006
9.1
10.4
0.0968
1.02
−0.205
4.27
0.0587
21.3
−0.0338
0.0690
−0.0124
0.0558
39.8
63.9
Given the forecasts, ŷt2 , of the squared returns ,yt2, we report the
proportion of the sample variation explained by the forecasts
with the R 2 statistic (Theil 1971) defined by
N 2
N 2
2 2
2 2
t=1 yt − ŷt
t=1 yt − ŷt
2
R = 1− N
2 .
1 N 2 2 = 1− N 2
2
s=1 yt
t=1 yt − N
t=1 yt − ȳ2
(5.1)
This relative accuracy statistic indicates that the model
accounts for over 100 × R 2 per cent of the variability in
the observations. For example, R 2 = 0.11 means that the
model accounts for 11% of the variability in the observations.
If R 2 = 0, then the model is incapable of extracting the
deterministic part of the time series, if there is any. If R 2 is
negative this means that the model introduce more variability
than the sample mean of the original time series.
Table 7 shows the R 2 values calculated for the six analysed
financial returns series using the GARCH model estimated by
the SVR and ML schemes.
The SVR is able to explain a higher percentage of all the
time series over the out-sample sets except for IBEX35, where
the ML method is superior. In addition, the SVR is always able
to predict better than the sample mean, which it is not possible
for the ML technique, see the HP data set. These results are
as expected because the data sets do not resemble a Gaussian
distribution and a technique such as the SVR that does not
assume normality seems to be able to extract more knowledge
from the return series than the usual ML techniques.
We have plotted the predicted values of ŷt2 by SVR and
the squared observations yt2 for the S&P100 index for the insample and out-sample data sets in figure 4. The prediction
made by both methods are very similar, although, as we can
see in table 7, the prediction obtained by the SVM explains
over a half a percentage more than the explanation made by
the ML scheme.
8
−9.12E−05
0.777
0.058
6.77
−0.0021
17.8
0.0173
0.0332
0.0518
0.0124
10.1
12.2
0.0274
0.915
−0.019
4.01
0.1430
40.2
0.0390
0.0124
0.0399
0.0828
35.3
79.7
103
6
0.0383
1.03
0.241
3.55
0.0006
30.9
0.0529
−0.0039
0.0326
−0.0398
16.5
23.4
0.0214
1.19
−0.227
7.15
−0.0399
10.2
0.0261
0.0018
−0.0087
−0.0259
18.0
19.2
S&P100 squared observations forcast GARCH SVM
R2 in-sample = 0.0565
R2 out-sample = 0.0427
4
2
0
0
200
400
600
800
1000
1200
Figure 4. S&P100 (January 1996–October 2000). Squared
observations yt2 and σ̂t2 for GARCH (1, 1) estimates using SVM
procedures.
5.2. The intraday return observations as a measure
of realized volatility
The volatility process is not an observable quantity and there
have also been different attempts to proposed a computational
mechanism for ex post estimates of volatility, called realized
volatility. The most common method for computing a realized
volatility is to square the observed daily returns. If yt = εt σt ,
with σt being independent of εt i.i.d.(0,1), then E[yt2 ] =
E[εt2 σt2 ] = E[σt2 ]. Therefore, accurate forecasting of σt2
will translate in accurate forecasting of yt2 . This has been the
methodology used in the previous sections, independent of the
Q U A N T IT A T IV E F I N A N C E
103
2.0
Estimating GARCH models with SVMs
SVM GARCH INTRA
1.5
1.0
0.5
0
0
150
100
50
200
250
Figure 5. IBEX35 (year 2000). Squared observations yt2 in blue
(back, thin line), intraday volatility (equation (5.2)) in red (middle)
and SVM forecasted volatilities σ̂t2 in black (first).
Table 7. R 2 statistic for GARCH forecasts; the in-sample and
out-sample size is N /2.
GARCH ML
In-sample
R2
SP100
0.0466
FTSE
0.0911
IBEX
0.0590
NIKKEI 0.0110
GM
0.0175
HP
−0.0116
GARCH SVR
Out-sample In-sample Out-sample
R2
R2
R2
0.0365
0.0352
0.1341
0.0423
0.0055
−0.0171
0.0565
0.0475
0.0502
0.0108
0.0153
2.16E−04
0.0427
0.0423
0.0999
0.0479
0.0066
0.0048
estimation of the GARCH parameters, and it can also be seen
from figure 4 that the squared volatilities σt2 are quite similar
to the squared observations yt2 .
Andersen and Bollerslev (1998) showed that this method
produces inaccurate measures of forecast ability for correctly
specified volatility models, such as the former R 2 criterion.
Independently of the adequacy of the used model, we will
obtain poor predictions for σt2 due to the noisy component εt2 .
To avoid this problem, one solution is to produce a different
volatility measure by sampling intraday data
2
σintra,t
=
m
2
y(m),t+
j/m
(5.2)
j=1
where m is the sampling frequency. Andersen and Bollerslev
(1998) have shown that the noisy component is diminished
and that theoretically the realized volatility is then much
closer to actual volatility during the day. They found, for
foreign exchange data, that the performance of ARCH models
improves as the amount of intraday data used to measure the
realized volatility increases. Related results for equity and
FX data are presented, respectively, in Andersen et al (2001a,
2001b).
As an example, we have tested the use of this different
measure of realized volatility. We use 15 min intraday data for
the IBEX35 index stock market during the year 2000 (248
business days with m = 34 quotes each day). We have
plotted the squared returns, the intraday volatility (5.2) and
the forecasted volatility for the SVR estimation in figure 5. In
this plot, it can be seen that the intraday volatility presents a less
abrupt behaviour than the squared of the daily observations.
The out-sample R 2 value is improved with this new
measure of realized volatility. The out-sample R 2 value for
this year, with the parameters estimated using the ML method,
increase from 4 to 8%. Using SVM estimation the same
measure changes from 3 to 21%. The use of intraday data has
increased the explicability of the time series but this increase
has been really noticeable for the SVM (over seven times
more explicability than with the daily data) than for the ML
schemes (about twice as much explicability). Not only do
these results show that the intraday volatility measure improves
explicability but they also suggest that the SVM is a far more
adequate way of estimating the GARCH parameters than the
usual ML schemes are.
These results highlight the expected relative superiority of
SVM estimation against the ML procedure, developed under
more restrictive distributional assumptions (at least under the
assumption of Gaussianity in our case), because the former can
be interpreted as a robust estimation procedure. Furthermore,
Acosta et al (2002) show that, even using simulations of
a conditionally Gaussian GARCH (1, 1) process, the ML
estimation of the conditional variance results in a measure
that usually overestimates the magnitude of volatility in the
time series. These authors also suggested that the bias in
the estimation of the persistence (α + β), increases when
the variance of the process is highly integrated. This is a
characteristic displayed by all series studied in this document;
therefore we would expect the same results for all of them,
if we could compute the intraday volatility as in the IBEX35
time series.
6. Discussion
In this paper we have used the SVM to estimate the parameters
of a GARCH model instead of using the standard ML
procedure and have shown that the SVM is able to give more
accurate predictions than regular ML estimation. This can be
explained by the different natures of the estimates produced by
the ML and SVM methods. The former tries to fit the residuals
to a Gaussian distribution that if correct will provide the best fit,
but if this is not the case, will give an extra error term caused by
forcing the residuals to be Gaussian. The SVM tries to get the
best fit with the data, not relying on any prior knowledge, and
it only concentrates on minimizing the prediction error with a
given machine complexity. If the residuals were Gaussian, the
SVM will not give such a good solution as the ML estimate,
because this method is based on this pdf model, but it will give
a set of residuals that are Gaussian. As one can see from
table 6, the residuals given by the SVM are not Gaussian,
explaining why the predictive ability of the SVM is greater
than ML estimation.
9
F Pérez-Cruz et al
We have used the SVM as a linear machine to replace
the ML estimation process and we have not tried to obtain a
nonlinear estimation using kernels, in which the comparison
with other NNs can be readily applied. We have left for future
work the development of nonlinear machines in which the
estimate of future volatilities will depend nonlinearly on the
past volatilities and observations. We have also left for future
work the integration of the estimation similar to the BHHH
algorithm, in which one will not need to use an intermediate
estimation of σt . These future studies will allow us to make
a full comparison between NNs and SVMs and explain their
advantages compared to linear estimation techniques.
Before ending, we would like to discuss the differences
between the SVM and most NNs. The SVM has several
properties that make it suitable for solving problem in which a
linear and nonlinear dependency has to be estimated from the
data. The most relevant property is that the SVM is based on
a well established learning theory that is able to give bounds
on the expected errors and convergence rate given the number
of samples and the machine complexity (measure as the VC
dimension) (Vapnik 1998). This is a nonasymptotic theory that
holds for any number of samples. In addition, the SVM has two
extra desirable properties: the functional to be minimized is
quadratic and linearly restricted and the machine architecture is
given by the learning procedure. The former property ensures
that the solution cannot get trapped in local minima (there are
none) and the latter property precludes the need to look for the
best connections and hidden layers because the SVM solution
provides it. The practitioner only needs to find the hyperparameters of the SVM, which can be found either by using
the SRM bounds or by cross validation.
References
Acosta E, Fernández F and Pérez J 2002 Volatility bias in the Garch
model: a simulation study Working paper
20026-02(http://www.fcee.ulpgc.es/hemeroteca/) University of
Las Palmas de Gran Canaria
Andersen T G and Bollerslev T 1998 Answering the skeptics: yes
standard volatility models do provide accurate forecasts Int.
Econ. Rev. 39 885–905
Andersen T G, Bollerslev T, Diebold F X and Ebens H 2001a The
distribution of realized stock return volatility J. Financial
Economics 61 43–76
Andersen T G, Bollerslev T, Diebold F X and Labys P 2001b The
distribution of exchange rate volatility J. Am. Stat. Assoc. 96
42–55
Berndt E K, Hall B H, Hall R E and Hausman J A 1974 Estimation
inference in nonlinear structural models Ann. Economy Social
Meas. 4 653–65
10
Q U A N T IT A T IV E F I N A N C E
Bishop C M 1995 Neural Networks for Pattern Recognition
(Oxford: Clarendon)
Blair B J, Poon S H and Taylor S J 2001 Forecasting S&P100
volatility: the incremental information content of implied
volatilities and high-frequency index returns J. Econometrics
105 5–26
Bollerslev T 1986 Generalized autoregressive conditional
heteroskedasticity J. Econometrics 31 307–27
Bollerslev T 1987 A conditional heteroskedasticity time series
model for speculative prices and rates of returns Rev.
Economics Statistics 69 542–7
Bollerslev T, Engle R F and Nelson D B 1994 ARCH models The
Handbook of Econometrics vol 4 (Amsterdam: Elsevier)
pp 2959–3038
Bollerslev T and Wooldridge J M 1992 Quasi maximum likelihood
estimation and inference in dynamic models with time varying
covariances Econometric Rev. 11 143–72
Boser B E, Guyon I M and Vapnik V N 1992 A training algorithm
for optimal margin classifiers 5th Annual ACM Workshop on
COLT (Pittsburgh, PA) pp 144–52
Q.5
Engle R F 1982 Autoregressive conditional heteroskedasticity with
estimates of the variance of UK inflation Econometrica 50
957–1008
Engle R F and Patton A J 2001 What good is a volatility model?
Quant. Finance 1 237–45
Figlewski S 1997 Forecasting volatility Financial Markets. Inst.
Instrum. 6 1–88
Fletcher R 1987 Practical Methods of Optimization (New York:
Wiley)
Franses P H and Van Dijk D 1995 Forecasting stock market volatility
using (nonlinear) GARCH models J. Forecasting 15 229–35
Q.6
Huber P H 1964 Robust estimation of location parameter Ann. Inst.
Stat. Math. 35
Li W K and Mak T K 1994 On the squared residual autocorrelations
in nonlinear time series with conditional heteroskedasticity
J. Time Series Anal. 15 627–36
Nelson D B 1991 Conditional heteroskedasticity in asset returns: a
new approach Econometrica 59 347–70
Pérez-Cruz F and Artés-Rodrı́guez A 2000 An IRWLS procedure
for nu-SVR Proc. ICASSP’01
Q.7
Pérez-Cruz F, Navia-Vázquez A, Alarcón-Diana P L and
Artés-Rodrı́guez A 2000 An IRWLS procedure for SVR Proc.
EUSIPCO’00
Schölkopf B and Smola A 2001 Learning with Kernels (Cambridge,
MA: MIT Press)
Schölkopf B, Smola A, Williamson R and Bartlett P L 2000 New
support vector algorithms Neural Comput. 12 1207–45
Smola A, Murata N, Schölkopf B and Müller K R 1998
Asymptotically optimal choice of ε-loss for support vector
machines Proc. ICANN’98
Theil H 1971 Principles of Econometrics (New York: Wiley)
Vapnik V 1982 Estimation of Dependences Based on Empirical
Data (Berlin: Springer)
Vapnik V 1998 Statistical Learning Theory (New York: Wiley)
West K D and Cho D 1995 The predicitive ability of several models
of exchange rate volatility J. Econometrics 69 367–91