Orbit Forecasting Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Orbit: Probabilistic Forecast with Exponential Smoothing

Edwin Ng 1 Zhishi Wang 1 Huigang Chen 1 Steve Yang 1 Slawek Smyl 1


{edwinng, zhishiw, huigang, steve.yang, slawek}@uber.com

Abstract ent time series data (Hewamalage et al., 2019), the authors
show mixed performances of various deep learning models
arXiv:2004.08492v4 [stat.CO] 22 Jan 2021

Time series forecasting is an active research topic


in academia as well as industry. Although we when compared against traditional statistical models.
see an increasing amount of adoptions of ma- On the other hand, traditional statistical parametric models
chine learning methods in solving some of those have a well-established theoretical foundation, such as the
forecasting challenges, statistical methods remain popular autoregressive integrated moving average (ARIMA)
powerful while dealing with low granularity data. (Box & Jenkins, 1968) and exponential smoothing (Gard-
This paper introduces a package Orbit where it ner Jr, 1985). Recently, researchers from Facebook devel-
refined Bayesian exponential smoothing model oped Prophet (Taylor & Letham, 2018), which is based
with the help of probabilistic programming pack- on an additive model where non-linear trends are fit with
age such as Stan and Pyro. Our model refinements seasonality and holidays.
include additional global trend, transformation for
multiplicative form, noise distribution and choice In this paper we propose a family of refined Bayesian ex-
of priors. A benchmark study is conducted on a ponential smoothing models, with great flexibility on the
rich set of time-series data sets for our models choice of priors, model type specifications, as well as noise
along with other well-known time series models. distribution. Our model introduces a novel global trend
term, which works well for short term time series. Most
importantly, our models come with a well crafted compute
1. Introduction software/package in Python, called Orbit (Object-oriented
Bayesian Time Series). Our package leverages the prob-
Time series forecasting is one of the most popular and yet abilistic programming languages, Stan (Carpenter et al.,
the most challenging tasks, faced by researchers and prac- 2017) and Pryo (Bingham et al., 2019), for the underlying
titioners. Its industrial applications have a wide range of MCMC sampling process and optimization. Pyro, devel-
areas such as social science, bioinformatics, finance, busi- oped by researchers at Uber, is a universal probabilistic pro-
ness operations and revenue forecast. At Uber, time series gramming language (PPL) written in Python and supported
forecasting has its applications from demand/supply predic- by PyTorch and JAX on the backend. Orbit currently has
tion in the marketplace to the Ads budget optimization in a subset of the available prediction and sampling methods
the marketing data science. available for estimation using Pyro.
In recent years, machine learning and deep learning methods The remainder of this paper is organized as follows. In
(Gers et al., 1999; Huang et al., 2015; Selvin et al., 2017) section 2, a review is given on some popular statistical
have gained increasing attention in the time series forecast- parametric time series models. In section 3, we give our
ing, due to their great ability in capturing the non-linear refined model equations, and Orbit package design review.
trend and complex interactions within multivariate time se- In section 4, an extensive benchmark study is conducted to
ries. However, the theories for deep learning are still in evaluate the proposed models’ performance and compare
the active research and development progress and not well with other time series models discussed in the paper. sec-
established yet in the forecasting literature, especially in the tion 5 is about the conclusion and future work. We focus on
case of univariate time series. At the same time, the black- the univariate time series in this work.
box nature of machine learning/deep learning models causes
difficulties in interpretability and explainability (Gunning,
2017; Gilpin et al., 2018). In a benchmark study with differ-
1
Uber Technologies, Inc..

Preliminary work. To be submitted to StanCon 2020.


Orbit: Probabilistic Forecast with Exponential Smoothing

2. Review The ARMA(p,q) model is defined for stationary data. How-


ever, many realistic time series in practice exhibit a non-
2.1. Problem Definition stationary structure, e.g., time series with trend and sea-
Let {y1 , · · · , yt } be a sequence of observations at time t. sonality. The ARIMA(p,d,q) (or SARIMA) overcomes this
Then a point forecast at time T denotes the process of esti- limitation by including an integration parameter of order
mating the set of future values of {ŷT +1 , · · · , ŷT +h } given d. In principle, ARIMA works by applying d differencing
information at t ∀t ≤ T where h denotes the forecasting transformations to the time series until it becomes stationary
horizon. before applying ARMA(p,q).

2.2. The State Space Models 2.2.2. E XPONENTIAL S MOOTHING

State space models is a family of models which can be Another popular approach is done through exponential
written in a general form smoothing in a reduced form

yt = ZtT αt + t
ŷt|t−1 = Z T αt−1
αt+1 = Tt αt + Rt ηt t = yt − ŷt|t−1

One big advantage is that they are modular, in the sense αt = T αt−1 + k
that independent state components can be combined by
concatenating their observation vectors Zt and arranging This was first described by Box and Jekins (Box, 1998).
the other model matrices as elements in a block diagonal Note that ŷt|t−1 can be computed recursively. Hyndman
matrix. This provides considerable flexibility for modeling et al. 2008 provide a complete review on such form. In the
trend, seasonality, regressors, and potentially other state literature, it introduces an ETS form with notation “A” and
components that may be necessary in practice. “M” which stands for additive and multiplicative respectively.
Well known methods such as Holt-Winter’s model can be
However, such form is not unique since the same model can viewed as ETS(A, A, A) or ETS(A, M, A). For notation
be expressed in multiple ways. In practice, there are various such as Ad , the subscript d denotes a damped factor is intro-
reduced forms introduced by researchers. We will go over duced. Many of those are implemented under the forecast
some of those widely used in the industry. package written in the statistical programming language R.

2.2.1. ARIMA 2.2.3. BAYESIAN S TRUCTURAL T IME S ERIES (BSTS)


ARMA model is one of the most commonly used methods Scott and Varian (Scott & Varian, 2014) provide a bayesian
to model univariate time series. ARMA(p,q) combines two approach on a reduced form as such
components: AR(p), and MA(q). In AR(p) model, the value
of a given time series, yt , can be estimated using a linear
combination of the p past observations, together with an yt = µt + τt + β T xt t
error term t and a constant term c: µt+1 = µt + δt + η0t
yt = c + Σpi=1 ψi yt−i + t δt+1 = δt + η1t
S−1
where ψi , ∀i ∈ {1, · · · , p} denote the model parameters, X
τt+1 = − τt + η2t
and p represents the order of the model. Similarly, the
s=1
MA(q) model uses past errors as explanatory variables:
This model is named as local linear trend model and is avail-
yt = µ + Σqi=1 θi t−i + t
able in the bsts package written in R. It allows additional
where µ denotes the mean of the observations, θi , ∀i ∈ regression estimation with coefficients β.
{1, · · · , q} represents the parameters of the models and q
denotes the order of the model. Essentially, the method 2.3. Prophet
MA(q) models the time series according to the random
Researchers at Facebook (Taylor & Letham, 2018) use a
errors that occurred in the past q lags.
generalized additive model (GAM) with three main com-
The model ARMA(p,q) can be constructed by combining ponents: trend, seasonality, and holidays. They can be
the AR(p) model with the MA(q) model, i.e., combined in the following equation:

yt = c + Σpi=1 ψi yt−i + Σqi=1 θi t−i + t yt = gt + st + ht + t


Orbit: Probabilistic Forecast with Exponential Smoothing

where gt is the piecewise linear or logistic growth curve the computation cost by vectorizing the noise generation
to model the non-periodic changes in the time series, st process.
is the seasonality term, ht is the holiday effect with irreg-
One limitation in this model is that it assumes lt > 0, ∀t. To
ular schedules, and t is the error term. On a high level,
ensure such condition is satisfied during the training period,
Prophet is framing the forecasting problem as a curve-fitting
it imposes a requirement such that yt > 0, ∀t.
exercise rather than looking explicitly at the time based
dependence of each observation within a time series. As
a computational tool/software, moreover, Prophet allows 3.2. Refined Model II - DLT
users to manually supply change points in fitting the trend For use cases with yt ∈ R, we provide an alternative -
term and set the boundaries for saturation growth, which Damped Local Trend (DLT) model.
gives great flexibility in business applications.
Such model is an extension of ETS(A, Ad , A) in Hyndman
et al. 2008, where θ is known as the damped factor. In the
3. The Orbit Package forecast process, we have
3.1. Refined Model I - LGT
Our proposed Local and Global Trend (LGT) model is an
yt = µt + st + rt + t
additive version based on the multiplicative model proposed
in Rlgt (Smyl et al., 2019). In the forecast process, we have µt = D(t) + lt−1 + θbt−1

with the update process as such


yt = µt + st + t
λ
µt = lt−1 + ξ1 bt−1 + ξ2 lt−1
gt = D(t)
t ∼ Student(ν, 0, σ)
lt = ρl (yt − gt − st − rt ) + (1 − ρl )(lt−1 + bt−1 )
σ ∼ HalfCauchy(0, γ0 )
bt = ρb (lt − lt−1 ) + (1 − ρb )θbt−1
st+m = ρs (yt − lt − rt ) + (1 − ρs )st
λ
where lt−1 , ξ1 bt−1 , ξ2 lt−1 , st and t can be viewed as level,
rt = Σj βj xjt
local trend, global trend, seasonality and error term, respec-
tively.
There are a few choices of D(t) as a deterministic global
In the update process, it is similar to the triple exponential trend. Options such as linear, log-linear and logistic are
smoothing form included in the package. Another feature of DLT is the
introduction of regression component rt . This serves the
lt = ρl (yt − st ) + (1 − ρl )lt−1 purpose of nowcasting or forecasting when exogenous re-
bt = ρb (lt − lt−1 ) + (1 − ρb )bt−1 gressors are known such as events and holidays. Without
st+m = ρs (yt − lt ) + (1 − ρs )st loss of generality, assume

where ρl , ρb and ρs are the smoothing parameters. Finally, βj ∼ Normal(µj , σj )


we fit ρl , ρb , ρs , θ, ξ, λ, ν in our training process while γ0
is a data-driven scaler. where µj = 0 and σj = 1 by default as a non-informative
prior. There are more choices of priors for the regression
This model structure is similar to the ETS family established component in the package.
by Rob Hynd (De Livera et al., 2011; Hyndman & Athana-
sopoulos, 2018) except that the form we proposed is a hybrid
3.3. Multiplicative Form
model generalizing both ETS(A, A, A) (when ξ = 0) and
autoregressive model (when ξ > 0). A similar approach is In previous section, we derive a general additive form for
found earlier in (Smyl, Zhang et al., 2015). The advantages LGT and DLT. It is trivial that by using transformation
over Rlgt with such form are two-fold. First, the computa- yt0 = ln(yt ), our forecast process yields a multiplicative
tion is more effective on the additive form, where we can form as such
apply simple log transformation to maintain multiplicative ŷt = µ0t · s0t · rt0
properties. Second, σ in Rlgt is parameterized with yt which
introduces dependency in noise generation, while σ is re- In the later benchmarking process, we used models fitted in
fined as an independent noise in LGT. This change reduces such multiplicative form to report performance metrics.
Orbit: Probabilistic Forecast with Exponential Smoothing

3.4. Package Design 4. Model Benchmark


Orbit is our Python package to implement the refined models 4.1. Data
discussed above. This package is written and designed from
a strict object oriented perspective with the goals of re- We performed a comprehensive benchmark study on five
usability, ease of maintenance, and high efficiency. datasets:

The base Estimator class contains generic logic to handle • US and Canada rider first-trips with Uber (20 weekly
interaction with the underlying inference engine (e.g PyStan, series by city)
Pyro) along with utilities to load and save Orbit models, and
the specifics of the model are implemented in each model • US and Canada driver weekly first-trips with Uber (20
class. Figure 1 is the overall workflow of Orbit package. weekly series by city)

• Worldwide first-orders with Uber Eats(15 daily series


by country)

• M3 series (1428 monthly series)

• M4 series (359 weekly series)

where M3/M4 time series are well-known in the forecast


community (Makridakis et al., 2018).

4.2. Performance Metric


We use symmetric mean absolute percentage error
(SMAPE), a widely adopted forecast metric, as our per-
formance benchmark metric

h
X |Ft − At |
SMAPE =
t=1
(|Ft | + |At |)/2
where Xt represents the value measured at time t and h is
the forecast horizon.
Here h is the forecast horizon which can also be considered
as the “holdouts” in a backtest process. Following what
competitions suggested, we use 13 forecast horizon and
18 forecast horizon, respectively, for M4 weekly and M3
monthly series with 1 split; for Uber datasets, we use h =
13, 3 splits with 26 incremental steps for weekly series and
h = 28, 4 splits and 14 incremental steps for daily series.
With multiple splits, we expect more robust result. The
calculation is done with the help of our backtest utilities
built in the Orbit package.

4.3. Results
We compared our proposed models, LGT and DLT, to other
popular time series models such as SARIMA (Seabold &
Perktold, 2010) and Facebook Prophet (Taylor & Letham,
2018). Both Prophet and Orbit models use Maximum A
Posterior (MAP) estimates and they are configured as similar
as possible in terms of optimization and seasonality settings.
Figure 1. Overall Design of Orbit Package. For SARIMA, we fit the (1, 1, 1) × (1, 0, 0, )S structure by
maximum likelihood estimation (MLE) where S represents
the choice of seasonality.
Orbit: Probabilistic Forecast with Exponential Smoothing

Li, P., and Riddell, A. Stan: A probabilistic programming


Table 1. Model Average SMAPE Comparison
language. Journal of statistical software, 76(1), 2017.
M ODEL LGT DLT P ROPHET SARIMA
R IDER - WEEKLY 0.108 (0.030) 0.106 (0.033) 0.121 (0.035) 0.110 (0.041) De Livera, A. M., Hyndman, R. J., and Snyder, R. D. Fore-
D RIVER - WEEKLY 0.205 (0.098) 0.206 (0.095) 0.274 (0.162) 0.207 (0.123)
E ATER - DAILY 0.178 (0.036) 0.182 (0.043) 0.309 (0.055) 0.221 (0.040) casting time series with complex seasonal patterns using
M4- WEEKLY 0.077 (0.073) 0.081 (0.080) 0.192 (0.321) 0.076 (0.131)
M3- MONTHLY 0.145 (0.155) 0.148 (0.158) 0.193 (0.234) 0.150 (0.171)
exponential smoothing. Journal of the American statisti-
cal association, 106(496):1513–1527, 2011.

Within a dataset, the models in consideration were run on Gardner Jr, E. S. Exponential smoothing: The state of the
each time series separately, and then we report the aggre- art. Journal of forecasting, 4(1):1–28, 1985.
gated metrics. Table 1 gives the average and the standard
Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to
deviation (within parentheses) of SMAPE across different
forget: Continual prediction with lstm. 1999.
models and datasets.
It shows that our models consistently deliver better accuracy Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M.,
than other candidate time series models in terms of SMAPE. and Kagal, L. Explaining explanations: An overview of
Orbit is also computationally efficient. For example, the interpretability of machine learning. In 2018 IEEE 5th
average compute time per series with full MCMC sampling International Conference on data science and advanced
and prediction from a subset of M4 weekly data is about analytics (DSAA), pp. 80–89. IEEE, 2018.
2.5 minutes and 16 ms. The run time for the same series Gunning, D. Explainable artificial intelligence (xai). De-
in Prophet is about 10 minutes for sampling and 2.4 s for fense Advanced Research Projects Agency (DARPA), nd
prediction. That’s a 4x speed up in training, and orders of Web, 2, 2017.
magnitude difference in prediction.
Code and M3/M4 data used in this benchmark study are Hewamalage, H., Bergmeir, C., and Bandara, K. Recurrent
available upon request. neural networks for time series forecasting: Current status
and future directions. arXiv preprint arXiv:1909.00590,
2019.
5. Conclusion
Huang, Z., Xu, W., and Yu, K. Bidirectional lstm-crf models
We have shown that our proposed models outperform the for sequence tagging. arXiv preprint arXiv:1508.01991,
baseline time series models consistently in terms of SMAPE 2015.
metrics. Furthermore, we also identified compute cost im-
provements when using the Orbit package. For our future Hyndman, R., Koehler, A. B., Ord, J. K., and Snyder, R. D.
work, we will continue to actively maintain the package, in- Forecasting with exponential smoothing: the state space
corporate new models, and provide new features or enhance- approach. Springer Science & Business Media, 2008.
ments (to support dual seasonality, fully Pyro integration,
etc). Hyndman, R. J. and Athanasopoulos, G. Forecasting: prin-
ciples and practice. OTexts, 2018.
References Hyndman, R. J., Athanasopoulos, G., Bergmeir, C., Cac-
Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., eres, G., Chhay, L., O’Hara-Wild, M., Petropoulos, F.,
Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Hors- Razbash, S., Wang, E., and Yasmeen, F. forecast: Fore-
fall, P., and Goodman, N. D. Pyro: Deep universal proba- casting functions for time series and linear models. 2018.
bilistic programming. The Journal of Machine Learning Makridakis, S., Spiliotis, E., and Assimakopoulos, V. The
Research, 20(1):973–978, 2019. m4 competition: Results, findings, conclusion and way
forward. International Journal of Forecasting, 34(4):
Box, G. P, Jenkins, GM and Reinsel, GC. Time series 802–808, 2018.
analysis: forecasting and control, volume 54, pp. 176–
180. 1998. Scott, S. L. and Varian, H. R. Predicting the present with
bayesian structural time series. International Journal of
Box, G. E. and Jenkins, G. M. Some recent advances in Mathematical Modelling and Numerical Optimisation, 5
forecasting and control. Journal of the Royal Statistical (1-2):4–23, 2014.
Society. Series C (Applied Statistics), 17(2):91–109, 1968.
Seabold, S. and Perktold, J. statsmodels: Econometric and
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., statistical modeling with python. In 9th Python in Science
Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Conference, 2010. Package version 0.11.1.
Orbit: Probabilistic Forecast with Exponential Smoothing

Selvin, S., Vinayakumar, R., Gopalakrishnan, E., Menon,


V. K., and Soman, K. Stock price prediction using lstm,
rnn and cnn-sliding window model. In 2017 international
conference on advances in computing, communications
and informatics (icacci), pp. 1643–1647. IEEE, 2017.
Smyl, S., Bergmeir, C., Wibowo, E., and Ng, T. W. Bayesian
exponential smoothing models with trend modifications.
https://cran.r-project.org/web/packages/Rlgt/index.html,
2019.
Taylor, S. J. and Letham, B. Forecasting at scale. The Amer-
ican Statistician, 72(1):37–45, 2018. Package version
0.7.1.

You might also like