Lag Llama

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Lag-Llama: Towards Foundation Models for

Probabilistic Time Series Forecasting

* Kashif Rasul 1 * Arjun Ashok 2 3 4



Andrew Robert Williams 3 4 ♢ Hena Ghonia 3 4 Rishika Bhagwatkar 3 4

Arian Khorasani 3 4 ♠ Mohammad Javad Darvishi Bayazi 3 ♠ George Adamopoulos 5 ♠ Roland Riachi 3 4
Nadhir Hassen 3 4 Marin Biloš 1 △ Sahil Garg 1 △ Anderson Schneider 1 △ Nicolas Chapados 2 4

Alexandre Drouin 2 4 △ Valentina Zantedeschi 2 ♣ Yuriy Nevmyvaka 1 ♣ Irina Rish 3 4

Abstract 1. Introduction
Probabilistic time series forecasting is an important practi-
Over the past years, foundation models have cal problem arising in a wide range of applications, from
caused a paradigm shift in machine learning due finance and weather forecasting to brain imaging and com-
to their unprecedented capabilities for zero-shot puter systems performance management (Peterson, 2017).
and few-shot generalization. However, despite Accurate probabilistic forecasting is usually an essential
the success of foundation models in modalities step towards the subsequent decision-making in such prac-
such as natural language processing and computer tical domains. The probabilistic nature of such forecasting
vision, the development of foundation models for endows decision-makers with a notion of uncertainty, allow-
time series forecasting has lagged behind. We ing them to consider a variety of future scenarios, along with
present Lag-Llama, a general-purpose founda- their respective likelihoods. Various methods have been pro-
tion model for univariate probabilistic time se- posed for this task, ranging from classical autoregressive
ries forecasting based on a decoder-only trans- models (Hyndman & Athanasopoulos, 2021) to the more
former architecture that uses lags as covariates. recent neural forecasting methods based on deep learning
Lag-Llama is pretrained on a large corpus of di- architectures (Torres et al., 2021). Note that the overwhelm-
verse time series data from several domains, and ing majority of these previous approaches are focused on
demonstrates strong zero-shot generalization ca- building dataset-specific models, i.e. models tested on the
pabilities compared to a wide range of forecast- same dataset in which training is performed.
ing models on downstream datasets across do-
mains. Moreover, when fine-tuned on relatively Recently, however, machine learning is witnessing a
small fractions of such previously unseen datasets, paradigm shift due to the rise of foundation models (Bom-
Lag-Llama achieves state-of-the-art performance, masani et al., 2022) — large-scale, general-purpose neural
outperforming prior deep learning approaches, networks pretrained in an unsupervised manner on large
emerging as the best general-purpose model on amounts of diverse data across various data distributions.
average. Lag-Llama serves as a strong contender Such models demonstrate remarkable few-shot generaliza-
to the current state-of-art in time series forecast- tion capabilities on a wide range of downstream datasets
ing and paves the way for future advancements in (Brown et al., 2020a), often outperforming dataset-specific
foundation models tailored to time series data. models. Following the successes of foundation models in
language and image processing domains(OpenAI, 2023;
Radford et al., 2021), we aim to develop foundation models
* Co-first authorship, authors contributed equally, order arbi- for time series, investigate their behaviour at scale, and push
trary. ♢♠△♣ Authors in each group contributed equally, order the limits of transfer achievable across diverse time series
arbitrary. 1 Morgan Stanley, New York, USA 2 ServiceNow Re- domains.
search, Montréal, Canada 3 Université de Montréal, Montréal,
Canada 4 Mila-Quebec AI Institute, Montréal, Canada 5 McGill In this paper, we present Lag-Llama— a foundation model
University, Montréal, Canada. Correspondence to: Ar- for probabilistic time series forecasting trained on a large
jun Ashok <[email protected]>, Kashif Rasul collection of open time series data, and evaluated on un-
<[email protected]>. seen time series datasets. We investigate the performance of
Preprint. Lag-Llama across several settings where unseen time series

1
Lag-Llama

datasets are encountered downstream with different levels of 2022). Various architectures have been developed for this
data history being available, and show that Lag-Llama per- setting, starting with RNN-based and LSTM-based models
forms comparably or better against state-of-the-art dataset- (Salinas et al., 2020; Wen et al., 2018). More recently in
specific models. light of the recent success of transformers (Vaswani et al.,
2017) for sequence-to-sequence modelling for natural lan-
Our contributions:
guage processing, many variations of transformers have
• We present Lag-Llama, a foundation model for univariate been proposed for time series forecasting. Different models
probabilistic time series forecasting based on a simple (Nie et al., 2023a; Wu et al., 2020a;b) process the input
decoder-only transformer architecture that uses lags as time series in different ways to be digestible by a vanilla
covariates. transformer, then re-process the output of a transformer for
• We show that Lag-Llama, when pretrained from scratch a point forecast or a probabilistic forecast. On the other
on a broad, diverse corpus of datasets, has strong zero-shot hand, various other works propose alternative strategies to
performance on unseen datasets, and performs compara- vanilla attention and build off the transformer architecture,
bly to models trained on the specific datasets. for better models tailored for time series (Lim et al., 2021;
• Lag-Llama also demonstrates state-of-the-art perfor- Li et al., 2023; Ashok et al., 2023; Oreshkin et al., 2020a;
mance across diverse datasets from different domains Zhou et al., 2021a; Wu et al., 2021; Woo et al., 2023; Liu
after finetuning, and emerges as the best general-purpose et al., 2022b; Zhou et al., 2022; Liu et al., 2022a; Ni et al.,
model without any knowledge of downstream datasets. 2023; Li et al., 2019; Gulati et al., 2020).
• We demonstrate the strong few-shot adaptation perfor-
Foundation models are an emerging paradigm of self-
mance of Lag-Llama on previously unseen datasets,
supervised (or) unsupervised learning on large datasets
across varying fractions of data history being available.
(Bommasani et al., 2022). Many such models (Devlin et al.,
• We investigate the diversity of the pretraining corpus
2019; OpenAI, 2023; Chowdhery et al., 2022; Radford et al.,
used to train Lag-Llama, and present the scaling laws
2021; Wang et al., 2022) have demonstrated adaptability
of Lag-Llama with respect to the pretraining data.
across modalities, extending beyond web data to scientific
domains such as protein design (Robert Verkuil, 2022). Scal-
2. Related Work ing the model, dataset size and data diversity have also been
shown to result in remarkable transfer capabilities and excel-
Statistical models have been the cornerstone of time series
lent few-shot learning on novel datasets and tasks (Thrun &
forecasting for decades, evolving continuously to address
Pratt, 1998; Brown et al., 2020b). Self-supervised learning
complex forecasting challenges. Traditional models such as
techniques have also been proposed for time series (Li et al.,
ARIMA (Autoregressive Integrated Moving Average) set the
2023; Woo et al., 2022a; Yeh et al., 2023). Most related to
foundation by using autocorrelation to forecast future val-
our work is Yeh et al. (2023) who train on a corpus of time
ues. ETS (Error, Trend, Seasonality) models advanced this
series datasets. The key difference is that they validate their
by decomposing a time series into its fundamental compo-
model only on the downstream classification tasks, and do
nents, allowing for more nuanced forecasting that captures
not validate on forecasting tasks. Works such as Time-LLM
trends and seasonal patterns. Theta models, introduced by
(Jin et al., 2023), LLM4TS (Chang et al., 2023), GPT2(6)
Assimakopoulos & Nikolopoulos (2000), represented an-
(Zhou et al., 2023a), UniTime (Liu et al., 2023), and TEMPO
other significant advancement in time series forecasting. By
(Anonymous, 2024) freeze LLM encoder backbones while
applying a decomposition technique combining both long-
simultaneously fine-tuning/adapting the input and distribu-
term trend and seasonality, these models offer a simple yet
tion heads for forecasting. The main goal of our work is to
effective method for forecasting Despite the success of the
apply the foundation model approach to time series data and
considerable successes of these statistical models and more
to investigate the extent of the transfer achievable across a
advanced ones (Croston, 1972; Syntetos & Boylan, 2005;
wide range of time series domains.
Hyndman & Athanasopoulos, 2018), these models share
common limitations. Their primary shortfall lies in their
inherent assumption of linear relationships and stationarity 3. Probabilistic Time Series Forecasting
in time series data, which is often not the case in real-world
We consider a dataset of D ≥ 1 univariate time series,
scenarios marked by abrupt changes and non-linear dynam-
Dtrain = {xi1:T i }Di=1 sampled at a specific discrete set
ics. Furthermore, they may require extensive manual tuning
of time points t ∈ {1, . . . , T i } where T i represents the
and domain knowledge to select appropriate models and
length of the time series i. Given this dataset, we aim to
parameters for specific forecasting tasks.
train a predictive model that can accurately predict the val-
Neural forecasting is a rapidly developing research area ues at the future P ≥ 1 time points; we refer to these
following the explosion of machine learning (Benidis et al., timesteps of our D time series as to the test dataset, denoted

2
Lag-Llama

Dtest = {xiT i +1:T i +P }D


i=1 .
lag indices:
The univariate probabilistic time series forecasting problem
involves modelling an unknown joint distribution of the
P future values of a one-dimensional sequence given its ...
observed past until timestep t from which prediction should sec(t)
be performed, and covariates: min(t)
...
pϕ (xit+1:t+P | xi1:t , ci1:t+P ). (1) month(t)

where ϕ represents the parameters of a parametric distribu-


tion. In practice, rather than considering the whole history time
of each time series i, which can vary considerably, we can
instead sub-sample fixed context windows of size C ≥ 1 Figure 1: For a time series, we depict the tokenization at the
of our choosing from the complete time series and learn an timestep t of the value xt which contains lag features constructed
approximation of the unknown distribution of the next P using an example set of lag indices L, where each value in the
future values given the covariates: vector is from the past of xt (in blue), and F possible temporal
covariates (date-time features) constructed from timestamp t (red).
pϕ (xiC+1:C+P | xi1:C , ci1:C+P ). (2)
When the distribution is modeled by a neural network with
parameters θ, predictions are then conditioned on these on a particular time value as xt 7→ kt ∈ R|L| where each
(learned) parameters θ. We will approximate the distribution entry j of kt is given by kt [j] = xt−L[j] . Thus to create
in Eq. (2) by an autoregressive model, using the chain rule lag features for some context-length window x1:C we need
of probability as follows: to sample a larger window with L more historical points
denoted by x−L:C ¶ . In addition to these lagged features,
pϕ (xiC+1:C+P | xi1:C , ci1:C+P ; θ) = we add date-time features of all the frequencies in our cor-
C+P
Y pus, namely second-of-minute, hour-of-day, etc. up till
pϕ (xit | xi1:t−1 , ci1:t−1 ; θ). the quarter-of-year from the time index t. Note that while
t=C+1 the primary goal of these date-time features is to provide
additional information, for any time series, all except one
4. Lag-Llama date-time feature will remain constant from one time-step
to the next, and from the model can implicitly make sense
We present Lag-Llama, a foundation model for univariate
of the frequency of the time series as well. Assuming we
probabilistic forecasting. The first step in building such a
employ a total of F date-time features, each of our tokens is
foundation model for time series is training on a large corpus
of size |L| + F . Fig. 1 shows an example tokenization. We
of diverse time series. When training on heterogenous uni-
note that a downside to using lagged features in tokenization
variate time series corpora, the frequency of the time series
is that it requires an L-sized or larger context window.
in our corpus varies. Further, when adapting our founda-
tion model to downstream datasets, we may encounter new
4.2. Lag-Llama Architecture
frequencies and combinations of seen frequencies, which
our model should be capable of handling. We now present Lag-Llama’s architecture is based on the decoder-only
a general method for tokenizing series from such a dataset, transformer-based architecture LLaMA (Touvron et al.,
without directly relying on the frequency of any specific 2023).
dataset, and thus potentially allowing unseen frequencies
and combinations of seen frequencies to be used at test time. Fig. 2 shows a general schematic of this model with M de-
coder layers. A univariate sequence of length xi−L:C along
with its covariates is tokenized by concatenating the covari-
4.1. Tokenization: Lag Features
ate vectors to a sequence of C tokens xi1:C . These tokens
The tokenization scheme of Lag-Llama involves construct- are passed through a shared linear projection layer that maps
ing lagged features from the prior values of the time series, the features to the hidden dimension of the attention module.
constructed according to a specified set of appropriate lag in- Similar to in Touvron et al. (2023), Lag-Llama incorporates
dices that include quarterly, monthly, weekly, daily, hourly, pre-normalization via the RMSNorm (Zhang & Sennrich,
and second-level frequencies. Given a sorted set of positive 2019) and Rotary Positional Encoding (RoPE) (Su et al.,
lag indices L = {1, . . . , L}* , we define the lag operation 2021) at each attention layer’s query and key representations
* Note that L refers to the list of lag indices, while L is the last ¶
This is since a history of L points in time is needed for all
lag index in the sorted list L points in the context, starting from the first point in the context

3
Lag-Llama

as in LLaMA (Touvron et al., 2023).


After passing through the causally masked transformer lay-
ers, the model predicts the parameters ϕ of the forecast
log prob
distribution of the next timestep, where the parameters are
output by a parametric distribution head, as described in Distribution
Sec. 4.3. The negative log-likelihood of the predicted distri- Head
bution of all predicted timesteps is minimized.
At inference time, given a time series of size at least L, we
Masked
can construct a feature vector that is passed to the model Positional Transformer
to obtain the distribution of the next time point. In this Encoding Decoder
Layer
fashion, via greedy autoregressive decoding, we can obtain
many simulated trajectories of the future up to our chosen
prediction horizon P ≥ 1. From these empirical samples,
we can calculate the uncertainty intervals for downstream Projection
decision-making tasks and metrics with respect to held-out
data.

4.3. Choice of Distribution Head


lag-featured inputs
The last layer of Lag-Llama is a distinct layer known as the
distribution head, which projects the model’s features to the
Figure 2: The Lag-Llama architecture. Lag-Llama learns to out-
parameters of a probability distribution. We can combine put a distribution over the values of the next time step based on
different distribution heads with the representational capac- lagged input features. The input to the model is the token of a
ity of the model to output the parameters ϕ of any parametric univariate time series i at a given timestep, xit , constructed as
probability distribution. For our experiments, we adopt a described in Sec.4.1. Here, we use cit to refer to all additional
Student’s t-distribution (Student, 1908) and output the three covariates used along with the value at a timestep t, which include
the |L| lags, F date-time features, and summary statistics. The in-
parameters corresponding to this distribution, namely its puts are projected through M masked decoder layers. The features
degrees of freedom, mean, and scale, with appropriate non- are then passed through the distribution head and trained to predict
linearities to ensure the appropriate parameters stay positive. the parameters of the forecast distribution of the next timestep.
More expressive choices of distributions, such as normaliz-
ing flows (Rasul et al., 2021b) and copulas (Salinas et al.,
transformed using the mean and variance, while sampling,
2019a; Drouin et al., 2022; Ashok et al., 2023) are potential
every timestep of data that is sampled is de-standardized
choices of distribution heads, however with the potential
using the same mean and variance. In practice, instead of
overhead of difficulties in model training and optimization.
the standard scaler, we find the following standardization
The goal of our work was to keep the model as simple as
strategy works well when pretraining our model.
possible, which led us to adopt a simple parametric distribu-
tional head. We leave the exploration of such distribution Robust Standardization ensures that our time series pro-
heads for future work. cessing is robust to outliers. This procedures normalizes
the series by removing the median and scaling according to
4.4. Value Scaling the Interquartile Range (IQR) (Dekking et al., 2005). For
a context-window sized series x1:C = {x1 , x2 , ..., xC } we
When training on a large corpus of time series data from standardize each time point as:
different datasets and domains, each time series can be of
different numerical magnitude. Since we pretrain a founda- xt − Med(x1:C )
x′t = , where (3)
tion model over such data, we utilize the scaling heuristic IQR(x1:C )
(Salinas et al., 2019b) where forPeach univariate window, we IQR(x1:C ) = Med({x⌈C/2⌉:C }) − Med({x1:⌊C/2⌋ }).
C
calculate its mean value µi = t=1 xit /C and variance σ i . (4)
We can then replace the time series xi1:C in the window by
{(xit − µi )/σ i }C i i
t=1 . We also incorporate µ and σ as time 4.5. Training Strategies
independent real-valued covariates for each token, to give
the model information of the statistics of the inputs, which We employ a series of training strategies to effectively pre-
we call summary statistics. train Lag-Llama on the corpus of datasets. Firstly, we find
that employing a stratified sampling approach where the
During training and obtaining likelihood, the values are
datasets in the corpus are weighed by the amount of total

4
Lag-Llama

number of series is useful when sampling random windows Through AutoGluon (Shchur et al., 2023), an AutoML
from the pretraining corpus. Further, we find that employ- framework for probabilistic time series forecasting, we
ing time series augmentation techniques of Freq-Mix and benchmark against 5 well-known statistical time series fore-
Freq-Mask (Chen et al., 2023) serve useful to reduce overfit- casting models: AutoARIMA (Hyndman & Khandakar,
ting. We search the hyperparameters of these augmentation 2008) and AutoETS (Hyndman & Khandakar, 2008) which
strategies as part of our hyperparameter search. are established statistical models that tune model parame-
ters locally for each time series (Hyndman & Khandakar,
5. Experimental Setup 2008), CrostonSBA (Syntetos and Boylan Approximate)
(Croston, 1972; Syntetos & Boylan, 2005) an intermittent
5.1. Datasets demand forecasting model using Croston’s model with the
Syntetos-Boylan bias correction approach, DynOptTheta
We collate a diverse corpus of 27 time series datasets from
(The Dynamically Optimized Theta model) (Box & Jenk-
several sources across six different semantically grouped
ins, 1976) a statistical forecasting method that is based on
domains such as energy, transportation, economics, na-
the decomposition of the time series into trend, seasonality
ture, air quality and cloud operations; each dataset has a
and noise, and NPTS (Non-Parametric Time Series Fore-
different set of characteristics, such as prediction lengths,
caster) (Shchur et al., 2023), a local forecasting method that
number of series, lengths of each series, and frequencies.
assumes a non-parametric sampling distribution. We fur-
We leave out a few datasets from each domain for testing ther compare with 3 strong deep-learning methods through
the few-shot generalization abilities of the pretrained model, the same AutoGluon framework: DeepAR (Salinas et al.,
whle using the remaining datasets for pretraining the founda- 2020), an autoregressive RNN-based method that has been
tion model. Furthermore, we set aside datasets from entirely shown to be a strong contender for probabilistic forecasting
different domains to assess our model’s performance on (Alexandrov et al., 2020), PatchTST (Nie et al., 2023b) a
data that may lack any potential similarity to the datasets in univariate transformer-based method that uses patching to
pretraining. Such a setup mimics the real-world use of our tokenize time series, TFT (Temporal Fusion Transformer)
model, where one may adapt it for datasets that fall closely (Lim et al., 2021), an attention-based architecture with re-
within the distribution of domains that the model has been current and feature-selection layers.
trained on, as well as datasets in completely different do-
We benchmark against 4 more deep learning models: N-
mains. Our pretraining corpus comprises a total of 7, 965
BEATS (Oreshkin et al., 2020b), a neural network archi-
different univariate time series, each of different lengths,
tecture that uses a recursive decomposition based on pro-
when put together, comprising a total of around 352 million
jecting residual signals on learned basis functions, Informer
data windows (tokens) for our model to train on. App. §A
(Zhou et al., 2021c), an efficient autoregressive transformer-
lists the datasets we use, along with their sources and prop-
based method that uses a ProbSparse self-attention mecha-
erties, their respective domains, and the dataset split used in
nism to handle extremely long sequences, AutoFormer (Wu
our experiments.
et al., 2022), a transformer-based architecture with an Auto-
Note that the term “domain” used here is just a label used Correlation mechanism based on the series periodicity, and
to group several datasets, which does not represent a com- ETSFormer (Woo et al., 2022b), a transformer that replaces
mon source or data distribution; each of the pretraining and self-attention with exponential smoothing attention and fre-
test datasets possesses very different general characteristics quency attention. We finally benchmark against OneFitsAll
(patterns, seasonalities), apart from having other distinct (Zhou et al., 2023b), a method that leverages a pretrained
properties. We use the default prediction length of each large language model (LLM) (GPT-2 (Radford et al., 2019))
dataset for evaluation and ensure that there is a wide variety and finetunes the input and output layers for time series
of prediction horizons in our unseen corpus of datasets, to forecasting.
evaluate models on short-term, medium-term, and long-term
Note that all the methods are compared in the univariate
forecasting setups. App. §A lists the different datasets used
setup, where, similar to Lag-Llama, each time series is
in this work, along with the sources and properties of each
treated and forecasted independently. All methods pro-
dataset. Sec. § 7.1 analyses the diversity of our corpus of
duced using AutoGluon support probabilistic forecasts. All
datasets.
the other models (N-BEATS, Informer, AutoFormer, ETS-
Former, and OneFitsAll) were originally designed for point
5.2. Baselines forecasting and clean normalized data; we adapt them for
We compare the performance of Lag-Llama to that of a probabilistic forecasting by using a distribution head at the
large set of baselines, including both standard statistical output and endowing them with all the features similar to
models, as well as deep neural networks. Lag-Llama such as value scaling.

5
Lag-Llama

Table 1: CRPS of Lag-Llama zero-shot and on finetuning on the unseen datasets, compared to supervised baselines trained solely on the
respective datasets. Lower is better. A mean or standard deviation of 0.0000 signifies that the first non-zero digit is beyond 3 decimal
places. The best results are in bold, and the second best results are in brown.

DATASET
M ODEL AVERAGE R ANK
WEATHER PED-COUNTS ETT-M2 PLATFORM-DELAY REQUESTS BEIJING-PM2.5 EXCHANGE

S UPERVISED
ETSF ORMER 0.528±0.175 0.275±0.024 0.140±0.002 0.171±0.025 0.218±0.070 0.266±0.099 0.029±0.014 13.000
NPTS 0.276±0.000 0.684±0.006 0.139±0.000 0.132±0.001 0.085±0.001 0.170±0.003 0.059±0.001 12.714
OFA 0.265±0.006 0.605±0.023 0.130±0.006 0.213±0.011 0.121±0.011 0.130±0.009 0.015±0.001 11.357
AUTO F ORMER 0.240±0.021 0.247±0.011 0.088±0.014 0.152±0.030 0.301±0.178 0.151±0.002 0.037±0.025 11.000
C ROSTON SBA 0.177±0.000 0.594±0.000 0.102±0.000 0.097±0.000 0.042±0.000 0.198±0.000 0.031±0.000 9.429
AUTOARIMA 0.213±0.000 0.755±0.000 NAN ± NAN 0.112±0.000 0.076±0.000 0.110±0.000 0.009±0.000 8.333
AUTO ETS 0.215±0.000 0.625±0.000 0.081±0.000 0.297±0.000 0.041±0.000 0.090±0.000 0.008±0.000 8.000
DYN O PT T HETA 0.217±0.000 1.817±0.000 0.049±0.000 0.118±0.000 0.055±0.000 0.108±0.000 0.008±0.000 7.857
I NFORMER 0.172±0.011 0.223±0.005 0.070±0.003 0.106±0.009 0.104±0.012 0.057±0.003 0.017±0.004 6.429
D EEPAR 0.148±0.004 0.239±0.002 0.068±0.003 0.068±0.003 0.045±0.009 0.154±0.000 0.012±0.000 5.714
PATCH TST 0.178±0.013 0.254±0.001 0.035±0.000 0.094±0.001 0.024±0.003 0.145±0.001 0.011±0.000 5.643
N-BEATS 0.134±0.003 0.267±0.018 0.031±0.005 0.112±0.007 0.021±0.005 0.081±0.004 0.024±0.004 5.071
TFT 0.151±0.016 0.268±0.009 0.030±0.000 0.099±0.001 0.015±0.003 0.156±0.000 0.008±0.000 5.000
Z ERO - SHOT
LAG-LLAMA 0.164±0.001 0.285±0.033 0.063±0.002 0.091±0.002 0.090±0.015 0.130±0.009 0.011±0.001 6.714
F INETUNED
LAG-LLAMA 0.132±0.001 0.227±0.010 0.017±0.001 0.096±0.002 0.012±0.002 0.125±0.021 0.009±0.000 2.786

5.3. Hyperparameter Search and Model Training Setups & Raftery, 2007; Matheson & Winkler, 1976), a common
metric in the probabilistic forecasting literature (Rasul et al.,
We perform a random search of 100 different hyperparam-
2021b;a; Salinas et al., 2019a; Shchur et al., 2023), for
eter configurations and use the validation loss of the pre-
evaluating our model’s performance. We use 100 empirical
training corpus to select our model. We elaborate on our
samples and report the CRPS averaged over the prediction
hyperparameter search and model selection in Appendix
horizon and across all the time series of a dataset. We
D. During pretraining, we use the batch size of 256 and a
further assess how well each method we benchmark on does
learning rate of 10−4 . Each epoch consists of 100 randomly
as a general-purpose forecasting algorithm, rather than a
sampled windows, each of length L + C as described in Sec.
dataset-specific one, by measuring the average rank of each
4.1. We use an early stopping criterion of 50 epochs based
method, with respect to all others, over all the datasets.
on the average validation loss of the training datasets in our
pretraining corpus. When fine-tuning for a specific dataset,
we train our models with the same batch size and learning 6. Results
rate, and each epoch consists of 100 randomly sampled win-
We first evaluate zero-shot performance of our pretrained
dows from the specific dataset, each of length L + (C + P ),
Lag-Llama on the unseen datasets (subsection 6.1), when
where P now is the prediction length of the specific dataset.
no samples from the new downstream domain are avail-
Since our model is decoder-only, and since prediction length
able for possible fine-tuning of the the model. Note that
is not fixed, the model can therefore work for any down-
such zero-shot forecasting scenarios are common in time
stream prediction length. We use an early stopping criterion
series forecasting literature (see, for example, the cold-start
of 50 epochs during fine-tuning, based on the validation
problem (Wikipedia, 2024; Fatemi et al., 2023)). We then
loss of the dataset being finetuned on. We elaborate on our
fine-tune our pretrained Lag-Llama on each unseen dataset
training procedure in Appendix B. For all the models trained
and evaluate the model after fine-tuning, to study how our
in this paper, we use a single Nvidia Tesla-P100 GPU with
pretrained model adapts to different unseen datasets and
12 GB of memory, 4 CPU cores, and 24 GB of RAM.
domains when there is considerable history available in the
dataset to train on. We then evaluate the few-shot adapta-
5.4. Inference and Model Evaluation tion performance of our foundation model — a well-known
Inference for a specific dataset is performed by sampling scenario in other modalities (e.g., text) where foundation
from the Lag-Llama model autoregressively, starting with models are expected to demonstrate strong generalization
conditioning on the context of length C, until a prediction capabilities. We vary the amount of history available for
length P , which is defined for a given dataset. We use the fine-tuning on each dataset, and present the few-shot adap-
Continuous Ranked Probability Score (CRPS) (Gneiting tation performance of our model at various levels of history

6
Lag-Llama

Table 2: CRPS of Lag-Llama on few-shot adaptation on the unseen datasets with different amounts of data history being available,
compared to supervised baselines trained solely on the respective datasets. Lower is better. A mean or standard deviation of 0.0000
signifies that the first non-zero digit is beyond 3 decimal places. The best results are in bold.

DATASET
DATA % M ODEL AVERAGE R ANK
WEATHER PED-COUNTS EXCHANGE-RATE ETT-M2 PLATFORM-DELAY REQUESTS BEIJING-PM2.5

D EEPAR 0.156±0.004 0.241±0.002 0.033±0.000 0.089±0.000 0.094±0.002 0.065±0.000 0.176±0.006 3.429


PATCH TST 0.169±0.017 0.259±0.008 0.012±0.000 0.035±0.001 0.088±0.001 0.025±0.000 0.153±0.003 2.714
20 %
TFT 0.154±0.002 0.296±0.027 0.009±0.000 0.038±0.000 0.087±0.002 0.017±0.000 0.144±0.004 2.000
L AG -L LAMA 0.136±0.001 0.239±0.016 0.017±0.001 0.016±0.001 0.108±0.005 0.011±0.001 0.147±0.008 1.857
D EEPAR 0.159±0.022 0.237±0.022 0.011±0.002 0.053±0.000 0.100±0.000 0.030±0.003 0.158±0.000 3.071
PATCH TST 0.171±0.017 0.253±0.007 0.011±0.001 0.035±0.000 0.092±0.000 0.025±0.002 0.162±0.000 2.929
40 %
TFT 0.156±0.001 0.269±0.002 0.008±0.000 0.036±0.000 0.104±0.000 0.014±0.002 0.150±0.000 2.500
L AG -L LAMA 0.135±0.000 0.229±0.003 0.009±0.001 0.017±0.002 0.102±0.002 0.014±0.001 0.149±0.011 1.500
D EEPAR 0.158±0.023 0.234±0.009 0.011±0.001 0.049±0.006 0.114±0.006 0.026±0.002 0.157±0.004 3.071
PATCH TST 0.174±0.011 0.241±0.004 0.011±0.000 0.035±0.001 0.093±0.003 0.028±0.002 0.159±0.001 2.929
60 %
TFT 0.152±0.001 0.272±0.000 0.008±0.000 0.037±0.000 0.113±0.008 0.017±0.002 0.154±0.000 2.429
L AG -L LAMA 0.133±0.001 0.246±0.002 0.009±0.001 0.016±0.001 0.099±0.005 0.012±0.001 0.133±0.003 1.571
D EEPAR 0.145±0.005 0.243±0.015 0.016±0.003 0.071±0.020 0.113±0.002 0.131±0.000 0.156±0.001 3.429
PATCH TST 0.174±0.033 0.247±0.015 0.015±0.002 0.035±0.000 0.091±0.003 0.024±0.000 0.153±0.002 2.714
80 %
TFT 0.148±0.004 0.287±0.013 0.008±0.000 0.042±0.008 0.094±0.001 0.017±0.000 0.152±0.006 2.429
L AG -L LAMA 0.132±0.001 0.215±0.006 0.009±0.000 0.019±0.001 0.099±0.008 0.013±0.002 0.131±0.016 1.429

(section 6.2). zero-shot performance, and when finetuned achieves perfor-


mance similar to the state-of-the-art. This establishes that
6.1. Zero-Shot & Finetuning Performance on New Data Lag-Llama performs well across frequencies and domains
from which the model may or may not have seen similar data
Tab. 1 presents the results comparing the performance of on during pretraining. Lag-Llama achieves a better average
supervised baselines trained on specific datasets to the pre- rank both in the zero-shot and finetuned setups compared
trained Lag-Llama zero-shot performance on the unseen to the Informer, AutoFormer, and ETSFormer models, all
datasets, and to finetuned Lag-Llama on the respective un- of which use complex inductive biases to model time series,
seen datasets. In the zero-shot setting, Lag-Llama achieves compared to Lag-Llama which uses a simple architecture,
comparable performance to all baselines, with an average lags and covariates, along with large-scale pretraining. Our
rank of 6.714. On fine-tuning, Lag-Llama achieves state- observations suggest that at scale, when used similarly to
of-the-art performance in 3 datasets, while performance in- Lag-Llama, vanilla decoder-only transformers outperform
creases significantly in all other datasets. Most importantly, other transformer architectures. We point out that similar
on fine-tuning, Lag-Llama achieves the best average rank results have been shown in the NLP community (Tay et al.,
of 2.786, with a significant difference of 2 points over the 2022) studying the influence of inductive bias at scale, how-
best supervised model, which suggests that if one had to ever, we emphasize that we are the first to point out such a
choose a method to use without prior knowledge of the data, result for time series, potentially opening doors to further
Lag-Llama would be the best option. This clearly estab- studies in time series that analyse the influence of induc-
lishes Lag-Llama as a strong foundation model that can be tive bias at scale. Next, compared to the OneFitsAll model
used on a wide range of downstream datasets, without prior (Zhou et al., 2023b) which adapts a pretrained LLM for
knowledge of these data distribution — a key property that forecasting, Lag-Llama achieves significantly better perfor-
a foundation model should satisfy. mance in all datasets, except for the dataset beijing-pm2.5,
We now take a deeper dive into Lag-Llama’s performance where it performs similarly to the baseline, while achiev-
analysis. Evaluated zero-shot, Lag-Llama achieves strong ing a much better average rank than this model. These re-
performance, notably in the platform-delay and weather sults demonstrate the potential of foundation models trained
datasets, where it is especially close to baselines. With from scratch on a large and diverse collection of time se-
fine-tuning, Lag-Llama consistently improves performance ries datasets when compared to the adaptation of pretrained
compared to inferring zero-shot. In 3 datasets - namely, LLMs, as in the OneFitsAll model (Zhou et al., 2023b). A
ETT-M2, weather, and requests — finetuned version of detailed investigation of the advantages and disadvantages
Lag-Llama achieves a significantly lower error than all the of adapting LLMs versus training time series foundation
baselines, becoming the state-of-the-art. On the exchange- models from scratch is left as a direction for future work.
rate dataset coming from an entirely new domain, exhibit- We further visualize the forecasts produced by Lag-Llama
ing a new unseen frequency, Lag-Llama has comparable on the unseen datasets qualitatively in App. §E. Lag-Llama

7
Lag-Llama

produces forecasts that closely match the ground truth. Fur- the top 2 components . We find that having multiple datasets
ther, comparing the forecasts produced by the model in the within domains and across domains increases the diversity
zero-shot (Fig. 8) and fine-tuned (Fig. 11) settings, one can of AC22 features in the top 2-component space (see Figure
clearly see that the quality of forecasts increase significantly 12 in Appendix).
when the model is fine-tuned.
7.2. Scaling Analysis
6.2. Few-Shot Adaptation Performance on Unseen Data
Dataset size has been shown empirically to improve per-
We restrict the data to only the last K% of the history from formance (Kaplan et al., 2020). Constructing neural scal-
the training set of the datasets, where we set K to 20, 40, 60, ing laws (Kaplan et al., 2020; Caballero et al., 2023) can
80 percentages respectively. We train the supervised meth- help understand how the performance of the model scales
ods from scratch on the available data, while we fine-tune with respect to different parameters such as the amount of
Lag-Llama. Results are presented in Tab. 2. Across varying pretraining data, number of parameters in the model etc.
levels of history being available for adaptation, Lag-Llama Towards understanding these quantities for models such
achieves the best average rank across all levels, which estab- as Lag-Llama, we fit neural scaling laws (Caballero et al.,
lishes Lag-Llama as one with strong adaptation capabilities 2023) to our model’s validation loss and present in App. §F.1
across all levels of data. As the amount of history available the obtained scaling laws that describe the performance of
increases, Lag-Llama achieves increasingly better perfor- our model with respect the amount of pretraining data.
mance across all datasets, and the gap between the rank of
Lag-Llama and the baselines widens, as expected. Note,
8. Discussion
however, that Lag-Llama is most often outperformed by
TFT in the exchange-rate dataset, which is from an entirely We present Lag-Llama, a foundation model for univari-
new domain and has a new unseen frequency. Our observa- ate probabilistic time series forecasting based on a sim-
tion demonstrates that, in cases where the data is most dis- ple decoder-only transformer architecture. We show that
similar, as compared to the pretraining corpus, Lag-Llama Lag-Llama, when pretrained from scratch on a large cor-
requires increasing amounts of history to train on, and, when pus of datasets, has strong zero-shot generalization per-
given enough history to adapt, performs comparable to state- formance on unseen datasets, and performs comparably
of-the-art (as discussed in subsection 6.1). to dataset-specific models. Lag-Llama also demonstrates
state-of-the-art performance across diverse datasets from
Overall, our empirical results demonstrate that Lag-Llama
different domains after finetuning, and emerges as the best
has strong few-shot adaptation capabilities, and that,
general-purpose model without any knowledge of down-
based on the characteristics of the downstream dataset,
stream datasets. Lag-Llama also demonstrates a strong
Lag-Llama can adapt and generalize with the appropriate
few-shot adaptation performance across varying amounts
amount of data.
of data history being available. Finally, we investigate the
diversity of the pretraining corpus used to train Lag-Llama.
7. Analysis
Our work opens up several potential directions for future
7.1. Data Diversity work. For now, collecting and collating a large scale time
series corpus of open dataset would be of high value, since
Although loss has been found to scale with pre-training
the largest time series dataset repositories (Godahewa et al.,
dataset size (Kaplan et al., 2020), it remains unclear what
2021) are themselves too small. Further, scaling up the
other properties of pre-training datasets lead to desirable
models further beyond the model sizes explored in this work
model behaviour, despite some initial research in this di-
using different training strategies constitutes an essential
rection (Chan et al., 2022). Notably, diversity in the pre-
next step towards building even more powerful time se-
training data has contributed to improved zero-shot per-
ries foundation models. Finally, expanding our work from
formance and few-shot adaptation (Brown et al., 2020b),
univariate towards multivariate approaches by capturing
notwithstanding the absence of an adequate definition.
complex multivariate dynamics of real-world datasets also
To quantify the diversity of the pretraining corpus, we ana- constitutes an important direction for future work.
lyze the properties of its datasets through 22 Canonical time
series Characteristics (“catch22 features”), a set of quickly 9. Impact Statement
computable time series features selected for their classifi-
cation ability (Lubba et al., 2019) from the features of the The goal of this work is to introduce general-purpose foun-
Highly Comparable Time Series Analysis (hctsa) library dation models for time series forecasting. There are many
(Fulcher et al., 2013). To assess diversity across datasets, potential societal consequences of such models, including
we apply PCA to the features averaged per-dataset and plot positive impacts on optimizing processes via better decision-

8
Lag-Llama

making, as well as possible negative impacts. model initial code and experiments, and worked with the
experiments for the N-BEATS model.
To the best of our knowledge, none of the datasets used con-
tain nor are linked to any individual or personally identifi- Mohammad worked with all AutoGluon models and experi-
able data, and have been sourced from referenced locations. ments, added the option to use Stochastic Weight Averaging
(SWA), and brainstormed about early stopping techniques
10. Contributions to use when pretraining.
George ran experiments, and contributed to the writing of
Arjun organized, planned, and led the project overall; re-
the first version of the paper, adapting the Electricity House-
fined and improved the Lag-Llama architecture by refin-
hold Consumption Dataset, M5, Walmart, Rossman, and
ing key components (lags, sampling of the model), and
Corporation and Restaurant Datasets used in the experi-
refining the architecture and training strategies (such as
ments of the project and the paper.
dropout, early stopping, learning rate scheduling), iterated
on the dataset choices for Lag-Llama and dataset splitting Roland worked with the code and experiments of the One-
strategies, fixed issues with data window sampling, ran and FitsAll model for all large-scale experiments in the paper,
iterated on all large-scale pretraining, fine-tuning and few- and contributed to writing several sections of the paper.
shot learning experiments for Lag-Llama, and wrote several
Nadhir integrated the N-BEATS model and worked with it
main parts of the paper.
for all large-scale experiments in the paper.
Kashif wrote the code for the Lag-Llama architecture
Marin wrote the initial code for sampling windows for the
and training strategies, conducted initial experiments for
pretraining set and provided feedback with GluonTS code
Lag-Llama and other lag-based architectures that were ex-
and experimental setups.
plored in the project; added Monash time series repository
dataset to Hugging Face datasets as well as other datasets; Sahil, Anderson, Nicolas, Alexandre, Valentina, and
implemented all (but one) transformer-based time series Yuriy advised the project as a whole, provided feedback
models; worked to merge fixes/features upstream to Glu- on the experiments and the paper, and contributed to the
onTS; integrated code with Hugging Face for open-source writing of several sections of the paper.
release; and wrote several main parts of the paper.
Irina advised the project with feedback in several stages,
Hena added support for time features, updated the alterna- contributing to the writing of the paper, acquisition of the
tive Lag-Transformer model for experiments, added support funding for the project, and conceiving and pushing forward
for the key-value cache for faster inference, compiled a list the research direction in the early stages of the project.
of all GluonTS datasets and their descriptions, and con-
tributed to dataset compilation efforts, added utilities to 11. Acknowledgements
track per-dataset validation and training loss, worked with
the Informer, Autoformer and ETSFormer models for the We are grateful to Viatcheslav Gurev, for useful discus-
paper for the large-scale experiments of the paper. sions during the course of the project. We acknowledge and
thank the authors and contributors of all the open-source
Andrew expanded the empirical design of the paper for the
libraries that were used in this work, especially: GluonTS
fine-tuning and downstream adaptation settings, ran experi-
(Alexandrov et al., 2020), NumPy (Harris et al., 2020), Pan-
ments and contributed to the writing of the first version of
das (Pandas development team, 2020), Matplotlib (Hunter,
the paper, wrote several key sections of the paper, adapted
2007) and PyTorch (Paszke et al., 2019).
air quality and Huawei datasets, integrated robust scaler for
data normalization, worked on the ideation and codebase of We acknowledge the support from the Canada CIFAR AI
the Catch-22 feature-based dataset analysis for the paper. Chair Program and from the Canada Excellence Research
Chairs (CERC) Program. This project used compute re-
Rishika ran experiments, and contributed to the writing of
sources provided by the Oak Ridge Leadership Computing
the first version of the paper, added all time series augmen-
Facility at the Oak Ridge National Laboratory, which is
tations to the codebase of the paper, updated the alterna-
supported by the Office of Science of the U.S. Department
tive Lag-Transformer model for new experiments, adapted
of Energy under Contract No. DE-AC05-00OR22725. This
ETT datasets, Azure/Borg/Alibaba datasets (Cloud datasets),
project further used compute resources provided by Servi-
added options for automatic batch size search and plotting
ceNow, Mila, and Compute Canada.
forecasts, integrated distribution heads such as IQN for ex-
periments.
Arian ran experiments and contributed to the writing of
the first version of the paper, worked with the OneFitsAll

9
Lag-Llama

References Y., Ruiz, C., Ryan, J., Ré, C., Sadigh, D., Sagawa, S., San-
thanam, K., Shih, A., Srinivasan, K., Tamkin, A., Taori,
Alexandrov, A., Benidis, K., Bohlke-Schneider, M.,
R., Thomas, A. W., Tramèr, F., Wang, R. E., Wang, W.,
Flunkert, V., Gasthaus, J., Januschowski, T., Maddix,
Wu, B., Wu, J., Wu, Y., Xie, S. M., Yasunaga, M., You, J.,
D. C., Rangapuram, S., Salinas, D., Schulz, J., Stella, L.,
Zaharia, M., Zhang, M., Zhang, T., Zhang, X., Zhang, Y.,
Türkmen, A. C., and Wang, Y. GluonTS: Probabilistic
Zheng, L., Zhou, K., and Liang, P. On the opportunities
and Neural Time Series Modeling in Python. Journal of
and risks of foundation models, 2022.
Machine Learning Research, 21(116):1–6, 2020. URL
http://jmlr.org/papers/v21/19-820.html. Box, G. and Jenkins, G. Time Series Analysis: Forecast-
ing and Control. Holden-Day series in time series anal-
Anonymous. TEMPO: Prompt-based generative pre-trained ysis and digital processing. Holden-Day, 1976. ISBN
transformer for time series forecasting. In The Twelfth 9780816211043. URL https://books.google.ca/
International Conference on Learning Representations, books?id=1WVHAAAAMAAJ.
2024. URL https://openreview.net/forum?id=
YH5w12OUuU. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J. D., Dhariwal, P., Neelakantan, A., Shyam, P.,
Ashok, A., Étienne Marcotte, Zantedeschi, V., Chapados, N., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
and Drouin, A. Tactis-2: Better, faster, simpler attentional Krueger, G., Henighan, T., Child, R., Ramesh, A.,
copulas for multivariate time series, 2023. Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M.,
Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J.,
Assimakopoulos, V. and Nikolopoulos, K. The
Berner, C., McCandlish, S., Radford, A., Sutskever,
theta model: a decomposition approach to fore-
I., and Amodei, D. Language models are few-shot
casting. International Journal of Forecasting,
learners. In Larochelle, H., Ranzato, M., Hadsell,
16(4):521–530, 2000. ISSN 0169-2070. doi:
R., Balcan, M. F., and Lin, H. (eds.), Advances in
https://doi.org/10.1016/S0169-2070(00)00066-2.
Neural Information Processing Systems, volume 33, pp.
URL https://www.sciencedirect.com/science/
1877–1901. Curran Associates, Inc., 2020a. URL https:
article/pii/S0169207000000662. The M3- Compe-
//proceedings.neurips.cc/paper/2020/file/
tition.
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Benidis, K., Rangapuram, S. S., Flunkert, V., Wang, Y., Mad-
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
dix, D., Turkmen, C., Gasthaus, J., Bohlke-Schneider,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
M., Salinas, D., Stella, L., Aubet, F.-X., Callot, L., and
Askell, A., et al. Language models are few-shot learners.
Januschowski, T. Deep learning for time series fore-
Advances in neural information processing systems, 33:
casting: Tutorial and literature survey. ACM Computing
1877–1901, 2020b.
Surveys, 55(6):1–36, 12 2022. doi: 10.1145/3533382.
URL https://doi.org/10.1145%2F3533382. Caballero, E., Gupta, K., Rish, I., and Krueger, D. Bro-
ken neural scaling laws. In The Eleventh International
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Conference on Learning Representations, 2023. URL
Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- https://arxiv.org/abs/2210.14891.
lut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card,
D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A.,
Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, Richemond, P., McClelland, J., and Hill, F. Data distri-
M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, butional properties drive emergent in-context learning in
K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, transformers. Advances in Neural Information Processing
K., Goodman, N., Grossman, S., Guha, N., Hashimoto, Systems, 35:18878–18891, 2022.
T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu,
K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Chang, C., Peng, W.-C., and Chen, T.-F. Llm4ts: Two-stage
Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, fine-tuning for time-series forecasting with pre-trained
P. W., Krass, M., Krishna, R., Kuditipudi, R., Kumar, A., llms, 2023.
Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, Chen, M., Xu, Z., Zeng, A., and Xu, Q. Fraug: Fre-
X. L., Li, X., Ma, T., Malik, A., Manning, C. D., Mirchan- quency domain augmentation for time series forecast-
dani, S., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, ing, 2023. URL https://openreview.net/forum?id=
A., Narayanan, D., Newman, B., Nie, A., Niebles, J. C., j83rZLZgYBv.
Nilforoshan, H., Nyarko, J., Ogut, G., Orr, L., Papadim-
itriou, I., Park, J. S., Piech, C., Portelance, E., Potts, C., Chen, S. Beijing PM2.5 Data. UCI Machine Learning
Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Repository, 2017. DOI: https://doi.org/10.24432/C5JS49.

10
Lag-Llama

Chen, S. Beijing Multi-Site Air-Quality Data. FiveThirtyEight. Uber tlc foil response. https:
UCI Machine Learning Repository, 2019. DOI: //github.com/fivethirtyeight/uber-tlc-foil-
https://doi.org/10.24432/C5RK5G. response.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Fulcher, B. D., Little, M. A., and Jones, N. S. Highly
G., Roberts, A., Barham, P., Chung, H. W., Sutton, comparative time-series analysis: the empirical structure
C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, of time series and their methods. Journal of the Royal
S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, Society Interface, 10(83):20130048, 2013.
N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B.,
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Gneiting, T. and Raftery, A. E. Strictly proper scoring
Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., rules, prediction, and estimation. Journal of the Amer-
Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fe- ican Statistical Association, 102(477):359–378, 2007.
dus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, doi: 10.1198/016214506000001437. URL https://
B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, doi.org/10.1198/016214506000001437.
S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M.,
Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J.,
K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., and Montero-Manso, P. Monash time series forecasting
Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, archive. In Neural Information Processing Systems Track
J., Petrov, S., and Fiedel, N. Palm: Scaling language on Datasets and Benchmarks, 2021.
modeling with pathways, 2022.
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu,
J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang,
Croston, J. D. Forecasting and stock control for intermittent
R. Conformer: Convolution-augmented transformer for
demands. Operational Research Quarterly (1970-1977),
speech recognition, 2020.
23(3):289–303, 1972. ISSN 00303623. URL http://
www.jstor.org/stable/3007885.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gom-
mers, R., Virtanen, P., Cournapeau, D., Wieser, E., Tay-
Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., and Meester,
lor, J., Berg, S., Smith, N. J., Kern, R., Picus, M.,
L. E. A Modern Introduction to Probability and Statistics:
Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane,
Understanding why and how, volume 488. Springer,
A., del R’ıo, J. F., Wiebe, M., Peterson, P., G’erard-
2005.
Marchant, P., Sheppard, K., Reddy, T., Weckesser, W.,
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Abbasi, H., Gohlke, C., and Oliphant, T. E. Array pro-
Pre-training of deep bidirectional transformers for lan- gramming with NumPy. Nature, 585(7825):357–362,
guage understanding. In Proceedings of the 2019 Confer- September 2020. doi: 10.1038/s41586-020-2649-2. URL
https://doi.org/10.1038/s41586-020-2649-2.
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technolo-
Hunter, J. D. Matplotlib: A 2D graphics environment. Com-
gies, Volume 1 (Long and Short Papers), pp. 4171–4186,
puting in Science & Engineering, 9(3):90–95, 2007. doi:
Minneapolis, Minnesota, June 2019. Association for
10.1109/MCSE.2007.55.
Computational Linguistics. doi: 10.18653/v1/N19-1423.
URL https://aclanthology.org/N19-1423. Hyndman, R. and Athanasopoulos, G. Forecasting: Princi-
ples and Practice. OTexts, Australia, 2nd edition, 2018.
Drouin, A., Marcotte, E., and Chapados, N. TACTiS:
Transformer-attentional copulas for time series. In Hyndman, R. and Athanasopoulos, G. Forecasting: Princi-
Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., ples and practice. OTexts, 2021. ISBN 978-0987507136.
Niu, G., and Sabato, S. (eds.), Proceedings of the
39th International Conference on Machine Learning, Hyndman, R. J. and Khandakar, Y. Automatic time
volume 162 of Proceedings of Machine Learning Re- series forecasting: The forecast package for R. J.
search, pp. 5447–5493. PMLR, 07 2022. URL https: Stat. Soft., 27(3):1–22, 2008. ISSN 1548-7660.
//proceedings.mlr.press/v162/drouin22a.html. doi: 10.18637/jss.v027.i03. URL https://doi.org/
10.18637/jss.v027.i03.
Fatemi, Z., Huynh, M.-T. T., Zheleva, E., Syed, Z., and Di,
X. Mitigating cold-start forecasting using cold causal Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y., Shi, X.,
demand forecasting model. ArXiv, abs/2306.09261, 2023. Chen, P.-Y., Liang, Y., Li, Y.-F., Pan, S., and Wen, Q.
URL https://api.semanticscholar.org/CorpusID: Time-llm: Time series forecasting by reprogramming
259164537. large language models, 2023.

11
Lag-Llama

Joosen, A., Hassan, A., Asenov, M., Singh, R., Darlow, time-series characteristics: Selected through highly com-
L., Wang, J., and Barker, A. How does it function? parative time-series analysis. Data Mining and Knowl-
characterizing long-term trends in production serverless edge Discovery, 33(6):1821–1852, 2019.
workloads. In Proceedings of the 2023 ACM Symposium
on Cloud Computing, SoCC ’23, pp. 443–458, New York, Matheson, J. E. and Winkler, R. L. Scoring Rules for Con-
NY, USA, 2023. Association for Computing Machinery. tinuous Probability Distributions. Management Science,
ISBN 9798400703874. doi: 10.1145/3620678.3624783. 22(10):1087–1096, 1976.
URL https://doi.org/10.1145/3620678.3624783. Ni, Z., Yu, H., Liu, S., Li, J., and Lin, W. Basisformer:
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Attention-based time series forecasting with learnable
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and and interpretable basis. In Thirty-seventh Conference
Amodei, D. Scaling laws for neural language models. on Neural Information Processing Systems, 2023. URL
https://openreview.net/forum?id=xx3qRKvG0T.
arXiv preprint arXiv:2001.08361, 2020.
Nie, Y., H. Nguyen, N., Sinthong, P., and Kalagnanam, J. A
Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang,
time series is worth 64 words: Long-term forecasting with
Y.-X., and Yan, X. Enhancing the locality and breaking
transformers. In International Conference on Learning
the memory bottleneck of transformer on time series
Representations, 2023a.
forecasting. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A
Advances in Neural Information Processing Systems, time series is worth 64 words: Long-term forecasting with
volume 32. Curran Associates, Inc., 2019. URL https: transformers. In The Eleventh International Conference
//proceedings.neurips.cc/paper files/paper/ on Learning Representations, 2023b. URL https://
2019/file/6775a0635c302542da2c32aa19d86be0- openreview.net/forum?id=Jbdc0vTOcol.
Paper.pdf.
OpenAI. Gpt-4 technical report, 2023.
Li, Z., Wang, P., Rao, Z., Pan, L., and Xu, Z. Ti-MAE: Self-
supervised masked time series autoencoders, 2023. URL Oreshkin, B. N., Carpov, D., Chapados, N., and Ben-
https://openreview.net/forum?id=9AuIMiZhkL2. gio, Y. N-beats: Neural basis expansion analysis for
interpretable time series forecasting. In International
Lim, B., Arık, S. O., Loeff, N., and Pfister, T. Temporal Conference on Learning Representations, 2020a. URL
fusion transformers for interpretable multi-horizon https://openreview.net/forum?id=r1ecqn4YwB.
time series forecasting. International Journal of
Forecasting, 37(4):1748–1764, 2021. ISSN 0169-2070. Oreshkin, B. N., Carpov, D., Chapados, N., and Ben-
doi: https://doi.org/10.1016/j.ijforecast.2021.03.012. gio, Y. N-BEATS: Neural basis expansion analysis for
URL https://www.sciencedirect.com/science/ interpretable time series forecasting. In International
article/pii/S0169207021000637. Conference on Learning Representations, 2020b. URL
https://openreview.net/forum?id=r1ecqn4YwB.
Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., and Dust-
dar, S. Pyraformer: Low-complexity pyramidal attention Pandas development team, T. pandas-dev/pandas: Pan-
for long-range time series modeling and forecasting. In das, February 2020. URL https://doi.org/10.5281/
International Conference on Learning Representations, zenodo.3509134.
2022a. URL https://openreview.net/forum?id= Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
0EXmFzUn5I.
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Liu, X., Hu, J., Li, Y., Diao, S., Liang, Y., Hooi, B., and Zim- L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
mermann, R. Unitime: A language-empowered unified son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,
model for cross-domain time series forecasting, 2023. L., Bai, J., and Chintala, S. PyTorch: An imperative
style, high-performance deep learning library. In Wal-
Liu, Y., Wu, H., Wang, J., and Long, M. Non-stationary lach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc,
transformers: Exploring the stationarity in time series F., Fox, E., and Garnett, R. (eds.), Advances in Neural In-
forecasting. In Oh, A. H., Agarwal, A., Belgrave, D., and formation Processing Systems 32, pp. 8026–8037. Curran
Cho, K. (eds.), Advances in Neural Information Process- Associates, Inc., 2019.
ing Systems, 2022b. URL https://openreview.net/
forum?id=ucNDIDRNjjv. Peterson, M. An Introduction to Decision Theory. Cam-
bridge Introductions to Philosophy. Cambridge Uni-
Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., versity Press, second edition, 2017. doi: 10.1017/
Fulcher, B. D., and Jones, N. S. catch22: Canonical 9781316585061.

12
Lag-Llama

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Syntetos, A. A. and Boylan, J. E. The accuracy of
Sutskever, I. Language models are unsupervised multitask intermittent demand estimates. International Journal
learners. 2019. of Forecasting, 21(2):303–314, 2005. ISSN 0169-2070.
doi: https://doi.org/10.1016/j.ijforecast.2004.10.001.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., URL https://www.sciencedirect.com/science/
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, article/pii/S0169207004000792.
J., Krueger, G., and Sutskever, I. Learning transferable
visual models from natural language supervision, 2021. Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus,
W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., and
Rasul, K., Seward, C., Schuster, I., and Vollgraf, R. Autore- Metzler, D. Scaling laws vs model architectures: How
gressive denoising diffusion models for multivariate prob- does inductive bias influence scaling?, 2022.
abilistic time series forecasting. In Meila, M. and Zhang,
T. (eds.), Proceedings of the 38th International Confer- Thrun, S. and Pratt, L. Learning to Learn: Introduction and
ence on Machine Learning, volume 139 of Proceedings of Overview, pp. 3–17. Kluwer Academic Publishers, USA,
Machine Learning Research, pp. 8857–8868. PMLR, 18– 1998. ISBN 0792380479.
24 Jul 2021a. URL https://proceedings.mlr.press/
Torres, J. F., Hadjout, D., Sebaa, A., Martı́nez-Álvarez, F.,
v139/rasul21a.html.
and Troncoso, A. Deep learning for time series forecast-
Rasul, K., Sheikh, A.-S., Schuster, I., Bergmann, U. M., and ing: a survey. Big Data, 9(1):3–21, 2021.
Vollgraf, R. Multivariate probabilistic time series forecast- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
ing via conditioned normalizing flows. In International M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Conference on Learning Representations, 2021b. URL Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lam-
https://openreview.net/forum?id=WiGQBFuVRv. ple, G. Llama: Open and efficient foundation language
Robert Verkuil, Ori Kabeli, Y. D. e. a. Language models models. arXiv preprint arXiv:2302.13971, 2023.
generalize beyond natural proteins, 2022. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L., Gomez, A. N., Kaiser, L., and Polo-
Salinas, D., Bohlke-Schneider, M., Callot, L., Medico, R.,
sukhin, I. Attention is all you need. In Guyon, I.,
and Gasthaus, J. High-dimensional multivariate fore-
Luxburg, U. V., Bengio, S., Wallach, H., Fergus,
casting with low-rank Gaussian copula processes. In
R., Vishwanathan, S., and Garnett, R. (eds.), Ad-
Advances in Neural Information Processing Systems, vol-
vances in Neural Information Processing Systems,
ume 32, pp. 6827–6837, 2019a.
volume 30. Curran Associates, Inc., 2017. URL https:
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, //proceedings.neurips.cc/paper/2017/file/
T. DeepAR: Probabilistic forecasting with autore- 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
gressive recurrent networks. International Journal Vito, S. Air Quality. UCI Machine Learning Repository,
of Forecasting, 2019b. ISSN 0169-2070. URL 2016. DOI: https://doi.org/10.24432/C59K5F.
http://www.sciencedirect.com/science/article/
pii/S0169207019301888. Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q.,
Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S.,
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, and Wei, F. Image as a foreign language: Beit pretraining
T. DeepAR: Probabilistic forecasting with autoregressive for all vision and vision-language tasks, 2022.
recurrent networks. International Journal of Forecasting,
36(3):1181–1191, 2020. Wen, R., Torkkola, K., Narayanaswamy, B., and Madeka, D.
A multi-horizon quantile recurrent forecaster, 2018.
Shchur, O., Turkmen, A. C., Erickson, N., Shen, H., Shirkov,
A., Hu, T., and Wang, B. Autogluon–timeseries: Au- Wikipedia. Cold start (recommender systems) — Wikipedia,
toML for probabilistic time series forecasting. In Au- the free encyclopedia. http://en.wikipedia.org/w/
toML Conference 2023 (ABCD Track), 2023. URL index.php?title=Cold%20start%20(recommender%
https://openreview.net/forum?id=XHIY3cQ8Tew. 20systems)&oldid=1172519745, 2024. [Online;
accessed 01-February-2024].
Student. The probable error of a mean. Biometrika, pp.
1–25, 1908. Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, S. CoST:
Contrastive learning of disentangled seasonal-trend rep-
Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. Roformer: resentations for time series forecasting. In International
Enhanced transformer with rotary position embedding, Conference on Learning Representations, 2022a. URL
2021. https://openreview.net/forum?id=PilZY3omXV2.

13
Lag-Llama

Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, S. Ets- 2021, Virtual Conference, volume 35, pp. 11106–11115.
former: Exponential smoothing transformers for time- AAAI Press, 2021b.
series forecasting, 2022b.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H.,
Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, and Zhang, W. Informer: Beyond efficient transformer
S. ETSformer: Exponential smoothing transform- for long sequence time-series forecasting, 2021c.
ers for time-series forecasting, 2023. URL https:
//openreview.net/forum?id=5m 3whfo483. Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and
Jin, R. FEDformer: Frequency enhanced decomposed
Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: De- transformer for long-term series forecasting. In Chaud-
composition transformers with auto-correlation for long- huri, K., Jegelka, S., Song, L., Szepesvari, C., Niu,
term series forecasting. In Beygelzimer, A., Dauphin, G., and Sabato, S. (eds.), Proceedings of the 39th In-
Y., Liang, P., and Vaughan, J. W. (eds.), Advances in ternational Conference on Machine Learning, volume
Neural Information Processing Systems, 2021. URL 162 of Proceedings of Machine Learning Research, pp.
https://openreview.net/forum?id=J4gRj6d5Qm. 27268–27286. PMLR, 17–23 Jul 2022. URL https:
//proceedings.mlr.press/v162/zhou22g.html.
Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decom-
position transformers with auto-correlation for long-term Zhou, T., Niu, P., Wang, X., Sun, L., and Jin, R. One fits
series forecasting, 2022. all:power general time series analysis by pretrained lm,
Wu, N., Green, B., Ben, X., and O’Banion, S. Deep trans- 2023a.
former models for time series forecasting: The influenza Zhou, T., Niu, P., Wang, X., Sun, L., and Jin, R. One
prevalence case, 2020a. fits all: Power general time series analysis by pre-
Wu, S., Xiao, X., Ding, Q., Zhao, P., Wei, Y., and Huang, trained LM. In Thirty-seventh Conference on Neural
J. Adversarial sparse transformer for time series Information Processing Systems, 2023b. URL https:
forecasting. In Larochelle, H., Ranzato, M., Hadsell, //openreview.net/forum?id=gMS6FVZvmF.
R., Balcan, M., and Lin, H. (eds.), Advances in Neural
Information Processing Systems, volume 33, pp. 17105–
17115. Curran Associates, Inc., 2020b. URL https:
//proceedings.neurips.cc/paper files/paper/
2020/file/c6b8c8d762da15fa8dbbdfb6baf9e260-
Paper.pdf.

Yeh, C.-C. M., Dai, X., Chen, H., Zheng, Y., Fan, Y., Der,
A., Lai, V., Zhuang, Z., Wang, J., Wang, L., and Zhang,
W. Toward a foundation model for time series data, 2023.
Zhang, B. and Sennrich, R. Root mean square layer nor-
malization. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.),
Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019. URL https:
//proceedings.neurips.cc/paper files/paper/
2019/file/1e8a19426224ca89e83cef47f1e7f53b-
Paper.pdf.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H.,
and Zhang, W. Informer: Beyond efficient transformer for
long sequence time-series forecasting. Proceedings of the
AAAI Conference on Artificial Intelligence, 35(12):11106–
11115, May 2021a. URL https://ojs.aaai.org/
index.php/AAAI/article/view/17325.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H.,
and Zhang, W. Informer: Beyond efficient transformer
for long sequence time-series forecasting. In The Thirty-
Fifth AAAI Conference on Artificial Intelligence, AAAI

14
Lag-Llama

A. Details of Datasets in Beijing and London from January 2017 to March 2018.
Measurements include PM2.5, PM10, NO2, CO, O3, and
We use the following datasets in our experiments, the statis- SO2 (Godahewa et al., 2021).
tics of which are in Table 4, and their domains in Table
3. Table 3 further presents if a dataset was present in the The Pedestrian Counts dataset (referred to as ped-counts
pretraining or downstream testing corpora in our work. in parts of the text) encompasses hourly pedestrian counts
recorded by 66 sensors within the city of Melbourne, com-
The Air Quality UC Irvine Repository dataset (UCI) con- mencing in May 2009 (Godahewa et al., 2021).
tains 9358 instances of hourly averaged responses from 5
metal oxide chemical sensors embedded in an Air Qual- The Solar dataset comprises 6000 simulated time series for
ity Chemical Multisensor Device in a polluted area (Vito, 5-minute solar power and hourly forecasts of photovoltaic
2016). power plants in the U.S. in 2006. It includes 137 time
series reflecting solar power production every 10 minutes in
The Australian Electricity Demand dataset comprises five Alabama during 2006 (Godahewa et al., 2021).
half-hourly time series of the electricity demand across five
Australian states: Victoria, New South Wales, Queensland, The Sunspot dataset comprises a singular extensive daily
Tasmania, and South Australia (Godahewa et al., 2021). time series of sunspot numbers spanning from January 1818
to May 2020 (Godahewa et al., 2021).
The Beijing PM2.5 dataset contains hourly data of PM2.5
levels recorded by the US Embassy in Beijing. The dataset The Traffic dataset encompasses 862 hourly time series
also includes meteorological data from Beijing Capital In- depicting road occupancy rates on the freeways in the San
ternational Airport (Chen, 2017). Francisco Bay area from 2015 to 2016 (Godahewa et al.,
2021).
The Beijing Multi-Site Air-Quality dataset comprises
hourly measurements of six primary air pollutants and six The Uber TLC Hourly dataset consists data of 4.5 million
corresponding meteorological variables at various locations Uber pickups in NYC (April-September 2014) and 14.3 mil-
in Beijing over a period of four years. (Chen, 2019) lion pickups (January-June 2015). It includes trip details for
10 other for-hire vehicle companies and aggregated data for
The Electricity Hourly dataset captures electricity usage 329 companies (FiveThirtyEight; Godahewa et al., 2021).
for 321 clients measured at hourly intervals from 2012 to
2014 (Godahewa et al., 2021). The Weather dataset includes time series of hourly climate
data near Monash University, Clayton, Victoria, Australia,
The ETTh1, ETTh2, ETTm1, ETTm2 datasets contain from January 2010 to May 2021. The data contains series
2 years worth of data obtained from two Electricity Trans- for temperature, dewpoint temperature, wind speed, mean
formers at hourly and 15-minute frequencies curated to help sea level pressure, relative humidity, surface solar radiation,
predict if electrical transformers’ oil is at a safe temperature surface thermal radiation, and total cloud cover (Godahewa
(Zhou et al., 2021b). et al., 2021).
The Exchange Rate compilation encompasses the daily The Wind Farms dataset contains minute-frequency time
exchange rates of eight foreign currencies, namely Australia, series data tracking the wind power production of 339 wind
the United Kingdom, Canada, Switzerland, China, Japan, farms in Australia (Godahewa et al., 2021).
New Zealand, and Singapore, spanning the period from
1990 to 2016 (Godahewa et al., 2021).
B. Protocol Details
The Huawei cloud datasets contain serverless traces (Joosen
et al., 2023). We select 8 series containing metrics based on For all datasets used in the paper, we have a training and test
the minute-frequency occurrences of the top 10 functions split that are non-overlapping based on the timestamps, as
by median occurrences over 141 days: function delay, plat- defined in the dataset. During pretraining, for each such
form delay, cpu usage, memory usage, cpu limit, memory dataset, we exclude the 14 last overlapping windows of the
limit, instances. platform delay, requests. train split, and use it as the dataset’s validation set. When
pretraining, we train on a combined dataset formed out of
The London Smart Meters dataset focuses on electrical the train split of each dataset, after every epoch, we obtain
consumption readings from smart meters in 5,567 house- the validation loss on the validation sets of all datasets used
holds that participated in the UK Power Networks Low Car- in the pretraining corpus. We use the average validation loss
bon London project between November 2011 and February for early stopping criterion (this is referred to as ”validation
2014 (Godahewa et al., 2021). loss” in the paper). When fine-tuning on a specific dataset,
The KDD Cup 2018 dataset comprises extensive hourly we exclude the single last window of the train split, and use
time series data reflecting air quality levels across 59 stations it as the dataset’s validation set. We train on the train split

15
Lag-Llama

Table 3: Datasets used in the pretraining corpus and the unseen datasets on which we evaluate, grouped by the domains they are labelled
against.

Transport & Tourism Energy Nature Air Quality Cloud Banking & Econ
Pretraining San Francisco Traffic Australian Electricity Demand KDD Cup 2018 Beijing Multisite CPU Limit Minute
Uber TLC Hourly Electricity Hourly Sunspot UCI CPU Usage Minute
London Smart Meters Function Delay Minute
Solar Instances Minute
Wind Farms Memory Limit Minute
ETT H1 Memory Usage Minute
ETT H2
ETT M1
Unseen Pedestrian Counts ETT M2 Weather Beijing PM2.5 Requests Minute Exchange Rate
Platform Delay Minute

Table 4: Statistics of all the datasets used in the paper. Frequencies H stands for Hourly, T for minute, and B for business day. Tokens refers
to the total number of windows of size 1 in the dataset, computed as the aggregate number of timesteps across all series in that dataset.

Train split
Dataset Freq Domain Prediction Length
Timestamps # Series Tokens
Australian Electricity Demand 0.5H Energy 60 230676 5 1153380
Electricity Hourly H Energy 48 26256 321 8428176
London Smart Meters 0.5H Energy 60 23844 5560 132572640
Solar 10T Energy 60 52500 137 7192500
Wind Farms T Energy 60 526980 339 178646220
Pedestrian Counts H Transport 48 84283 66 5562678
Uber TLC Hourly H Transport 24 4254 262 1114548
Traffic H Transport 24 14036 862 12099032
KDD Cup 2018 H Nature 48 10850 270 2929500
Sunspot D Nature 30 73894 1 73894
Weather D Nature 30 695 3010 2091950
Exchange Rate 1B Economic 30 6071 8 48568
ETT H1 H Energy 24 8640 1 8640
ETT H2 H Energy 24 8640 1 8640
ETT M1 15T Energy 24 34560 1 34560
ETT M2 15T Energy 24 34560 1 34560
Requests Minute T Cloud 60 64800 10 648000
Function Delay Minute T Cloud 60 64800 10 648000
Platform Delay Minute T Cloud 60 64800 10 648000
CPU Usage Minute T Cloud 60 64800 10 648000
Memory Usage Minute T Cloud 60 64800 10 648000
CPU Limit Minute T Cloud 60 64800 10 648000
Memory Limit Minute T Cloud 60 64800 10 648000
Instances Minute T Cloud 60 64800 10 648000
UCI H Air Quality 24 9357 13 121641
Beijing PM2.5 H Air Quality 24 43824 8 350592
Beijing Multisite H Air Quality 24 35064 132 4628448

16
Lag-Llama

Table 5: Hyperparameter choices for Lag-Llama. The values with * represent the optimal values obtained by hyperparameter search.

Note that this is just the consecutive context that is sampled for each window; in practice we use a much larger context window due to
the use of lags, as described in Sec. § 4.1

H YPERPARAMETER LAG-LLAMA
N UMBER OF LAYERS 1,2,3,4,5,6,7,8*,9
N UMBER OF HEADS 1,2,3,4,5,6,7,8,9*
E MBEDDING D IMENSIONS PER HEAD 16*, 32, 64, 128, 256, 512
C ONTEXT L ENGTH C † 32*, 64, 128, 256, 512, 1024
AUGMENTATION P ROBABILITY 0,0.25,0.5*,1.0
F REQUENCY M ASKING R ATE 0,0.25,0.5*,1.0
F REQUENCY M IXING R ATE 0,0.25*,0.5,1.0
W EIGHT D ECAY 0*,0.25,0.5,1.0
D ROPOUT 0*,0.25,0.5,1.0

of the dataset, and use the validation split for early stopping. E. Forecast Visualizations
We use the same setup as fine-tuning Lag-Llama, for all
supervised baselines that we produce results for in the paper. We plot some sample forecasts and highlight the me-
Following typical evaluation setups (Shchur et al., 2023), dian, 50-th (dark green) and 90-th (light-green) predic-
all results reported in the paper are on the last prediction tion interval; starting from datasets in the pretraining cor-
window of the test splits defined in App. §A. pus: Electricity Hourly in Figure 3, ETT-H2 in Fig-
ure 4, Traffic in Figure 5. The Zero-shot forecasts of
Lag-Llama on downstream unseen datasets are highlighted
C. Additional Empirical Results for textttETT-M2 in Figure 6, Pedestrian Counts in Fig-
C.1. Results on the Pretraining Datasets ure 7 and Requests Minute in Figure 8. Finally, forecasts
after fine-tuning on these downstream unseen datasets are
A strong foundation model should not just be good at adapt- shown for ETT-M2 in Figure 9, Pedestrian Counts in
ing zero-shot and few-shot to unseen distributions of data, Figure 10 and Requests Minute in Figure 11. Note in
but should also perform well in-distribution, i.e. on the particular the different magnitudes of the sampled values
datasets that the model has been pretrained on. Therefore, depending on the dataset, via the same shared model.
apart from evaluating our model on unseen datasets, we also
evaluate our model on those datasets we use for pretraining.
F. Additional Visualizations
Results are given in Tab. 6, Tab. 7, and Tab. 8. Results on Av-
F.1. Neural Scaling Laws
erage Rank on all datasets are given in Tab. 9. The training
budget of Lag-Llama was split among all the pretraining The parameters of the Neural Scaling Law (Caballero et al.,
datasets, while other supervised models on the dataset do 2023) fit in Figure 13 to the validation loss (y) with respect
not have that constraint. Thereby, Lag-Llama did not see as to the pretraining data epochs seen (x) (where each epoch is
100 randomly sampled windows) are given below.
much data in each dataset as the other models, and thereby
is not expected to perform as well as each supervised model n  1/fi !−ci ∗fi
−c0 
Y x
on the specific datasets. This is reflected in the results, as y = a + bx 1+
i=1
di
Lag-Llama is not the best performing model in each dataset.
Still, Lag-Llama achieves a comparable average rank, and a = −6.1167
is among the models achieving the top average ranks. b = 8.01589
c0 = 0.0155
c1 = −0.1043
D. Hyperparameters of Lag-Llama
d1 = 1.6423e − 36
We perform a random search of 100 different hyperparame- f1 = −36.4660
ter configurations and use the average validation loss over
all datasets in the pretraining corpus to select our model. With such a law, one can extrapolate the validation loss of
We list the possible hyperparameters of Lag-Llama and the the model and predict performance in larger dataset regimes
optimal values obtained by our hyperparameter search in (Figure 13). As efforts progress towards collating better
Tab. 5. Our final model obtained by hyperparameter search data repositories for time series foundation model training,
contains 2,449,299 parameters. such laws can help quantify the relations between the data
used and the performance of the model.

17
Lag-Llama

Figure 3: Forecasting examples on the Electricity Hourly dataset

Figure 4: Forecasting examples from ETT-H2 dataset

Figure 5: Forecasting examples from Traffic dataset

Figure 6: Zero-shot forecasting examples on the unseen downstream ETT-M2 dataset

18
Lag-Llama

Figure 7: Zero-shot forecasting examples on the unseen downstream Pedestrian Counts dataset

Figure 8: Zero-shot forecasting examples on the unseen downstream Requests Minute dataset

Figure 9: Lag-Llama fine-tuned forecasting examples on the downstream ETT-M2 dataset

Figure 10: Lag-Llama fine-tuned forecasting examples on the downstream Pedestrian Counts dataset

19
Lag-Llama

Figure 11: Lag-Llama fine-tuned forecasting examples on the downstream Requests Minute dataset

Figure 12: Principal Component Analysis (PCA) on the average catch22 features of each pre-training dataset. We take the average of
the catch22 features for each dataset, standardize them, and then perform PCA on those points, such that each point corresponds to one
dataset. We then visualize these points projected onto the top 2 components, and we color the name of each dataset according to its
domain. The datasets are spread over both components, showing a diversity among the average catch-22 features of the different datasets.
Also, datasets from different domains tend to be clustered together, which demonstrates that combining different domains increases
pre-training data diversity. Together, these results suggest that combining multiple datasets across different domains increases the diversity
of the pre-training data. Under the assumption that diversity in the pre-training data is beneficial for foundation model pre-training (Brown
et al., 2020b), pre-training a single time series model on a diverse combination of multiple datasets from multiple domains is beneficial to
the foundation model’s zero-shot and few-shot adaptation performance.

20
Lag-Llama

Table 6: CRPS of Lag-Llama on 7/20 datasets in the pretraining corpus, compared to supervised baselines trained solely on the respective
datasets. Lower is better. A mean or standard deviation of 0.0000 signifies that the first non-zero digit is beyond 3 decimal places.

DATASET
M ODEL
AUS-ELEC-DEMAND ELECTRICITY KDD-CUP LONDON-SMART-METERS SOLAR SUNSPOT TRAFFIC

AUTOARIMA 0.065±0.000 0.098±0.003 0.552±0.000 NAN ± NAN 0.558±0.000 77.862±0.000 0.277±0.000


AUTO ETS 0.160±0.000 0.104±0.000 2.350±0.000 NAN ± NAN 0.551±0.000 171.363±0.000 0.492±0.000
C ROSTON SBA 0.127±0.000 0.244±0.000 0.459±0.000 0.500±0.000 1.016±0.000 34.458±0.000 0.414±0.000
D EEPAR 0.043±0.000 0.085±0.005 0.327±0.014 0.409±0.000 0.446±0.002 1.390±0.000 0.100±0.000
DYNAMIC O PTIMIZE 0.043±0.000 0.203±0.000 0.550±0.000 0.681±0.000 1.580±0.000 181.350±0.000 0.383±0.000
NPTS 0.098±0.000 0.139±0.001 0.346±0.001 0.464±0.000 0.404±0.001 201.558±10.653 0.191±0.000
PATCH TST 0.056±0.000 0.088±0.001 0.432±0.043 0.375±0.000 0.734±0.002 3.083±0.000 0.153±0.001
T EMPORAL F USION T 0.041±0.000 0.100±0.008 0.411±0.023 0.343±0.000 0.443±0.003 25.675±0.000 0.108±0.001
NBEATS 0.032±0.002 0.072±0.000 0.435±0.080 0.453±0.000 0.655±0.000 20.089±20.404 0.116±0.000
OFA 0.112±0.003 0.286±0.040 0.491±0.034 0.285±0.046 3.786±0.234 38.119±1.536 0.446±0.009
I NFORMER 0.064±0.020 0.081±0.002 0.351±0.000 0.424±0.011 0.990±0.140 4.765±0.336 0.157±0.000
AUTO F ORMER 0.090±0.021 0.102±0.005 0.451±0.018 0.383±0.003 2.107±0.425 40.456±12.354 0.185±0.010
ETSF ORMER 0.105±0.011 0.191±0.026 0.692±0.071 0.460±0.009 1.271±0.086 58.708±17.080 0.188±0.008
L AG LL AMA 0.087±0.018 0.095±0.013 0.323±0.004 0.381±0.003 1.536±0.237 4.961±1.912 0.119±0.001

Table 7: CRPS of Lag-Llama on the 7/20 datasets in the pretraining corpus, compared to supervised baselines trained solely on the
respective datasets. Lower is better. A mean or standard deviation of 0.0000 signifies that the first non-zero digit is beyond 3 decimal
places.

DATASET
M ODEL
UBER WINDFARMS ETT H1 ETT H2 ETT M1 AIRQUALITYUCI BEIJINGMULTISITE

AUTOARIMA 0.322±0.000 0.084±0.000 0.120±0.000 0.095±0.000 NAN ± NAN 0.206±0.000 0.359±0.000


AUTO ETS 0.461±0.000 0.096±0.000 0.117±0.000 0.105±0.000 0.073±0.000 0.220±0.000 0.472±0.000
C ROSTON SBA 0.427±0.000 0.130±0.000 0.123±0.000 0.112±0.000 0.094±0.000 0.237±0.000 0.400±0.000
D EEPAR 0.170±0.003 0.070±0.000 0.105±0.002 0.082±0.010 0.074±0.007 0.195±0.006 0.282±0.032
DYNAMIC O PTIMIZE 0.433±0.000 0.060±0.000 0.117±0.000 0.085±0.000 0.070±0.000 0.216±0.000 0.394±0.000
NPTS 0.191±0.000 0.208±0.000 0.268±0.001 0.216±0.001 0.162±0.000 0.130±0.001 0.414±0.006
PATCH TST 0.219±0.007 0.057±0.000 0.099±0.001 0.067±0.001 0.063±0.001 0.189±0.003 0.304±0.016
T EMPORAL F USION T 0.197±0.012 0.055±0.000 0.082±0.006 0.049±0.001 0.058±0.000 0.227±0.026 0.410±0.019
NBEATS 0.352±0.000 0.117±0.000 0.013±0.001 0.010±0.001 0.009±0.000 0.156±0.004 0.340±0.016
OFA 0.424±0.006 0.190±0.010 0.172±0.002 0.148±0.002 0.146±0.006 0.201±0.016 0.362±0.040
I NFORMER 0.196±0.003 0.099±0.014 0.174±0.003 0.112±0.014 0.098±0.008 0.191±0.024 0.241±0.016
AUTO F ORMER 0.205±0.007 0.246±0.038 0.155±0.010 0.119±0.005 0.119±0.008 0.172±0.012 0.238±0.012
ETSF ORMER 0.313±0.011 0.588±0.331 0.142±0.004 0.102±0.005 0.108±0.003 0.197±0.021 0.481±0.084
L AG LL AMA 0.168±0.002 0.145±0.009 0.104±0.001 0.073±0.005 0.068±0.001 0.138±0.006 0.340±0.055

Table 8: CRPS of Lag-Llama on the 6/20 datasets in the pretraining corpus, compared to supervised baselines trained solely on the
respective datasets. Lower is better. A mean or standard deviation of 0.0000 signifies that the first non-zero digit is beyond 4 decimal
places.

DATASET
M ODEL
CPU LIMIT CPU USAGE FUNCTION DELAY INSTANCES MEMORY LIMIT MEMORY USAGE

AUTOARIMA 0.2245±0.0000 0.0814±0.0000 0.0936±0.0000 0.0121±0.0000 0.2024±0.0000 0.0326±0.0000


AUTO ETS 0.0632±0.0000 0.0806±0.0000 NAN ± NAN 0.0128±0.0000 0.0632±0.0000 0.0664±0.0000
C ROSTON SBA 0.0278±0.0000 0.0826±0.0000 0.0756±0.0000 0.0318±0.0000 0.0278±0.0000 0.0346±0.0000
D EEPAR 0.0004±0.0001 0.1034±0.0016 0.1097±0.0039 0.0179±0.0054 0.0004±0.0000 0.0147±0.0016
DYNAMIC O PTIMIZE 0.0012±0.0000 0.0813±0.0000 0.0381±0.0000 0.0140±0.0000 0.0010±0.0000 0.0667±0.0000
NPTS 0.0001±0.0001 0.1010±0.0004 0.0808±0.0008 0.0158±0.0002 0.0001±0.0001 0.0164±0.0002
PATCH TST 0.0023±0.0005 0.0805±0.0026 0.0571±0.0000 0.0104±0.0021 0.0042±0.0012 0.0172±0.0042
T EMPORAL F USION T 0.0001±0.0001 0.0830±0.0062 0.0552±0.0030 0.0057±0.0010 0.0000±0.0000 0.0113±0.0013
NBEATS 0.0001±0.0000 0.0972±0.0018 0.0502±0.0030 0.0086±0.0012 0.0000±0.0000 0.0121±0.0009
OFA 0.0004±0.0003 0.1209±0.0082 0.1249±0.0170 0.0235±0.0019 0.0000±0.0000 0.0137±0.0012
I NFORMER 0.0001±0.0000 0.0986±0.0040 0.0843±0.0143 0.0164±0.0000 0.0000±0.0000 0.0110±0.0004
AUTO F ORMER 0.0392±0.0040 0.1040±0.0031 0.1652±0.0328 0.1311±0.0600 0.1489±0.0334 0.1301±0.1235
ETSF ORMER 0.0021±0.0015 0.1295±0.0061 0.2066±0.0774 0.5406±0.4268 1.2181±1.2744 0.0605±0.0120
L AG LL AMA 0.0001±0.0000 0.0897±0.0013 0.0590±0.0000 0.0062±0.0010 0.0000±0.0000 0.0127±0.0007

21
Lag-Llama

Figure 13: A neural scaling law fit to the validation loss (negative log-likelihood) of our foundation model, averaged across 3 seeds. ”fit”
represents points from the validation curve used for constructing the scaling law. ”unseen” represents points of the validation curve that
are predicted with the constructed scaling. We use a 60/20/20 train/val/test split to fit our scaling law.

22
Lag-Llama

Table 9: Average Rank Across all Pre-Training Datasets. Lower


is better.

M ODEL AVERAGE R ANK

ETSF ORMER 10.900


AUTO ETS 10.200
C ROSTON SBA 10.000
OFA 9.850
AUTO F ORMER 9.550
NPTS 8.350
AUTOARIMA 8.333
DYNAMIC O PTIMIZE 8.300
I NFORMER 6.025
D EEPAR 5.125
PATCH TST 4.700
L AG LL AMA 4.625
NBEATS 4.600
T EMPORAL F USION T 3.875

23

You might also like