paper3_MTP

Submitted to Journal of Machine Learning Research
Ensemble Modeling for Time Series Forecasting: an

Adaptive Robust Optimization Approach
Dimitris Bertsimas dbertsim@mit.edu

Sloan School of Management and Operations Research Center
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
arXiv:2304.04308v1 [cs.LG] 9 Apr 2023
Léonard Boussioux leobix@mit.edu (corresponding author)

Operations Research Center
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
Abstract
Accurate time series forecasting is critical for a wide range of problems with temporal
data. Ensemble modeling is a well-established technique for leveraging multiple predictive
models to increase accuracy and robustness, as the performance of a single predictor can be
highly variable due to shifts in the underlying data distribution. This paper proposes a new
methodology for building robust ensembles of time series forecasting models. Our approach
utilizes Adaptive Robust Optimization (ARO) to construct a linear regression ensemble in
which the models’ weights can adapt over time. We demonstrate the effectiveness of our
method through a series of synthetic experiments and real-world applications, including
air pollution management, energy consumption forecasting, and tropical cyclone intensity
forecasting. Our results show that our adaptive ensembles outperform the best ensemble
member in hindsight by 16-26% in root mean square error and 14-28% in conditional value
at risk and improve over competitive ensemble techniques.
Keywords: time series forecasting, ensemble modeling, regression, robust and adaptive
optimization, sustainability
1. Introduction
Time series data capture sequential measurements and observations indexed by timestamps.
The unique characteristics of time series data, in which observations have a chronological
order, often make their analysis challenging. In particular, predicting future target values
through time series forecasting is a critical research topic in many real-world applications
with a temporal component, such as healthcare operations, finance, energy, climate, and
©.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v/.html.
business. Organizations and professionals often use forecasting to inform their decision-
making, anticipate uncertain scenarios, and take proactive measures.
Many popular forecasting methods exist, including auto-regressive processes (Hamilton,

2020; Hyndman and Athanasopoulos, 2021), exponential smoothing approaches (Gard-
ner Jr., 1985), and neural networks (Salinas et al., 2020; Lara-Benı́tez et al., 2021). An-
other common approach consists in converting a time series into a non-temporal format
by concatenating several time steps of data to use classic machine learning models, such
as gradient-boosted trees (Makridakis et al., 2022). However, no forecasting method con-
sistently outperforms others as modeling assumptions may have to change over time to
account for the often complex mechanisms behind the time series generation (Chatfield,
2000). As a result, the accuracy of each forecasting method can vary significantly across
time depending on model drift (Gama et al., 2014).
Ensemble modeling is a well-established technique to leverage the strengths and limitations

of multiple models and benefit from their diversity (Dietterich, 2000; Brown and Kuncheva,
2010; Oliveira and Torgo, 2015). The principle is to combine the predictions of the fore-
casting models available to obtain a more accurate, stable, and robust predictor. This
approach has gained widespread attention with a variety of methods, including the con-
cepts of stacking (Wolpert, 1992), bagging (Breiman, 1996b), or boosting (Schapire, 1990;
Breiman, 1996a; Chen and Guestrin, 2016).
When applied to time series forecasting, ensemble modeling can become a complex task,
as the temporal aspect of the data adds difficulty in determining which model is most
suitable at any given point in time (Oliveira and Torgo, 2015; Cerqueira et al., 2020).
A common approach is to weigh the ensemble members based on their performance in
recent or historical observations, assuming the data distribution will stay similar and models
behave consistently. Other approaches include meta-learning (Prudêncio and Ludermir,
2004; Gastinger et al., 2021) or regret minimization, for example, under the multi-armed
bandit setting (Cesa-Bianchi and Lugosi, 2006).
However, these methods do not necessarily enforce the robustness of the ensembles, which
can lead to critical prediction errors. In this study, we propose to explore ensemble modeling
for time series forecasting from a different perspective, using robust optimization. We
present a novel optimization scheme that dynamically selects the weights of an ensemble of
forecasts at each time step, taking into account the uncertainty and errors of the underlying
forecasts. Our approach is based on multi-stage robust optimization, where the weights are
optimized variables determined at each time step in response to newly revealed information
about the uncertain parameters.
Traditional optimization assumes that all optimization variables are “here and now” deci-
sions, i.e., we have to determine the values for these optimization variables now and cannot
wait for more information on the uncertain parameters. However, in multi-stage (dynamic)
decision problems, we can introduce “wait and see” variables that can be determined af-
ter the uncertain parameter values have been revealed (Bertsimas and den Hertog, 2022).
2
Adaptive Robust Ensemble Modeling
Although this modeling process is challenging to implement in practice, it offers powerful

guarantees for robust dynamic decision-making under uncertainty.
The Adaptive Robust Optimization (ARO) framework is a methodology to address multi-

stage problems by adjusting wait-and-see variables as a function of uncertain parameters.
By utilizing ARO, we aim to demonstrate the viability of creating a new branch of ensemble
methods for temporal tasks, leveraging the strengths of adaptive robust methods. The
results of our work showcase the exciting potential of ARO for applications in the field of
machine learning. Our contributions include:
1. A novel ensemble method that leverages the ARO framework to dynamically adjust
the weights of underlying models in real time, resulting in a robust and accurate
forecasting tool. Our approach, based on linear regression ensembles, is detailed in
Section 3.
2. An analysis of the robustness guarantees of our method, accompanied by equivalent

formulations for easy and tractable implementation, as outlined in Section 3.4.
3. A comprehensive study of the impact of various hyperparameters on our adaptive

ensemble method, as demonstrated through synthetic experiments in Section 4.
4. Empirical validation of our proposed ensemble method, showcasing its superiority

over existing ensemble methods across multiple real-world forecasting applications,
including wind speed forecasting for air pollution management, energy consumption
forecasting, and tropical cyclone intensity forecasting (see Figure 1). Our method
outperforms the best ensemble member in hindsight by 16-26% in terms of root mean
square error and 14-28% in conditional value at risk with a 15% threshold, as demon-
strated in Section 5.
2. Background
This section briefly reviews standard ensemble methods for time series forecasting before
introducing the field of robust and adaptive optimization.
2.1 Ensemble Methods for Time Series Forecasting
Linear Ensembles After the seminal work of Bates and Granger (1969), several combi-
nation methodologies were added to the forecaster toolbox (Clemen, 1989; Zou and Yang,
2004). In particular, weighted linear combinations of different ensemble members became
popular due to the straightforward implementation for real-world deployment. Some typi-
cal linear combination techniques include the simple average, trimmed average, winsorized
average, or median of the ensemble members’ forecasts. Other methodologies can weigh
3
Air pollution management
Base model
1 Pr
e di
ct Energy use from appliances
ion
1
Base model Pred
Data ic tion 2
2
Ensemble Final prediction
…
-1 model
tion m
Base model Predic
m
m-1 ion
ict
ed
Pr
Base model Hurricane intensity forecasting
m
Figure 1: Schematic of the principle of ensemble modeling. Several base learners make
predictions from the original data and are combined into an ensemble. We represent the
three real-world data applications included in this paper.
the models based on their past errors, respective performance, or variance. For instance,
one can train a regularized linear regression such as Ridge (Hoerl and Kennard, 1970) or
LASSO (Tibshirani, 1996) to ensemble the different forecasts. See Adhikari and Agrawal
(2012) for a thorough review of linear ensemble methods.
Model Drift In general, the weighted ensembles are often designed and trained using
some historical observations and then deployed as fixed. While this is convenient for de-
ployment purposes, this static setting may represent limitations when there is model drift,
i.e., potential degradation of a model’s predictive power due to changes in the data. In
this case, the ensemble members’ performance can vary across time or different events due
to “data drift” or “concept drift”. Data drift refers to the model drift scenario when the
properties of the independent variables’ distribution change, for example, due to season-
ality, a shift in consumer behaviors or company strategies, or unexpected events. On the
other hand, concept drift corresponds to the scenario when the properties of the dependent
variable change, which can happen because of changes in the definition of the target value,
the annotation methodology, or the target sensor.
Dynamic Ensemblers Another common approach to ensemble forecasting consists in dy-

namically changing the weights of the different models in response to model drift (Kuncheva,
2004; Gama et al., 2014). In particular, the multi-armed bandit settings have been widely in-
vestigated (Cesa-Bianchi and Lugosi, 2006). Numerous methods with different assumptions
and contexts exist to track the losses of the ensemble members and find regret guarantees
of the ensemble with respect to the best model of the ensemble: for instance, the Exp3,
Upper-Confidence Bound, Online Passive-Aggressive (Crammer et al., 2006) algorithms (see
Lattimore and Szepesvári (2020) for a complete review).
4
These algorithms dynamically combine the forecasting models at each time step by consid-
ering the problem of minimizing the multi-armed bandit regret against the best ensemble
member.
2.2 Robust Optimization
Robust Optimization (RO) seeks to immunize models and problem formulations from adver-
sarial perturbations in the data by introducing, in general, an uncertainty set that captures
the deterministic assumptions about these perturbations. The objective is to find solutions
that are still good or feasible under a general level of uncertainty and are called robust
against errors of the specific magnitude and type chosen by the practitioner.
RO has gained substantial traction in recent years due to the guarantees provided to prop-
erly designed formulations and the progress of techniques and software to solve min-max
adversarial optimization formulations (Bertsimas and den Hertog, 2022).
Formalism We introduce some quick formalism to explicit the RO framework. We define

a decision vector x in a compact set of possible decisions X and an ensemble of parameters
z ∈ Rnz , nz ≥ 0 representing the information available in the problem. Our optimization
problem aims to minimize some objective function f (z, x). We may also specify a set of
constraints fi (z, x) ≤ 0, i ∈ [p], where we assume fi (·, x) is convex in x. Overall, we are
interested in solving problems in this initial form:
min f (z, x)
x∈X
s.t. fi (z, x) ≤ 0, ∀i ∈ [p].
When z is readily available, we can solve this problem with classical optimization methods.
However, z is uncertain and rarely explicitly known in real-world applications. The robust
optimization approach consists in representing the set of possible values of z with an uncer-
tainty set Z and solving for the decision x such that the constraints fi are always satisfied.
We now optimize f for the worst case, and the problem becomes:
min max f (z, x)

x∈X z∈Z
s.t. fi (z, x) ≤ 0, ∀i ∈ [p], ∀z ∈ Z.
The adversarial nature of such a problem can make it very difficult to solve to optimality.
A key aspect of RO is to derive an equivalent reformulation of a robust problem with a
computationally tractable form.
5
2.3 Adaptive Robust Optimization
Adaptive robust optimization (ARO) expands the RO framework by separating the decision
x into multiple stages. The principle of a multi-stage problem is to contain adaptive “wait-
and-see” decisions in addition to urgent “here-and-now” decisions. With ARO, once the
“here-and-now” decisions are made, we consider that some of the uncertain parameters will
become known before determining the “wait-and-see” decisions.
This multi-stage setting encompasses many real-world scenarios, such as lot-sizing on a

network, product inventory systems, unit commitment, and factory location problems, e.g.,
(Bertsimas et al., 2013; Bertsimas and Georghiou, 2015).
Formally, we denote the adaptive decisions as y(z) since they are selected only after some
of the uncertain parameters z may be revealed. An ARO problem typically writes as:
min max f (z, x, y(z))

x∈X ,y(z) z∈Z
s.t. fi (z, x, y(z)) ≤ 0, ∀i ∈ [p].
Note that expressing y as a function of z allows the practitioner to choose the function y(z)
arbitrarily before learning z, which is then called a decision rule.
In general, problems that contain such constraints with decision rules are NP-hard (Ben-Tal
et al., 2004). Therefore, to make them tractable, we may restrict y(z) to a given class of
functions even though it could be sup-optimal. One such class is the affine decision rule,
which conveniently expresses y(z) with an affine relationship with z: y(z) = x0 + V · z,
where the coefficients x0 and V are to be determined and become “here-and-now” decision
variables. Using decision rules is one of ARO’s core ideas that will serve this paper’s
methodology.
3. The Problem: a Robust Linear Regression for Time Series Forecasting
We aim to formulate an adaptive version of a robust linear regression problem, such as Ridge
or LASSO, in ensemble modeling for time series forecasting. We show how to develop an
adaptive robust formulation that leverages the temporal aspect of the data and guarantees
specific protection against model drift.
3.1 Linear Regression for a Time Series Ensemble
Define a sample Xt = [x1t , . . . , xm m

t ] ∈ R as the forecasts made by m individual models at
time t for the same fixed lead time period. We call these m models ensemble members. We
6
consider we are given a historical time series of ensemble members’ predictions (X, y) ∈
(RT ×m × RT ) with T regularly spaced samples Xt ∈ Rm with the corresponding ground-
truth values yt ∈ R.
Since we are interested in a dynamic linear combination of the forecasts, we associate all
ensemble member predictions to time-varying coefficients β t = [βt1 , . . . , βtm ] ∈ Rm . For
compactness, we also define the vector of coefficients:
β := [β 1 | . . . |β T ] = [β11 , . . . , β1m , . . . , βT1 , . . . , βTm ] ∈ Ω ⊂ RT ·m ,
where Ω is a user-designed set of constraints on the coefficients, and | is the concatenation

operation.
To build an adaptive weighted ensemble to approximate the ground truth values, we are
interested in minimizing problems similar to the following ordinary least squares problem:
v
u T
uX 2
min t yt − X>
t βt ,
(1)
β∈Ω
t=1
or the least absolute deviation problem written as follows:
T
X
min yt − X >
t βt . (2)
β∈Ω
t=1
Before proceeding, we make several remarks about the settings in this paper:
• We allow β t to vary over time, as we want more flexibility to capture the fluctuations
in the forecast skills. We will define potential constraints on β ∈ Ω later.
• We consider the complete observation of the forecasts and targets for a history of
T time steps. We assume access to historical data that can serve as training and
validation sets.
• We assume that each time step is regularly spaced. However, the lead time for predic-
tion is not necessarily one time step. For example, we can consider 6 hours between
each time step, and make forecasts for 24 hours later, i.e., for a given time step t, yt
is the ground truth value in 24 hours (t + 4), and Xt contains all the values forecasted
for t + 4 by each ensemble member.
• After optimizing the above weights on a training set, the goal is to use the learned
rules about β to make predictions in the “future”, i.e., at time T + 1, T + 2, . . . .
7
For convenience and further developments, we rewrite problems (1) and (2) under a general
compact form where the temporal aspect does not appear although it is still present in the
underlying variables and parameters:
min y − X̃β , (3)

β∈Ω
X>
 
1 0 ··· 0
.. 
X>

 0 . 
where k·k is a given norm, and X̃ :=  2  ∈ RT ×T ·m .
 .. .. 
 . . 0 
0 ··· 0 X>
T
3.2 Robustification
The nominal formulation (3) is akin to the popular Sample Average Approximation (SAA)
(Shapiro, 2003) minimizing the empirical error on the observed historical data. However,
such formulation is prone to overfitting and poor out-of-sample performance (Smith and
Winkler, 2006). This overfitting phenomenon typically originates in corruption of the data
— such as with noise or outliers — or simply from the fact that the available data is finite,
and the empirical error does not approximate precisely the out-of-sample error (Bennouna
and Van Parys, 2022). Hence, a natural approach to avoid overfitting is to robustify the
nominal formulation (3) using an uncertainty set of possible perturbations ∆ ∈ U of the
forecast matrix X̃. We consider, therefore, the robust formulation:
min max y − (X̃ + ∆)β(∆) , (4)

β∈Ω ∆∈U
where U is an uncertainty set that characterizes the user’s belief about perturbations of the
historic forecasts X at each time step t. Here, as explained in Section 2.3, the decision β
becomes an adaptive variable β(∆) to the perturbation.
Using a proper choice of uncertainty set U, the formulation (4) accounts for “adversarial
noise” in the forecast matrix X̃ and seeks to protect the dynamic linear regression problem
from structural uncertainty in the data.
Intuitively, we can understand that this uncertainty matrix encapsulates some of the errors
that all forecasting models naturally make at each time step due to errors in the modeling
process or the data.
8
Example 1 (Time-Varying Least Absolute Deviations) Using `1 norm and given

uncertainty sets U1 , . . . , UT associated with each time step, the compact problem (4) rewrites
more specifically as:
T
X
min max yt − (Xt + ∆t )> β t (∆1 , . . . , ∆T ) . (5)
β 1 ,...,β T ∆1 ∈U1 ,...,∆T ∈UT
t=1
We will consider in the rest of the paper that β t only depends on previous uncertainties
∆1 , . . . , ∆t−1 that have been revealed so far at time t. Therefore, the above example
problem (5) would write as:
T
X
min max yt − (Xt + ∆t )> β t (∆1 , . . . , ∆t−1 ) .
β 1 ,...,β T ∆1 ∈U1 ,...,∆T ∈UT
t=1
With this consideration, we now proceed in the general case to derive decision rules for
β t (∆1 , . . . , ∆t−1 ).
3.3 Adaptive Robust Formulation
As explained in Section 2.3, the multi-stage problem (4) can be cast in a more amenable
formulation using an affine decision rule for β t with respect to ∆1 , . . . , ∆t−1 .
We first decide on using a fixed window size of information τ ≥ 1 such that:
β t (∆1 , . . . , ∆t−1 ) := β t (∆t−τ , . . . , ∆t−1 ).
This fixed window of past information makes the problem more tractable and practical
by restraining the parameter size and ensuring every β t depends on the same number of
historical time steps. Note that for the edge cases β τ −1 , . . . , β 1 , we define without loss of
generality some extra values ∆0 , . . . , ∆−τ +1 .
We next consider that at time t, the uncertainties (∆s )t>s≥1 have been observed as (∆ b s )t>s≥1 ,
and we consider the proxy that (∆ b s )1≤s<t followed: ∆b s ∼ Xs − ys e, i.e., we consider the
forecast uncertainties corresponded to the previous forecast errors.
We now model β t with an affine decision rule depending on the observed uncertainties:
β t := β 0 + V0 · Zt ,
9
where we define the variables β 0 ∈ Rm and V0 ∈ Rm×m·τ , and the vector of parameters:
b t−1 ] = [Xt−τ − yt−τ e| . . . |Xt−1 − yt−1 e] ∈ Rm·τ ,

b t−τ | . . . |∆
Zt := [∆
where | indicates the concatenation operation and e the vector of 1s.
Overall, the proposed formulation adapts the coefficients at each time step, following an
affine decision rule that can leverage the recent forecast error history.
Example 1 (follow-up) With this modeling and decision rule, the previous adaptive
robust formulation (5), using `1 norm and given uncertainty sets U1 , . . . , UT , becomes:
T
X
min max yt − (Xt + ∆t )> (β 0 + V0 · Zt ) . (6)
β 0 ,V0 ∆1 ∈U1 ,...,∆T ∈UT
t=1
Compact form of the Adaptive Robust Linear Ensemble The rest of the paper
now focuses on this general and more compact formulation:
min max y − (X̃ + ∆)β , (7)

β∈Ωadapt ∆∈U
where Ωadapt is defined as the set of all possible values for β that satisfy the affine decision
rule:
Ωadapt := ΩX,y,τ m
adapt := {β subject to β t = β 0 + V0 · Zt , β 0 ∈ R , V0 ∈ R
m×m·τ
, ∀t}, where
Zt := [Xt−τ − yt−τ e| . . . |Xt−1 − yt−1 e] ∈ Rm·τ .
Notice that Ωadapt is parameterized by X, y, and τ , but we omit this dependency in nota-
tions.
3.4 Equivalence to a Regularized Problem
Even with the affine decision rule, it is a priori unclear how to conveniently solve the compact
robust optimization problem (7) due to its adversarial nature. Therefore, we now relate it
to equivalent regularized regression problems for several relevant uncertainty sets.
Indeed, in some cases, a robust formulation identifies the adversarial perturbations the
model is protected against with an equivalent regularized problem (Bertsimas and Copen-
haver, 2018). This is convenient for the practitioner as a min-max formulation can be
converted to a minimization problem with a regularizer on the variables.
10
Let us consider a natural choice for the uncertainty set as:
U = {∆ ∈ RT ×T ·m : k∆k ≤ λ},
where k·k is some matrix norm or seminorm, and λ > 0. We assume λ is fixed for the
remainder of the paper.
Example 2: Adaptive Ridge The choice of the 2-Frobenius norm k∆k := k∆kF2 =
qP P
T T ·m 2
i=1 j=1 ∆ij can be interpreted as a Euclidean ball centered at the observed data X
with radius λ, that models a global perturbation in the data.
In this case, using the `2 norm and UF2 := {∆ ∈ RT ×T ·m : k∆kF2 ≤ λ}, we can show that
the problem:
min max y − (X̃ + ∆)β (8)
β∈Ω ∆∈U
adapt F2 2
is equivalent to the following problem, which we name Adaptive Ridge:
min y − X̃β + λ kβk2 . (9)

β∈Ωadapt 2
We now devise the more general equivalence properties in which this example falls.
Definition 1 For two norms g, h, we define the induced-norm uncertainty set:
U(h,g) := {∆ ∈ RT ×T ·m : k∆k(h,g) ≤ λ},
g(∆β)
where k∆k(h,g) = max .
h(β)
Definition 2 For p ∈ [1, +∞], we define the Frobenius uncertainty set:
UFp := {∆ ∈ RT ×T ·m : k∆kFp ≤ λ},
where the Frobenius norm corresponds to

 1
T T ·m p
∆pij 
X X
k∆kFp = .
i=1 j=1
Now, we can apply well-established equivalence results between robust min-max formula-
tions and regularized regressions because we expressed our adaptive regression problems in
a compact form.
11
Theorem 1 For p, q ∈ [1, ∞],
min max ky − (X̃ + ∆)βkp = min ky − X̃βkp + λkβkq .

β∈Ωadapt ∆∈U(q,p) β∈Ωadapt
In particular, for p = q = 2, we recover what we call “Adaptive Ridge” in this paper as a

robustification; likewise, for p = 2 and q = 1, we recover what we call the Adaptive Lasso.
Proof The result follows from the lemma below:
Lemma 1 Bertsimas and Copenhaver (2018) If g : Rn → R is a seminorm which is not

identically zero and h : Rn → R is a norm, then for any z ∈ Rm and β ∈ Rn
max g(z + ∆β) = g(z) + λh(β),

∆∈U(h,g)

where U(h,g) = ∆ : k∆k(h,g) ≤ λ .
We reproduce and adapt the proof of the lemma for completeness.
The triangle inequality directly gives the first side of the equality:
g(z + ∆β) ≤ g(z) + g(∆β) ≤ g(z) + λh(β) for any ∆ ∈ U := U(h,g) .
We next show that there exists some ∆ ∈ U so that g(z + ∆β) = g(z) + λh(β). Let v ∈ Rn
so that v ∈ argmaxh∗ (v)=1 β > v, where h∗ is the dual norm of h. Note in particular that
β > v = h(β) by the definition of the dual norm h∗ . For now, suppose that g(z) 6= 0. Define
the rank one matrix ∆b := λ zv> . We observe that
g(z)

λh(β) g(z) + λh(β)
g(z + ∆β)
b =g z+ z = g(z) = g(z) + λh(β).
g(z) g(z)
b ∈ U. We remark that for any x ∈ Rn ,

We next show that ∆
>
b = g λv x z = λ v> x ≤ λh(x)h∗ (v) = λh(x),
g(∆x)
g(z)
b ∈ U, as desired.
where the final inequality follows by definition of the dual norm. Hence ∆
We now consider the case when g(z) = 0.

Let u ∈ Rn so that g(u) = 1. Since g is not identically zero, there exists some u such that
g(u) > 0. Therefore, by homogeneity of g we can take u so that g(u) = 1.
b := λuv> . We observe that
Let v be as before and define ∆

g(z + ∆β)
b = g z + λuv> β ≤ g(z) + λ v> β g(u) = 0 + λ v> β = λh(β).
12
Now, by the reverse triangle inequality,

g(z + ∆β)
b ≥ g(∆β)
b − g(z) = g(∆β)
b = λh(β)
and therefore g(z + ∆β)
b b ∈ U is identical to
= λh(β) = g(z) + λh(β). The proof that ∆
the case when g(z) 6= 0. This completes the proof of the lemma.
Theorem 1 follows by applying this Lemma 1 with the norm q as g and the norm p as h
and passing the equality to the min.
We also propose the following equivalence result with another norm:
Corollary 1 For any p ∈ [1, ∞]:

min max ky − (X̃ + ∆)βkp = min ky − X̃βkp + λkβkp∗ ,
β∈Ωadapt ∆∈UFp β∈Ωadapt
1 1
where p + p∗ = 1 and UFp := {∆ : k∆kFp ≤ λ}.
Proof We obtain the result as a corollary from the theorem below:
Theorem 2 Xu et al. (2008)
Let p ≥ 1. Define U 0 := {(δ 1 , . . . , δ m )| fj (kδ 1 kp , . . . , kδ m kp ) ≤ 0; j = 1, . . . , k} where

fj (·) are convex functions and k, m ≥ 1.
If the set Z := {z ∈ Rm |fj (z) ≤ 0, j = 1, . . . , k; z ≥ 0} has a non-empty relative interior,

then the robust regression problem

minm max0 ky − (X + ∆)βkp
β∈R ∆∈U
is equivalent to the following regularized regression problem:

n o
min ky − Xβkp + v(λ, κ, β) ,
λ∈Rk+ ,κ∈Rm
+ ,β∈R
m
h i
where v(λ, κ, β) := maxc∈Rm (κ + |β|)> c − kj=1 λj fj (c) .
P
By taking n o
U 0 = UFp = (δ 1 , . . . , δ m )| k kδ 1 kp , . . . , kδ m kp kp ≤ λ
and constraining β ∈ Ωadapt , we recover our Corollary 1.
13
3.5 Predictions in Real-Time
The adaptive ensemble method requires training and validation before deployment. We
recommend separating the available forecast data into training and validation sets to select
the two hyperparameters: the regularization factor λ and the window size of past data τ
to include in the affine decision rule. Note that there must be no sample overlap between
the data used to train the ensemble members or the ensemble model to avoid the biases of
potential overfit and ensure the ensemble can determine the models’ behaviors on held-out
data.
Once the previous Adaptive Robust Ensemble problem (7) has been solved to optimality
using the equivalent formulations, one can make predictions at time T + k, using the affine
decision rule:
yT +k = X> > ∗ ∗
T +k β T +k = XT +k (β 0 + V0 ZT +k ), ∀k ≥ 1.
Below, we summarize the overall training, validation, and testing mechanism of the adaptive
ridge formulation in Algorithm 1.
4. Synthetic Experiments
Our synthetic experiments aim to identify the suitable conditions where our adaptive en-
semble method can have the edge over competitive methods and provide guidance and
intuition on the hyperparameters to use.
4.1 General Set Up
We focus our experiments and the rest of the paper on the Adaptive Ridge problem:
min y − X̃β + λ kβk2 . (10)

β∈Ωadapt 2
where we remind that Ωadapt corresponds to the previous affine decision rules on each β t :
Ωadapt := {β subject to β t = β 0 + V0 · Zt , β 0 ∈ Rm , V0 ∈ Rm×m·τ , ∀t},

m·τ
Zt := [Xt−τ − yt−τ e| . . . |Xt−1 − yt−1 e] ∈ R .
We investigate the dynamics of the different ensembles’ performances with respect to the
following:
• The number of ensemble members m available,
14
Algorithm 1 Adaptive Ridge Pipeline

Input:
1: X — Ensemble members forecast data
2: y — Targets
3: Λ := [λ1 , . . . , λp ] — Set of regularization parameters to validate with grid search
4: W := [w1 , . . . , wT ] — Set of window sizes of past data included in the adaptive rule to
validate with grid search
5: train size, val size — Percentage of data for training and validation sets (test set is
the rest)
6: M — Validation error metric function
7: M — Set of evaluation metrics for the test set
Output:
1: β 0 , V0 — Parameters for the adaptive rule
2: results — Results on the test set for metrics M
Algorithm:
1: Xtrain−val , Xtest , ytrain−val , ytest ← train test split(X, y, train size, shuf f le = f alse)
2: Xtrain , Xval , ytrain , yval ← train test split(Xtrain−val , ytrain−val , val size, shuf f le =
f alse)
3: β ∗0i,w , V ∗0i,w ← arg min ytrain − X̃train β + λi kβk2 , ∀i ∈ (1, p), ∀t ∈ (1, T )
t t tw 2
β∈Ωadapt
. Find the adaptive rule coefficients for all hyperparameter combinations.
4: β val
i,wt ← get adaptive coefs(Xval , yval , β ∗0i,w , V ∗0i,w )
t t
. Compute the adaptive coefficients on the validation set.
5: i∗ , t∗ ← arg min M yval , X̃val β val
i,wt
i,t∀i∈(1,p),t∈(1,T )
. Determine best hyperparameter combination based on validation performance.
∗ ∗
6: β 0 , V 0 ← arg min ytrain−val − X̃train−val β + λi∗ kβk2
t w ∗ 2
β∈Ωadapt
. Retrain the best model on the training and validation data combined.
7: β ∗ ← get adaptive coefs(Xtest , ytest , β ∗0 , V ∗0 )
. Compute the adaptive coefficients on the test set.
∗
8: results ← M ytest , X̃val β
. Compute the results on the test set.
15
• The number of samples available for model training Ttrain ,
• The amount of model drift in the ensemble members,
• The number of past time steps τ the models can use in the decision rule.
4.1.1 Metrics
Overall, we reproduced each experiment 30 times and evaluated the different ensemble
methods with the average and standard deviation of the following metrics:
• to evaluate accuracy: Mean Absolute Error (MAE), Root Mean Square Error (RMSE),
Mean Absolute Percentage Error (MAPE),
• to evaluate robustness: Conditional Value at Risk 5% (CVaR 5), Conditional Value

at Risk 15% (CVaR 15).
In high-stakes machine learning applications, performing well on average and when re-
stricted to challenging scenarios is crucial. Therefore, we evaluated the Conditional Value
at Risk CVaRα (Rockafellar and Uryasev, 2000) that determines the expected loss once the
α Value at Risk (VaR) breakpoint has been breached.
Consider T forecasting cases, where the ground truth at time step t is yt and the ensemble
method prediction is noted yˆt . The metrics are calculated as follows:
T
1X
MAE(y, ŷ) := |yt − yˆt |,
T
t=1
v
u
u1 X T
RMSE(y, ŷ) := t (yt − yˆt )2 ,
T
t=1
T
100 X yt − yˆt
MAPE(y, ŷ) := ,
T yt
t=1
T
1 X
CVaRα (y, ŷ) := min τ + max(0, |yt − yˆt | − τ ).
τ αT
t=1
Notice that CVaRα in its discrete form is a simple optimization problem that we solve using
Julia.
16
4.1.2 Data Generation Process
Ground Truth For all experiments, we used the same ground truth data: a univari-
ate time series of 4,000 time steps. We generated a periodic signal with some normally
distributed noise:
2π
y(t) = sin · t + (t), ∀t ∈ [4000],
500
(t) ∼ N (0, 0.1).
Ensemble Members’ Forecasts To simulate the availability of several forecasts pro-

vided by different predictive models at each time step, we randomly chose the bias and
standard deviation of the ensemble members’ errors from a given range.
Formally, at each time step t, we generated the values X̃k (t) of each ensemble member
k ∈ [m] as:
X̃k (t) = y(t) + k (t), ∀t ∈ [4000],
k (t) ∼ N (bk , σk ),
where we sampled b1 , . . . , bm and σ1 , . . . , σm independently before hand as:
bk ∼ Uniform(−0.5, 0.5),
σk ∼ Uniform(−0.5, 0.5),
Since we repeated experiments 30 times across different seeds, we resampled different bk , σk

for each experiment.
Adding Drift to the Ensemble Members’ Forecasts To test the capacity of the
ensembles to perform well against drifting forecasts, we additionally simulate a temporal
change in the error distributions of each ensemble member.
Therefore, at each time step t, our final forecast values Xk (t) of each ensemble member
k ∈ [m] are defined as:
t
Xk (t) = X̃k (t) + · driftk (t), ∀t ∈ [4000],
4000
driftk (t) ∼ N (b0k , σk0 ),
where we selected their error biases b01 , . . . b0m and error standard deviations σ10 , . . . , σm
0 before
hand as:
b0k ∼ N (0, σdrift ),
σk0 ∼ Uniform(0, sdrift ).
We conducted experiments with different σdrift , sdrift , ranging from 0 to 1. Again, since we
repeated experiments 30 times across different seeds, we resampled b0k , σk0 for each experi-
ment.
17
Training, Validation, Test Splits We split the data chronologically into training (50%
of the data, i.e., t ∈ [1, 2000]), validation (25% of the data, i.e., t ∈ [2001, 3000]), and test
(25% of the data, i.e., t ∈ [3001, 4000]) sets.
Table 9 in Appendix summarizes all the different hyperparameters involved in the synthetic
data experiments.
4.1.3 Other Methods Benchmarked
Along with the Adaptive Ridge method, we evaluated the following ensembles on the same
data:
• Best Model in Hindsight: we determine in hindsight what was the best ensemble
member on the test data with respect to the MAPE and report its performance for all
metrics. Notice that in real-time, it is impossible to know which model would be the
best on the overall test set, which means the best model in hindsight is a competitive
benchmark.
• Ensemble Mean: consists of weighing each model equally, predicting the average of
all ensemble members at each time step.
• Exp3 (Cesa-Bianchi and Lugosi, 2006): under the multi-armed bandit setting, Exp3
weighs the different models to minimize the regret compared to the best model so far.
The update rule is given by:
−ηt · Regretit

β it+1 = exp Pm i
, with
i=1 exp(−η · Regrett )
t
X
Regretit = (ys − Xsi )2 , ∀i ∈ [1, m], and
s=t−t0
s
8 log(m)
ηt = ,
t0
where the window size t0 considered to determine the regularized leader is tuned.
• Passive-Aggressive (Crammer et al., 2006), a well-known margin-based online learning

algorithm that updates the weights of its linear model based on the following equation:
max(0, |X>
t β t − yt | − )
β t+1 = β t + sign yt e − X>
t t τt Xt ,
β τt = ,
kXt k22
where is a margin parameter to be tuned.
• Ridge (Hoerl and Kennard, 1970): consists in learning the best linear combination of
ensemble members by solving a ridge problem on the forecasts Xt :
18
T
X 2
min yt − X >
t β + λkβk22 ,
β
t=1
which gives the closed-form solution: β = (XX> + λI)−1 Xy.
We summarize the different ensemble methods evaluated in Table 1 below.
Table 1: Summary of all ensemble methods compared.
Ensemble method Time-varying Update rule Hyperparameters to tune

weights
Best Model in Hindsight No N/A N/A
1
Ensemble Mean No βt = m e N/A
−ηt ·Regretit

Exp3 Yes β it+1 = exp Pm exp(−η ·Regreti ) Window of past data to use to compute regrets
i=1 t t
Passive-Aggressive Yes β t+1 = β t + sign (yt e − Xt β t ) τt Xt Margin parameter used in τt

Ridge No β t = (XX> + λI)−1 Xy Regularization factor λ
Adaptive Ridge Yes β t = β 0 + V0 · Zt Regularization factor λ, window of past data in Zt
4.1.4 Validation Mechanism
For each experiment, we performed a grid search in [10−4 , 10−3 , 10−2 , 10−1 , 1] on the vali-
dation set to tune the value of the regularization factor λ of the adaptive ridge formulation,
λridge for the ridge formulation, and PA for the Passive-Aggressive algorithm. For each seed,
we selected the parameter that led to the lowest average MAE. Then, we retrained the mod-
els on the training and validation sets combined, using the tuned values, and evaluated the
different metrics on the held-out test set.
4.2 Software and Computational Resources
We wrote all code in Julia 1.6 (Bezanson et al., 2017), using the JuMP package (Dunning
et al., 2017) to write optimization functions and Gurobi (Gurobi Optimization, LLC, 2022)
as the solver. We performed each individual experiment reported in this paper using 4 Intel
Xeon Platinum 8260 CPU cores from the Supercloud cluster (Reuther et al., 2018).
4.3 Evaluation of the Number of Ensemble Members m
These experiments compare the performance of the different ensemble techniques with re-
spect to the number of ensemble members available.
We fixed the drift parameters of the error distributions to σdrift = 0.5, sdrift = 0.5, and the
number of past time steps to use to τ = 5.
19
MAE RMSE MAPE
0.45 0.5 250
0.40
0.35 0.4 200
0.30
0.3 150
0.25
0.20
0.2 100
0.15
0.10 50
0.1
0.05
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
CVaR 5% CVaR 15%

0.9 Adaptive Ridge
Passive-Aggressive
1.0 0.8 Ridge
Exp3
Best Model in Hindsight
0.7 Ensemble Mean
0.8
0.6
0.6 0.5
0.4
0.4
0.3
0.2
0.2
0.1
10 20 30 40 50 10 20 30 40 50
Figure 2: Average performance across 30 seeds of the different ensemble methods with
respect to the number of ensemble members. We added in light color the corresponding
one-standard deviation intervals around the average.
The results with all metrics in Figure 2 show the same trend. We make the following
conclusions:
• The performance increases when more models can be used by the ensemble, which
makes sense since there are more opportunities to unveil the underlying ground truth
data by learning how the different models make their errors.
• There is a rapid increase in the performance going from 3 ensemble members to 10.
However, after 15 ensemble members, the performance is stable and does not improve
20
anymore or becomes slightly worse. It suggests that a high number of forecasters does
not necessarily translate to better performance. Instead, a higher number of models
can lead to overfitting the training set.
• Adaptive ridge is consistently the best method and outperforms the classic ridge
formulation, which is static. The PA model also performs well. These methods
clearly outperform the mean baseline and show that a weighted linear combination
can significantly improve upon the best model of the ensemble.
• The Exp3 algorithm indeed achieves lower errors than the best model but does not
compete with the PA algorithm and the ridge formulations.
4.4 Evaluation of the Ensemble Members’ Drift
These experiments aim to determine how well methods perform when the ensemble members
can significantly drift across time.
We fixed the number of ensemble members to m = 10 and the window size of past forecasts
the adaptive part can use to τ = 5.
Gaussian Drift We test several values of σdrift = sdrift ranging from 0 to 1.
Figure 3 shows the following:
• When there is no or very little drift, ridge and adaptive ridge perform very similarly,
which is expected since the adaptive part is mainly useful to leverage temporal trends.
• When the ensemble members can drift more than 0.2, ridge starts performing worse
than the PA algorithm, which is again expected since the trained weights may differ
from the ensemble members’ performance trend on the test set.
• On the other side, adaptive ridge maintains its lead on PA since it can leverage recent
trends.
• The baseline methods quickly start to perform terribly with high drift. At the same
time, PA and adaptive ridge metrics maintain a seemingly sublinear relationship with
respect to the amount of drift possible.
Discrete Gaussian Drift We also tested the models with a different type of noise, where
the Gaussian drift happens or not according to a Bernoulli variable.
For this experiment only, our final forecast values Xk (t) of each ensemble member k ∈ [m]
are defined as:
Xk (t) = X̃k (t) + bk (t) · driftk (t), ∀t ∈ [4000],
bk (t) ∼ Bernoulli(pdrift ),
driftk (t) ∼ N (b0k , σk0 ),
21
MAE RMSE MAPE
0.6
250
0.4 0.5
200
0.4
0.3
150
0.3
0.2
0.2 100
0.1 50
0.1
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
CVaR 5% CVaR 15%

Adaptive Ridge
1.0 Passive-Aggressive
1.2 Ridge
Exp3
1.0 0.8 Ensemble Mean
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
respect to the ensemble members’ possible drift. We added in light color the corresponding
where we selected their error biases b01 , . . . b0m and error standard deviations σ10 , . . . , σm
0 before
hand as:
b0k ∼ N (0, 0.5),
σk0 ∼ Uniform(0, 0.5).
We tested several values of the Bernoulli parameter pdrift ranging from 0 to 1. 0 means no
drift is added, and the ensemble members’ errors follow their initial distribution. 1 means
22
the drift is always added, shifting the ensemble members’ errors to a new distribution. Any
value in-between means the errors are oscillating between two distributions of errors.
MAE RMSE MAPE
0.35 175
0.4
0.30 150
0.25 0.3 125
0.20 100
0.2
0.15 75
0.10 50
0.1
0.05 25
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
CVaR 5% CVaR 15%

1.0 Adaptive Ridge
0.8 PA
Ridge
0.7 EXP3
0.8 Best Model
0.6 Mean
0.6 0.5
0.4
0.4
0.3
0.2
0.2
0.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
respect to the ensemble members’ possible drift. We added in light color the corresponding
The results shown by Figure 4 suggest the following:
• The ridge method cannot compensate for an additional discrete noise as much as the
adaptive method.
• Adaptive ridge has the most stable performance across the different regimes, which is
expected since it is a robust formulation.
23
• For all methods, it is more difficult to adjust to oscillating error distributions than to
a fixed error distribution, even if the errors are higher in the second distribution.
4.5 Evaluation of the Window of Past Forecasts that Can Be Used
These experiments aim to determine the performance of adaptive ridge with respect to the
window size of past forecasts it can use at each time step.
We fixed the number of ensemble members to m = 10, the drift parameters of the error
distributions to σdrift = 0.5, sdrift = 0.5. We vary τ ∈ [1, 25].
Figure 5 shows the following:
• A large window size deteriorates the performance, as the adaptive ridge model overfits.
• Between a window size of 2 and 8 time steps, the performance is stable. There is a
sweet spot around 3-4 past time steps. It suggests the importance of hypertuning this
value for the practitioner.
4.6 Evaluation of the Training Data Size
Contrary to all previous experiments, we varied the amount of training and validation data
available to evaluate how the different ensemble methods adapt to lower amounts of samples
in particular. The test set remained the same as previously for all experiments. We split
the training and validation data so that the test set immediately follows those samples. We
varied the number of samples between 100 and 3000 by fixing the validation data size to
one-third of the data available (e.g., for data size = 750, we get training data size = 500,
validation data size = 250).
We fixed the number of ensemble members to m = 10, the drift parameters of the error
distributions to σdrift = 0.5, sdrift = 0.5, and the number of past time steps to use to τ = 5.
We notice two interesting regimes in Figure 6:
• In the small data regime, with under 500 samples available for training, PA is the best
method as it does not rely on any training data. Ridge closely follows and bridges the
gap at around 300 samples. The adaptive ridge method suffers from the lack of data
as it overfits the training set.
• With more data available, the adaptive ridge method closes the gap at 500 - 750
samples and then outperforms all other methods.
• After 750 samples, the performance of all methods remains stable, with adaptive ridge
being the best, followed by PA, and then ridge.
24
MAE RMSE MAPE

0.40
160
0.30
0.35
140
0.25
0.30 120
0.20 0.25 100
0.20 80
0.15
0.15 60
0.10
0.10 40
0.05
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
CVaR 5% CVaR 15%

0.8 Adaptive Ridge
0.7 Ridge
Exp3
0.8 Best Model in Hindsight
0.6 Ensemble Mean
0.7
0.6 0.5
0.5 0.4
0.4
0.3
0.3
0.2
0.2
0.1
5 10 15 20 25 5 10 15 20 25
respect to the window size of past data that can be used. We added in light color the
corresponding one-standard deviation intervals around the average.
These experiments highlight that the adaptive framework is primarily suitable when suffi-
cient data is available to determine adaptive coefficients that generalize well enough.
4.7 Conclusion of the Synthetic Experiments
We summarize our main findings:
25
MAE RMSE MAPE
0.40
0.5 160
0.35
140
0.30 0.4
120
0.25
0.3 100
0.20
80
0.15 0.2
60
0.10
0.1 40
0.05
20
0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000
CVaR 5% CVaR 15%

1.2 Adaptive Ridge
Ridge
1.0 Exp3
0.8 Ensemble Mean
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 1000 2000 3000 0 1000 2000 3000
Figure 6: Average performance across 30 seeds of the different ensemble methods with re-
spect to the number of training samples available. We added in light color the corresponding
• Adaptive ridge has a significant edge over ridge when sufficient data is available and
when ensemble members may suffer from performance drift.
• Adaptive ridge is not suitable when there is little data available. In that case, a
purely online method such as PA should be preferred, or a simpler static method such
as ridge.
• We highlight the importance of validating the different hyperparameters of the adap-
tive ridge method: the regularization factor λ and the window size τ of past forecasts
to use in the adaptive term. Adaptive ridge, due to its additional variables, may be
26
sensitive to overfitting in certain situations (e.g., large window size, small training
data size).
• Ensemble methods such as the Ensemble Mean and Exp3 sum models’ weights to
1, while ridge and adaptive ridge can correct forecast biases with negative weights
offering greater accuracy and robustness than the best model in hindsight.
5. Real-World Case Studies
This section illustrates the benefits of our adaptive ensemble method with real-world data.
We investigate three different applications where the time series characteristics differ (see
Figure 7):
1. Air pollution management through wind speed forecasting: the time series exhibits a
daily cyclical behavior and a long-term seasonality.
2. Energy use forecasting: the time series exhibits bursts a few hours a week.
3. Tropical cyclone intensity forecasting: the time series are shorter and smoother, but
the underlying system is chaotic and complicated to predict.
5.1 Next-Hour Wind Speed Forecasting for Pollution Management
Motivation Air pollution is a pressing concern with far-reaching consequences for human
health, the environment, and ecosystems. The emission of toxic substances from chemical
factories poses a significant risk to nearby populations, mainly when meteorological condi-
tions carry the pollution toward populated areas. It is crucial to accurately predict weather
conditions to inform real-time, effective pollution management strategies to mitigate this
risk. This forms the basis of our first use case: predicting wind speed for air pollution
management. Our data comes from a phosphate production site in Morocco, the largest
chemical industry plant in the country. Previously, Bertsimas et al. (2023) demonstrated
with this plant the effectiveness of a data-driven approach, including a weighted average
ensemble model, in reducing the diffusion of air pollution from industrial plants into nearby
cities, providing a valuable use case to show how our adaptive ensembles can improve further
the robustness and accuracy.
Air Pollution Management Pipeline The phosphate production site is located 10km
southwest of Safi city, Morocco, threatening the health and well-being of the 300,000 res-
idents. With this population in close proximity, weather conditions play a critical role in
determining air pollution dispersion. The site therefore implemented a comprehensive mon-
itoring procedure that includes planning production rates and shutdowns based on 48-hour
meteorological forecasts, as well as real-time wind monitoring systems to detect dangerous
conditions and stop production. The timely and accurate wind speed prediction in the next
hour is crucial for the effective functioning of this procedure.
27
Wind speed over time (1-hour intervals)
10
Wind speed (km/h) 8
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Day
Total energy use by appliances over time (1-hour intervals)

700
600
Energy use (Wh)
500
400
300
200
100
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Day
Intensity over time for four different hurricanes (6-hour intervals)

160
140
Intensity (knots)
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Day
Figure 7: Subsamples of the time series for our three real-world use cases: wind speed at
Safi factories, energy use by appliances, tropical cyclone intensity.
Building on the success of Bertsimas et al. (2023)’s implementation of regularized linear

regression models for wind speed forecasting, we explore the potential of the adaptive ridge
method to improve the accuracy and robustness and enhance the pollution management
pipeline.
5.1.1 Data
Ensemble Members The Safi operations team receives the local official weather forecast
bulletins from the Moroccan weather agency daily around 6:00 am GMT, with an update
around 6:00 pm GMT. The wind speed forecasts are provided hourly for the next 48 hours.
28
Besides this operational forecast, the different models used by Bertsimas et al. (2023) are
XGBoost, Decision Trees, Optimal Regression Trees, Lasso Regression, and Ridge Regres-
sion. These models took as input the recent meteorological data observations.
We have access to the 8494 historical one-hour lead time wind speed forecasts made by
these 6 models between 2019 and 2022 for every hour. The ground truth was measured by
a sensor on-site that collected data every minute and then was averaged hourly on the Safi
platforms.
Experiments Protocol We split the data chronologically into training (50%), validation
(20%), and test (30%) sets. We standardized all data (targets and features) by subtracting
the mean of the training targets and dividing by the standard deviation of the training
targets.
After selecting the best hyperparameter combination for each ensemble method using the
validation set (see the Appendix for more details), we retrained the models on the training
and validation sets combined. We then evaluated them on the test set in the same way we
conducted the synthetic experiments.
5.1.2 Results
Table 2 compares the performance of the different ensemble methods with the same metrics
described previously.
Adaptive ridge consistently provides the best performance across all metrics, improving over
the best model in hindsight by 8% in MAE, 17% in RMSE, 7% in MAPE, 26% in CVaR
5%, and 14% in CVaR 15%. The adaptive ridge can substantially reduce the worst-case
errors, which is critical for the success of the pollution management pipeline.
In comparison, the other ensemble methods fail to outperform the best model in hindsight
consistently. As expected, Exp3 provides comparable performance to the best model in
hindsight: 0.510 vs. 0.507 in MAE, 0.690 vs. 0.756 in RMSE, and more robust results: 1.87
vs. 2.27 in CVaR 5%, 1.36 vs. 1.46 in CVaR 15%.
Adaptive ridge outperforms ridge substantially on all metrics, notably the robustness, by
17% in CVaR 5% and 12% in CVaR 15%, which is of substantial interest to the plant
operators to prepare better against incoming dangerous weather conditions.
5.1.3 Remarks on the adaptive coefficients
The ability of regression coefficients to adapt over time is demonstrated in Figure 8, which
plots a subset of these values over a period of two weeks. The rate of change for the
coefficients varies among models, with some displaying more significant shifts than others.
Furthermore, as depicted in Figure 9, the coefficients can take on both positive and negative
29
Table 2: Results for each metric and ensemble method on the test set for the wind speed
forecasting task on the Safi site. Results are given in km/h.
Ensemble Method MAE RMSE MAPE (%) CVaR 5% CVaR 15%

Best Model in Hindsight 0.507 0.756 15.9 2.27 1.46
Ensemble Mean 0.553 0.783 17.4 2.27 1.52
Exp3 0.510 0.690 16.5 1.87 1.36
Passive-Aggressive 0.643 0.868 20.1 2.35 1.72
Ridge 0.521 0.718 16.7 2.03 1.42
Adaptive Ridge 0.469 0.626 14.8 1.68 1.25
values. A daily cyclical pattern is also observed, which aligns with the cyclical nature of
the forecast errors (due itself to the cyclical nature of the weather data).
It is worth noting that the predictions of some models may compensate for one another,
particularly when there is a high degree of correlation among them. In such cases, prac-
titioners may want to consider selecting a subset of models before utilizing an adaptive
ensemble, and potentially impose an additional constraint that the sum of their coefficients
is equal to 1, depending on the specific application.
Figure 8: Evolution every hour of the regression coefficients values for each ensemble mem-
ber.
5.2 Energy Consumption Forecasting
Due to fast urbanization, rising population, and increased social needs, the energy demand
in the building sector has expanded dramatically over the past few decades. In particu-
lar, appliances in residential buildings represent a substantial part of the electrical energy
demand, approximating 30% (EIA, 2022). Numerous studies investigated appliances’ en-
ergy use in buildings and developed forecasting models of electrical energy consumption in
buildings to guide multiple applications such as detecting abnormal energy use patterns, de-
termining adequate sizing of alternate energy sources and energy storage to diminish power
flow into the grid, informing energy management system for load control, demand-side man-
agement and demand-side response, predicting electricity price (e.g., Pham et al. (2020);
Olu-Ajayi et al. (2022); Zaffran et al. (2022)). According to Colmenar-Santos et al. (2013),
30
Official Forecast XGBoost CART
1.0 1.0 1.0

Coefficient Value
Coefficient Value
Coefficient Value
0.5 0.5 0.5
0.0 0.0 0.0
0.5 0.5 0.5
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
Model Errors Model Errors Model Errors
Ridge Lasso ORT
1.0 1.0 1.0

Coefficient Value
Coefficient Value
Coefficient Value
0.5 0.5 0.5
0.0 0.0 0.0
0.5 0.5 0.5
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
Model Errors Model Errors Model Errors
Figure 9: Coefficient values of each ensemble member with respect to their errors.
the availability of a building energy system with precise forecasting could save between 10%
and 30% of all energy consumed. Overall, as accurate and robust forecasting can be impact-
ful, we investigate how adaptive ridge compares with the other ensemble techniques using
real-world data on appliance energy consumption provided by Candanedo et al. (2017).
5.2.1 Data
The dataset is publicly available at the University of California Irvine (UCI) Machine Learn-
ing repository1 . First, we trained and selected our own forecasting models to constitute the
ensemble members before comparing the different ensemble methods.
Each observation in the dataset is a vector of measurements made by a wireless sensor

network in a low-energy building. The predictive features we use include the temperature
and humidity conditions in various rooms in the building, the weather conditions in the
nearest weather station, and the date and time. Our goal is to predict the total energy
consumption in Wh of the building’s appliances within one hour at each time step. Mea-
surements are taken every 10 minutes over 4.5 months. We aggregated the current features
with the past 5 time steps of data for each forecasting case to form a vector of 168 features
used as predictors for standard machine learning models. We trained several forecasting
models that will constitute our ensemble members. We separated the data chronologically
into a training set (50%, 9858 samples), validation set 1 (25%, 4929 samples), validation
set 2 (10%, 1973 samples), and test set (15%, 2956 samples).
1. https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction
31
Table 3: Results for each metric and ensemble method on the test set for the energy
consumption task. Results are given in Wh.
Ensemble Method MAE RMSE MAPE (%) CVaR 5% CVaR 15%

Best Model in Hindsight 31.1 80.0 23.2 309.0 158.3
Ensemble Mean 33.3 79.4 27.6 304.3 157.6
Exp3 32.9 78.8 27.1 301.9 156.4
Passive-Aggressive 52.2 149.1 45.0 476.7 249.8
Ridge 35.3 76.4 33.7 287.5 145.0
Adaptive Ridge 26.3 59.4 23.5 224.9 113.8
We used the Python package Lazy Predict2 to conveniently train and cross-validate 36
different forecasting models using the training set only. Models trained included numerous
tree-based models, linear regressions, support vector machines, neural networks, and various
other methods. Among the 30 models trained, we selected the best 10 with respect to MAE
on the validation set 1 to constitute our final ensemble members. The best 10 models
were Ordinary Least Squares (OLS), Ridge, Bayesian Ridge, ElasticNet, Huber Regressor,
Least Angle Regression (LARS), Lasso, LassoLARS, Linear Support Vector Regression, and
Orthogonal Matching Pursuit.
We collected the predictions of these different models on the validation sets and test sets
and used validation set 1 as training data for the ensembles, validation set 2 as validation
data for the ensembles, and the test set to evaluate the results.
5.2.2 Results
The results presented in Table 3 demonstrate the superiority of the adaptive ridge method
compared to other ensemble techniques. It is the only ensemble method that outperformed
the best model in hindsight in MAE and RMSE, showing a 15% improvement in MAE and
a 26% improvement in RMSE. The MAPE performance of adaptive ridge was similar to
the best ensemble member in hindsight (23.5% vs. 23.2%), while the second-best ensemble
method, Exp3, achieved a MAPE of 27.6%.
The energy consumption target values in this dataset displayed significant bursts, as shown
previously in Figure 7, making it difficult for all ensemble methods to accurately capture
such events, as evidenced by the poor CVaR scores. However, the adaptive ridge method
demonstrated its ability to effectively leverage the errors of each ensemble member, resulting
in more robust solutions than any other method. The improvement in CVaR 5% was 27%
and 28% in CVaR 15% compared to the best ensemble member, while ridge showed an
improvement of 7% in CVaR 5% and 8% in CVaR 15%.
2. https://lazypredict.readthedocs.io/
32
5.3 Tropical Cyclone Intensity Forecasting
Motivation Tropical cyclones are a considerable threat to communities, causing hun-

dreds of deaths and billions of dollars in damage each year. The US National Hurricane
Center (NHC) forecasts tropical cyclones’ track, intensity, size, structure, storm surges,
rainfall, and tornadoes. Intensity predictions, i.e., the maximum sustained wind speed of
a storm over a 1-minute interval, is one of the most important and difficult characteristics
to forecast and directly impacts decision-making. In this real-world use case, we evaluate
the effectiveness of the different ensemble methods for 24-hour lead time tropical cyclone
intensity forecasting, which is critical to undertake life-saving measures and mitigate the
impact of these devastating natural disasters.
Background on the Ensemble Members Current operational TC forecasts used by the

NHC can be classified into dynamical models, statistical models, and statistical-dynamical
models (Cangialosi, 2020). Dynamical models, also known as numerical models, utilize
powerful supercomputers to simulate atmospheric fields’ evolution using dynamical and
thermodynamical equations (Biswas et al., 2018; ECWMF, 2019). Statistical models ap-
proximate historical relationships between storm behavior and storm-specific features and,
in general, do not explicitly consider the physical process (Aberson, 1998; Knaff et al.,
2003). Statistical-dynamical models use statistical techniques but further include atmo-
spheric variables provided by dynamical models (DeMaria et al., 2005). Lastly, the NHC
employs consensus models (i.e., ensemble models) to combine individual operational fore-
casts (Cangialosi, 2020; Cangialosi et al., 2020). In addition, recent developments in deep
learning enabled machine learning models to employ multiple data processing techniques
to process and combine information from a wide range of sources and create sophisticated
architectures to model spatial-temporal relationships. In particular, Boussioux et al. (2022)
proposed a deep feature extractor methodology combined with boosted tree methods that
have comparable performance with operational forecasts.
5.3.1 Data
We collected the different operational and machine learning intensity forecasts provided by
Boussioux et al. (2022)3 . We used the forecasts made on their test set, corresponding to the
years 2016-2019. They obtained operational forecast data from the Automated Tropical
Cyclone Forecasting (ATCF) data set maintained by the NHC (Sampson and Schrader,
2000; National Hurricane Center, 2021). The ATCF data contains historical forecasts by
operational models used by the NHC for its official forecasting of tropical and subtropical
cyclones in the North Atlantic and Eastern Pacific basins.
Table 4 lists the 7 forecast models included as ensemble members and the 2 operational
benchmark models and summarizes the predictive methodologies employed.
3. https://github.com/leobix/hurricast
33
Table 4: Summary of all tropical cyclone intensity forecasting models used as ensemble
members or benchmark.
Model ID Model name Model type Used as
HUML-(stat, xgb) Hurricast (tabular data) Gradient-boosted trees Ensemble member
HUML-(stat/viz, xgb/td) Hurricast (tabular and vision data) Feature extraction with tensor decomposition Ensemble member
+ Gradient-boosted trees
HUML-(stat/viz, xgb/cnn/gru) Hurricast (tabular and vision data) Feature extraction with deep learning Ensemble member
(CNN, GRU) + Gradient-boosted trees
HUML-(stat/viz, xgb/cnn/transfo) Hurricast (tabular and vision data) Feature extraction with deep learning Ensemble member
(CNN, Transformers) + Gradient-boosted trees
Decay-SHIPS Decay Statistical Hurricane Intensity Statistical-dynamical Ensemble member
Prediction Scheme
GFSO Global Forecast System model Multi-layer global dynamical Ensemble member
HWRF Hurricane Weather Research and Multi-layer regional dynamical Ensemble member
Forecasting model
FSSE Florida State Super Ensemble Corrected consensus Benchmark model
OFCL Official NHC Forecast Consensus Benchmark model
Overall, our ensemble members comprise four deep learning models (Hurricast), one statistical-
dynamical model (Decay-SHIPS), and two dynamical models (GFSO, HWRF). We bench-
mark our ensemble methods against the forecasts made by FSSE and OFCL, which are
the highest-performing consensus models used by the NHC. The technical details of all the
models can be found in Cangialosi (2020) and Boussioux et al. (2022).
5.3.2 Experiments Protocol
Implementation Adjustments The use-case of tropical cyclone intensity forecasting

requires two minor implementation adjustments. First, since we use the different forecasts
for lead time 24-hour every 6 hours, this means at a time step t we don’t predict t + 1, but
rather t + 4 (t + 24h instead of t + 6h). Therefore, for all methods, we only use the ground
truth data that would be available in real-time to update the weights, i.e., to predict t + 4,
we do not use the information that becomes available at t + 1, t + 2, t + 3.
Second, tropical cyclones are temporary events: the forecast data comes as several time
series of different lengths instead of only one. For the adaptive ridge method, we treat all
samples from each hurricane as independent samples in the training set and we build Zt as
usual. For the bandit and PA method, for each new tropical cyclone, we reuse the latest
updated weights computed on the previous tropical cyclone.
Training, Validation, Test Splits The forecast data comprises 870 samples for the
North Atlantic Basin and 849 for the Eastern Pacific basin. We split the data chronologically
into training (50% of the data), validation (20% of the data), and test (30% of the data)
sets.
34
5.3.3 Results
Table 5 compares the different ensemble methods. The adaptive ridge and ridge methods
outperform all other ensembles, including the current operational models OFCL and FSSE,
which is of significant interest for operational use due to its potential to improve the official
forecasts issued by the NHC. In particular, the good performance of adaptive ridge in terms
of CVaR is particularly relevant for tropical cyclone forecasting as it may decrease the worst
errors, which are the ones leading to poor decision-making on the ground. For instance,
adaptive ridge outperforms the official forecast OFCL by 5% (resp. 21%) on the North
Atlantic (resp. Eastern Pacific) basin in CVaR 15% and by 12% (resp. 17%) in MAE.
The Ensemble Mean and Exp3 methods perform similarly to the operational consensus
FSSE and OFCL and generally improve slightly over the best model in hindsight. Adaptive
ridge has the best performance overall across basins and metrics, with only ridge slightly
surpassing it in CVaR on the North Atlantic basin. The adaptive component provides a
valuable performance boost compared to ridge only, such as an MAE gain of 4% on the
North Atlantic basin and 9% on the Eastern Pacific basin.
Table 5: Results in knots for each metric and ensemble method for 24-hour lead time
intensity forecasting task. Bold values highlight the best performance for each metric.
North Atlantic Basin Eastern Pacific Basin
Ensemble Method Comparison on 261 cases Comparison on 254 cases
MAE RMSE MAPE (%) CVaR 5% CVar 15% MAE RMSE MAPE (%) CVaR 5% CVaR 15%
Best Model in Hindsight 9.2 12.2 14.2 31.4 24.6 11.3 15.3 14.5 41.1 30.8
Ensemble Mean 9.4 12.8 14.0 35.2 25.6 10.9 14.6 14.2 38.2 29.2
Exp3 8.8 11.5 13.3 30.0 22.6 10.3 14.9 15.5 43.3 31.1
Passive-Aggressive 9.1 12.4 13.9 34.6 25.3 17.9 22.9 27.1 57.2 44.4
FSSE 8.8 11.6 13.8 31.0 23.2 10.0 14.4 15.5 41.3 28.9
OFCL 8.4 11.4 13.3 29.8 22.2 10.6 15.6 17.2 44.3 30.6
Ridge 7.7 10.3 11.7 27.2 21.0 9.7 13.6 15.0 39.2 27.8
Adaptive Ridge 7.4 10.2 11.6 28.6 21.1 8.8 12.1 13.4 34.1 24.1
5.4 Further Applications to Adaptive Feature Selection
Beyond ensemble modeling, our methodology can be employed for adaptive feature selec-
tion in complex multimodal tasks. For instance, in healthcare, the methods developed by
Soenksen et al. (2022); Bertsimas et al. (2022) merge modalities such as tables, text, time
series, and tables by extracting features through deep learning, forming a unified vector
representation. This vector is then employed to generate forecasts using standard machine-
learning algorithms. However, the weighting of these features remains static over time. We
can dynamically identify the most relevant features for predicting patient outcomes, disease
progression, or treatment responses by incorporating adaptive feature selection. Additional
applications of adaptive multimodal feature selection encompass natural disaster manage-
ment (Zeng and Bertsimas, 2023), climate change, agriculture, finance and economics, and
many others.
35
6. Conclusion
Using an ARO approach, this paper presented a novel methodology for building robust
ensembles of time series forecasting models. Our technique is based on a linear ensemble
framework, where the weights of the ensemble members are adaptively adjusted over time
based on the latest errors. This approach robustly adapts to model drift using an affine
decision rule and a min-max problem formulation transformed into a tractable form with
equivalence theorems.
We have demonstrated the effectiveness of our adaptive ensemble mechanism through a

series of synthetic and real-world experiments. Our results have shown that our approach
outperforms several standard ensemble methods in terms of accuracy and robustness. We
have also applied our technique to three real-world use cases – weather forecasting for
air pollution management, energy consumption forecasting, and tropical cyclone intensity
forecasting – highlighting its merit for critical tasks where robustness and risk minimization
are critical.
One of the advantages of our method is that it relies on a robust linear formulation, which
makes it valuable from a practical standpoint for ease of deployment and interpretability.
However, this method also requires the availability of a training set of some size, a trade-off
we exhibited with synthetic experiments. Therefore, we recommend using our adaptive
ensemble framework when a few hundred historical time series points are available for
training.
Adaptive ensembles have the potential for many time series applications, including epidemi-
ological predictions, healthcare operations, logistics, supply chain, manufacturing, finance,
and product demand forecasting.
Acknowledgements
We thank Shuvomoy Das Gupta, Amine Bennouna, Moı̈se Blanchard for useful discussions.
The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing
Center for providing high-performance computing resources that have contributed to the
research results reported within this paper.
36
References
Sim D. Aberson. Five-day tropical cyclone track forecasts in the north atlantic basin.
Weather and Forecasting, 13(4):1005 – 1015, 1998.
Ratnadip Adhikari and R. K. Agrawal. Combining multiple time series models through a
robust weighted mechanism. In 2012 1st International Conference on Recent Advances
in Information Technology (RAIT), pages 455–460, 2012. 10.1109/RAIT.2012.6194621.
J. M. Bates and C. W. J. Granger. The combination of forecasts. OR, 20(4):451–468, 1969.
A. Ben-Tal, A. Goryashko, E. Guslitzer, and A. Nemirovski. Adjustable robust solutions of

uncertain linear programs. Mathematical Programming, 99(2):351–376, 2004.
Amine Bennouna and Bart Van Parys. Holistic robust data-driven decisions, 2022. URL
https://arxiv.org/abs/2207.09560.
Dimitris Bertsimas and Martin S. Copenhaver. Characterization of the equivalence of ro-

bustification and regularization in linear and matrix regression. European Journal of
Operational Research, 270(3):931–942, 2018.
Dimitris Bertsimas and Dick den Hertog. Adaptive and robust optimization. Dynamic
Ideas, 2022.
Dimitris Bertsimas and Angelos Georghiou. Design of near optimal decision rules in multi-
stage adaptive mixed-integer optimization. Operations Research, 63(3):610–627, 2015.
Dimitris Bertsimas, Eugene Litvinov, Xu Andy Sun, Jinye Zhao, and Tongxin Zheng. Adap-
tive robust optimization for the security constrained unit commitment problem. IEEE
Transactions on Power Systems, 28(1):52–63, 2013.
Dimitris Bertsimas, Kimberly Villalobos Carballo, Yu Ma, Liangyuan Na, Léonard Bous-
sioux, Cynthia Zeng, Luis R. Soenksen, and Ignacio Fuentes. Tabtext: a systematic
approach to aggregate knowledge across tabular data structures, 2022.
Dimitris Bertsimas, Leonard Boussioux, and Cynthia Zeng. Reducing air pollution through
machine learning, 2023. URL https://arxiv.org/abs/2303.12285.
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach
to numerical computing. SIAM review, 59(1):65–98, 2017.
Mrinal K Biswas, Sergio Abarca, Bernardet Ligia, Ginis Isaac, Grell Evelyn, Iacono Michael,
Kalina Evan, Liu Bin, and et al. Hurricane weather research and forecasting (hwrf) model:
2018 scientific documentation. Developmental Testbed Center, 2018.
Léonard Boussioux, Cynthia Zeng, Théo Guénais, and Dimitris Bertsimas. Hurricane fore-
casting: A novel multimodal machine learning framework. Weather and Forecasting, 37
(6), 2022. https://doi.org/10.1175/WAF-D-21-0091.1.
Leo Breiman. Bias, variance, and arcing classifiers. 1996a.
37
Leo Breiman. Heuristics of instability and stabilization in model selection. Annals of
Statistics, 24(6):2350–2383, 1996b.
Gavin Brown and Ludmila Kuncheva. “good” and “bad” diversity in majority vote ensem-
bles. pages 124–133, 2010.
Luis M. Candanedo, Véronique Feldheim, and Dominique Deramaix. Data driven prediction
models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81–
97, 2017.
John P. Cangialosi. National hurricane center forecast verification report. Na-

tional Hurricane Center, 2020. URL https://www.nhc.noaa.gov/verification/pdfs/
Verification 2020.pdf.
John P. Cangialosi, Eric Blake, Mark DeMaria, Andrew Penny, Andrew Latto, Edward
Rappaport, and Vijay Tallapragada. Recent Progress in Tropical Cyclone Intensity Fore-
casting at the National Hurricane Center. Weather and Forecasting, pages 1–30, 2020.
Vitor Cerqueira, Luis Torgo, and Igor Mozetič. Evaluating time series forecasting models:
an empirical study on performance estimation methods. Machine Learning, 109(11):
1997–2028, 2020.
Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
Chris Chatfield. Time-Series Forecasting. Chapman and Hall/CRC, New York, 2000.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR,
abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.
Robert T. Clemen. Combining forecasts: A review and annotated bibliography. Interna-

tional Journal of Forecasting, 5(4), 1989. https://doi.org/10.1016/0169-2070(89)90012-5.
Antonio Colmenar-Santos, Lya Lober, David Borge-Diez, and Manuel Castro. Solutions to
reduce energy consumption in the management of large buildings. Energy and Buildings,
56:66–77, 01 2013. 10.1016/j.enbuild.2012.10.004.
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online
passive-aggressive algorithms. Journal of Machine Learning Research, 7(19):551–585,
2006.
Mark DeMaria, Michelle Mainelli, Lynn K. Shay, John A. Knaff, and John Kaplan. Further
improvements to the statistical hurricane intensity prediction scheme (ships). Weather
and Forecasting, 20(4):531 – 543, 2005.
Thomas G. Dietterich. Ensemble Methods in Machine Learning. In Multiple Classifier

Systems, pages 1–15. Springer Berlin Heidelberg, 2000.
Iain Dunning, Joey Huchette, and Miles Lubin. Jump: A modeling language for mathe-
matical optimization. SIAM Review, 59(2):295–320, 2017. 10.1137/15M1020575.
38
ECWMF. PART III: Dynamics and Numerical Procedures. Number 3 in IFS Documenta-
tion. ECMWF, 2019. URL https://www.ecmwf.int/node/19307.
U.S. Energy Information Administration EIA. Annual energy outlook, 2022. URL https://
www.eia.gov/outlooks/aeo/index.php.
João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Hamid Bouchachia.
A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46, 04 2014.
10.1145/2523813.
Everette S. Gardner Jr. Exponential smoothing: The state of the art. Journal of Forecasting,
4(1):1–28, 1985.
Julia Gastinger, Sébastien Nicolas, Dušica Stepić, Mischa Schmidt, and Anett Schülke. A
study on Ensemble Learning for Time Series Forecasting and the need for Meta-Learning.
Technical Report arXiv:2104.11475, arXiv, 2021. URL http://arxiv.org/abs/2104
.11475.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL https://
www.gurobi.com.
James Douglas Hamilton. Time Series Analysis. Princeton University Press, 2020.
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):55–67, 1970. 10.1080/00401706.1970
.10488634.
Rob Hyndman and G. Athanasopoulos. Forecasting: Principles and Practice. OTexts,

Australia, 3rd edition, 2021.
J.A. Knaff, M. DeMaria, C.R. Sampson, and J.M. Gross. Statistical 5-day tropical cyclone
intensity forecasts derived from climatology and persistence. Weather and Forecasting,
18:80 –92, 2003.
Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-

Interscience, USA, 2004.
Pedro Lara-Benı́tez, Manuel Carranza-Garcı́a, and José C. Riquelme. An experimental

review on deep learning architectures for time series forecasting. CoRR, abs/2103.12057,
2021. URL https://arxiv.org/abs/2103.12057.
Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press,
2020. 10.1017/9781108571401.
Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy com-

petition: Results, findings, and conclusions. International Journal of Forecasting, 2022.
National Hurricane Center. Automated tropical cyclone forecasting system (atcf), 2021.
URL https://ftp.nhc.noaa.gov/atcf/. Accessed: 2021-04-06.
39
Mariana Oliveira and Luis Torgo. Ensembles for time series forecasting. In Dinh Phung
and Hang Li, editors, Proceedings of the Sixth Asian Conference on Machine Learning,
volume 39 of Proceedings of Machine Learning Research, pages 360–370. PMLR, 26–28
Nov 2015.
Razak Olu-Ajayi, Hafiz Alaka, Ismail Sulaimon, Funlade Sunmola, and Saheed Ajayi.
Building energy consumption prediction for residential buildings using deep learning and
other machine learning techniques. Journal of Building Engineering, 45:103406, 2022.
https://doi.org/10.1016/j.jobe.2021.103406.
Anh-Duc Pham, Ngoc-Tri Ngo, Thi Thu Ha Truong, Nhat-To Huynh, and Ngoc-Son Truong.
Predicting energy consumption in multiple buildings using machine learning for improving
energy efficiency and sustainability. Journal of Cleaner Production, 260:121082, 2020.
https://doi.org/10.1016/j.jclepro.2020.121082.
Ricardo Prudêncio and Teresa Ludermir. Meta-learning approaches to selecting time series
models. Neurocomputing, 61:121–137, 2004. 10.1016/j.neucom.2004.03.008.
Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand, David
Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones,
Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles Yee,
and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine learning and
data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC),
pages 1–6. IEEE, 2018.
R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.

Journal of Risk, 2:21–41, 2000.
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Prob-
abilistic forecasting with autoregressive recurrent networks. International Journal of
Forecasting, 36(3):1181–1191, 2020.
C. Sampson and Ann J. Schrader. The automated tropical cyclone forecasting system
(version 3.2). Bulletin of the American Meteorological Society, 81:1231–1240, 2000.
Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227,

June 1990. 10.1007/BF00116037.
Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research and
management science, 10:353–425, 2003.
James E Smith and Robert L Winkler. The optimizer’s curse: Skepticism and postdecision
surprise in decision analysis. Management Science, 52(3):311–322, 2006.
Luis R. Soenksen, Yu Ma, Cynthia Zeng, Leonard Boussioux, Kimberly Villalobos Car-
ballo, Liangyuan Na, Holly M. Wiberg, Michael L. Li, Ignacio Fuentes, and Dimitris
Bertsimas. Integrated multimodal artificial intelligence framework for healthcare appli-
cations. Nature npj Digital Medicine, 5(1):149, 2022. URL https://doi.org/10.1038/
s41746-022-00689-4.
40
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
David H. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259, 1992. https://

doi.org/10.1016/S0893-6080(05)80023-1.
Huan Xu, Constantine Caramanis, and Shie Mannor. Robust Regression and Lasso. In
D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Infor-
mation Processing Systems, volume 21. Curran Associates, Inc., 2008.
Margaux Zaffran, Aymeric Dieuleveut, Olivier Féron, Yannig Goude, and Julie Josse. Adap-
tive conformal predictions for time series, 2022.
Cynthia Zeng and Dimitris Bertsimas. Global flood prediction: a multimodal machine
learning approach, 2023.
Hui Zou and Yuhong Yang. Combining time series models for forecasting. International
Journal of Forecasting, 20(1):69–84, 2004.
Appendix A. Additional Details on Data Experiments
A.1 Evaluation of the Training Time with Synthetic Data
We provide the computational time of adaptive ridge when we vary the training data size
(Table 6), the number of past time steps to use for the adaptive rule (Table 7), and the
number of ensemble members available (Table 8). Overall, the method remains tractable
with typical values of those hyperparameters.
Table 6: Average training time across 30 seeds of adaptive ridge with respect to the number
of samples used in the training set, when we fix the number of ensemble members to m = 10
and the number of past time steps to use to τ = 5.
Number of samples in training set Average time (s)

100 0.4
200 0.6
400 1.3
750 64
1500 92
3000 239
41
of past time steps to use, when we fix the number of ensemble members to m = 10 and the
number of training samples to N = 3000.
Number of time steps Average time (s)

2 54
3 100
5 239
10 603
15 1079
25 1361
of ensemble members to use, when we fix the number of past time steps to τ = 5 and the
number of training samples to N = 3000.
Number of ensemble members Average time (s)

3 7
6 51
10 239
15 236
25 236
50 250
A.2 Hyperparameter Tuning in Real-World Experiments
For all real-world experiments, we tuned the following hyperparameters:
• regularization factor for Ridge, Adaptive Ridge, and Passive-Aggressive with the val-
ues λ, P A ∈ {0, 1e − 4, 1e − 3, 1e − 2, 1e − 1, 1, 2}
• window size of past data to use: τ ∈ [1, 10].
A.3 Synthetic Data Hyperparameters
Table 9 summarizes the different hyperparameters involved in our synthetic experiments.
42
Table 9: List of features included in our statistical data with their fixed value or range when
they vary.
Notation (if
Synthetic Data Value Range when varies for a
Synthetic Data Hyperparameter defined in
Hyperparameter Type when fixed specific experiment
paper)
Total number of samples available T 4000 Always fixed
[50, 2000] (only varies
Number of samples in the training set 2000 for the training data
length experiment)
General data settings [25, 1000] (only varies
Number of samples in the validation set 1000 for the training data
length experiment)
Number of samples in the test set 1000 Always fixed
Number of seeds to repeat standard experiments 30 Always fixed
Sine Period 500 Always fixed

Ground truth values Target Data Gaussian Noise mean 0 Always fixed
Target Data Gaussian Noise standard deviation 0,1 Always fixed

[2, 50] (only varies for
Number of ensemble members m 10 number of ensemble
members experiments)
Lower bound for Uniform sampling of the
b_ 0.5 Always fixed
ensemble members' mean errors
Ensemble members Upper bound for Uniform sampling of the
b+ 0.5 Always fixed
ensemble members' mean errors
Lower bound for Uniform sampling of the
s_ 0 Always fixed
standard deviation of the ensemble members' errors
Upper bound for Uniform sampling of the standard
s+ 0.5 Always fixed
deviation of the ensemble members' errors
Mean of the Gaussian distribution used to sample
the mean of each ensemble members' drift 0 Always fixed
distribution
Standard deviation of the Gaussian distribution
used to sample the mean of each ensemble 𝛔 drift 0.5
model drift experiments)
members' drift distribution
Lower Bound of the Uniform distribution used to
Ensemble members' drift sample the standard deviation of each ensemble 0 Always fixed
Upper Bound of the Uniform distribution used to
sample the standard deviation of each ensemble s drift 0.5
model drift experiments)
Probability to add an additional Gaussian noise at
each time step (only used for a specific data drift p drift [0, 1]
experiment)
43

paper3_MTP

Uploaded by

Copyright:

Available Formats

paper3_MTP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

paper3_MTP

Uploaded by

Copyright:

Available Formats

Submitted to Journal of Machine Learning Research

Ensemble Modeling for Time Series Forecasting: an

Dimitris Bertsimas dbertsim@mit.edu

Léonard Boussioux leobix@mit.edu (corresponding author)

Many popular forecasting methods exist, including auto-regressive processes (Hamilton,

Ensemble modeling is a well-established technique to leverage the strengths and limitations

Although this modeling process is challenging to implement in practice, it offers powerful

The Adaptive Robust Optimization (ARO) framework is a methodology to address multi-

2. An analysis of the robustness guarantees of our method, accompanied by equivalent

3. A comprehensive study of the impact of various hyperparameters on our adaptive

4. Empirical validation of our proposed ensemble method, showcasing its superiority

2.1 Ensemble Methods for Time Series Forecasting

Dynamic Ensemblers Another common approach to ensemble forecasting consists in dy-

2.2 Robust Optimization

Formalism We introduce some quick formalism to explicit the RO framework. We define

min max f (z, x)

This multi-stage setting encompasses many real-world scenarios, such as lot-sizing on a

min max f (z, x, y(z))

s.t. fi (z, x, y(z)) ≤ 0, ∀i ∈ [p].

3. The Problem: a Robust Linear Regression for Time Series Forecasting

3.1 Linear Regression for a Time Series Ensemble

Define a sample Xt = [x1t , . . . , xm m

β := [β 1 | . . . |β T ] = [β11 , . . . , β1m , . . . , βT1 , . . . , βTm ] ∈ Ω ⊂ RT ·m ,

where Ω is a user-designed set of constraints on the coefficients, and | is the concatenation

or the least absolute deviation problem written as follows:

min y − X̃β , (3)

min max y − (X̃ + ∆)β(∆) , (4)

Example 1 (Time-Varying Least Absolute Deviations) Using `1 norm and given

3.3 Adaptive Robust Formulation

We first decide on using a fixed window size of information τ ≥ 1 such that:

β t (∆1 , . . . , ∆t−1 ) := β t (∆t−τ , . . . , ∆t−1 ).

b t−1 ] = [Xt−τ − yt−τ e| . . . |Xt−1 − yt−1 e] ∈ Rm·τ ,

where | indicates the concatenation operation and e the vector of 1s.

min max y − (X̃ + ∆)β , (7)

3.4 Equivalence to a Regularized Problem

Let us consider a natural choice for the uncertainty set as:

is equivalent to the following problem, which we name Adaptive Ridge:

min y − X̃β + λ kβk2 . (9)

Definition 1 For two norms g, h, we define the induced-norm uncertainty set:

U(h,g) := {∆ ∈ RT ×T ·m : k∆k(h,g) ≤ λ},

Definition 2 For p ∈ [1, +∞], we define the Frobenius uncertainty set:

UFp := {∆ ∈ RT ×T ·m : k∆kFp ≤ λ},

where the Frobenius norm corresponds to

min max ky − (X̃ + ∆)βkp = min ky − X̃βkp + λkβkq .

In particular, for p = q = 2, we recover what we call “Adaptive Ridge” in this paper as a

Proof The result follows from the lemma below:

Lemma 1 Bertsimas and Copenhaver (2018) If g : Rn → R is a seminorm which is not

max g(z + ∆β) = g(z) + λh(β),

We reproduce and adapt the proof of the lemma for completeness.

g(z + ∆β) ≤ g(z) + g(∆β) ≤ g(z) + λh(β) for any ∆ ∈ U := U(h,g) .

b ∈ U. We remark that for any x ∈ Rn ,

We now consider the case when g(z) = 0.

Now, by the reverse triangle inequality,

We also propose the following equivalence result with another norm:

Corollary 1 For any p ∈ [1, ∞]:

Proof We obtain the result as a corollary from the theorem below:

Theorem 2 Xu et al. (2008)

Let p ≥ 1. Define U 0 := {(δ 1 , . . . , δ m )| fj (kδ 1 kp , . . . , kδ m kp ) ≤ 0; j = 1, . . . , k} where

If the set Z := {z ∈ Rm |fj (z) ≤ 0, j = 1, . . . , k; z ≥ 0} has a non-empty relative interior,

is equivalent to the following regularized regression problem:

Passive-Aggressive Yes β t+1 = β t + sign (yt e − Xt β t ) τt Xt Margin parameter used in τt