paper3_MTP
paper3_MTP
paper3_MTP
Abstract
Accurate time series forecasting is critical for a wide range of problems with temporal
data. Ensemble modeling is a well-established technique for leveraging multiple predictive
models to increase accuracy and robustness, as the performance of a single predictor can be
highly variable due to shifts in the underlying data distribution. This paper proposes a new
methodology for building robust ensembles of time series forecasting models. Our approach
utilizes Adaptive Robust Optimization (ARO) to construct a linear regression ensemble in
which the models’ weights can adapt over time. We demonstrate the effectiveness of our
method through a series of synthetic experiments and real-world applications, including
air pollution management, energy consumption forecasting, and tropical cyclone intensity
forecasting. Our results show that our adaptive ensembles outperform the best ensemble
member in hindsight by 16-26% in root mean square error and 14-28% in conditional value
at risk and improve over competitive ensemble techniques.
Keywords: time series forecasting, ensemble modeling, regression, robust and adaptive
optimization, sustainability
1. Introduction
Time series data capture sequential measurements and observations indexed by timestamps.
The unique characteristics of time series data, in which observations have a chronological
order, often make their analysis challenging. In particular, predicting future target values
through time series forecasting is a critical research topic in many real-world applications
with a temporal component, such as healthcare operations, finance, energy, climate, and
©.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v/.html.
business. Organizations and professionals often use forecasting to inform their decision-
making, anticipate uncertain scenarios, and take proactive measures.
When applied to time series forecasting, ensemble modeling can become a complex task,
as the temporal aspect of the data adds difficulty in determining which model is most
suitable at any given point in time (Oliveira and Torgo, 2015; Cerqueira et al., 2020).
A common approach is to weigh the ensemble members based on their performance in
recent or historical observations, assuming the data distribution will stay similar and models
behave consistently. Other approaches include meta-learning (Prudêncio and Ludermir,
2004; Gastinger et al., 2021) or regret minimization, for example, under the multi-armed
bandit setting (Cesa-Bianchi and Lugosi, 2006).
However, these methods do not necessarily enforce the robustness of the ensembles, which
can lead to critical prediction errors. In this study, we propose to explore ensemble modeling
for time series forecasting from a different perspective, using robust optimization. We
present a novel optimization scheme that dynamically selects the weights of an ensemble of
forecasts at each time step, taking into account the uncertainty and errors of the underlying
forecasts. Our approach is based on multi-stage robust optimization, where the weights are
optimized variables determined at each time step in response to newly revealed information
about the uncertain parameters.
Traditional optimization assumes that all optimization variables are “here and now” deci-
sions, i.e., we have to determine the values for these optimization variables now and cannot
wait for more information on the uncertain parameters. However, in multi-stage (dynamic)
decision problems, we can introduce “wait and see” variables that can be determined af-
ter the uncertain parameter values have been revealed (Bertsimas and den Hertog, 2022).
2
Adaptive Robust Ensemble Modeling
1. A novel ensemble method that leverages the ARO framework to dynamically adjust
the weights of underlying models in real time, resulting in a robust and accurate
forecasting tool. Our approach, based on linear regression ensembles, is detailed in
Section 3.
2. Background
This section briefly reviews standard ensemble methods for time series forecasting before
introducing the field of robust and adaptive optimization.
Linear Ensembles After the seminal work of Bates and Granger (1969), several combi-
nation methodologies were added to the forecaster toolbox (Clemen, 1989; Zou and Yang,
2004). In particular, weighted linear combinations of different ensemble members became
popular due to the straightforward implementation for real-world deployment. Some typi-
cal linear combination techniques include the simple average, trimmed average, winsorized
average, or median of the ensemble members’ forecasts. Other methodologies can weigh
3
Air pollution management
Base model
1 Pr
e di
ct Energy use from appliances
ion
1
Base model Pred
Data ic tion 2
2
Ensemble Final prediction
…
-1 model
tion m
Base model Predic
m
m-1 ion
ict
ed
Pr
Base model Hurricane intensity forecasting
m
Figure 1: Schematic of the principle of ensemble modeling. Several base learners make
predictions from the original data and are combined into an ensemble. We represent the
three real-world data applications included in this paper.
the models based on their past errors, respective performance, or variance. For instance,
one can train a regularized linear regression such as Ridge (Hoerl and Kennard, 1970) or
LASSO (Tibshirani, 1996) to ensemble the different forecasts. See Adhikari and Agrawal
(2012) for a thorough review of linear ensemble methods.
Model Drift In general, the weighted ensembles are often designed and trained using
some historical observations and then deployed as fixed. While this is convenient for de-
ployment purposes, this static setting may represent limitations when there is model drift,
i.e., potential degradation of a model’s predictive power due to changes in the data. In
this case, the ensemble members’ performance can vary across time or different events due
to “data drift” or “concept drift”. Data drift refers to the model drift scenario when the
properties of the independent variables’ distribution change, for example, due to season-
ality, a shift in consumer behaviors or company strategies, or unexpected events. On the
other hand, concept drift corresponds to the scenario when the properties of the dependent
variable change, which can happen because of changes in the definition of the target value,
the annotation methodology, or the target sensor.
4
Adaptive Robust Ensemble Modeling
These algorithms dynamically combine the forecasting models at each time step by consid-
ering the problem of minimizing the multi-armed bandit regret against the best ensemble
member.
Robust Optimization (RO) seeks to immunize models and problem formulations from adver-
sarial perturbations in the data by introducing, in general, an uncertainty set that captures
the deterministic assumptions about these perturbations. The objective is to find solutions
that are still good or feasible under a general level of uncertainty and are called robust
against errors of the specific magnitude and type chosen by the practitioner.
RO has gained substantial traction in recent years due to the guarantees provided to prop-
erly designed formulations and the progress of techniques and software to solve min-max
adversarial optimization formulations (Bertsimas and den Hertog, 2022).
min f (z, x)
x∈X
s.t. fi (z, x) ≤ 0, ∀i ∈ [p].
When z is readily available, we can solve this problem with classical optimization methods.
However, z is uncertain and rarely explicitly known in real-world applications. The robust
optimization approach consists in representing the set of possible values of z with an uncer-
tainty set Z and solving for the decision x such that the constraints fi are always satisfied.
We now optimize f for the worst case, and the problem becomes:
The adversarial nature of such a problem can make it very difficult to solve to optimality.
A key aspect of RO is to derive an equivalent reformulation of a robust problem with a
computationally tractable form.
5
2.3 Adaptive Robust Optimization
Adaptive robust optimization (ARO) expands the RO framework by separating the decision
x into multiple stages. The principle of a multi-stage problem is to contain adaptive “wait-
and-see” decisions in addition to urgent “here-and-now” decisions. With ARO, once the
“here-and-now” decisions are made, we consider that some of the uncertain parameters will
become known before determining the “wait-and-see” decisions.
Formally, we denote the adaptive decisions as y(z) since they are selected only after some
of the uncertain parameters z may be revealed. An ARO problem typically writes as:
Note that expressing y as a function of z allows the practitioner to choose the function y(z)
arbitrarily before learning z, which is then called a decision rule.
In general, problems that contain such constraints with decision rules are NP-hard (Ben-Tal
et al., 2004). Therefore, to make them tractable, we may restrict y(z) to a given class of
functions even though it could be sup-optimal. One such class is the affine decision rule,
which conveniently expresses y(z) with an affine relationship with z: y(z) = x0 + V · z,
where the coefficients x0 and V are to be determined and become “here-and-now” decision
variables. Using decision rules is one of ARO’s core ideas that will serve this paper’s
methodology.
We aim to formulate an adaptive version of a robust linear regression problem, such as Ridge
or LASSO, in ensemble modeling for time series forecasting. We show how to develop an
adaptive robust formulation that leverages the temporal aspect of the data and guarantees
specific protection against model drift.
6
Adaptive Robust Ensemble Modeling
consider we are given a historical time series of ensemble members’ predictions (X, y) ∈
(RT ×m × RT ) with T regularly spaced samples Xt ∈ Rm with the corresponding ground-
truth values yt ∈ R.
Since we are interested in a dynamic linear combination of the forecasts, we associate all
ensemble member predictions to time-varying coefficients β t = [βt1 , . . . , βtm ] ∈ Rm . For
compactness, we also define the vector of coefficients:
To build an adaptive weighted ensemble to approximate the ground truth values, we are
interested in minimizing problems similar to the following ordinary least squares problem:
v
u T
uX 2
min t yt − X>
t βt ,
(1)
β∈Ω
t=1
T
X
min yt − X >
t βt . (2)
β∈Ω
t=1
Before proceeding, we make several remarks about the settings in this paper:
• We allow β t to vary over time, as we want more flexibility to capture the fluctuations
in the forecast skills. We will define potential constraints on β ∈ Ω later.
• We consider the complete observation of the forecasts and targets for a history of
T time steps. We assume access to historical data that can serve as training and
validation sets.
• We assume that each time step is regularly spaced. However, the lead time for predic-
tion is not necessarily one time step. For example, we can consider 6 hours between
each time step, and make forecasts for 24 hours later, i.e., for a given time step t, yt
is the ground truth value in 24 hours (t + 4), and Xt contains all the values forecasted
for t + 4 by each ensemble member.
• After optimizing the above weights on a training set, the goal is to use the learned
rules about β to make predictions in the “future”, i.e., at time T + 1, T + 2, . . . .
7
For convenience and further developments, we rewrite problems (1) and (2) under a general
compact form where the temporal aspect does not appear although it is still present in the
underlying variables and parameters:
X>
1 0 ··· 0
..
X>
0 .
where k·k is a given norm, and X̃ := 2 ∈ RT ×T ·m .
.. ..
. . 0
0 ··· 0 X>
T
3.2 Robustification
The nominal formulation (3) is akin to the popular Sample Average Approximation (SAA)
(Shapiro, 2003) minimizing the empirical error on the observed historical data. However,
such formulation is prone to overfitting and poor out-of-sample performance (Smith and
Winkler, 2006). This overfitting phenomenon typically originates in corruption of the data
— such as with noise or outliers — or simply from the fact that the available data is finite,
and the empirical error does not approximate precisely the out-of-sample error (Bennouna
and Van Parys, 2022). Hence, a natural approach to avoid overfitting is to robustify the
nominal formulation (3) using an uncertainty set of possible perturbations ∆ ∈ U of the
forecast matrix X̃. We consider, therefore, the robust formulation:
where U is an uncertainty set that characterizes the user’s belief about perturbations of the
historic forecasts X at each time step t. Here, as explained in Section 2.3, the decision β
becomes an adaptive variable β(∆) to the perturbation.
Using a proper choice of uncertainty set U, the formulation (4) accounts for “adversarial
noise” in the forecast matrix X̃ and seeks to protect the dynamic linear regression problem
from structural uncertainty in the data.
Intuitively, we can understand that this uncertainty matrix encapsulates some of the errors
that all forecasting models naturally make at each time step due to errors in the modeling
process or the data.
8
Adaptive Robust Ensemble Modeling
T
X
min max yt − (Xt + ∆t )> β t (∆1 , . . . , ∆T ) . (5)
β 1 ,...,β T ∆1 ∈U1 ,...,∆T ∈UT
t=1
We will consider in the rest of the paper that β t only depends on previous uncertainties
∆1 , . . . , ∆t−1 that have been revealed so far at time t. Therefore, the above example
problem (5) would write as:
T
X
min max yt − (Xt + ∆t )> β t (∆1 , . . . , ∆t−1 ) .
β 1 ,...,β T ∆1 ∈U1 ,...,∆T ∈UT
t=1
With this consideration, we now proceed in the general case to derive decision rules for
β t (∆1 , . . . , ∆t−1 ).
As explained in Section 2.3, the multi-stage problem (4) can be cast in a more amenable
formulation using an affine decision rule for β t with respect to ∆1 , . . . , ∆t−1 .
This fixed window of past information makes the problem more tractable and practical
by restraining the parameter size and ensuring every β t depends on the same number of
historical time steps. Note that for the edge cases β τ −1 , . . . , β 1 , we define without loss of
generality some extra values ∆0 , . . . , ∆−τ +1 .
We next consider that at time t, the uncertainties (∆s )t>s≥1 have been observed as (∆ b s )t>s≥1 ,
and we consider the proxy that (∆ b s )1≤s<t followed: ∆b s ∼ Xs − ys e, i.e., we consider the
forecast uncertainties corresponded to the previous forecast errors.
We now model β t with an affine decision rule depending on the observed uncertainties:
β t := β 0 + V0 · Zt ,
9
where we define the variables β 0 ∈ Rm and V0 ∈ Rm×m·τ , and the vector of parameters:
Overall, the proposed formulation adapts the coefficients at each time step, following an
affine decision rule that can leverage the recent forecast error history.
Example 1 (follow-up) With this modeling and decision rule, the previous adaptive
robust formulation (5), using `1 norm and given uncertainty sets U1 , . . . , UT , becomes:
T
X
min max yt − (Xt + ∆t )> (β 0 + V0 · Zt ) . (6)
β 0 ,V0 ∆1 ∈U1 ,...,∆T ∈UT
t=1
Compact form of the Adaptive Robust Linear Ensemble The rest of the paper
now focuses on this general and more compact formulation:
where Ωadapt is defined as the set of all possible values for β that satisfy the affine decision
rule:
Ωadapt := ΩX,y,τ m
adapt := {β subject to β t = β 0 + V0 · Zt , β 0 ∈ R , V0 ∈ R
m×m·τ
, ∀t}, where
Zt := [Xt−τ − yt−τ e| . . . |Xt−1 − yt−1 e] ∈ Rm·τ .
Notice that Ωadapt is parameterized by X, y, and τ , but we omit this dependency in nota-
tions.
Even with the affine decision rule, it is a priori unclear how to conveniently solve the compact
robust optimization problem (7) due to its adversarial nature. Therefore, we now relate it
to equivalent regularized regression problems for several relevant uncertainty sets.
Indeed, in some cases, a robust formulation identifies the adversarial perturbations the
model is protected against with an equivalent regularized problem (Bertsimas and Copen-
haver, 2018). This is convenient for the practitioner as a min-max formulation can be
converted to a minimization problem with a regularizer on the variables.
10
Adaptive Robust Ensemble Modeling
U = {∆ ∈ RT ×T ·m : k∆k ≤ λ},
where k·k is some matrix norm or seminorm, and λ > 0. We assume λ is fixed for the
remainder of the paper.
Example 2: Adaptive Ridge The choice of the 2-Frobenius norm k∆k := k∆kF2 =
qP P
T T ·m 2
i=1 j=1 ∆ij can be interpreted as a Euclidean ball centered at the observed data X
with radius λ, that models a global perturbation in the data.
In this case, using the `2 norm and UF2 := {∆ ∈ RT ×T ·m : k∆kF2 ≤ λ}, we can show that
the problem:
min max y − (X̃ + ∆)β (8)
β∈Ω ∆∈U
adapt F2 2
We now devise the more general equivalence properties in which this example falls.
g(∆β)
where k∆k(h,g) = max .
h(β)
∆pij
X X
k∆kFp = .
i=1 j=1
Now, we can apply well-established equivalence results between robust min-max formula-
tions and regularized regressions because we expressed our adaptive regression problems in
a compact form.
11
Theorem 1 For p, q ∈ [1, ∞],
where U(h,g) = ∆ : k∆k(h,g) ≤ λ .
The triangle inequality directly gives the first side of the equality:
We next show that there exists some ∆ ∈ U so that g(z + ∆β) = g(z) + λh(β). Let v ∈ Rn
so that v ∈ argmaxh∗ (v)=1 β > v, where h∗ is the dual norm of h. Note in particular that
β > v = h(β) by the definition of the dual norm h∗ . For now, suppose that g(z) 6= 0. Define
the rank one matrix ∆b := λ zv> . We observe that
g(z)
λh(β) g(z) + λh(β)
g(z + ∆β)
b =g z+ z = g(z) = g(z) + λh(β).
g(z) g(z)
b ∈ U, as desired.
where the final inequality follows by definition of the dual norm. Hence ∆
12
Adaptive Robust Ensemble Modeling
Theorem 1 follows by applying this Lemma 1 with the norm q as g and the norm p as h
and passing the equality to the min.
1 1
where p + p∗ = 1 and UFp := {∆ : k∆kFp ≤ λ}.
h i
where v(λ, κ, β) := maxc∈Rm (κ + |β|)> c − kj=1 λj fj (c) .
P
By taking n o
U 0 = UFp = (δ 1 , . . . , δ m )| k kδ 1 kp , . . . , kδ m kp kp ≤ λ
and constraining β ∈ Ωadapt , we recover our Corollary 1.
13
3.5 Predictions in Real-Time
The adaptive ensemble method requires training and validation before deployment. We
recommend separating the available forecast data into training and validation sets to select
the two hyperparameters: the regularization factor λ and the window size of past data τ
to include in the affine decision rule. Note that there must be no sample overlap between
the data used to train the ensemble members or the ensemble model to avoid the biases of
potential overfit and ensure the ensemble can determine the models’ behaviors on held-out
data.
Once the previous Adaptive Robust Ensemble problem (7) has been solved to optimality
using the equivalent formulations, one can make predictions at time T + k, using the affine
decision rule:
yT +k = X> > ∗ ∗
T +k β T +k = XT +k (β 0 + V0 ZT +k ), ∀k ≥ 1.
Below, we summarize the overall training, validation, and testing mechanism of the adaptive
ridge formulation in Algorithm 1.
4. Synthetic Experiments
Our synthetic experiments aim to identify the suitable conditions where our adaptive en-
semble method can have the edge over competitive methods and provide guidance and
intuition on the hyperparameters to use.
We focus our experiments and the rest of the paper on the Adaptive Ridge problem:
where we remind that Ωadapt corresponds to the previous affine decision rules on each β t :
We investigate the dynamics of the different ensembles’ performances with respect to the
following:
14
Adaptive Robust Ensemble Modeling
Algorithm:
1: Xtrain−val , Xtest , ytrain−val , ytest ← train test split(X, y, train size, shuf f le = f alse)
2: Xtrain , Xval , ytrain , yval ← train test split(Xtrain−val , ytrain−val , val size, shuf f le =
f alse)
3: β ∗0i,w , V ∗0i,w ← arg min ytrain − X̃train β + λi kβk2 , ∀i ∈ (1, p), ∀t ∈ (1, T )
t t tw 2
β∈Ωadapt
. Find the adaptive rule coefficients for all hyperparameter combinations.
4: β val
i,wt ← get adaptive coefs(Xval , yval , β ∗0i,w , V ∗0i,w )
t t
. Compute the adaptive coefficients on the validation set.
5: i∗ , t∗ ← arg min M yval , X̃val β val
i,wt
i,t∀i∈(1,p),t∈(1,T )
. Determine best hyperparameter combination based on validation performance.
∗ ∗
6: β 0 , V 0 ← arg min ytrain−val − X̃train−val β + λi∗ kβk2
t w ∗ 2
β∈Ωadapt
. Retrain the best model on the training and validation data combined.
7: β ∗ ← get adaptive coefs(Xtest , ytest , β ∗0 , V ∗0 )
. Compute the adaptive coefficients on the test set.
∗
8: results ← M ytest , X̃val β
. Compute the results on the test set.
15
• The number of samples available for model training Ttrain ,
• The number of past time steps τ the models can use in the decision rule.
4.1.1 Metrics
Overall, we reproduced each experiment 30 times and evaluated the different ensemble
methods with the average and standard deviation of the following metrics:
• to evaluate accuracy: Mean Absolute Error (MAE), Root Mean Square Error (RMSE),
Mean Absolute Percentage Error (MAPE),
In high-stakes machine learning applications, performing well on average and when re-
stricted to challenging scenarios is crucial. Therefore, we evaluated the Conditional Value
at Risk CVaRα (Rockafellar and Uryasev, 2000) that determines the expected loss once the
α Value at Risk (VaR) breakpoint has been breached.
Consider T forecasting cases, where the ground truth at time step t is yt and the ensemble
method prediction is noted yˆt . The metrics are calculated as follows:
T
1X
MAE(y, ŷ) := |yt − yˆt |,
T
t=1
v
u
u1 X T
RMSE(y, ŷ) := t (yt − yˆt )2 ,
T
t=1
T
100 X yt − yˆt
MAPE(y, ŷ) := ,
T yt
t=1
T
1 X
CVaRα (y, ŷ) := min τ + max(0, |yt − yˆt | − τ ).
τ αT
t=1
Notice that CVaRα in its discrete form is a simple optimization problem that we solve using
Julia.
16
Adaptive Robust Ensemble Modeling
Ground Truth For all experiments, we used the same ground truth data: a univari-
ate time series of 4,000 time steps. We generated a periodic signal with some normally
distributed noise:
2π
y(t) = sin · t + (t), ∀t ∈ [4000],
500
(t) ∼ N (0, 0.1).
Formally, at each time step t, we generated the values X̃k (t) of each ensemble member
k ∈ [m] as:
X̃k (t) = y(t) + k (t), ∀t ∈ [4000],
k (t) ∼ N (bk , σk ),
where we sampled b1 , . . . , bm and σ1 , . . . , σm independently before hand as:
bk ∼ Uniform(−0.5, 0.5),
σk ∼ Uniform(−0.5, 0.5),
Adding Drift to the Ensemble Members’ Forecasts To test the capacity of the
ensembles to perform well against drifting forecasts, we additionally simulate a temporal
change in the error distributions of each ensemble member.
Therefore, at each time step t, our final forecast values Xk (t) of each ensemble member
k ∈ [m] are defined as:
t
Xk (t) = X̃k (t) + · driftk (t), ∀t ∈ [4000],
4000
driftk (t) ∼ N (b0k , σk0 ),
where we selected their error biases b01 , . . . b0m and error standard deviations σ10 , . . . , σm
0 before
hand as:
b0k ∼ N (0, σdrift ),
σk0 ∼ Uniform(0, sdrift ).
We conducted experiments with different σdrift , sdrift , ranging from 0 to 1. Again, since we
repeated experiments 30 times across different seeds, we resampled b0k , σk0 for each experi-
ment.
17
Training, Validation, Test Splits We split the data chronologically into training (50%
of the data, i.e., t ∈ [1, 2000]), validation (25% of the data, i.e., t ∈ [2001, 3000]), and test
(25% of the data, i.e., t ∈ [3001, 4000]) sets.
Table 9 in Appendix summarizes all the different hyperparameters involved in the synthetic
data experiments.
Along with the Adaptive Ridge method, we evaluated the following ensembles on the same
data:
• Best Model in Hindsight: we determine in hindsight what was the best ensemble
member on the test data with respect to the MAPE and report its performance for all
metrics. Notice that in real-time, it is impossible to know which model would be the
best on the overall test set, which means the best model in hindsight is a competitive
benchmark.
• Ensemble Mean: consists of weighing each model equally, predicting the average of
all ensemble members at each time step.
• Exp3 (Cesa-Bianchi and Lugosi, 2006): under the multi-armed bandit setting, Exp3
weighs the different models to minimize the regret compared to the best model so far.
The update rule is given by:
−ηt · Regretit
β it+1 = exp Pm i
, with
i=1 exp(−η · Regrett )
t
X
Regretit = (ys − Xsi )2 , ∀i ∈ [1, m], and
s=t−t0
s
8 log(m)
ηt = ,
t0
where the window size t0 considered to determine the regularized leader is tuned.
• Ridge (Hoerl and Kennard, 1970): consists in learning the best linear combination of
ensemble members by solving a ridge problem on the forecasts Xt :
18
Adaptive Robust Ensemble Modeling
T
X 2
min yt − X >
t β + λkβk22 ,
β
t=1
For each experiment, we performed a grid search in [10−4 , 10−3 , 10−2 , 10−1 , 1] on the vali-
dation set to tune the value of the regularization factor λ of the adaptive ridge formulation,
λridge for the ridge formulation, and PA for the Passive-Aggressive algorithm. For each seed,
we selected the parameter that led to the lowest average MAE. Then, we retrained the mod-
els on the training and validation sets combined, using the tuned values, and evaluated the
different metrics on the held-out test set.
We wrote all code in Julia 1.6 (Bezanson et al., 2017), using the JuMP package (Dunning
et al., 2017) to write optimization functions and Gurobi (Gurobi Optimization, LLC, 2022)
as the solver. We performed each individual experiment reported in this paper using 4 Intel
Xeon Platinum 8260 CPU cores from the Supercloud cluster (Reuther et al., 2018).
These experiments compare the performance of the different ensemble techniques with re-
spect to the number of ensemble members available.
We fixed the drift parameters of the error distributions to σdrift = 0.5, sdrift = 0.5, and the
number of past time steps to use to τ = 5.
19
MAE RMSE MAPE
0.45 0.5 250
0.40
0.30
0.3 150
0.25
0.20
0.2 100
0.15
0.10 50
0.1
0.05
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
0.6 0.5
0.4
0.4
0.3
0.2
0.2
0.1
10 20 30 40 50 10 20 30 40 50
Figure 2: Average performance across 30 seeds of the different ensemble methods with
respect to the number of ensemble members. We added in light color the corresponding
one-standard deviation intervals around the average.
The results with all metrics in Figure 2 show the same trend. We make the following
conclusions:
• The performance increases when more models can be used by the ensemble, which
makes sense since there are more opportunities to unveil the underlying ground truth
data by learning how the different models make their errors.
• There is a rapid increase in the performance going from 3 ensemble members to 10.
However, after 15 ensemble members, the performance is stable and does not improve
20
Adaptive Robust Ensemble Modeling
anymore or becomes slightly worse. It suggests that a high number of forecasters does
not necessarily translate to better performance. Instead, a higher number of models
can lead to overfitting the training set.
• Adaptive ridge is consistently the best method and outperforms the classic ridge
formulation, which is static. The PA model also performs well. These methods
clearly outperform the mean baseline and show that a weighted linear combination
can significantly improve upon the best model of the ensemble.
• The Exp3 algorithm indeed achieves lower errors than the best model but does not
compete with the PA algorithm and the ridge formulations.
These experiments aim to determine how well methods perform when the ensemble members
can significantly drift across time.
We fixed the number of ensemble members to m = 10 and the window size of past forecasts
the adaptive part can use to τ = 5.
• When there is no or very little drift, ridge and adaptive ridge perform very similarly,
which is expected since the adaptive part is mainly useful to leverage temporal trends.
• When the ensemble members can drift more than 0.2, ridge starts performing worse
than the PA algorithm, which is again expected since the trained weights may differ
from the ensemble members’ performance trend on the test set.
• On the other side, adaptive ridge maintains its lead on PA since it can leverage recent
trends.
• The baseline methods quickly start to perform terribly with high drift. At the same
time, PA and adaptive ridge metrics maintain a seemingly sublinear relationship with
respect to the amount of drift possible.
Discrete Gaussian Drift We also tested the models with a different type of noise, where
the Gaussian drift happens or not according to a Bernoulli variable.
For this experiment only, our final forecast values Xk (t) of each ensemble member k ∈ [m]
are defined as:
Xk (t) = X̃k (t) + bk (t) · driftk (t), ∀t ∈ [4000],
bk (t) ∼ Bernoulli(pdrift ),
driftk (t) ∼ N (b0k , σk0 ),
21
MAE RMSE MAPE
0.6
250
0.4 0.5
200
0.4
0.3
150
0.3
0.2
0.2 100
0.1 50
0.1
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Figure 3: Average performance across 30 seeds of the different ensemble methods with
respect to the ensemble members’ possible drift. We added in light color the corresponding
one-standard deviation intervals around the average.
where we selected their error biases b01 , . . . b0m and error standard deviations σ10 , . . . , σm
0 before
hand as:
b0k ∼ N (0, 0.5),
σk0 ∼ Uniform(0, 0.5).
We tested several values of the Bernoulli parameter pdrift ranging from 0 to 1. 0 means no
drift is added, and the ensemble members’ errors follow their initial distribution. 1 means
22
Adaptive Robust Ensemble Modeling
the drift is always added, shifting the ensemble members’ errors to a new distribution. Any
value in-between means the errors are oscillating between two distributions of errors.
MAE RMSE MAPE
0.35 175
0.4
0.30 150
0.20 100
0.2
0.15 75
0.10 50
0.1
0.05 25
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.6 0.5
0.4
0.4
0.3
0.2
0.2
0.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 4: Average performance across 30 seeds of the different ensemble methods with
respect to the ensemble members’ possible drift. We added in light color the corresponding
one-standard deviation intervals around the average.
• The ridge method cannot compensate for an additional discrete noise as much as the
adaptive method.
• Adaptive ridge has the most stable performance across the different regimes, which is
expected since it is a robust formulation.
23
• For all methods, it is more difficult to adjust to oscillating error distributions than to
a fixed error distribution, even if the errors are higher in the second distribution.
These experiments aim to determine the performance of adaptive ridge with respect to the
window size of past forecasts it can use at each time step.
We fixed the number of ensemble members to m = 10, the drift parameters of the error
distributions to σdrift = 0.5, sdrift = 0.5. We vary τ ∈ [1, 25].
• A large window size deteriorates the performance, as the adaptive ridge model overfits.
• Between a window size of 2 and 8 time steps, the performance is stable. There is a
sweet spot around 3-4 past time steps. It suggests the importance of hypertuning this
value for the practitioner.
Contrary to all previous experiments, we varied the amount of training and validation data
available to evaluate how the different ensemble methods adapt to lower amounts of samples
in particular. The test set remained the same as previously for all experiments. We split
the training and validation data so that the test set immediately follows those samples. We
varied the number of samples between 100 and 3000 by fixing the validation data size to
one-third of the data available (e.g., for data size = 750, we get training data size = 500,
validation data size = 250).
We fixed the number of ensemble members to m = 10, the drift parameters of the error
distributions to σdrift = 0.5, sdrift = 0.5, and the number of past time steps to use to τ = 5.
• In the small data regime, with under 500 samples available for training, PA is the best
method as it does not rely on any training data. Ridge closely follows and bridges the
gap at around 300 samples. The adaptive ridge method suffers from the lack of data
as it overfits the training set.
• With more data available, the adaptive ridge method closes the gap at 500 - 750
samples and then outperforms all other methods.
• After 750 samples, the performance of all methods remains stable, with adaptive ridge
being the best, followed by PA, and then ridge.
24
Adaptive Robust Ensemble Modeling
0.6 0.5
0.5 0.4
0.4
0.3
0.3
0.2
0.2
0.1
5 10 15 20 25 5 10 15 20 25
Figure 5: Average performance across 30 seeds of the different ensemble methods with
respect to the window size of past data that can be used. We added in light color the
corresponding one-standard deviation intervals around the average.
These experiments highlight that the adaptive framework is primarily suitable when suffi-
cient data is available to determine adaptive coefficients that generalize well enough.
25
MAE RMSE MAPE
0.40
0.5 160
0.35
140
0.30 0.4
120
0.25
0.3 100
0.20
80
0.15 0.2
60
0.10
0.1 40
0.05
20
0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000
0.4
0.4
0.2
0.2
Figure 6: Average performance across 30 seeds of the different ensemble methods with re-
spect to the number of training samples available. We added in light color the corresponding
one-standard deviation intervals around the average.
• Adaptive ridge has a significant edge over ridge when sufficient data is available and
when ensemble members may suffer from performance drift.
• Adaptive ridge is not suitable when there is little data available. In that case, a
purely online method such as PA should be preferred, or a simpler static method such
as ridge.
• We highlight the importance of validating the different hyperparameters of the adap-
tive ridge method: the regularization factor λ and the window size τ of past forecasts
to use in the adaptive term. Adaptive ridge, due to its additional variables, may be
26
Adaptive Robust Ensemble Modeling
sensitive to overfitting in certain situations (e.g., large window size, small training
data size).
• Ensemble methods such as the Ensemble Mean and Exp3 sum models’ weights to
1, while ridge and adaptive ridge can correct forecast biases with negative weights
offering greater accuracy and robustness than the best model in hindsight.
This section illustrates the benefits of our adaptive ensemble method with real-world data.
We investigate three different applications where the time series characteristics differ (see
Figure 7):
1. Air pollution management through wind speed forecasting: the time series exhibits a
daily cyclical behavior and a long-term seasonality.
2. Energy use forecasting: the time series exhibits bursts a few hours a week.
3. Tropical cyclone intensity forecasting: the time series are shorter and smoother, but
the underlying system is chaotic and complicated to predict.
Motivation Air pollution is a pressing concern with far-reaching consequences for human
health, the environment, and ecosystems. The emission of toxic substances from chemical
factories poses a significant risk to nearby populations, mainly when meteorological condi-
tions carry the pollution toward populated areas. It is crucial to accurately predict weather
conditions to inform real-time, effective pollution management strategies to mitigate this
risk. This forms the basis of our first use case: predicting wind speed for air pollution
management. Our data comes from a phosphate production site in Morocco, the largest
chemical industry plant in the country. Previously, Bertsimas et al. (2023) demonstrated
with this plant the effectiveness of a data-driven approach, including a weighted average
ensemble model, in reducing the diffusion of air pollution from industrial plants into nearby
cities, providing a valuable use case to show how our adaptive ensembles can improve further
the robustness and accuracy.
Air Pollution Management Pipeline The phosphate production site is located 10km
southwest of Safi city, Morocco, threatening the health and well-being of the 300,000 res-
idents. With this population in close proximity, weather conditions play a critical role in
determining air pollution dispersion. The site therefore implemented a comprehensive mon-
itoring procedure that includes planning production rates and shutdowns based on 48-hour
meteorological forecasts, as well as real-time wind monitoring systems to detect dangerous
conditions and stop production. The timely and accurate wind speed prediction in the next
hour is crucial for the effective functioning of this procedure.
27
Wind speed over time (1-hour intervals)
10
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Day
600
Energy use (Wh)
500
400
300
200
100
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Day
140
Intensity (knots)
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Day
Figure 7: Subsamples of the time series for our three real-world use cases: wind speed at
Safi factories, energy use by appliances, tropical cyclone intensity.
5.1.1 Data
Ensemble Members The Safi operations team receives the local official weather forecast
bulletins from the Moroccan weather agency daily around 6:00 am GMT, with an update
around 6:00 pm GMT. The wind speed forecasts are provided hourly for the next 48 hours.
28
Adaptive Robust Ensemble Modeling
Besides this operational forecast, the different models used by Bertsimas et al. (2023) are
XGBoost, Decision Trees, Optimal Regression Trees, Lasso Regression, and Ridge Regres-
sion. These models took as input the recent meteorological data observations.
We have access to the 8494 historical one-hour lead time wind speed forecasts made by
these 6 models between 2019 and 2022 for every hour. The ground truth was measured by
a sensor on-site that collected data every minute and then was averaged hourly on the Safi
platforms.
Experiments Protocol We split the data chronologically into training (50%), validation
(20%), and test (30%) sets. We standardized all data (targets and features) by subtracting
the mean of the training targets and dividing by the standard deviation of the training
targets.
After selecting the best hyperparameter combination for each ensemble method using the
validation set (see the Appendix for more details), we retrained the models on the training
and validation sets combined. We then evaluated them on the test set in the same way we
conducted the synthetic experiments.
5.1.2 Results
Table 2 compares the performance of the different ensemble methods with the same metrics
described previously.
Adaptive ridge consistently provides the best performance across all metrics, improving over
the best model in hindsight by 8% in MAE, 17% in RMSE, 7% in MAPE, 26% in CVaR
5%, and 14% in CVaR 15%. The adaptive ridge can substantially reduce the worst-case
errors, which is critical for the success of the pollution management pipeline.
In comparison, the other ensemble methods fail to outperform the best model in hindsight
consistently. As expected, Exp3 provides comparable performance to the best model in
hindsight: 0.510 vs. 0.507 in MAE, 0.690 vs. 0.756 in RMSE, and more robust results: 1.87
vs. 2.27 in CVaR 5%, 1.36 vs. 1.46 in CVaR 15%.
Adaptive ridge outperforms ridge substantially on all metrics, notably the robustness, by
17% in CVaR 5% and 12% in CVaR 15%, which is of substantial interest to the plant
operators to prepare better against incoming dangerous weather conditions.
The ability of regression coefficients to adapt over time is demonstrated in Figure 8, which
plots a subset of these values over a period of two weeks. The rate of change for the
coefficients varies among models, with some displaying more significant shifts than others.
Furthermore, as depicted in Figure 9, the coefficients can take on both positive and negative
29
Table 2: Results for each metric and ensemble method on the test set for the wind speed
forecasting task on the Safi site. Results are given in km/h.
values. A daily cyclical pattern is also observed, which aligns with the cyclical nature of
the forecast errors (due itself to the cyclical nature of the weather data).
It is worth noting that the predictions of some models may compensate for one another,
particularly when there is a high degree of correlation among them. In such cases, prac-
titioners may want to consider selecting a subset of models before utilizing an adaptive
ensemble, and potentially impose an additional constraint that the sum of their coefficients
is equal to 1, depending on the specific application.
Figure 8: Evolution every hour of the regression coefficients values for each ensemble mem-
ber.
Due to fast urbanization, rising population, and increased social needs, the energy demand
in the building sector has expanded dramatically over the past few decades. In particu-
lar, appliances in residential buildings represent a substantial part of the electrical energy
demand, approximating 30% (EIA, 2022). Numerous studies investigated appliances’ en-
ergy use in buildings and developed forecasting models of electrical energy consumption in
buildings to guide multiple applications such as detecting abnormal energy use patterns, de-
termining adequate sizing of alternate energy sources and energy storage to diminish power
flow into the grid, informing energy management system for load control, demand-side man-
agement and demand-side response, predicting electricity price (e.g., Pham et al. (2020);
Olu-Ajayi et al. (2022); Zaffran et al. (2022)). According to Colmenar-Santos et al. (2013),
30
Adaptive Robust Ensemble Modeling
Coefficient Value
Coefficient Value
0.5 0.5 0.5
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
Model Errors Model Errors Model Errors
Coefficient Value
Coefficient Value
0.5 0.5 0.5
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
Model Errors Model Errors Model Errors
Figure 9: Coefficient values of each ensemble member with respect to their errors.
the availability of a building energy system with precise forecasting could save between 10%
and 30% of all energy consumed. Overall, as accurate and robust forecasting can be impact-
ful, we investigate how adaptive ridge compares with the other ensemble techniques using
real-world data on appliance energy consumption provided by Candanedo et al. (2017).
5.2.1 Data
The dataset is publicly available at the University of California Irvine (UCI) Machine Learn-
ing repository1 . First, we trained and selected our own forecasting models to constitute the
ensemble members before comparing the different ensemble methods.
1. https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction
31
Table 3: Results for each metric and ensemble method on the test set for the energy
consumption task. Results are given in Wh.
We used the Python package Lazy Predict2 to conveniently train and cross-validate 36
different forecasting models using the training set only. Models trained included numerous
tree-based models, linear regressions, support vector machines, neural networks, and various
other methods. Among the 30 models trained, we selected the best 10 with respect to MAE
on the validation set 1 to constitute our final ensemble members. The best 10 models
were Ordinary Least Squares (OLS), Ridge, Bayesian Ridge, ElasticNet, Huber Regressor,
Least Angle Regression (LARS), Lasso, LassoLARS, Linear Support Vector Regression, and
Orthogonal Matching Pursuit.
We collected the predictions of these different models on the validation sets and test sets
and used validation set 1 as training data for the ensembles, validation set 2 as validation
data for the ensembles, and the test set to evaluate the results.
5.2.2 Results
The results presented in Table 3 demonstrate the superiority of the adaptive ridge method
compared to other ensemble techniques. It is the only ensemble method that outperformed
the best model in hindsight in MAE and RMSE, showing a 15% improvement in MAE and
a 26% improvement in RMSE. The MAPE performance of adaptive ridge was similar to
the best ensemble member in hindsight (23.5% vs. 23.2%), while the second-best ensemble
method, Exp3, achieved a MAPE of 27.6%.
The energy consumption target values in this dataset displayed significant bursts, as shown
previously in Figure 7, making it difficult for all ensemble methods to accurately capture
such events, as evidenced by the poor CVaR scores. However, the adaptive ridge method
demonstrated its ability to effectively leverage the errors of each ensemble member, resulting
in more robust solutions than any other method. The improvement in CVaR 5% was 27%
and 28% in CVaR 15% compared to the best ensemble member, while ridge showed an
improvement of 7% in CVaR 5% and 8% in CVaR 15%.
2. https://lazypredict.readthedocs.io/
32
Adaptive Robust Ensemble Modeling
5.3.1 Data
We collected the different operational and machine learning intensity forecasts provided by
Boussioux et al. (2022)3 . We used the forecasts made on their test set, corresponding to the
years 2016-2019. They obtained operational forecast data from the Automated Tropical
Cyclone Forecasting (ATCF) data set maintained by the NHC (Sampson and Schrader,
2000; National Hurricane Center, 2021). The ATCF data contains historical forecasts by
operational models used by the NHC for its official forecasting of tropical and subtropical
cyclones in the North Atlantic and Eastern Pacific basins.
Table 4 lists the 7 forecast models included as ensemble members and the 2 operational
benchmark models and summarizes the predictive methodologies employed.
3. https://github.com/leobix/hurricast
33
Table 4: Summary of all tropical cyclone intensity forecasting models used as ensemble
members or benchmark.
Model ID Model name Model type Used as
HUML-(stat, xgb) Hurricast (tabular data) Gradient-boosted trees Ensemble member
HUML-(stat/viz, xgb/td) Hurricast (tabular and vision data) Feature extraction with tensor decomposition Ensemble member
+ Gradient-boosted trees
HUML-(stat/viz, xgb/cnn/gru) Hurricast (tabular and vision data) Feature extraction with deep learning Ensemble member
(CNN, GRU) + Gradient-boosted trees
HUML-(stat/viz, xgb/cnn/transfo) Hurricast (tabular and vision data) Feature extraction with deep learning Ensemble member
(CNN, Transformers) + Gradient-boosted trees
Decay-SHIPS Decay Statistical Hurricane Intensity Statistical-dynamical Ensemble member
Prediction Scheme
GFSO Global Forecast System model Multi-layer global dynamical Ensemble member
HWRF Hurricane Weather Research and Multi-layer regional dynamical Ensemble member
Forecasting model
FSSE Florida State Super Ensemble Corrected consensus Benchmark model
OFCL Official NHC Forecast Consensus Benchmark model
Overall, our ensemble members comprise four deep learning models (Hurricast), one statistical-
dynamical model (Decay-SHIPS), and two dynamical models (GFSO, HWRF). We bench-
mark our ensemble methods against the forecasts made by FSSE and OFCL, which are
the highest-performing consensus models used by the NHC. The technical details of all the
models can be found in Cangialosi (2020) and Boussioux et al. (2022).
Second, tropical cyclones are temporary events: the forecast data comes as several time
series of different lengths instead of only one. For the adaptive ridge method, we treat all
samples from each hurricane as independent samples in the training set and we build Zt as
usual. For the bandit and PA method, for each new tropical cyclone, we reuse the latest
updated weights computed on the previous tropical cyclone.
Training, Validation, Test Splits The forecast data comprises 870 samples for the
North Atlantic Basin and 849 for the Eastern Pacific basin. We split the data chronologically
into training (50% of the data), validation (20% of the data), and test (30% of the data)
sets.
34
Adaptive Robust Ensemble Modeling
5.3.3 Results
Table 5 compares the different ensemble methods. The adaptive ridge and ridge methods
outperform all other ensembles, including the current operational models OFCL and FSSE,
which is of significant interest for operational use due to its potential to improve the official
forecasts issued by the NHC. In particular, the good performance of adaptive ridge in terms
of CVaR is particularly relevant for tropical cyclone forecasting as it may decrease the worst
errors, which are the ones leading to poor decision-making on the ground. For instance,
adaptive ridge outperforms the official forecast OFCL by 5% (resp. 21%) on the North
Atlantic (resp. Eastern Pacific) basin in CVaR 15% and by 12% (resp. 17%) in MAE.
The Ensemble Mean and Exp3 methods perform similarly to the operational consensus
FSSE and OFCL and generally improve slightly over the best model in hindsight. Adaptive
ridge has the best performance overall across basins and metrics, with only ridge slightly
surpassing it in CVaR on the North Atlantic basin. The adaptive component provides a
valuable performance boost compared to ridge only, such as an MAE gain of 4% on the
North Atlantic basin and 9% on the Eastern Pacific basin.
Table 5: Results in knots for each metric and ensemble method for 24-hour lead time
intensity forecasting task. Bold values highlight the best performance for each metric.
North Atlantic Basin Eastern Pacific Basin
Ensemble Method Comparison on 261 cases Comparison on 254 cases
MAE RMSE MAPE (%) CVaR 5% CVar 15% MAE RMSE MAPE (%) CVaR 5% CVaR 15%
Best Model in Hindsight 9.2 12.2 14.2 31.4 24.6 11.3 15.3 14.5 41.1 30.8
Ensemble Mean 9.4 12.8 14.0 35.2 25.6 10.9 14.6 14.2 38.2 29.2
Exp3 8.8 11.5 13.3 30.0 22.6 10.3 14.9 15.5 43.3 31.1
Passive-Aggressive 9.1 12.4 13.9 34.6 25.3 17.9 22.9 27.1 57.2 44.4
FSSE 8.8 11.6 13.8 31.0 23.2 10.0 14.4 15.5 41.3 28.9
OFCL 8.4 11.4 13.3 29.8 22.2 10.6 15.6 17.2 44.3 30.6
Ridge 7.7 10.3 11.7 27.2 21.0 9.7 13.6 15.0 39.2 27.8
Adaptive Ridge 7.4 10.2 11.6 28.6 21.1 8.8 12.1 13.4 34.1 24.1
Beyond ensemble modeling, our methodology can be employed for adaptive feature selec-
tion in complex multimodal tasks. For instance, in healthcare, the methods developed by
Soenksen et al. (2022); Bertsimas et al. (2022) merge modalities such as tables, text, time
series, and tables by extracting features through deep learning, forming a unified vector
representation. This vector is then employed to generate forecasts using standard machine-
learning algorithms. However, the weighting of these features remains static over time. We
can dynamically identify the most relevant features for predicting patient outcomes, disease
progression, or treatment responses by incorporating adaptive feature selection. Additional
applications of adaptive multimodal feature selection encompass natural disaster manage-
ment (Zeng and Bertsimas, 2023), climate change, agriculture, finance and economics, and
many others.
35
6. Conclusion
Using an ARO approach, this paper presented a novel methodology for building robust
ensembles of time series forecasting models. Our technique is based on a linear ensemble
framework, where the weights of the ensemble members are adaptively adjusted over time
based on the latest errors. This approach robustly adapts to model drift using an affine
decision rule and a min-max problem formulation transformed into a tractable form with
equivalence theorems.
One of the advantages of our method is that it relies on a robust linear formulation, which
makes it valuable from a practical standpoint for ease of deployment and interpretability.
However, this method also requires the availability of a training set of some size, a trade-off
we exhibited with synthetic experiments. Therefore, we recommend using our adaptive
ensemble framework when a few hundred historical time series points are available for
training.
Adaptive ensembles have the potential for many time series applications, including epidemi-
ological predictions, healthcare operations, logistics, supply chain, manufacturing, finance,
and product demand forecasting.
Acknowledgements
We thank Shuvomoy Das Gupta, Amine Bennouna, Moı̈se Blanchard for useful discussions.
The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing
Center for providing high-performance computing resources that have contributed to the
research results reported within this paper.
36
Adaptive Robust Ensemble Modeling
References
Sim D. Aberson. Five-day tropical cyclone track forecasts in the north atlantic basin.
Weather and Forecasting, 13(4):1005 – 1015, 1998.
Ratnadip Adhikari and R. K. Agrawal. Combining multiple time series models through a
robust weighted mechanism. In 2012 1st International Conference on Recent Advances
in Information Technology (RAIT), pages 455–460, 2012. 10.1109/RAIT.2012.6194621.
Amine Bennouna and Bart Van Parys. Holistic robust data-driven decisions, 2022. URL
https://arxiv.org/abs/2207.09560.
Dimitris Bertsimas and Dick den Hertog. Adaptive and robust optimization. Dynamic
Ideas, 2022.
Dimitris Bertsimas and Angelos Georghiou. Design of near optimal decision rules in multi-
stage adaptive mixed-integer optimization. Operations Research, 63(3):610–627, 2015.
Dimitris Bertsimas, Eugene Litvinov, Xu Andy Sun, Jinye Zhao, and Tongxin Zheng. Adap-
tive robust optimization for the security constrained unit commitment problem. IEEE
Transactions on Power Systems, 28(1):52–63, 2013.
Dimitris Bertsimas, Kimberly Villalobos Carballo, Yu Ma, Liangyuan Na, Léonard Bous-
sioux, Cynthia Zeng, Luis R. Soenksen, and Ignacio Fuentes. Tabtext: a systematic
approach to aggregate knowledge across tabular data structures, 2022.
Dimitris Bertsimas, Leonard Boussioux, and Cynthia Zeng. Reducing air pollution through
machine learning, 2023. URL https://arxiv.org/abs/2303.12285.
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach
to numerical computing. SIAM review, 59(1):65–98, 2017.
Mrinal K Biswas, Sergio Abarca, Bernardet Ligia, Ginis Isaac, Grell Evelyn, Iacono Michael,
Kalina Evan, Liu Bin, and et al. Hurricane weather research and forecasting (hwrf) model:
2018 scientific documentation. Developmental Testbed Center, 2018.
Léonard Boussioux, Cynthia Zeng, Théo Guénais, and Dimitris Bertsimas. Hurricane fore-
casting: A novel multimodal machine learning framework. Weather and Forecasting, 37
(6), 2022. https://doi.org/10.1175/WAF-D-21-0091.1.
37
Leo Breiman. Heuristics of instability and stabilization in model selection. Annals of
Statistics, 24(6):2350–2383, 1996b.
Gavin Brown and Ludmila Kuncheva. “good” and “bad” diversity in majority vote ensem-
bles. pages 124–133, 2010.
Luis M. Candanedo, Véronique Feldheim, and Dominique Deramaix. Data driven prediction
models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81–
97, 2017.
John P. Cangialosi, Eric Blake, Mark DeMaria, Andrew Penny, Andrew Latto, Edward
Rappaport, and Vijay Tallapragada. Recent Progress in Tropical Cyclone Intensity Fore-
casting at the National Hurricane Center. Weather and Forecasting, pages 1–30, 2020.
Vitor Cerqueira, Luis Torgo, and Igor Mozetič. Evaluating time series forecasting models:
an empirical study on performance estimation methods. Machine Learning, 109(11):
1997–2028, 2020.
Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
Chris Chatfield. Time-Series Forecasting. Chapman and Hall/CRC, New York, 2000.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR,
abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.
Antonio Colmenar-Santos, Lya Lober, David Borge-Diez, and Manuel Castro. Solutions to
reduce energy consumption in the management of large buildings. Energy and Buildings,
56:66–77, 01 2013. 10.1016/j.enbuild.2012.10.004.
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online
passive-aggressive algorithms. Journal of Machine Learning Research, 7(19):551–585,
2006.
Mark DeMaria, Michelle Mainelli, Lynn K. Shay, John A. Knaff, and John Kaplan. Further
improvements to the statistical hurricane intensity prediction scheme (ships). Weather
and Forecasting, 20(4):531 – 543, 2005.
Iain Dunning, Joey Huchette, and Miles Lubin. Jump: A modeling language for mathe-
matical optimization. SIAM Review, 59(2):295–320, 2017. 10.1137/15M1020575.
38
Adaptive Robust Ensemble Modeling
ECWMF. PART III: Dynamics and Numerical Procedures. Number 3 in IFS Documenta-
tion. ECMWF, 2019. URL https://www.ecmwf.int/node/19307.
U.S. Energy Information Administration EIA. Annual energy outlook, 2022. URL https://
www.eia.gov/outlooks/aeo/index.php.
João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Hamid Bouchachia.
A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46, 04 2014.
10.1145/2523813.
Everette S. Gardner Jr. Exponential smoothing: The state of the art. Journal of Forecasting,
4(1):1–28, 1985.
Julia Gastinger, Sébastien Nicolas, Dušica Stepić, Mischa Schmidt, and Anett Schülke. A
study on Ensemble Learning for Time Series Forecasting and the need for Meta-Learning.
Technical Report arXiv:2104.11475, arXiv, 2021. URL http://arxiv.org/abs/2104
.11475.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL https://
www.gurobi.com.
James Douglas Hamilton. Time Series Analysis. Princeton University Press, 2020.
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):55–67, 1970. 10.1080/00401706.1970
.10488634.
J.A. Knaff, M. DeMaria, C.R. Sampson, and J.M. Gross. Statistical 5-day tropical cyclone
intensity forecasts derived from climatology and persistence. Weather and Forecasting,
18:80 –92, 2003.
Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press,
2020. 10.1017/9781108571401.
National Hurricane Center. Automated tropical cyclone forecasting system (atcf), 2021.
URL https://ftp.nhc.noaa.gov/atcf/. Accessed: 2021-04-06.
39
Mariana Oliveira and Luis Torgo. Ensembles for time series forecasting. In Dinh Phung
and Hang Li, editors, Proceedings of the Sixth Asian Conference on Machine Learning,
volume 39 of Proceedings of Machine Learning Research, pages 360–370. PMLR, 26–28
Nov 2015.
Razak Olu-Ajayi, Hafiz Alaka, Ismail Sulaimon, Funlade Sunmola, and Saheed Ajayi.
Building energy consumption prediction for residential buildings using deep learning and
other machine learning techniques. Journal of Building Engineering, 45:103406, 2022.
https://doi.org/10.1016/j.jobe.2021.103406.
Anh-Duc Pham, Ngoc-Tri Ngo, Thi Thu Ha Truong, Nhat-To Huynh, and Ngoc-Son Truong.
Predicting energy consumption in multiple buildings using machine learning for improving
energy efficiency and sustainability. Journal of Cleaner Production, 260:121082, 2020.
https://doi.org/10.1016/j.jclepro.2020.121082.
Ricardo Prudêncio and Teresa Ludermir. Meta-learning approaches to selecting time series
models. Neurocomputing, 61:121–137, 2004. 10.1016/j.neucom.2004.03.008.
Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand, David
Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones,
Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles Yee,
and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine learning and
data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC),
pages 1–6. IEEE, 2018.
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Prob-
abilistic forecasting with autoregressive recurrent networks. International Journal of
Forecasting, 36(3):1181–1191, 2020.
C. Sampson and Ann J. Schrader. The automated tropical cyclone forecasting system
(version 3.2). Bulletin of the American Meteorological Society, 81:1231–1240, 2000.
Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research and
management science, 10:353–425, 2003.
James E Smith and Robert L Winkler. The optimizer’s curse: Skepticism and postdecision
surprise in decision analysis. Management Science, 52(3):311–322, 2006.
Luis R. Soenksen, Yu Ma, Cynthia Zeng, Leonard Boussioux, Kimberly Villalobos Car-
ballo, Liangyuan Na, Holly M. Wiberg, Michael L. Li, Ignacio Fuentes, and Dimitris
Bertsimas. Integrated multimodal artificial intelligence framework for healthcare appli-
cations. Nature npj Digital Medicine, 5(1):149, 2022. URL https://doi.org/10.1038/
s41746-022-00689-4.
40
Adaptive Robust Ensemble Modeling
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
Huan Xu, Constantine Caramanis, and Shie Mannor. Robust Regression and Lasso. In
D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Infor-
mation Processing Systems, volume 21. Curran Associates, Inc., 2008.
Margaux Zaffran, Aymeric Dieuleveut, Olivier Féron, Yannig Goude, and Julie Josse. Adap-
tive conformal predictions for time series, 2022.
Cynthia Zeng and Dimitris Bertsimas. Global flood prediction: a multimodal machine
learning approach, 2023.
Hui Zou and Yuhong Yang. Combining time series models for forecasting. International
Journal of Forecasting, 20(1):69–84, 2004.
We provide the computational time of adaptive ridge when we vary the training data size
(Table 6), the number of past time steps to use for the adaptive rule (Table 7), and the
number of ensemble members available (Table 8). Overall, the method remains tractable
with typical values of those hyperparameters.
Table 6: Average training time across 30 seeds of adaptive ridge with respect to the number
of samples used in the training set, when we fix the number of ensemble members to m = 10
and the number of past time steps to use to τ = 5.
41
Table 7: Average training time across 30 seeds of adaptive ridge with respect to the number
of past time steps to use, when we fix the number of ensemble members to m = 10 and the
number of training samples to N = 3000.
Table 8: Average training time across 30 seeds of adaptive ridge with respect to the number
of ensemble members to use, when we fix the number of past time steps to τ = 5 and the
number of training samples to N = 3000.
• regularization factor for Ridge, Adaptive Ridge, and Passive-Aggressive with the val-
ues λ, P A ∈ {0, 1e − 4, 1e − 3, 1e − 2, 1e − 1, 1, 2}
42
Adaptive Robust Ensemble Modeling
Table 9: List of features included in our statistical data with their fixed value or range when
they vary.
Notation (if
Synthetic Data Value Range when varies for a
Synthetic Data Hyperparameter defined in
Hyperparameter Type when fixed specific experiment
paper)
Total number of samples available T 4000 Always fixed
[50, 2000] (only varies
Number of samples in the training set 2000 for the training data
length experiment)
General data settings [25, 1000] (only varies
Number of samples in the validation set 1000 for the training data
length experiment)
Number of samples in the test set 1000 Always fixed
Number of seeds to repeat standard experiments 30 Always fixed
43