Box Jenkins

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Box-Jenkins Modeling Strategy for Fitting ARMA(p, q) Identification of Stationary ARMA(p,q) Processes

Models
Intuition: The mean, variance, and autocorrelations de-
1. Transform the data, if necessary, so that the assump- fine the properties of an ARMA(p,q) model. A natural
tion of covariance stationarity is a reasonable one way to identify an ARMA model is to match the pattern
of the observed (sample) autocorrelations with the pat-
2. Make an initial guess for the values of p and q terns of the theoretical autocorrelations of a particular
ARMA(p, q) model.
3. Estimate the parameters of the proposed ARMA(p, q)
model Sample autocovariances/autocorrelations

1 X T 1 XT
4. Perform diagnostic analysis to confirm that the pro- γ̂ j = (yt − μ̂)(yt−j − μ̂), μ̂ = yt
T t=j+1 T t=1
posed model adequately describes the data (e.g. examine
γ̂ j
residuals from fitted model) ρ̂j =
γ̂ 0
Sample autocorrelation function (SACF)/correlogram

plot ρ̂j vs. j


Result: If Yt ∼ W N (0, σ 2) then ρj = 0 for all j and Box-Pierce and Box-Ljung Q-statistics
√ d
T ρ̂j → N (0, 1) The joint statistical significance of ρ̂1, . . . , ρ̂k may be
so that tested using the Box-Pierce Portmanteau statistic
1 k
avar(ρ̂j ) = X
T Q(k) = T ρ̂2j
Therefore, a simple t−statistic for H0 : ρj = 0 is j=1
√ Assuming Yt ∼ W N (0, σ 2) it is straightforward to show
T ρ̂j
that
and we reject H0 : ρj = 0 if d
√ Q(k) → χ2(k)
|ρ̂j | > 1.96 T
Reject H0 : ρ1 = · · · = ρk = 0 if
Remark: ρ̂1, . . . , ρ̂k are asymptotically independent:
√ Q(k) > χ20.95(k)
d
T ρ̂k → N (0, Ik )
Remark: Box and Ljung show that a simple degrees-of-
where ρ̂k = (ρ̂1, . . . , ρ̂k )0. freedom adjustment improves the finite sample perfor-
mance:
k
X ρ̂2j
Q∗(k) = T (T + 2)
j=1 T − j
Partial Autocorrelation Function (PACF) Example: AR(2)

Yt − μ = φ1(Yt−1 − μ) + φ2(Yt−2 − μ) + εt
The kth order partial autocorrelation of Xt = Yt − μ is
φ11 6= 0, φ22 = φ2, φkk = 0 for k > 2
the partial regression coefficient (in the population) φkk
in the kth order autoregression Example: MA(1)

Xt = φ1k Xt−1 + φ2k Xt−2 + · · · + φkk Xt−k + errort Yt = μ + εt + θεt−1, |θ| < 1 (invertible)
PACF: Since |θ| < 1, Yt has an AR(∞) representation
plot φkk vs. k (Yt − μ) = φ1(Yt−1 − μ) + φ2(Yt−2 − μ) + · · ·
Sample PACF (SPACF) φj = −(−θ)j

plot φ̂kk vs. k Therefore,

where φkk is estimated from an AR(k) model for Yt. φkk 6= 0 for all k, and φkk → 0 as k → ∞
Results:

1. AR(p) processes: φkk 6= 0 for k ≤ p, φkk = 0 for


k>p

2. MA(q) processes: φkk 6= 0 for all k, and → 0 as


k→∞
Inspection of the SACF and SPACF to identify ARMA How to use AIC, BIC to identify an ARMA(p,q) model
models is somewhat of an art rather than a science. A
more rigorous procedure to identify an ARMA model is to 1. Set upper bounds, P and Q for the AR and MA order,
use formal model selection criteria. The two most widely respectively
used criteria are the Akaike information criterion (AIC)
and the Bayesian (Schwarz) criterion (BIC or SIC) 2. Fit all possible ARMA(p, q) models for p ≤ P and
2(p + q) q ≤ Q using a common sample size T
AIC(p, q) = ln(σ̂ 2) +
T
ln(T )(p + q) 3. The best models satisfy
BIC(p, q) = ln(σ̂ 2) +
T
σ̂ 2 = estimate of σ 2 from ARMA(p, q) min
p≤P,q≤Q
AIC(p, q)

Intuition: Think of adjusted R2 in regression: min


p≤P,q≤Q
BIC(p, q)

ln(σ̂ 2) measures overall fit Result: If the true values of p and q satisfy p ≤ P and
2(p + q) ln(T )(p + q) q ≤ Q then
, penalty terms for large models
T T
Note: BIC penalizes larger models more than AIC. 1. BIC picks the true model with probability 1 as T → ∞

2. AIC picks largers values of p and q with positive prob-


ability as T → ∞
Maximum Likelihood Estimation of ARMA models Problem: For a sample from a covariance stationary time
series {yt}, the construction of the log-likelihood give
For iid data with marginal pdf f (yt; θ), the joint pdf for above doesn’t work because the random variables in the
a sample y = (y1, . . . , yT ) is sample (y1, . . . , yT ) are not iid.
T
Y
f (y; θ) = f (y1, . . . , yT ; θ) = f (yt; θ) One Solution: Conditional factorization of log-likelihood
t=1
The likelihood function is this joint density treated as a Intiution: Consider the joint density of two adjacent ob-
function of the parameters θ given the data y : servations f (y2, y1; θ). The joint density can always be
T
Y factored as the product of the conditional density of y2
L(θ|y) = L(θ|y1, . . . , yT ) = f (yt; θ) given y1 and the marginal density of y1 :
t=1
The log-likelihood is f (y2, y1; θ) = f (y2|y1; θ)f (y1; θ)
T
X For three observations, the factorization becomes
ln L(θ|y) = ln f (yt; θ)
t=1 f (y3, y2, y1; θ) = f (y3|y2, y1; θ)f (y2|y1; θ)f (y1; θ)
In general, the conditional marginal factorization has the Two types of maximum likelihood estimates (mles) may
form be computed. The first type is based on maximizing the
⎛ ⎞
T
Y conditional log-likelihood function. These estimates are
f (yT , . . . , y1; θ) = ⎝ f (yt|It−1, θ)⎠ · f (yp, . . . , y1; θ) called conditional mles and are defined by
t=p+1
T
X
It = {yt, . . . , y1} = info availble at time t
θ̂cmle = arg max ln f (yt|It−1, θ)
yp, . . . , y1 = initial values θ t=p+1

The exact log-likelihood function may then be expressed The second type is based on maximizing the exact log-
as likelihood function. These estimates are called exact
T mles, and are defined by
X
ln L(θ|y) = ln f (yt|It−1, θ)+ln f (yp, . . . , y1; θ) T
X
t=p+1 θ̂mle = arg max ln f (yt|It−1, θ)+ln f (yp, . . . , y1; θ)
θ t=p+1
The conditional log-likelihood is
T
X
ln L(θ|y) = ln f (yt|It−1, θ) Result: For stationary models, θ̂cmle and θ̂mle are con-
t=p+1
sistent and have the same limiting normal distribution.
In finite samples, however, θ̂cmle and θ̂mle are generally
not equal and my differ by a substantial amount if the
data are close to being non-stationary or non-invertible.
Example: MLE for stationary AR(1) To determine the marginal density for the initial value y1,
recall that for a stationary AR(1) process
yt = c + φyt−1 + εt, εt ∼ iid N (0, σ 2), t = 1, . . . , T
c
θ = (c, φ, σ 2)0, |φ| < 1 E[y1] = μ =
1−φ
Conditional on It−1 σ 2
var(y1) =
1 − φ2
yt|It−1 ∼ N (c + φyt−1, σ 2), t = 2, . . . , T
It follows that
which only depends on yt−1. The conditional density à !
c σ2
f (yt|It−1, θ) is then y1 ∼ N ,
1 − φ 1 − φ2
µ ¶ Ã !−1/2 ⎛ Ã !2⎞
2 −1/2 1 2
f (yt|yt−1, θ) = (2πσ ) exp − 2 (yt − c − φyt−1) , 2πσ 2 1 − φ2 c
2σ f (y1; θ) = exp ⎝− y1 − ⎠
t = 2, . . . , T 1 − φ2 2σ 2 1−φ
The conditional log-likelihood function is The marginal log-likelihood for the initial value y1 is
à !
T
X −(T − 1) (T − 1) 1 1 σ2
ln f (yt|yt−1, θ) = ln(2π) − ln(σ 2) ln f (y1; θ) = − ln(2π) − ln
2 2 2 2 1 − φ2
t=2 Ã !2
1 XT 1 − φ2 c
− 2 (yt − c − φyt−1)2 − y1 −
2σ t=2 2σ 2 1−φ
The exact log-likelihood function is then
Notice that the conditional log-likelihood function has the
à !
form of the log-likelihood function for a linear regression T 1 σ2
ln L(θ|y) = − ln(2π) − ln
model with normal errors 2 2 1 − φ2
à !2
yt = c + φyt−1 + εt, t = 2, . . . , T 1 − φ2 c
− 2
y1 −
2σ 1−φ
εt ∼ iid N (0, σ 2)
T
(T − 1) 1 X
It follows that − ln(σ 2) − 2 (yt − c − φyt−1)2
2 2σ t=2
ĉcmle = ĉols
φ̂cmle = φ̂ols
T
X
σ̂ 2cmle = (T − 1)−1 (yt − ĉcmle − φ̂cmleyt−1)2
t=2
Remarks Prediction Error Decomposition of Log-Likelihood

1. The exact log-likelihood function is a non-linear func- To illustrate this algorithm, consider the simple AR(1)
model. Recall,
tion of the parameters θ, and so there is no closed form
solution for the exact mles. yt|It−1 ∼ N (c + φyt−1, σ 2), t = 2, . . . , T
from which it follows that
2. The exact mles must be determined by numerically
E[yt|It−1] = c + φyt−1
maximizing the exact log-likelihood function. Usually, a
var(yt|It−1) = σ 2
Newton-Raphson type algorithm is used for the maxi-
mization which leads to the interative scheme The 1-step ahead prediction errors may then be defined
as
θ̂mle,n = θ̂mle,n−1 − Ĥ(θ̂mle,n−1)−1ŝ(θ̂mle,n−1)
vt = yt − E[yt|It−1] = yt − c + φyt−1, t = 2, . . . T
where Ĥ(θ̂) is an estimate of the Hessian matrix (2nd The variance of the prediction error at time t is
derivative of the log-likelihood function), and ŝ(θ̂) is an
ft = var(vt) = var(εt) = σ 2, t = 2, . . . T
estimate of the score vector (1st derivative of the log-
likelihood function). For the initial value, the first prediction error and its vari-
ance are
c
3. The estimates of the Hessian and Score may be com- v1 = y1 − E[y1] = y1 −
1−φ
puted numerically (using numerical derivative routines) or 2
σ
they may be computed analytically (if analytic derivatives f1 = var(v1) =
1 − φ2
are known).
Using the prediction errors and the prediction error vari- 2. With above simplification, σ 2 may be concentrated
ances, the exact log-likelihood function may be re-expressed out of the log-likelihood. That is,
as T v2
∂ ln L(θ|y) 2 1 X t
T T = 0 ⇒ σ̂ mle(c, φ) =
T 1X 1X vt2 ∂σ 2 T t=1 ft∗
ln L(θ|y) = − ln(2π) − ln ft −
2 2 t=1 2 t=1 ft
Substituting σ̂ 2mle(c, φ) back into ln L(θ|y) gives the
which is the prediction error decomposition. concentrated log-likelihood

T 1 T
X
Remarks T
ln Lc(c, φ|y) = − ln(2π+1)− ln σ̂ 2mle(c, φ)− ln ft∗
2 2 2 t=1
1. A further simplification may be achieved by writing Maximizing ln Lc(c, φ|y) gives the mles for c and φ.
Maximizing ln Lc(c, φ|y) is faster than maximizing ln L(θ|y)
var(vt) = σ 2ft∗
and is more numerically stable.
1
= σ2 · for t = 1
1 − φ2
= σ 2 · 1 for t > 1 3. For general time series models, the prediction error
decomposition may be conveniently computed as a by
That is ft∗ = 1/(1−φ2) for t = 1 and ft∗ = 1 for t > 1. product of the Kalman filter algorithm if the time series
Then the log-likelihood becomes model can be cast in state space form.
T T
T T 1X 1 X vt2
ln L(θ|y) = − ln(2π)− ln σ 2− ln ft∗− 2
2 2 2 t=1 2σ t=1 ft∗
Diagnostics of Fitted ARMA Model

1. Compare theoretical ACF, PACF with SACF, SPACF

2. Examine autocorrelation properties of residuals

SACF, SPACF of residuals

LM test for serial correlation in residuals

Remarks:

1. If the Box-Ljung Q∗-statistic is used on the residuals


from a fitted ARMA(p,q) model, the degrees of freedom
of the limiting chi-square distribution must be adjusted
for the number of estimated parameters:
d
Q∗(k) → χ2k−(p+q)
Note that this test is only valid for k > (p + q). In finite
samples, it may perform badly if k is close to p + q.

You might also like