0% found this document useful (0 votes)
90 views31 pages

Stat Cookbook

This cookbook integrates various topics in probability theory and statistics, based on literature and in-class material from courses of the statistics department at the University of California in Berkeley. It provides concise summaries of key concepts across many areas including distribution overview, probability theory, random variables, statistical inference, parametric inference, hypothesis testing, Bayesian inference, sampling methods, decision theory, linear regression, non-parametric function estimation, stochastic processes, and time series analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views31 pages

Stat Cookbook

This cookbook integrates various topics in probability theory and statistics, based on literature and in-class material from courses of the statistics department at the University of California in Berkeley. It provides concise summaries of key concepts across many areas including distribution overview, probability theory, random variables, statistical inference, parametric inference, hypothesis testing, Bayesian inference, sampling methods, decision theory, linear regression, non-parametric function estimation, stochastic processes, and time series analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Probability and Statistics

Cookbook

Version 0.2.4
14th May, 2017
http://statistics.zone/
Copyright
c Matthias Vallentin, 2017
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29


1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a + 1)2 1 eas e(b+1)s

bxca+1 I(a x b) a+b
Uniform Unif {a, . . . , b} axb
ba ba+1 2 12 s(b a)
1 x>b

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p + pes
!
n x
Binomial Bin (n, p) I1p (n x, x + 1) p (1 p)nx np np(1 p) (1 p + pes )n
x

np1 !n
np1 (1 p1 ) np1 p2
k
! k
n! x
X . X
Multinomial Mult (n, p) px1 pk k xi = n .. .. pi e si
x1 ! . . . xk ! 1 i=1 np2 p1 . i=0
npk
m N m
!  
x np x nx nm nm(N n)(N m)
Hypergeometric Hyp (N, m, n) N N 2 (N 1)
p 
np(1 p) n
N
!  r
x+r1 r 1p 1p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 p)x r r 2
r1 p p 1 (1 p)es
1 1p pes
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+
p p2 1 (1 p)es
x
X i x e s
Poisson Po () e e(e 1)

i=0
i! x!

1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).

3
Uniform (discrete) Binomial Geometric Poisson
n = 40, p = 0.3 0.8 p = 0.2
=1
n = 30, p = 0.6 p = 0.5 =4
n = 25, p = 0.9 p = 0.8 = 10

0.3
0.2 0.6

0.2
PMF

PMF

PMF

PMF
1

0.4
n



0.1





0.1
0.2











0.0



0.0


0.0

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x
Uniform (discrete) Binomial Geometric Poisson
1 1.00
1.0 1.00












0.75 0.8 0.75

i


n


CDF

CDF

CDF

CDF
0.50 0.6 0.50



i

n

0.25 0.4 0.25




n = 40, p = 0.3 p = 0.2
=1
n = 30, p = 0.6 p = 0.5


=4

0 0.00

n = 25, p = 0.9 0.2 p = 0.8 0.00

= 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a)2 esb esa

xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b

(x )2
Z x
2 s2
   
1
N , 2 2

Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2
   
1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+

Log-Normal + erf exp
2 2 2 2 x 2 2 2 2
 
1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2 (
+1
 
 
2 x2 2
>2
Students t Student() Ix , 1 + 0 >1
2

2 2 1<2
 
1 k x 1
Chi-square 2k , xk/21 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)
 
d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)

d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 s (s < )
1
!
(, x) 1 x 1
Gamma Gamma (, ) x e s (s < )
() () 2 1

, x

1 /x 2 2(s)/2 p 
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2) ()
P 
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1

sn n 
   
k k  x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
 x 
m x xm x2m
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 >2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)

1
We use the rate parameterization where =
. Some textbooks use as scale parameter instead [6].

5
Uniform (continuous) Normal LogNormal Student's t
2.0 1.00 0.4 =1
= 0, = 0.2
2
= 0, = 3
2

= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
1.5 0.75 = 0.125, 2 = 1 0.3
PDF

PDF

PDF

PDF
1
1.0 0.50 0.2
ba

0.5 0.25 0.1

0.0 0.00 0.0

a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 = 0.5 0.5 = 1, = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 =1 = 2, = 0.5
k=2 d1 = 5, d2 = 2 = 2.5 = 3, = 0.5
k=3 d1 = 100, d2 = 1 = 5, = 1
k=4 d1 = 100, d2 = 100 0.4 = 9, = 2
k=5
1.5
0.75

2
0.3
PDF

PDF

PDF
PDF

0.50 1.0

0.2

1
0.25 0.5
0.1

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
= 1, = 1 5 = 0.5, = 0.5 2.0 = 1, k = 0.5 4 xm = 1, k = 1
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, k = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, k = 4
4 = 3, = 0.5 = 2, = 2 = 1, k = 5
4 = 2, = 5
1.5 3

3
3
PDF

PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

6
Uniform (continuous) Normal LogNormal Student's t
1 1.00 = 0, = 3
2 1.00
= 2, 2 = 2
0.75 = 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
0.75 = 0.125, 2 = 1 0.75

0.50
CDF

CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

= 0, 2 = 0.2 =1
= 0, 2 = 1 =2
= 0, 2 = 5 =5
0 0.00 = 2, 2 = 0.5 0.00 0.00 =

a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75


CDF

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25


k=1 d1 = 1, d2 = 1 = 1, = 0.5
k=2 d1 = 2, d2 = 1 = 2, = 0.5
k=3 d1 = 5, d2 = 2 = 0.5 = 3, = 0.5
k=4 d1 = 100, d2 = 1 =1 = 5, = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 = 2.5 0.00 = 9, = 2

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 = 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

= 1, = 1 = 1, k = 0.5
= 2, = 1 = 1, k = 1 xm = 1, k = 1
= 3, = 1 = 1, k = 1.5 xm = 1, k = 2
0.00 = 3, = 0.5 0.00 0.00 = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1

Outcome (point or element) Bayes Theorem


Event A
n
-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn = Ai
1. A j=1 P [B | Aj ] P [Aj ] i=1
S
2. A1 , A2 , . . . , A = i=1 Ai A Inclusion-Exclusion Principle
3. A A = A A
n n
r
[ X X \
Probability Distribution P (1)r1

Ai = A ij


1. P [A] 0 A i=1 r=1 ii1 <<ir n j=1

2. P [] = 1
" #
G
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
Probability space (, A, P) X:R

Properties Probability Mass Function (PMF)

P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]


B = B = (A A) B = (A B) (A B)
Probability Density Function (PDF)
P [A] = 1 P [A]
b
P [B] = P [A B] + P [A B]
Z
P [a X b] = f (x) dx
P [] = 1 P [] = 0 a
S T T S
( n An ) = n An ( n An ) = n An DeMorgan
S T Cumulative Distribution Function (CDF)
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B] FX : R [0, 1] FX (x) = P [X x]
= P [A B] P [A] + P [B]
1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )
P [A B] = P [A B] + P [A B] + P [A B]
2. Normalized: limx = 0 and limx = 1
P [A B] = P [A] P [A B]
3. Right-Continuous: limyx F (y) = F (x)
Continuity of Probabilities
S b
A1 A2 . . . = limn P [An ] = P [A]
Z
where A = i=1 Ai
T P [a Y b | X = x] = fY |X (y | x)dy ab
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai a

Independence
f (x, y)
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Conditional Probability Independence

P [A B] 1. P [X x, Y y] = P [X x] P [Y y]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 1 = E [X] E [Y ]
Discrete P [X = Y ] = 1 = E [X] = E [Y ]
X
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
 
fX (x)
X
E [X] = P [X x] X discrete
x1 (z) x=1

Continuous Sample mean


n
n = 1
Z X
X Xi
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z} n i=1
Az
Conditional expectation
Special case if strictly monotone Z

d

dx 1 E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)

dz dz |J| E [X] = E [E [X | Y ]]
Z
The Rule of the Lazy Statistician E(X,Y ) | X=x [=] (x, y)fY |X (y | x) dx

Z Z
E [Z] = (x) dFX (x) E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz

Z Z E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E [(X)Y | X] = (X)E [Y | X]
A
E [Y | X] = c = Cov [X, Y ] = 0
Convolution
Z Z z
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx
0 5 Variance
Z
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z Z 2
    2
X V [X] = X = E (X E [X])2 = E X 2 E [X]
Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y X X X
V Xi = V [Xi ] + Cov [Xi , Xj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
V Xi = V [Xi ] if Xi
Xj
Definition and properties i=1 i=1
X

xfX (x) X discrete Standard deviation p
sd[X] = V [X] = X

Z x

E [X] = X = x dFX (x) = Covariance

Z
xfX (x) dx X continuous


Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
P [X = c] = 1 = E [X] = c Cov [X, a] = 0
E [cX] = c E [X] Cov [X, X] = V [X]
E [X + Y ] = E [X] + E [Y ] Cov [X, Y ] = Cov [Y, X]
9
Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
Cov Xi , Yj = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
Xi Bern (p) = Xi Bin (n, p)
i=1
Correlation X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)
Cov [X, Y ]
[X, Y ] = p limn Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
X NBin (1, p) = Geo (p)
Pr
Sample variance X NBin (r, p) = i=1 Geo (p)
n P P
1 X n )2 Xi NBin (ri , p) = Xi NBin ( ri , p)
S2 = (Xi X
n 1 i=1 X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Conditional variance Poisson
    2 n n
!
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] X X
Xi Po (i ) Xi Xj = Xi Po i
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1

n n
X X i
6 Inequalities Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
j=1 j=1 j=1 j

Cauchy-Schwarz
2 Exponential
E [XY ] E X 2 E Y 2
   
n
X
Markov Xi Exp () Xi
Xj = Xi Gamma (n, )
E [(X)]
P [(X) t] i=1
t Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X E [X]| t]
t2  
X

Chernoff X N , 2 = N (0, 1)

 
e 
X N , Z = aX + b = Z N a + b, a2 2
2

P [X (1 + )] > 1
(1 + )1+ 
Xi N i , i2 Xi Xj =
P
Xi N
P
i , i i2
P 
i i
Hoeffding  
P [a < X b] = b a


X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
1

E X t e2nt2 t > 0
   Upper quantile of N (0, 1): z = (1 )
P X
 2 2
 Gamma
E X | t 2 exp Pn 2n t
   
P |X 2
t>0
i=1 (bi ai ) X Gamma (, ) X/ Gamma (, 1)
P
Jensen Gamma (, ) i=1 Exp ()
P P
E [(X)] (E [X]) convex Xi Gamma (i , ) Xi
Xj = i Xi Gamma ( i i , )
10
Z
() 9.2 Bivariate Normal
= x1 ex dx
0  
Let X N x , x2 and Y N y , y2 .
Beta  
1 ( + ) 1 1 z
x1 (1 x)1 = x (1 x)1 f (x, y) = exp
2(1 2 )
p
B(, ) ()() 2x y 1 2
  B( + k, ) +k1
E X k1
  " #
E Xk =
2 2
=
  
B(, ) ++k1 x x y y x x y y
z= + 2
Beta (1, 1) Unif (0, 1) x y x y
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] +
X
(Y E [Y ])
  Y
GX (t) = E tX |t| < 1 p
V [X | Y ] = X 1 2
" #  
X (Xt)i X E Xi
ti
 
MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
9.3 Multivariate Normal
P [X = 0] = GX (0)
P [X = 1] = G0X (0) Covariance matrix (Precision matrix 1 )
(i)
GX (0)
P [X = i] = V [X1 ] Cov [X1 , Xk ]
i! .. .. ..
=

E [X] = G0X (1 ) . . .
  (k)
E X k = MX (0) Cov [Xk , X1 ] V [Xk ]
 
X! (k) If X N (, ),
E = GX (1 )
(X k)!  
2 1
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 )) fX (x) = (2) n/2
||
1/2
exp (x )T 1 (x )
d 2
GX (t) = GY (t) = X = Y
Properties
9 Multivariate Distributions Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)
9.1 Standard Bivariate Normal X N (, ) = AX N A, AAT

p 
Let X, Y N (0, 1) X
Z where Y = X + 1 2 Z X N (, ) kak = k = aT X N aT , aT a

Joint density
1 x2 + y 2 2xy
  10 Convergence
f (x, y) = exp
2(1 2 )
p
2 1 2 Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2
 
and D
1. In distribution (weakly, in law): Xn X
Independence
X
Y = 0 lim Fn (t) = F (t) t where F continuous
n 11
P
2. In probability: Xn X
Xn n(Xn ) D
Zn := q   = Z where Z N (0, 1)
( > 0) lim P [|Xn X| > ] = 0 n
n V X
as
3. Almost surely (strongly): Xn X lim P [Zn z] = (z) zR
n
h i h i
P lim Xn = X = P : lim Xn () = X() = 1 CLT notations
n n

qm
Zn N (0, 1)
4. In quadratic mean (L2 ): Xn X  2

X n N ,
lim E (Xn X)2 = 0 n
 
n  2

X n N 0,
Relationships n

2

qm P D n(Xn ) N 0,
Xn X = Xn X = Xn X
as
Xn X = Xn X
P n(Xn )
N (0, 1)
D P
Xn X (c R) P [X = c] = 1 = Xn X
P P P
Xn X Yn Y = Xn + Yn X + Y
qm qm qm
Xn X Yn Y = Xn + Yn X + Y Continuity correction
P P P
Xn X Yn Y = Xn Yn XY
x + 12
P P
 
Xn X = (Xn ) (X) 
P Xn x


D
Xn X = (Xn ) (X)
D / n
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
x 12
 
n
qm
n x 1
 
X1 , . . . , Xn iid E [X] = V [X] < X P X
/ n
Slutzkys Theorem Delta method
D P D
Xn X and Yn c = Xn + Yn X + c 
2
 
2 2

D P D
Xn X and Yn c = Xn Yn cX Yn N , = (Yn ) N (), (0 ())
n n
D D D
In general: Xn X and Yn Y =
6 Xn + Yn X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN)
iid
Let X1 , , Xn F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .
Weak (WLLN)
n
X
P
n 11.1 Point Estimation
Strong (SLLN) Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )
h i
n
X
as
n bias(bn ) = E bn
P
Consistency: bn
10.2 Central Limit Theorem (CLT)
Sampling distribution: F (bn )
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .
r h i
Standard error: se(n ) = V bn
b
12
h i h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn 11.4 Statistical Functionals
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent Statistical functional: T (F )
bn D Plug-in estimator of = (F ): bn = T (Fbn )
Asymptotic normality: N (0, 1) R
se Linear functional: T (F ) = (x) dFX (x)
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consis- Plug-in estimator for linear functional:
tent estimator
bn . Z n
1X
T (Fbn ) = (x) dFbn (x) = (Xi )
11.2 Normal-Based Confidence Interval n i=1
 
b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2
 
Suppose bn N , se
 
  b 2 = T (Fbn ) z/2 se
Often: T (Fbn ) N T (F ), se b
and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then
pth quantile: F 1 (p) = inf{x : F (x) p}
Cn = bn z/2 se
b b=X n
n
1 X n )2
b2 =
(Xi X
11.3 Empirical distribution n 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi b)3

b=
Pn
I(Xi x) Pb3
Fn (x) = i=1
b n
i=1 (Xi Xn )(Yi Yn )
n b = qP qP
n 2 n 2
i=1 (Xi Xn ) i=1 (Yi Yn )
(
1 Xi x
I(Xi x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
E Fbn = F (x)

Let F = f (x; ) : be a parametric model with parameter space Rk
h i F (x)(1 F (x)) and parameter = (1 , . . . , k ).
V Fbn =
n
F (x)(1 F (x)) D 12.1 Method of Moments
mse = 0
n
P j th moment
Fbn F (x) Z
j () = E X j = xj dFX (x)
 
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )
 
P sup F (x) Fn (x) > = 2e2n
b 2
j th sample moment
x n
1X j
Nonparametric 1 confidence band for F
bj = X
n i=1 i
L(x) = max{Fbn n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s   1 () =
b1
1 2
= log 2 () =
b2
2n
.. ..
.=.
P [L(x) F (x) U (x) x] 1 k () =
bk
13
Properties of the MoM estimator Equivariance: bn is the mle = (bn ) is the mle of ()
bn exists with probability tending to 1 Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
Consistency: bn ples. If en is any other estimator, the asymptotic relative efficiency is:
p
Asymptotic normality: 1. se 1/In ()
(bn ) D
D
n(b ) N (0, ) N (0, 1)
se
  q
where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b 1/In (bn )
2. se
1
g = (g1 , . . . , gk ) and gj = j ()
(bn ) D
N (0, 1)
se
b
12.2 Maximum Likelihood Asymptotic optimality
Likelihood: Ln : [0, ) h i
V bn
n
Y are(en , bn ) = h i 1
Ln () = f (Xi ; ) V en
i=1
Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n () = log Ln () = log f (Xi ; )
i=1 b where is differentiable and 0 () 6= 0:
If = ()
Maximum likelihood estimator (mle)
n ) D
(b
N (0, 1)
Ln (bn ) = sup Ln () se(b
b )

where b = ()
b is the mle of and
Score function

s(X; ) = log f (X; ) b = 0 ()
se se(
b n )
b b

Fisher information
I() = V [s(X; )] 12.3 Multiparameter Models
In () = nI() Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.
Fisher information (exponential family)
2 `n 2 `n
  Hjj = Hjk =
2 j k
I() = E s(X; )
Fisher information matrix
Observed Fisher information
E [H11 ] E [H1k ]

n
In () = .. .. ..
2 X

. . .
Inobs () =

log f (Xi ; )
2 i=1 E [Hk1 ] E [Hkk ]

Properties of the mle Under appropriate regularity conditions


P
Consistency: bn (b ) N (0, Jn )
14
with Jn () = In1 . Further, if bj is the j th component of , then Critical value c
Test statistic T
(bj j ) D Rejection region R = {x : T (x) > c}
N (0, 1)
se
bj Power function () = P [X R]
h i Power of a test: 1 P [Type II error] = 1 = inf ()
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se 1
Test size: = P [Type I error] = sup ()
0
12.3.1 Multiparameter delta method
Let = (1 , . . . , k ) and let the gradient of be Retain H0 Reject H0



H0 true Type
I Error ()
1 H1 true Type II Error () (power)
.
p-value
..
=

k

p-value = sup0 P [T (X) T (x)] = inf : T (x) R
P [T (X ? ) T (X)]

p-value = sup0 = inf : T (X) R
Suppose =b 6= 0 and b = ().
b Then, | {z }
1F (T (X)) since T (X ? )F
) D
(b
N (0, 1)
se(b
b )
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 0.05 strong evidence against H0
  
se(b
b ) =
b Jbn
b
0.05 0.1 weak evidence against H0
b and

b = b. > 0.1 little or no evidence against H0
and Jbn = Jn () =
Wald test
12.4 Parametric Bootstrap
Two-sided test
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator. b 0
Reject H0 when |W | > z/2 where W =
  se
b
P |W | > z/2
13 Hypothesis Testing p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)

H0 : 0 versus H1 : 1
Likelihood ratio test
Definitions

Null hypothesis H0 sup Ln () Ln (bn )


Alternative hypothesis H1 T (X) = =
sup0 Ln () Ln (bn,0 )
Simple hypothesis = 0 k
Composite hypothesis > 0 or < 0 iid
D
X
(X) = 2 log T (X) 2rq where Zi2 2k and Z1 , . . . , Zk N (0, 1)
Two-sided test: H0 : = 0 versus H1 : 6= 0
 i=1 
One-sided test: H0 : 0 versus H1 : > 0 p-value = P0 [(X) > (x)] P 2rq > (x)
15
Multinomial LRT Natural form
 
X1 Xk
mle: pbn = ,..., fX (x | ) = h(x) exp { T(x) A()}
n n
k
Y  pbj Xj = h(x)g() exp { T(x)}
Ln (b
pn )
T (X) = = = h(x)g() exp T T(x)

Ln (p0 ) j=1
p0j
k  
X pbj D
(X) = 2 Xj log 2k1 15 Bayesian Inference
j=1
p 0j

The approximate size LRT rejects H0 when (X) 2k1, Bayes Theorem
Pearson Chi-square Test f (x | )f () f (x | )f ()
f ( | x) = =R Ln ()f ()
k f (xn ) f (x | )f () d
X (Xj E [Xj ])2
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
T 2k1 X n = (X1 , . . . , Xn )
 
p-value = P 2k1 > T (x) xn = (x1 , . . . , xn )
D
2
Faster Xk1 than LRT, hence preferable for small n Prior density f ()
Likelihood f (xn | ): joint density of the data
Independence testing Yn
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
I rows, J columns, X multinomial sample of size n = I J i=1
X
mles unconstrained: pbij = nij Posterior density f ( | xn )
X
Normalizing constant cn = f (xn ) = f (x | )f () d
R
mles under H0 : pb0ij = pbi pbj = Xni nj
Kernel: part of a density that dependsRon
 
PI PJ nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj L ()f ()d
Posterior mean n = f ( | xn ) d = R n
R
PI PJ (X E[X ])2 Ln ()f () d
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
LRT and Pearson 2k , where = (I 1)(J 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [ (a, b) | xn ] = f ( | xn ) d = 1
Scalar parameter a

fX (x | ) = h(x) exp {()T (x) A()} Equal-tail credible interval


= h(x)g() exp {()T (x)} Z a Z
f ( | xn ) d = f ( | xn ) d = /2
Vector parameter b

Highest posterior density (HPD) region Rn


( s
)
X
fX (x | ) = h(x) exp i ()Ti (x) A()
i=1 1. P [ Rn ] = 1
= h(x) exp {() T (x) A()} 2. Rn = { : f ( | xn ) > k} for some k
= h(x)g() exp {() T (x)} Rn is unimodal = Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let = () and A = { : () }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for 
Unif (0, ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [() | x ] = f ( | x ) d Exp () Gamma (, ) + n, + xi
A
i=1
 Pn   
0 i=1 xi 1 n
2
 2

Posterior density N , c N 0 , 0 + / + 2 ,
2 2 02 c
 0 c1
1 n
h( | xn ) = H 0 ( | xn ) + 2
02 c
Pn
 02 + i=1 (xi )2
Bayesian delta method N c , 2 Scaled Inverse Chi- + n,
+n
square(, 02 )

+ nx n

| X n N (),
b seb 0 ()

N , 2
b
Normal- , + n, + ,
+n 2
scaled Inverse n 2
1X (
x )
Gamma(, , , ) + )2 +
(xi x
2 i=1 2(n + )
15.3 Priors 1
1 1
1 1

MVN(, c ) MVN(0 , 0 ) 0 + nc 0 0 + n x
,
1 1
1

Choice 0 + nc
Xn
MVN(c , ) Inverse- n + , + (xi c )(xi c )T
Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(, ) i=1
the researchs a priori knowledgevia prior elicitation n
X xi
Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (, ) + n, + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 kn where k0 > kn
Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (c , ) Gamma (0 , 0 ) 0 + nc , 0 + xi
i=1

Types

Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys Prior (transformation-invariant):

p p
f () I() f () det(I())

Conjugate: f () and f ( | xn ) belong to the same parametric family


17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
X X
Bern (p) Beta (, ) + xi , + n xi
i=1 i=1
12 10 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (, ) + xi , + Ni xi
p
i=1 i=1 i=1 1p BF10
n
X p = p where p = P [H1 ] and p = P [H1 | xn ]
NBin (p) Beta (, ) + rn, + xi 1 + 1p BF10
i=1
n
16 Sampling Methods
X
Po () Gamma (, ) + xi , + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir () + x(i)
i=1 Setup
n
X
Geo (p) Beta (, ) + n, + xi U Unif (0, 1)
i=1 XF
F 1 (u) = inf{x | F (x) u}
15.4 Bayesian Testing Algorithm
1. Generate u Unif (0, 1)
If H0 : 0 :
2. Compute x = F 1 (u)
Z
Prior probability P [H0 ] = f () d
0 16.2 The Bootstrap
Z
Posterior probability P [H0 | xn ] = f ( | xn ) d Let Tn = g(X1 , . . . , Xn ) be a statistic.
0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:

Let H0 . . .Hk1 be k hypotheses. Suppose f ( | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ] i. Sample uniformly X1 , . . . , Xn Fbn .
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn = g(X1 , . . . , Xn ).
(b) Then
Marginal likelihood B B
!2
1 X 1 X
vboot = VFbn =
b Tn,b T
B B r=1 n,r
Z
n
f (x | Hi ) = f (xn | , Hi )f ( | Hi ) d b=1

16.2.1 Bootstrap Confidence Intervals
Posterior odds (of Hi relative to Hj )
Normal-based interval
n
P [Hi | x ] n
f (x | Hi ) P [Hi ] Tn z/2 se
b boot
=
P [Hj | xn ] f (xn | Hj ) P [Hj ] Pivotal interval
| {z } | {z }
Bayes Factor BFij prior odds 1. Location parameter = T (F )
18
2. Pivot Rn = bn 2. Generate u Unif (0, 1)
3. Let H(r) = P [Rn r] be the cdf of Rn Ln (cand )

3. Accept cand if u
4. Let Rn,b = bn,b bn . Approximate H using bootstrap: Ln (bn )
B
1 X 16.4 Importance Sampling
H(r)
b = I(Rn,b r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q() | xn ]:
5. = sample quantile of (bn,1

, . . . , bn,B ) iid
1. Sample from the prior 1 , . . . , n f ()
6. r = beta sample quantile of (Rn,1

, . . . , Rn,B ), i.e., r = bn
Ln (i )
2. wi = PB i = 1, . . . , B
 
7. Approximate 1 confidence interval Cn = a , b where
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi
b 1 1 =
 

a
= bn H bn r1/2 = 2bn 1/2
2

b = bn Hb 1
=
bn r/2 =
2bn /2 17 Decision Theory
2
Percentile interval   Definitions

Cn = /2 , 1/2 Unknown quantity affecting our decision:
Decision rule: synonymous for an estimator b
16.3 Rejection Sampling Action a A: possible value of the decision rule. In the estimation
context, the action is just an estimate of , (x).
b
Setup
Loss function L: consequences of taking action a when true state is or
We can easily sample from g() discrepancy between and , b L : A [k, ).
We want to sample from h(), but it is difficult Loss functions
k()
We know h() up to a proportional constant: h() = R Squared error loss: L(, a) = ( a)2
k() d (
Envelope condition: we can find M > 0 such that k() M g() K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Algorithm
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
1. Draw cand g() Lp loss: L(, a) = | a|p
2. Generate u Unif (0, 1)
(
0 a=
k(cand ) Zero-one loss: L(, a) =
3. Accept cand if u 1 a 6=
M g(cand )
4. Repeat until B values of cand have been accepted
17.1 Risk
Example
Posterior risk
We can easily sample from the prior g() = f ()
Z h i
r(b | x) = L(, (x))f
b ( | x) d = E|X L(, (x))
b
Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M (Frequentist) risk
Algorithm Z h i
1. Draw cand
f () R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Response variable Y
Covariate X (aka predictor variable or feature)
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )
b

18.1 Simple Linear Regression


h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Model
17.2 Admissibility Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2
Fitted line
b0 dominates b if
b0 rb(x) = b0 + b1 x
: R(, ) R(, )
b
Predicted (fitted) values
: R(, b0 ) < R(, )
b Ybi = rb(Xi )
b is inadmissible if there is at least one other estimator b0 that dominates Residuals  
it. Otherwise it is called admissible. i = Yi Ybi = Yi b0 + b1 Xi

Residual sums of squares (rss)


17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(b0 , b1 ) = 2i
i=1
r(f, )
b = inf e r(f, )

e
R Least square estimates
(x)
b = inf r(b | x) x = r(f, )
b = r(b | x)f (x) dx
bT = (b0 , b1 )T : min rss

b0 ,
b1
Theorems

Squared error loss: posterior mean b0 = Yn b1 Xn


Pn Pn
Absolute error loss: posterior median i=1 (Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
1 =
b Pn = P n
Zero-one loss: posterior mode i=1 (Xi Xn )
2 2 2
i=1 Xi nX
 
0
h i
E b | X n =
17.4 Minimax Rules 1
2 n1 ni=1 Xi2 X n
h i  P 
Maximum risk V b | X n = 2
)
R( b = sup R(, )
b
R(a) = sup R(, a) nsX X n 1
r Pn
2
i=1 Xi

b
Minimax rule se(
b b0 ) =
) sX n n
sup R(, )
b = inf R( e = inf sup R(, )
e
e e

b
se(
b b1 ) =
sX n
b = Bayes rule c : R(, )
b =c Pn Pn 2
where s2X = n1 i=1 (Xi X n )2 and 1
b2 = n2 i=1 
i (unbiased estimate).
Least favorable prior Further properties:
P P
bf = Bayes rule R(, bf ) r(f, bf ) Consistency: b0 0 and b1 1
20
Asymptotic normality: 18.3 Multiple Regression
b0 0 D b1 1 D Y = X + 
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
where
Approximate 1 confidence intervals for 0 and 1 :
X11 X1k 1 1
.. .. = ...
.. ..
b0 z/2 se( and b1 z/2 se( X= . =.

b b0 ) b b1 ) . .
Xn1 Xnk k n
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
W = b1 /se(
b b1 ). Likelihood
 
1
R2 L(, ) = (2 2 )n/2 exp 2 rss
Pn b 2
Pn 2 2
i=1 (Yi Y )  rss
2
R = Pn 2
= 1 Pn i=1 i 2 = 1
i=1 (Yi Y ) i=1 (Yi Y )
tss
N
X
Likelihood rss = (y X)T (y X) = kY Xk2 = (Yi xTi )2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2
i=1 i=1 i=1 If the (k k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) b = (X T X)1 X T Y
i=1 h i
V b | X n = 2 (X T X)1
n
( )
Y 1 X 2
n
L2 = fY |X (Yi | Xi ) exp 2 Yi (0 1 Xi )
2 i b N , 2 (X T X)1

i=1

Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
 rb(x) = bj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for 2


Observe X = x of the covariate and want to predict their outcome Y . n
1 X 2
b2 =
  = X b Y
Yb = b0 + b1 x n k i=1 i
h i h i h i h i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1 mle
nk 2
Prediction interval
b=X b2 =

 Pn 2
 n
2 2 i=1 (Xi X )
n =
b b P 2j + 1
n i (Xi X) 1 Confidence interval
Yb z/2 bn bj z/2 se(
b bj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y for covariates X and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) k
AIC(S) = `n (bS ,
Issues
Bayesian Information Criterion (BIC)
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance k
bS2 )
BIC(S) = `n (bS , log n
Procedure 2

1. Assign a score to each model Validation and training


2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi (S) Yi )2 m = |{validation data}|, often or
i=1
4 2
H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi Ybi (S)
mspe = E (Yb (S) Y )2 R
bCV (S) = (Yi Yb(i) )2 =
i=1 i=1
1 Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )1 XS (hat matrix)
X X
R(S) = mspei = E (Ybi (S) Yi )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) Yi )2
19 Non-parametric Function Estimation
i=1
2 19.1 Density Estimation
R Pn b 2
R i=1 (Yi (S) Y )
rss(S) btr (S) R
R2 (S) = 1 =1 =1 Estimate f (x), where f (x) = P [X A] = A
f (x) dx.
P n 2
i=1 (Yi Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z  2 Z
h i L(f, fbn ) = f (x) fn (x) dx = J(h) + f 2 (x) dx
b
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) R(S) = 2
b b Cov Ybi , Yi
i=1
h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n 1 rss
R2 (S) = 1
n k tss h i
Mallows Cp statistic b(x) = E fbn (x) f (x)
h i
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n  
Definitions 1X1 x Xi
fbn (x) = K
n i=1 h h
Number of bins m
Z Z
1 4 00 2 1
1 R(f, fn ) (hK )
b (f (x)) dx + K 2 (x) dx
Binwidth h = m 4 nh
Bin Bj has j observations c
2/5 1/5 1/5
c2 c3
Z Z
h = 1 c = 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
Define pbj = j /n and pj = Bj f (u) du n1/5
1 K 2 3

Z 4/5 Z 1/5
c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (K ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x Bj )
j=1
h Epanechnikov Kernel
h i pj
E fbn (x) = (
3

h
4 5(1x2 /5)
|x| < 5
h i p (1 p ) K(x) =
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n  
1 X X Xi Xj
Z

h = 1/3 R 2Xb 2
2 du JbCV (h) = fbn2 (x) dx f(i) (Xi ) K + K(0)
n (f 0 (u)) n i=1 hn2 i=1 j=1 h nh
 2/3 Z 1/3
b C 3 0 2
R (fn , f ) 2/3 C= (f (u)) du
n 4 Z
K (x) = K (2) (x) 2K(x) K (2) (x) = K(x y)K(y) dy
Cross-validation estimate of E [J(h)]

Z
2Xb
n
2 n+1 X 2
m 19.2 Non-parametric Regression
JbCV (h) = fbn2 (x) dx f(i) (Xi ) = pb
n i=1 (n 1)h (n 1)h j=1 j Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

Yi = r(xi ) + i
19.1.2 Kernel Density Estimator (KDE)
E [i ] = 0
Kernel K V [i ] = 2

K(x) 0 k-nearest Neighbor Estimator


R
K(x) dx = 1

R
xK(x) dx = 0 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}

R 2 2
x K(x) dx K >0 k
i:xi Nk (x)
23
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
xxi

K {0, 1, . . . } = Z discrete
wi (x) = h [0, 1] {Xt : t T } T =
[0, )

Pn
K
xxj continuous
j=1 h
4 Z  2
h4 f 0 (x)
Z
2 2 00 0 Notations Xt , X(t)
R(brn , r) x K (x) dx r (x) + 2r (x) dx
4 f (x) State space X
Z 2R 2
K (x) dx Index set T
+ dx
nhf (x)
c1
h 1/5 20.1 Markov Chains
n
c2
R (b
rn , r) 4/5 Markov chain
n

P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X


Cross-validation estimate of E [J(h)]
n
X n
X (Yi rb(xi ))2 Transition probabilities
JbCV (h) = (Yi rb(i) (xi ))2 = !2
i=1 i=1 K(0) pij P [Xn+1 = j | Xn = i]
1 Pn  xx 
j
K
j=1 h pij (n) P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )


Approximation
J (i, j) element is pij
X X
r(x) = j j (x) j j (x) pij > 0
P
j=1 j=1 i pij = 1
Multivariate regression
Y = + Chapman-Kolmogorov

0 (x1 ) J (x1 ) X
.. .. .. pij (m + n) = pij (m)pkj (n)
where i = i and = . . . k
0 (xn ) J (xn )
Least squares estimator Pm+n = Pm Pn
b = (T )1 T Y
Pn = P P = Pn
1
T Y (for equally spaced observations only)
n Marginal probability
Cross-validation estimate of E [J(h)]
2 n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]
n J
R
bCV (J) =
X
Yi
X
j (xi )bj,(i) 0 , initial distribution
i=1 j=1 n = 0 Pn
24
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] (s, t)
(s, t) = p =p
{Xt : t [0, )} = number of events up to and including time t V [xs ] V [xt ] (s, s)(t, t)
X0 = 0
Independent increments: Cross-covariance function (CCV)
t0 < < tn : Xt1 Xt0
Xtn Xtn1
xy (s, t) = E [(xs xs )(yt yt )]
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h) Cross-correlation function (CCF)
P [Xt+h Xt = 2] = o(h)
xy (s, t)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =
Rt
(s) ds xy (s, t) = p
0 x (s, s)y (t, t)
Homogeneous Poisson process
Backshift operator
(t) = Xt Po (t) >0
B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs Difference operator
 
1 d = (1 B)d
Wt Gamma t,

Interarrival times White noise
St = Wt+1 Wt
2
 
1 wt wn(0, w )
St Exp iid 2

Gaussian: wt N 0, w
E [wt ] = 0 t T
St V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T
Wt1 Wt t

Random walk
21 Time Series
Drift
Pt
Mean function Z
xt = t + j=1 wj
xt = E [xt ] = xft (x) dx E [xt ] = t

Autocovariance function Symmetric moving average

x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k


X k
X
mt = aj xtj where aj = aj 0 and aj = 1
x (t, t) = E (xt t )2 = V [xt ]
 
j=k j=k
25
21.1 Stationary Time Series Sample variance
n  
Strictly stationary 1 X |h|
V [
x] = 1 x (h)
n n
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ] h=n

k N, tk , ck , h Z Sample autocovariance function

Weakly stationary nh
1 X
 
b(h) = (xt+h x
)(xt x
)
E x2t < t Z n t=1
 2
E xt = m t Z
x (s, t) = x (s + r, t + r) r, s, t Z Sample autocorrelation function
Autocovariance function

b(h)
b(h) =
(h) = E [(xt+h )(xt )] h Z
b(0)
 
(0) = E (xt )2
(0) 0 Sample cross-variance function
(0) |(h)|
nh
(h) = (h) 1 X

bxy (h) = (xt+h x
)(yt y)
n t=1
Autocorrelation function (ACF)

Cov [xt+h , xt ] (t + h, t) (h) Sample cross-correlation function


x (h) = p =p =
V [xt+h ] V [xt ] (t + h, t + h)(t, t) (0)

bxy (h)
Jointly stationary time series bxy (h) = p
bx (0)b
y (0)
xy (h) = E [(xt+h x )(yt y )]
Properties
xy (h)
xy (h) = p 1
x (0)y (h) bx (h) = if xt is white noise
n
Linear process 1
bxy (h) = if xt or yt is white noise

X
X n
xt = + j wtj where |j | <
j= j=


21.3 Non-Stationary Time Series
X
2
(h) = w j+h j Classical decomposition model
j=

xt = t + st + wt
21.2 Estimation of Correlation
Sample mean t = trend
n
1X st = seasonal component
x
= xt
n t=1 wt = random noise term
26
21.3.1 Detrending Moving average polynomial
Least squares (z) = 1 + 1 z + + q zq z C q 6= 0
2
1. Choose trend model, e.g., t = 0 + 1 t + 2 t
Moving average operator
2. Minimize rss to obtain trend estimate bt = b0 + b1 t + b2 t2
3. Residuals , noise wt (B) = 1 + 1 B + + p B p
Moving average MA (q) (moving average model order q)
1
The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + 1 wt1 + + q wtq xt = (B)wt
k q
1 X X
vt = xt1 E [xt ] = j E [wtj ] = 0
2k + 1
i=k j=0
Pk ( Pqh
1 2
If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
w j=0 j j+h 0hq
(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + wt1
t = 0 + 1 t = xt = 1
2 2
(1 + )w h = 0

2
21.4 ARIMA models (h) = w h=1

0 h>1

Autoregressive polynomial
(

(z) = 1 1 z p zp z C p 6= 0 2 h=1
(h) = (1+ )
0 h>1
Autoregressive operator
ARMA (p, q)
(B) = 1 1 B p B p
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
Autoregressive model order p, AR (p)
(B)xt = (B)wt
xt = 1 xt1 + + p xtp + wt (B)xt = wt
Partial autocorrelation function (PACF)
AR (1) xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }
k1 hh = corr(xh xh1
h , x0 xh1
0 ) h2
X k,||<1 X
xt = k (xtk ) + j (wtj ) = j (wtj ) E.g., 11 = corr(x1 , x0 ) = (1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P j
d xt = (1 B)d xt is ARMA (p, q)
E [xt ] = j=0 (E [wtj ]) = 0
2 h
w (B)(1 B)d xt = (B)wt
(h) = Cov [xt+h , xt ] = 12
(h) Exponentially Weighted Moving Average (EWMA)
(h) = (0) = h
(h) = (h 1) h = 1, 2, . . . xt = xt1 + wt wt1
27

X Frequency index (cycles per unit time), period 1/
xt = (1 )j1 xtj + wt when || < 1
j=1
Amplitude A
Phase
n+1 = (1 )xn +
x xn
U1 = A cos and U2 = A sin often normally distributed rvs
Seasonal ARIMA
Periodic mixture
Denoted by ARIMA (p, d, q) (P, D, Q)s
q
P (B s )(B)D d s
s xt = + Q (B )(B)wt X
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
k=1
21.4.1 Causality and Invertibility
P Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that Pq
(h) = k=1 k2 cos(2k h)
  Pq

X (0) = E x2t = k=1 k2
xt = wtj = (B)wt
j=0 Spectral representation of a periodic process
P
ARMA (p, q) is invertible {j } : j=0 j < such that (h) = 2 cos(20 h)
2 2i0 h 2 2i0 h
X = e + e
(B)xt = Xtj = wt 2 2
Z 1/2
j=0
= e2ih dF ()
Properties 1/2

ARMA (p, q) causal roots of (z) lie outside the unit circle Spectral distribution function


X (z)
j 0
< 0
(z) = j z = |z| 1
(z) F () = 2 /2 < 0
j=0
2
0
ARMA (p, q) invertible roots of (z) lie outside the unit circle
F () = F (1/2) = 0

X (z) F () = F (1/2) = (0)
(z) = j z j = |z| 1
j=0
(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
X 1 1
AR (p) MA (q) ARMA (p, q) f () = (h)e2ih
2 2
h=
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P R 1/2
Needs h= |(h)| < = (h) = 1/2
e2ih f () d h = 0, 1, . . .
21.5 Spectral Analysis f () 0
f () = f ()
Periodic process f () = f (1 )
R 1/2
xt = A cos(2t + ) (0) = V [xt ] = 1/2 f () d
2
= U1 cos(2t) + U2 sin(2t) White noise: fw () = w
28
ARMA (p, q) , (B)xt = (B)wt : 22.2 Beta Function
Z 1
(x)(y)
|(e2i )|2
2 Ordinary: B(x, y) = B(y, x) = tx1 (1 t)y1 dt =
fx () = w 0 (x + y)
|(e2i )|2 Z x
a1 b1
Pp Pq Incomplete: B(x; a, b) = t (1 t) dt
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k 0
Regularized incomplete:
Discrete Fourier Transform (DFT) a+b1
B(x; a, b) a,bN X (a + b 1)!
Ix (a, b) = = xj (1 x)a+b1j
n
X B(a, b) j=a
j!(a + b 1 j)!
d(j ) = n1/2 xt e2ij t
I0 (a, b) = 0 I1 (a, b) = 1
i=1
Ix (a, b) = 1 I1x (b, a)
Fourier/Fundamental frequencies
22.3 Series
j = j/n
Finite Binomial
Inverse DFT n n  
n1 X n(n + 1) X n
= 2n
X
xt = n 1/2
d(j )e 2ij t k=
2 k
j=0 k=1 k=0
n n    
X X r+k r+n+1
Periodogram (2k 1) = n2 =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n    
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
k2 = =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X 
n(n + 1)
2 Vandermondes Identity:
n k3 = r  
m n
 
m+n

2
!2 !2 X
n n k=1 =
2X 2X n k rk r
= xt cos(2tj/n + xt sin(2tj/n cn+1 1 k=0
n t=1 n t=1
X
ck = c 6= 1 Binomial Theorem:
c1 n  
n nk k
k=0
X
a b = (a + b)n
22 Math k
k=0

22.1 Gamma Function Infinite


Z

Ordinary: (s) = ts1 et dt X 1 X p
0 pk = , pk = |p| < 1
Z 1p 1p
k=0 k=1
Upper incomplete: (s, x) = ts1 et dt
!  
X d X d 1 1
Z xx kpk1 = pk
= = |p| < 1
dp dp 1 p (1 p)2
Lower incomplete: (s, x) = ts1 et dt k=0 k=0
0  
X r+k1 k
( + 1) = () >1 x = (1 x)r r N+
k
(n) = (n 1)! nN k=0
 
(0) = (1) = X k
p = (1 + p) |p| < 1 , C
(1/2) = k
k=0
(1/2) = 2(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.
k1 Springer, 2002.
Y n!
ordered nk = (n i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n k)!
nk
     
n n! n1+r n1+r
unordered = = =
k k! k!(n k)! r n1

Stirling numbers, 2nd kind


        (
n n1 n1 n 1 n=0
=k + 1kn =
k k k1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :BU D = distinguishable, D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective


( (
mn m n
 
n n! m = n
B : D, U : D mn m!
0 else m 0 else
      (
m+n1 m n1 1 m=n
B : D, U : D
n n m1 0 else
m  
(   (
X n 1 mn n 1 m=n
B : D, U : D
k 0 else m 0 else
k=1
m
( (
X 1 mn 1 m=n
B : D, U : D Pn,k Pn,m
k=1
0 else 0 else

References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

You might also like