Generalised Linear Models and Bayesian Statistics
Generalised Linear Models and Bayesian Statistics
Generalised Linear Models and Bayesian Statistics
Clare Walker
Components:
Generalized linear models
Decision theory and Bayesian statistics
[ ][ ]
xT1 x11 ⋯ x1 p
2. explanatory variables: X= ⋮ = ⋮ ⋱ ⋮ and a Set of parameters: β1, … , β p (p <
x TN xN1 ⋯ x Np
N)
3. Link function: g represents the transformation μi of that will be modeled as a linear function
of explanatory variables. Must be monotone and differentiable. Examples:
o identity link: ηi=μi z = y in this case
μ
o log link: log ( ¿¿ i)
ηi =¿
μi
o logit link: ηi=log ( ) 1−μi
1
o inverse link: ηi =
μi
1−μ
o log-log link: log (−log (¿ ¿i))
ηi =¿
Basic set up:
Expectation as function of underlying distribution: μi=E [Y i ]=function of θ
Expectation as function of explanatory variables: ηi=g ( μi ) =x ti β=linear function of β j
Fitting distribution of response
Choice of distribution depends on: discrete (categorical, count, binary) vs continuous, symmetrical vs
asymmetric (right, left), natural bounds, likelihood of observing extreme values
∑ (¿ ¿ i+μ i)2 2
χn, λ
i
¿
gamma R parameterization: E.g. claim size, annual skewed to right
E[Y] = αβ −αy income , lab assay non-negative
α −1 β
Var[Y] = α β2 y e expressions α = shape, β = rate
f Y ( y )=
β α Γ (α )
inverse Gaussian CDF of inverse Gaussian extremely skew data very skewed to right
E[Y] = μ is inverse of CDF of non-negative
Var[Y] = μ3 σ 2 Gaussian.
2
−1 y−μ
( )
2 y μσ
e
f Y ( y )=
√ 2 π y3 σ2
weibull λ = 1: exponential very skew data very skewed to right
E.g. lifetimes, failure time non-negative
λ = shape, θ = scale
a ( y ) b ( θ ) +c ( θ ) +d ( y)
Exponential family: f ( y ; θ )=e
Properties:
−c ' (θ) ∂ ∂
∂θ ∫
1. E [ a ( Y ) ]= use: f ( y)dy=0=∫ f ( y )dy
b ' (θ) ∂θ
b' ' ( θ ) c ' ( θ )−b' ( θ ) c '' ( θ )
2. Var [ a (Y ) ]= use:
b ' ( θ )3
∂2 ∂2
∂θ 2
∫ f ( y)dy=0=∫ ∂θ 2 f ( y)dy
dl
3. U= =a ( y ) b' ( θ )+ c' ( θ )
dθ
'' '
b (θ )c ( θ ) ''
4. J =Var [ U ] = −c ( θ )
b' (θ)
5. J =E [ U 2 ] =−E [ U ' ]
o interpretation: the entries in J represent the curvature of l around θ^ . low curvature ⇒ U
is near 0 for many values of θ ⇒ variance of θ^ is large
Notes:
canonical form: a(y) = y. in this case b(θ) is called the natural parameter.
canonical link: g ( μi ) = b(θ)
Estimation
Set up: Y 1 ,… , Y N share same distribution but parameter values may vary ⇒ PDF = f ( y i ; θi )
Estimating θ : method of moments, maximum likelihood exact, maximum likelihood using method of
scoring
Method of maximum likelihood: find θ such that l(θ) is maximized ⇒ U( θ(m ) ¿ = 0
∑ a ( y i) b ( θi )+∑ c (θi )+∑ d ( yi)
joint PDF: f ( y 1 , … , y N ; θ1 , … ,θ N )=e look at data given
parameters
likelihood: L ( θ 1 , … , θ N ; y 1 , … , y N )=¿ ∑ a ( y i ) b (θ i ) + ∑ c (θi ) + ∑ d ( y i ) look at parameters
e
given data
dl d li ' (θ ) '(θ )
Score function: =∑ U= =∑ a ( y i ) b + c =0 i i
(now assume θi=θ ? ¿
dθ i dθ i
o solve analytically for θ(m )
U (m−1) U (m−1)
− ' (m−1 ) =θ (m−1 )+ (m −1 )
(m ) (m−1)
o solve iteratively using: θ =θ
U J
Variance-covariance: J =Var [ U ]
Estimating β: maximum likelihood estimates (MLE) using iteratively re-weigthed least squares (IRLS)
dl d l d θi d μi ( y −μ ) d μ
Score function: U j= =∑ i =∑ i i x ij i
d β j i d θi d μ i d β i i Var ( y i) d ηi
2
x ij xik d μi
Variance-covariance: J jk =E [ U j U k ]=∑
i var ( y i ) d ηi
( )
Variance-covariance matrix: J =E [ U U T ]=X T WX where
2
( )
d μ1
1
var ( y 1) d η1 ( ) ⋯ 0
W= ⋮ ⋱ ⋮
2
d μN
0 ⋯
1
( )
var ( y N ) d η N
d ηi
z i=∑ x ik b k
(m−1)
transformed response variable: +( y i −μi)
k d μi
Solve for β iteratively using: X T WX b(m) =X T Wz use:
(m) (m−1) −1 (m −1 )
b =b +J U
−1 T
b =( X T WX )
(m)
o X Wz
1 −1
o Y i N ( μ i , σ 2 ) ∧identity link used ⇒ W =I 2
∧¿Y ⇒ b=( X T X ) X T Y
σ
use observed means as initial estimates
Inference
method confidence interval hypothesis tests
what random interval which contains true compares how well two related models fit
parameter A% of the time the data based on a “goodness of fit”
statistic
related: same pdf, same link, different
explanatory
uses answer question about relationship find the simplest model that accurately
between response and explanatory describes the data
variable with more accuracy, width gives a
measure of precision
requires sampling distribution of parameter sampling distribution of goodness of fit
estimates statistic
[]
s1
Standardizing results: let S = ⋮ be a sample statistic with covariance matrix J:
sp
S – E[S] ⩪ MVN(0, J)
2
( S−E [S ] )T J −1 ( S−E[S ]) ⩪ χp
o if S has an underlying normal distribution (linear combination of normal rv), results are
exact
o if not, results hold under CLT (regularity conditions satisfied by all exponential family
distributions)
T −1 2
Score statistic: U J U ⩪ χp
related result: U ~ MVN(0, J)
Derivation: E[U] = 0, Var[U] = J and apply standardizing result
uses: typically for inference about distribution parameters
∴ J −1=
( ⋮ ⋱
cov (b 1 , b p ) ⋯
⋮
var (b p ) )
uses: inference about distribution and beta parameters, confidence intervals
confidence interval for slope including interaction term: e.g. Y = β 0 + β1X1 + β2X2 + β3X1 X2. let β*
= slope of X1:
o b* = b1 + b3 and var(b*) = var(b1) + var(b3) + 2cov(b1, b3)
Examples:
( y −nπ )2
Binomial (β = π): ⩪ χ 21
nπ (1−π )
( y −λ )2
Poisson (β = λ): ⩪ χ 21
λ
( ý −μ )2 2
Normal (β = μ): χ1
σ2/ N
2
Log-likelihood ratio statistic: D = 2 ( l ( bmax ; y ) −l ( b ; y ) ) χ (m−p , ν)
maximal/full/saturated model: no summarizing has taken place ⇒ N parameters (one for each
observation), l ( bmax ; y ) will always be maximum possible log-likelihood for assumed pdf and
link as provides most complete information
large values of D suggest the suggested model is a poor description of data relative to maximal
model
Derivation: 3rd order Talyor expansion of l about β ⇒ 2 ( l ( b )−l ( β ) ) =( b−β )T J ( b−β ) χ 2p
D=2 ( l ( bmax )−l ( β max ) )−( l ( b )−l ( β ) ) + ( l ( β max )−l ( β ) ) = χ 2m− χ 2p +ν (positive
constant)
conditioning/subtracting the log-likelihood with true parameter values ensures
independence
uses: hypothesis testing for nested models when choosing simplest but best fitting model,
goodness of fit of a single model comparing it only to maximal (only if n i > 1)
Hypothesis testing: (best when distribution does not have any nuisance parameters)
T
o H0: β = β0 = [ β1 ⋯ βq] corresponding to M0 and has deviance D0
T
o H1: β = β1 = [ β1
⋯ β p ] corresponding to M1 and has deviance D1 and q < p < N
o test stat: ΔD=D0−D 1=2 ( l ( b1 )−l ( b 0) ) χ 2p−q if both models hold
2
o interpretation: if ΔD is consistent with χ p−q then choose M0 as it is simpler
if ΔD is in the critical region for χ 2p−q reject M0 as M1 describes the data
significantly better
Examples:
Binomial: D =
( ()
2 ∑ y i log
( ))
yi
^y i
n−y
+(ni− y i ) log i i
ni−^y i
⩪ χ 2n− p where ^y i = ni π^ i
∑( ) ( ( ))
yi yi
Poisson: D= 2
( ) y i log
^y i
−( y i− ^y i ) ⩪ χ 2n− p where ^y i= ^λ i ≈ 2 ∑ yi log
^y i
when
∑( ( )) oi
∑ y i=∑ ^y i ⇒ 2 oi log
e^ i
1
Normal:D = 2∑( i
y − μ^ i )2 χ 2n −p
σ
dispersion in R output:
scaled deviance
=
∑ ( y i− μ^ i )2
degrees of freedom N− p
2
scaled deviance: D i =σ D = ∑ ( y i −^
2
μi ) (in R: null
and residual deviance)
dispersion
2
Chi-square test: Q = D 0−D 1 ⩪ χ p−q (don’t account for
largest model ¿
¿
estimation of σ2)
D0−D 1
2 p−q
F test: to account for estimation of σ we use F = F p−q , N− p under H0 (account for
D1
N−p
estimation of σ2)
Goodness of fit statistics
p = k + 1 = number of unknown parameters,
m = number of covariate patterns (binary/logistic)
m
Log-likelihood ratio statistic: D= 2 ( l ( bmax ; y ) −l ( b ; y ) ) χ 2 ( m− p ) =∑ d k 2
i=1
∑¿
i=1
y k −E( y k )
chi-square residual: Xk = (moments of yk are expressed in terms of
√ var ( y k )
fitted parameters)
Xk
standardized Xk: rPk =
√1−hk
asymptotically equal to the deviance, chi-square approximation poor when there are small
expected frequencies
assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
The Hosmer-Lemeshow statistic:
2
χ HL ⩪ χ 2(g−2)
groups the predicted probabilities into a number of groups (g) with approximately equal number of
observations in each group. The observed number of successes and failures in each of the g groups
are summarized and the Pearson chi-square statistic for a gx2 contingency tables is calculated as a
measure of fit,
assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
Akaike information criterion: AIC = -2 l ( ^ y ) +2 p
assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
use for: comparing two models that are not nested
Bayesian information criterion: BIC = -2 l ( ^
y ) +2 p ln (¿ observations)
Measures of influence
1 /2 T −1 T 1/ 2
Leverage: hk is the kth diagonal of the hat matrix = W X ( X WX ) X W
measure of the leverage of covariate pattern i: how “far” does the covariate pattern lie from the
average covariate pattern and thus how much of an effect does it have on the estimated model
1 hi
Cook’s distance: Di =
( )
p 1−h i
r 2Pi
Model checking
1. The form of the linear predictor, i.e., is the linear component of the model adequate
T T
scatterplot rDk vs xk b , scatterplot of rPk vs xk b (linear predictor, in R: predict(model))
systematic patter indicates model is incorrect
2. The adequacy of the link function, i.e., is the logit transformation appropriate
3. Outlying and/or influential observations (rather identity visually than using cut-off values, state
properties of observation)
outliers: |rPk| > 2
influence of ith covariate pattern on estimated model: hi > 2p/n
influence of ith covariate pattern on parameter vector: Di
influence of ith covariate pattern on parameter bj: dfbetaij
4. Is Var(yi) correct or is there over-dispersion – indicated by residual variance/df>=2
GLMs for discrete response
covariate pattern: group of response variables described by the same explanatory variables
in grouped data: each group is a covariate pattern (ni is number in each group)
in ungrouped data: response variables with the same values for all explanatory variables.
Increasing number of explanatory variables, especially continuous ones, increases the number of
distinct covariate patterns (ni is number sharing these values)
binomial response poisson response log-linear model
interested in relative odds and interested in incidence per assosiations between
cumulative incidence exposure unit, incidence rate various explanatoary
of success for a given period of time variables
response cumulative count data count data per unit exposure cell frequency in cross-
for particular for particular covariate pattern classified table
covariate pattern
explanatoary categorical and categorical and continuous all treated indentically
variables continuous all categorical
exposure varying amounts must be constant
ni
we can aggreate observations in ith group so that π ij =π i : Y i=∑ Z ij ~ Bin( ni , π i ¿ and has E[
j=1
Yi ]= ni π i .
we can summarize information by modeling πi as a function of the covariates: g ( π i )=x ti β ⇒ π^ i
=
−1
g ( x ^β )
t
i
di
to ensure π ∈ [0, 1], it is often modeled as: π i=∫ f S (s) ds where f S (s) is a PDF called the
−∞
tolerance distribution
experimental units have different tolerance level (level at which outcome changes) to conditions
we place them under
from observing the levels we can construct a density f(s) describing the trend in tolerance levels
di
for particular unit with tolerance level s, π i=P ( success ) =P ( s ≤ d i )= ∫ f S (s) ds
−∞
d i is particular level of exposure (single explanatory variable)
common link functions and associated tolerance dbns:
πi
Logistic (logit) link: log ( )
1−π i
=x ti β
o symmetric, linear between 0.2 and 0.8, underlying tolerance dbn is logistic dbn, commonly
used in medical field
o fitted values are the logarithm of the “odds”
πi
t
t −μ
Probit link:
π i=ɸ i
σ ( )
⇒ ɸ−1 ¿
)= xi β *
ex b
( )
i
^y i=ni ^π i=ni t
1+e x b i
w ii=ni π i (1−π i )
y −n π
z i=ηi + i i i
ni π i (1−π i)
D=
( ()
2 ∑ y i log
yi
^y i
n−y
(
+(ni− y i ) log i i
ni−^y i )) 2
⩪ χ n− p where ^y i = ni π^ i
Model checking:
y i−ni π^ i
Xi =
√ni π^ i (1− π^ i )
di = sign(
√[
y i−ni ^π i ¿ × 2 y i log
yi
( )
ni π^ i
+(ni− y i)log
ni− y i
(
n i−ni π^ i
when ni =1 and we are dealing with ungrouped data, the chi-square approximation does NOT hold
)]
and it is safest to look at the change in deviance for nested models rather than using the deviance
as goodness-of fit statistics for a given model
Interpretation:
e ηi : odds of success for covariate pattern i
eη i
βi
e : interpret on exponential scale, any calculation done on coefficient scale
dichotomous categorical: OR of success for presence of i relative to absence
polychotomous categorical: OR of success for presence of i relative to reference category
continuous: estimated change in the OR corresponding to a one unit increase in X j
o e r β i is estimated change in the OR corresponding to a r unit increase in X j
βi : logarithm of the relevant odds ratio / estimated change in relevant odds ratio
βi
0< e < 1: (1 – A)% decrease in relative odds of success
β
1< e i
< 2: (A – 1)% increase in relative odds of success
βi
e > 1: (A – 1)% increase in relative odds of success / A fold increase in relative odds of success
notes:
πi
predict(model) = log ( ) 1−π i
=x ti b
fitted(model) = π^ i
in R: matrix(y, n - y) ~ explanatory variables
Poisson regression
Yi :number of successes from exposure ni for covariate pattern i, i = 1, …, N
μi = niθi: expected number of events
ni: number of units / exposure ⇒ log ( ni ) is offset parameter depicting varying exposure
θi: per unit incidence / success rate
model: log ( μi ) =log ( ni ) + x ti β = log(ni) + ηi i.e μi=n e η ,θi =e η
i i
^y i=^μi=ni e x b i
w ii=μ i
y i−μi
z i=log ( μi ) +
μi
l ( μ ; y ) = ∑ ( y i log ( μi ) −μi−log ( yi ! ) )
D=
( ()
2 ∑ y i log
yi
^y i
t
)
−( y i− ^y i ) ⩪ χ 2n− p
( ( ))
≈ 2 ∑ o i log
oi
e^ i
Model checking: let oi = yi, ei = ni e x b
i
oi−ei
Xi =
√ ei
di = sign(
√[
oi −e i ¿ × 2 o i log
() oi
ei
−(o i−e i)
]
Interpretation:
η
e i : rate of success for covariate pattern i
eη i
η
: rate ratio of success of covariate pattern i relative to covariate pattern j
e j
eβ j
: interpret on exponential scale, manipulate in linear form (forming CI’s)
dichotomous categorical: rate ratio of success for presence vs absence
polychotomous categorical: OR of success for presence of j relative to reference category
continuous: multiplicative increase in E(Yi) associated with a one unit increase in x j
rβ
o e i is estimated change in the incidence rate of success corresponding to a r unit
increase in Xj
βi : difference in log( μi ) for one group compared to another group
βi
0< e < 1: (1 – A)% decrease in incidence rate of of success
1< eβ i
< 2: (A – 1)% increase in incidence rate of success
βi
e > 1: (A – 1)% increase in incidence rate of success / A fold increase in incidence rate of success
notes:
predict(model) = log ( ni ) + x ti b
t
fitted(model) = ni e x b
i
[]
Y1
Y= ⋮ where Yi ~ P(λi) are independent and ΣYi = n ~ P(λ1 + … + λN), then Y|n ~ Multinomial(n, π1,
Yn
λi
…, πN) where πi =
Σ λk
Observe independent variables, say A(with I levels), B(with J levels) ,C(with K levels) and D(with L levels):
1. Independent Poisson sampling (no constraints)
Y ijkl = the number of observations in each cell of the cross-classification table N = IxJxKxL cells and
Y ijkl P( μijkl )
N
μ yp e−μp p
f ( y ; μ ) =∏
p=1 yp!
saturated model:
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
log ( μijkl )=log ( μ+ λ i + λ j + λ k + λl + λij + λik + λil + λ jk + λ jl + λ kl + λ ijk + λ ijl + λikl + λ jkl + λ ijkl )
2. Multinomial sampling (grand total fixed)
also conditional Poisson sampling with a fixed n
observational studies
Y ijkl = the number of observations in each cell of the cross-classification table with Y ijkl P( μijkl )
subject to ∑ y ijkl =n
ijkl
N yp
θp
f ( y ; μ∨n )=n ! ∏
p=1 yp!
o joint PDF is multinomial, but can apply GLM exponential results as marginals are Poisson
E( Y ijkl ¿=μijkl =n θijkl is expected number of observed counts
θijkl = probability of being in ijkl cell of the cross-classification table
saturated model: E( Y ijkl ¿=n θijkl
log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD + λijAB + λikAC + λilAD + λBC BD CD ABC ABD ACD BCD ABCD
jk + λ jl + λ kl + λ ijk + λijl + λikl + λ jkl + λ ijkl
log ( μijkl )=log ( n )+ log ( θijkl )
o where i, j, k, l = 1, …., N (over-parameterized)
μ
A B C
BC BD DCD AB
ABC AC
ABD AD
ACD BCD ABCD
log ( ¿¿ 1111+ λ + λ + λ + λ + λ + λ + λ + λ jk + λ jl + λkl + λijk + λijl + λikl + λ jkl + λijkl )
i j k l ij ik il
log ( μijkl )=¿
(corner point)
o where i, j, k, l = 2, …., N (resolves over parameterization)
additive model: E( Y ijkl ¿=nθi ∙∙ ∙ θ∙ j ∙∙ θ∙ ∙k ∙ θ ∙∙∙ l (all variables are independent)
log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD
minimal model: E( Y ijkl ¿=¿ n observation equally distributed across all cells (variables do not influence
cell frequencies)
log ( μijkl )=μ
e x β =μijkl
i
is expected cell frequency
significant interaction terms imply there is association between relevant variables (they are not
independent)
Notes:
models are hierarchical, i.e., if a higher order interaction is included, all related lower- order terms
are also included.
The hypotheses of interest mostly centre around the inclusion of interaction terms.
Fitting the saturated model that contains all possible interactions is equivalent to reproducing the
data. The aim is usually to summarize the data with a parsimonious model, one that contains as few
terms as possible
Goodness of fit statistics as described for Poisson regression models are valid and hypothesis
testing is based on comparing these goodness of fit statistics for nested models.
Pearson and deviance residuals can be calculated
fitted(model) = ^ μijkl = expected frequency
predict(model) = log( ^μijkl )
in R: frequency ~ categories, family = poisson
xtabs(frequency/fitted ~ row heading +column heading + tables split by)
Decision Theory and Bayesian statistics
Bayesian philosophy:
inferential / subjective probability: measure of belief in the assertion that an outcome will occur
acquired by systematic but subjective evaluation e.g. probability wheel
must be:
o internally logically consistent
o any “objective” data must be correctly integrated with the prior subjective evaluations
o totality of information is properly incorporated into the decision making process.
Game Theory
player I chooses aI ∈ A I , player II chooses a II ∈ A II , player I gets payoff Pa a
I II
, player II gets
payoff Qa a I II
∑ x i Pij
i
∑ y i Pij
i
f ( x∨θ2 ) π (θ2 )
m(x) predictive probabilities
π ( θ 1| x ) posterior probabilities
π (θ2∨x )
a1 posterior expected
regrets/loss
a2
Eθ∨ X [L ( a , θ )∨x ]
optimal action action minimizing
Eθ∨ X [L ( a , θ )∨x ]
α +β n
× prior mean+ × sample mean
α + β +n α + β +n
α + nθ
E [ θ^ ( X )|θ ]=
α + β +n
nθ(1−θ)
Var [ θ^ ( X )∨θ ] =
( α + β +n )2
specifying α, β:
1. specify likelihood of a range of θ (e.g. 95% sure θ is between 0.2 and 0.4)
α αβ
2. specify mean (m) and variance (v) of θ: solve m = ,v= 2
α+β ( α + β ) (α + β+ 1)
3. specify mean (m) and sample size (n) big enough to place equal confidence in sample and prior: n
α
= α + β, m =
α+β
Poisson sampling (sample size = n): X ∨θ P(θ)
X ∨θ P(θ)
θ~ Gam(α , λ)
θ∨ X Ga(α + ∑ x i , λ +n)
( S
) (
π ( μ , τ|x ) ∝ π ( μ , τ ) × τ 2 exp −τ xx × exp −τ
2 2 )
n ( μ−x́ )
where S xx =∑ ( x i−x́ )2
i
n
τ MLE = and S xx is sufficient for τ
S xx
μMLE = x́ and x́ is sufficient for μ
ideally perform inference about μ, τ at same time using prior joint dbn
evaluation of joint posterior is computationally tedious ⇒ perform inference one at a time
Case 1: τ known, μ unknown
( 1τ )
X ∨μ N μ ,
1
μ~ N (m , )
ɸ
ɸm +nτ x́ 1
μ∨ X N (
ɸ+nτ ɸ+nτ )
,
ɸm +nτ x́ ɸ nτ ɸ nτ
E [ μ| X ] = μ^ = = m+ x́ = × prior mean+ × sample mean
ɸ+ nτ ɸ+nτ ɸ+ nτ ɸ+ nτ ɸ+nτ
as n→∞, relative quality of prior information decreases ⇒ ϕ → 0
Credibility intervals: using posterior normal dbn
Case 2: μ known, τ unknown
n s2
X ∨τ Ga
n
2( +1,
2 )
τ ~ Ga ( α , λ )
2
( n
τ ∨X Ga α + , λ +
2
ns
2 )
n+ 2α
E [ τ|X ]= τ^ =
n s2 +2 λ
n s2 +2 λ
1
τ[|]
E [ σ 2|X ] =E X = =
α −1
×
λ
+
n/2
n+2 α −2 α −1+ n/2 α −1 α−1+ n/2
× s
2
n s2 2 α +n 1
Credibility interval: let u=2 λ+
(2 )
τ . Then u∨ X Ga
2 2 (
, = χ 2n +2 α )
Bayesian Regression
1 1
Set up: y= X β +e where
τ ( )
e MVN 0, I ⇒ y MVN ( X β , I )
τ
Least Squares estimate: ^β=( X t X )−1 X t y use: X y
t
=
(X X) ^β
t
t
S R =^e e^ =( y −X ^β ) ( y− X β^ ) = y y− β^ X X β^
t t t t
Residual sum of squares:
General case: τ, μ unknown
n n
L ( β , τ ; y , X ) ∝ τ 2 exp ( −τ
2
( y− Xβ )t ( y −Xβ ) =¿ ) τ 2 exp ( −τ2 ( β− ^β ) X X ( β− ^β )) ×exp ( −τ2 S )
t t
R
1
y∨β MVN ( X β , I )
τ
β ~ N ( b0 , T 0 )
−1
β∨ y , X N ( b1 ,T −11 )
t −1 t ^)
where T 1 =T 0 + τ X X and b1=T 1 (T 0 b 0 +τ X X β
E [ β| y , X ] = β^ = b1
Case 2: b0 = 0, T 0=kτI for k > 0
^ = b1 = (kI + X t X)−1 X t y
Ridge regression estimator: E [ β| y , X ] = β
prior assumptions imply all betas are independent, all equally likely to be positive or negative
generalization of frequentist approach (recovered when k = 0 ⇔ prior variance is infinite)
admissible under squared error loss for k > 0 not for k = 0
Prediction: given future observation x0 , want information about predicted value Y 0=x to β with
posterior dbn Y 0∨ y , X
1
we know: Y 0∨β N x to β , ( τ )
and β∨ y , X N ( b1 ,T 1 )
−1
⇒ Y 0 , β ∨ y , X Normal and
of N ( ba , 1a )
t t t
o θ Aθ−2 θ B + terms independent of θ ∝ ( θ−A−1 B ) A ( θ−A−1 B ) ⇒ functional
form of N ( A−1 B , A−1 )
Monte Carlo simulation
Use: calculation of E[θ∨X ] , Var [θ∨ X ] , quantiles, integrals of form ∫ r ( θ ) π ( θ|x ) dθ when
Θ
3. ∫ r ( θ ) π ( θ|x ) dθ ≈
∑ r (θi )
Θ N
Probability integral transform
1. Generate U ~ U(0, 1)
2. Define θ = ψ-1(U) where ψ is CDF
CDF of θ = P(θ ≤ α) = P(U ≤ ψ(α)) = ψ(α) ⇒ θ is a observation from dbn with CDF ψ
Probability integral transform with accept/reject algorithm
1. choose a standard (or CDF invertible) dbn h(θ) s.t. h(θ) > 0 ⇔ π ( θ| x ) > 0
more similar h(θ) and π ( θ| x ) , the higher acceptance rate
useful technique: h(θ) = π(θ)
joint prior: π(a, b) = π(a|b)π(b) ⇒ generate bi from π(b), ai from π(a|bi)
π ( θ ) g(t , θ)
2. Define C=max
h(θ)
θ∨h ( θ ) >0
π ( θ ) g(t ,θ)
3. Define = γ (θ ) =
C h(θ)
4. Generate θi from h(θ)
5. Independently, generate U i from U(0, 1)
U
6. accept θi if (¿ ¿ i≤ γ ( θ ) ) and reject otherwise. if θi is accepted:
i
¿
U
U
¿
U
P(¿ ¿ i ≤ γ ( θi ) )=π ( θ ) g (t , θ)∝ π ( θ| x )
PDF of θi = h( θi |
(¿ i≤ γ ( θi )|θi ¿ ) h ( θi )
¿
P¿
( ¿ ¿ i≤ γ ( θ i ))=¿
¿
Notes
results depend on explicitly stated prior ⇒ problematic when objectivity is sought (legal, scientific).
Counter-arguemnts:
o existence of non-informative priors
o contributions of data and prior can often be clearly distinguished
o subjective choices in frequentist approach: choice of model, stopping rule in sampling,
choice of hypotheses
o situations in which real prior information does exist and cannot be ignored
Conjugate prior: posterior dbn has same functional form as prior (calculate both to prove
something is a conjugate prior)
o in this case we can write Bayesian estimate as weighted sum of data and prior information
(credibility formula)
precision: spread of data relative to each other
accuracy: distance of estimate to true value
as n → ∞, Bayesian estimate → maximum likelihood estimate in value (as prior information is
counted less and less relative to sample information) but still differs in interpretation
as n → ∞, credibility interval → conventional confidence interval from sampling theory in value of
bounds, but still differs in interpretation
functional form: PDF stripped of all terms that do not depend on the RV
mode: value of X that maximizes the pdf
Non-Informative priors
Empirical Bayes (sequential sampling)
Set up: have sequence of parameters θ1 , … ,θ n and measurements X1 , … , Xn and there is no
systematic trend, draw inference on θn +1
Ideally use empirical of θi , but only have empirical of X i . Proceed as follows:
if we suppose θi G ( θ|η ) , we can use π ( θ )=G ( θ|η^ ) where η^ is found by maximizing
L ( x ; η )=∏ f ( x i|η )
EBCT model 1:
Definitions:
Xi1 , … , X ¿ are claim history (claim size or #claims) for n periods for group/risk i, i = 1, …., m
X it follow a marginal dbn which depends on parameter θi (dim(θ) ≥ 1)
X
E ¿
¿
¿
X
Var ¿
¿
¿
Aim: estimate unconditional mean of marginal dbn, namely m(θi )
Method: solve for coefficients of linear estimator that minimize the posterior square error loss
Assumptions:
θi ,θ j are i.i.d ⇒ prior moments independent of i: Eθ [ m(θi) ] = E [ m(θ) ] for all i
X kt , X iu are independent for k ≠ i (even if t = u)
X it ∨θi are i.i.d
no systematic trend over time
exposure to risk is constant over all groups and time periods
Notation:
∑ X́ i
Global mean:
X́´ = i
m
∑ X it
Group mean: t
X́ i=
n
Within groups SS: SSE = ∑ ∑ ( X it− X́ i )2
i t
2
between groups SS: SST= ∑ ( X́ i− X́´ )
i
s
[¿¿ 2(θ)]
^ n+ E
EBCT estimate for m(θi ) : m(θi )=Z X́ i +(1−Z) E [ m ( θ ) ] where Z = Var [m ( θ ) ]
n
¿
linear rule for easy interpretation, credibility formula irrespective of prior dbn, no need for
conjugate prior
Unbiased estimates for prior parameters (using data from other risks ⇒ empirical Bayes approach)
parameter unbiased estimate*
E [m(θ )] X́´
s SSE
[¿ ¿ 2(θ)] m(n−1)
E¿
m(θ )
Var ¿
] max 0 ;{ SST
−
SSE
m−1 mn (n−1) } 2
*prove by taking expectation, using E[X] =E[E[X|θ]], (n – 1)s2/σ2 ~ χ n−1
EBCT model 2:
Definitions:
Y i1, … , Y ¿ are claim history (claim size or #claims) for n periods for group/risk i, i = 1, …., m
Pi 1 , … , P¿ are corresponding exposure to risk / risk volume
Y
X it = it are claim per unit risk volume
Pit
X it follow a marginal dbn which depends on parameter θi (dim(θ) ≥ 1)
X
E ¿ (implies Y it ∝ Pit )
¿
¿
X Y
Var ¿ (implies Var ¿ = Pit
2
s (θi )¿
¿ ¿
¿ ¿
Assumptions: (fix)
θi ,θ j are i.i.d ⇒ prior moments independent of i: Eθ [ m(θi) ] = E [ m(θ) ] for all i
X it ∨θi are independent (not identical due to dependence of variance on Pit ¿
no systematic trend over time
Notation:
´
Ṕ= ∑ Ṕi
i
Ṕi=∑ Pit
t
∑ Ṕi ( 1− Ṕi / Ṕ )2
i
P∗¿
mn−1
∑ ∑ Y it ∑ X́ i Ṕi
X́´ =
i t i
=
Ṕ´ Ṕ´
∑ Y it ∑ X it Pit
X́ i= t
= t
Ṕi Ṕi
SSE = ∑ ∑ Pit ( X it− X́ i ) 2
i t
2
SSQ = ∑ ∑ Pit ( X it− X́´ )
i t
s
[¿¿ 2(θ)]
^ Ṕi+ E
EBCT estimate for X i , n+1 : m(θi )=Z X́ i +(1−Z) E [ m ( θ ) ] where Z = Var [m (θ ) ]
Ṕi
¿
linear rule for easy interpretation, credibility formula irrespective of prior dbn, no need for
conjugate prior
expectation of Y i ,n +1 : ^)
Pi ,n +1 m(θ i
s
expectation of standard deviation of Y i ,n +1 : Pi ,n +1 E [¿¿ 2(θ)]
√¿
Unbiased estimates for prior parameters (using data from other risks ⇒ empirical Bayes approach)
parameter unbiased estimate*
E [m(θ )] X́´
s SSE
[¿ ¿ 2(θ)] m(n−1)
E¿
m(θ ) P∗¿
]
Var ¿ SSQ SSE
(¿ ¿
−1
( −
mn−1 m(n−1) ) }
0;¿
max ¿
2
*prove by taking expectation, using E[X] =E[E[X|θ]], (n – 1)s2/σ2 ~ χ n−1
Chapter 8: More about loss distributions
Properties of extreme event dbns:
extremely long tails: lim P ( X > x+ t| X > x )=1
x→∞
generally non-negative
Ways of deriving dbns: specify hazard rate, EVT, mixture dbn
f (x)
1. Hazard rate / failure rate / force of mortality: r ( x )= ⇒ r ( x ) h≈ P(x ≤ X ≤ x +h∨ X ≥ x)
1−F( x )
x
ɣ−1
r(x) increasing: shorter tail than exponential e.g. r(x) = cɣ x for ɣ > 1
ɣ−1
r(x) constant: exponential dbn e.g. r(x) = cɣ x for ɣ = 1
ɣ−1
r(x) decreasing: longer tail than exponential e.g. r(x) = cɣ x for ɣ < 1 (Weibull), r(x) =
α
(Pareto)
λ+ x
Power transformations:
Y = X Ɣ ~ Exp(λ) ⇒ X ~ Weibull(λ, ɣ)
Y = X Ɣ ~ Pareto(α, λ) ⇒ X ~ Burr(α, λ, ɣ) = 3 parameter Pareto
2. Extreme value theory
Set up: random sample X 1 , … , X n with maximum Y n
Y n−b n
if for sequence an , bn , lim P
n →∞ ( an )
≤ z =lim F (an z+ bn )n=G ( z )
n→∞
for each z where G(z) is proper
f ( x )= 0.5 0.5
e−0.5 ( x−μ ) Σ (x−μ)
(2 π ) ∨Σ ¿
Not examinable in 2016:
Non-Informative priors
1. Min max criterion: composite hypotheses and constant risk rules
recall: MIN MAX RISK decision rule: δ ∈ � that minimizes r M ( δ )=max [ ¿ θ ∈Θ R ( δ , θ ) ]
Composite hypotheses (decision is choice of a hypothesis)
Loss (in regrets so that one equals 0):
θ ≤ θ0 θ > θ0
accept H0 0 c01
accept H1 c10 0
all admissible rules will be of form: reject H 0 ⇔ x > k (direction of inequality depends on
hypotheses)
{
r M ( δ )= c 10 P ( X > k|θ ) ,∧θ ≤ θ0
c 01 P ( X ≤ k|θ ) ,∧θ>θ0
Φ ( k −θσ )= c c+ c
0
10
10
01
(at equality)
|∂∂ ϕθ|
o reason: ϕ = h(θ) is location parameter ⇒ π(ϕ) = k
o example: binomial samping: ϕ = arcsin √θ (find by trial and error)
1 2
Example: Normal sampling τ, μ unknown: π(μ, τ) = π(μ|τ)π(τ) ∝ ⇒ τ|x ~ χ n−1 and μ|x ~
τ
t n−1
improper prior π(θ): prior that is not a valid dbn ⇒ Can use it as long as π(θ|x) is proper