Generalised Linear Models and Bayesian Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

Statistics III

Clare Walker
Components:
 Generalized linear models
 Decision theory and Bayesian statistics

Generalized linear models


exploratory data analysis: getting to know data
 distributions of response and each explanatory variable
o continuous: histograms, means, standard deviations, extreme values
o categorical: frequencies, relative frequencies
 simple association between response and explanatory variables separately
o continuous: scatterplot, correlation analysis
o continuous and categorical: boxplots of continuous for different subgroups
o categorical: contingency tables
model formulation: specifying GLM
parameter estimation: Method of moments, maximum likelihood, least squares
inference: compare models to find simplest model that best describes the data
model validation
 check general form of model is valid
 check underlying assumptions are satisfied: graphical analysis of residuals, studentised residuals,
standardized
o heteoroskedasity, non-normality*, multi-collinearity, autocorrelation (time series)
*residuals should be normal even if underlying model is not
 outliers (not well fitted by model): graphical analysis of residuals, studentised residuals,
standardized residuals (> ±2)
 influential point (too much influence on parameter values): Cook’s D , dfbetas
Interpretation:
 Interpretation of coefficients when using identity link:
o Coefficient of continuous variable: mean change in response for one unit change in X k while
holding other predictors in the model* constant”
o Coefficient of categorical variable: captures effect of being in ith category compared to
being in baseline category (intercept captures effect of baseline) but depends on how they
are coded
o coefficients of interaction terms measure the modification of effect size
 p-value: the probability of observing a sample statistic this large or larger assuming H 0 is true
 confidence interval:
 Quote CI’s rather than p-values: incorporate sample size by giving an idea of the precision of
estimate (especially when the estimate is a linear combination of estimated betas)
Reporting: presenting results so that a non-statistician can understand

Generalized Linear Model:


1. response variables: Y 1 ,… , Y N share same distribution from exponential family in canonical
form

[ ][ ]
xT1 x11 ⋯ x1 p
2. explanatory variables: X= ⋮ = ⋮ ⋱ ⋮ and a Set of parameters: β1, … , β p (p <
x TN xN1 ⋯ x Np
N)
3. Link function: g represents the transformation μi of that will be modeled as a linear function
of explanatory variables. Must be monotone and differentiable. Examples:
o identity link: ηi=μi z = y in this case
μ
o log link: log ⁡( ¿¿ i)
ηi =¿
μi
o logit link: ηi=log ⁡ ( ) 1−μi
1
o inverse link: ηi =
μi
1−μ
o log-log link: log ⁡(−log ⁡(¿ ¿i))
ηi =¿
Basic set up:
Expectation as function of underlying distribution: μi=E [Y i ]=function of θ
Expectation as function of explanatory variables: ηi=g ( μi ) =x ti β=linear function of β j
Fitting distribution of response
Choice of distribution depends on: discrete (categorical, count, binary) vs continuous, symmetrical vs
asymmetric (right, left), natural bounds, likelihood of observing extreme values

description used for shape


binomial counts of successes in a proportions that arise from p < 50%:skewed left
discrete

E[Y] = np given number of trials a binary outcome p = 50%: symmetric


Var[Y] = npq with a specified E.g. probability of a claim, p > 50%: skewed
probability of success in probability of disease right
each trial
Poisson a number of events counts in a given period of skewed left
E[Y] = λ occurring within a given exposure when the skewness ↓ as λ ↑
Var[Y] = λ interval (time, length, probability of an event
area, volume, mass unit) occurring in a very small
time (or space) is low and
the events occur
independently
E.g. # of claims in given
periods of membership.
negative binomial the number of failures counts in a period of
rq before the specified (r) exposure, data is skew &
E[Y] = number of successes. over/under dispersed
p (Poisson is not applicable)
rq
Var[Y] =
p2
normal 1. strong tendency for the data to take on a central symmetric with thin
continuous

E[Y] = μ value. tails


Var[Y] = σ2 2. Positive and negative deviations are equally likely
3. The frequency of the deviations falls off rapidly
for values further away from the central value.
Often log-transformed data is normal
Cauchy similar to normal
distribution with fatter
tails
Chi square Zi ~ N(0, 1) testing and comparing skewed to right
2
(χ )
n
2
model fit non-negative
n
E[Y] = n ∑ Z i2 χn
i
Var[Y] = 2n

non-central Chi Zi ~ N(0, 1) testing and comparing skewed to right


square (χ
2
)
n model fit non-negative
E[Y] = n + λ
n, λ
λ=∑ μ2
i
Var[Y] = 2n + 4λ
Z
n

∑ (¿ ¿ i+μ i)2 2
χn, λ
i
¿
gamma R parameterization: E.g. claim size, annual skewed to right
E[Y] = αβ −αy income , lab assay non-negative
α −1 β
Var[Y] = α β2 y e expressions α = shape, β = rate
f Y ( y )=
β α Γ (α )
inverse Gaussian CDF of inverse Gaussian extremely skew data very skewed to right
E[Y] = μ is inverse of CDF of non-negative
Var[Y] = μ3 σ 2 Gaussian.
2
−1 y−μ
( )
2 y μσ
e
f Y ( y )=
√ 2 π y3 σ2
weibull λ = 1: exponential very skew data very skewed to right
E.g. lifetimes, failure time non-negative
λ = shape, θ = scale
a ( y ) b ( θ ) +c ( θ ) +d ( y)
Exponential family: f ( y ; θ )=e
Properties:
−c ' (θ) ∂ ∂
∂θ ∫
1. E [ a ( Y ) ]= use: f ( y)dy=0=∫ f ( y )dy
b ' (θ) ∂θ
b' ' ( θ ) c ' ( θ )−b' ( θ ) c '' ( θ )
2. Var [ a (Y ) ]= use:
b ' ( θ )3
∂2 ∂2
∂θ 2
∫ f ( y)dy=0=∫ ∂θ 2 f ( y)dy
dl
3. U= =a ( y ) b' ( θ )+ c' ( θ )

'' '
b (θ )c ( θ ) ''
4. J =Var [ U ] = −c ( θ )
b' (θ)
5. J =E [ U 2 ] =−E [ U ' ]
o interpretation: the entries in J represent the curvature of l around θ^ . low curvature ⇒ U
is near 0 for many values of θ ⇒ variance of θ^ is large
Notes:
 canonical form: a(y) = y. in this case b(θ) is called the natural parameter.
 canonical link: g ( μi ) = b(θ)
Estimation
 Set up: Y 1 ,… , Y N share same distribution but parameter values may vary ⇒ PDF = f ( y i ; θi )
Estimating θ : method of moments, maximum likelihood exact, maximum likelihood using method of
scoring
Method of maximum likelihood: find θ such that l(θ) is maximized ⇒ U( θ(m ) ¿ = 0
∑ a ( y i) b ( θi )+∑ c (θi )+∑ d ( yi)
 joint PDF: f ( y 1 , … , y N ; θ1 , … ,θ N )=e look at data given
parameters
 likelihood: L ( θ 1 , … , θ N ; y 1 , … , y N )=¿ ∑ a ( y i ) b (θ i ) + ∑ c (θi ) + ∑ d ( y i ) look at parameters
e
given data
dl d li ' (θ ) '(θ )
 Score function: =∑ U= =∑ a ( y i ) b + c =0 i i
(now assume θi=θ ? ¿
dθ i dθ i
o solve analytically for θ(m )
U (m−1) U (m−1)
− ' (m−1 ) =θ (m−1 )+ (m −1 )
(m ) (m−1)
o solve iteratively using: θ =θ
U J
 Variance-covariance: J =Var [ U ]

Estimating β: maximum likelihood estimates (MLE) using iteratively re-weigthed least squares (IRLS)
dl d l d θi d μi ( y −μ ) d μ
 Score function: U j= =∑ i =∑ i i x ij i
d β j i d θi d μ i d β i i Var ( y i) d ηi
2
x ij xik d μi
 Variance-covariance: J jk =E [ U j U k ]=∑
i var ( y i ) d ηi
( )
 Variance-covariance matrix: J =E [ U U T ]=X T WX where
2

( )
d μ1
1
var ( y 1) d η1 ( ) ⋯ 0

W= ⋮ ⋱ ⋮
2
d μN
0 ⋯
1
( )
var ( y N ) d η N
d ηi
z i=∑ x ik b k
(m−1)
 transformed response variable: +( y i −μi)
k d μi
 Solve for β iteratively using: X T WX b(m) =X T Wz use:
(m) (m−1) −1 (m −1 )
b =b +J U
−1 T
b =( X T WX )
(m)
o X Wz
1 −1
o Y i N ( μ i , σ 2 ) ∧identity link used ⇒ W =I 2
∧¿Y ⇒ b=( X T X ) X T Y
σ
 use observed means as initial estimates
Inference
method confidence interval hypothesis tests
what random interval which contains true compares how well two related models fit
parameter A% of the time the data based on a “goodness of fit”
statistic
related: same pdf, same link, different
explanatory
uses answer question about relationship find the simplest model that accurately
between response and explanatory describes the data
variable with more accuracy, width gives a
measure of precision
requires sampling distribution of parameter sampling distribution of goodness of fit
estimates statistic

[]
s1
Standardizing results: let S = ⋮ be a sample statistic with covariance matrix J:
sp
 S – E[S] ⩪ MVN(0, J)
2
 ( S−E [S ] )T J −1 ( S−E[S ]) ⩪ χp
o if S has an underlying normal distribution (linear combination of normal rv), results are
exact
o if not, results hold under CLT (regularity conditions satisfied by all exponential family
distributions)

T −1 2
Score statistic: U J U ⩪ χp
 related result: U ~ MVN(0, J)
 Derivation: E[U] = 0, Var[U] = J and apply standardizing result
 uses: typically for inference about distribution parameters

Wald statistic: ( b−β )T J ( b−β ) ⩪ χp


2
here: b = ^β MLE
 related result: b ~ MVN(β, J −1 ) use this result to construct Wald confidence intervals
−1
 Derivation: 2 nd
order Taylor expansion of U around β ⇒ b – β = J U , assume J is constant, E[b]
−1
= β, Var[b] = J
var (b1 ) ⋯ cov (b1 , b p )


∴ J −1=
( ⋮ ⋱
cov (b 1 , b p ) ⋯

var (b p ) )
uses: inference about distribution and beta parameters, confidence intervals
 confidence interval for slope including interaction term: e.g. Y = β 0 + β1X1 + β2X2 + β3X1 X2. let β*
= slope of X1:
o b* = b1 + b3 and var(b*) = var(b1) + var(b3) + 2cov(b1, b3)
Examples:
( y −nπ )2
Binomial (β = π): ⩪ χ 21
nπ (1−π )
( y −λ )2
Poisson (β = λ): ⩪ χ 21
λ
( ý −μ )2 2
Normal (β = μ): χ1
σ2/ N
2
Log-likelihood ratio statistic: D = 2 ( l ( bmax ; y ) −l ( b ; y ) ) χ (m−p , ν)
 maximal/full/saturated model: no summarizing has taken place ⇒ N parameters (one for each
observation), l ( bmax ; y ) will always be maximum possible log-likelihood for assumed pdf and
link as provides most complete information
 large values of D suggest the suggested model is a poor description of data relative to maximal
model
 Derivation: 3rd order Talyor expansion of l about β ⇒ 2 ( l ( b )−l ( β ) ) =( b−β )T J ( b−β ) χ 2p
D=2 ( l ( bmax )−l ( β max ) )−( l ( b )−l ( β ) ) + ( l ( β max )−l ( β ) ) = χ 2m− χ 2p +ν (positive
constant)
conditioning/subtracting the log-likelihood with true parameter values ensures
independence
 uses: hypothesis testing for nested models when choosing simplest but best fitting model,
goodness of fit of a single model comparing it only to maximal (only if n i > 1)
 Hypothesis testing: (best when distribution does not have any nuisance parameters)
T
o H0: β = β0 = [ β1 ⋯ βq] corresponding to M0 and has deviance D0
T
o H1: β = β1 = [ β1
⋯ β p ] corresponding to M1 and has deviance D1 and q < p < N
o test stat: ΔD=D0−D 1=2 ( l ( b1 )−l ( b 0) ) χ 2p−q if both models hold
2
o interpretation: if ΔD is consistent with χ p−q then choose M0 as it is simpler
if ΔD is in the critical region for χ 2p−q reject M0 as M1 describes the data
significantly better
Examples:

Binomial: D =
( ()
2 ∑ y i log
( ))
yi
^y i
n−y
+(ni− y i ) log i i
ni−^y i
⩪ χ 2n− p where ^y i = ni π^ i

∑( ) ( ( ))
yi yi
Poisson: D= 2
( ) y i log
^y i
−( y i− ^y i ) ⩪ χ 2n− p where ^y i= ^λ i ≈ 2 ∑ yi log
^y i
when

∑( ( )) oi
∑ y i=∑ ^y i ⇒ 2 oi log
e^ i
1
Normal:D = 2∑( i
y − μ^ i )2 χ 2n −p
σ
 dispersion in R output:
scaled deviance
=
∑ ( y i− μ^ i )2
degrees of freedom N− p
2
scaled deviance: D i =σ D = ∑ ( y i −^
2
 μi ) (in R: null
and residual deviance)
dispersion
2
 Chi-square test: Q = D 0−D 1 ⩪ χ p−q (don’t account for
largest model ¿
¿
estimation of σ2)
D0−D 1
2 p−q
 F test: to account for estimation of σ we use F = F p−q , N− p under H0 (account for
D1
N−p
estimation of σ2)
Goodness of fit statistics
 p = k + 1 = number of unknown parameters,
 m = number of covariate patterns (binary/logistic)
m
Log-likelihood ratio statistic: D= 2 ( l ( bmax ; y ) −l ( b ; y ) ) χ 2 ( m− p ) =∑ d k 2
i=1

 deviance residuals: dk = √ 2( l (b max √


; y k )−l ( b ; y k ) )= 2 ( l ( y k )−l ( ^y k ) ) for k = 1, …, m
dk
 standardized dk: rDk =
√1−hk
2
Likelihood ratio chi-squared statistic: D= 2 ( l ( b ; y )−l ( bmin ; y ) ) χ ( p−1)
 all parameters equal in minimal model
l ( b min ; y )−l ( b ; y )
pseudo R2: R2 =
l ( bmin ; y )
 measures proportional improvement in the log-likelihood function due to the explanatory variables
included in the current model, compared to the null model
y k −E ( y k )¿2
¿ m
¿
Pearson/chi-squared statistic: Q= ¿ χ 2 ( m−p ) ¿∑ X k2
m i=1

∑¿
i=1
y k −E( y k )
 chi-square residual: Xk = (moments of yk are expressed in terms of
√ var ( y k )
fitted parameters)
Xk
 standardized Xk: rPk =
√1−hk
 asymptotically equal to the deviance, chi-square approximation poor when there are small
expected frequencies
 assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
The Hosmer-Lemeshow statistic:
2
χ HL ⩪ χ 2(g−2)
 groups the predicted probabilities into a number of groups (g) with approximately equal number of
observations in each group. The observed number of successes and failures in each of the g groups
are summarized and the Pearson chi-square statistic for a gx2 contingency tables is calculated as a
measure of fit,
 assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
Akaike information criterion: AIC = -2 l ( ^ y ) +2 p
 assesses: how well model fits data ⇒ small value of the statistics (large p-value) indicates that the
model fits the data well
 use for: comparing two models that are not nested
Bayesian information criterion: BIC = -2 l ( ^
y ) +2 p ln ⁡(¿ observations)
Measures of influence
1 /2 T −1 T 1/ 2
Leverage: hk is the kth diagonal of the hat matrix = W X ( X WX ) X W
 measure of the leverage of covariate pattern i: how “far” does the covariate pattern lie from the
average covariate pattern and thus how much of an effect does it have on the estimated model
1 hi
Cook’s distance: Di =
( )
p 1−h i
r 2Pi

dfbeta: dfbetaij = Δ ^β j =b j−b j (i )


 change in estimated parameters when ith is observation omitted from data

Model checking
1. The form of the linear predictor, i.e., is the linear component of the model adequate
T T
 scatterplot rDk vs xk b , scatterplot of rPk vs xk b (linear predictor, in R: predict(model))
 systematic patter indicates model is incorrect
2. The adequacy of the link function, i.e., is the logit transformation appropriate
3. Outlying and/or influential observations (rather identity visually than using cut-off values, state
properties of observation)
 outliers: |rPk| > 2
 influence of ith covariate pattern on estimated model: hi > 2p/n
 influence of ith covariate pattern on parameter vector: Di
 influence of ith covariate pattern on parameter bj: dfbetaij
4. Is Var(yi) correct or is there over-dispersion – indicated by residual variance/df>=2
GLMs for discrete response
covariate pattern: group of response variables described by the same explanatory variables
 in grouped data: each group is a covariate pattern (ni is number in each group)
 in ungrouped data: response variables with the same values for all explanatory variables.
Increasing number of explanatory variables, especially continuous ones, increases the number of
distinct covariate patterns (ni is number sharing these values)
binomial response poisson response log-linear model
interested in relative odds and interested in incidence per assosiations between
cumulative incidence exposure unit, incidence rate various explanatoary
of success for a given period of time variables
response cumulative count data count data per unit exposure cell frequency in cross-
for particular for particular covariate pattern classified table
covariate pattern
explanatoary categorical and categorical and continuous all treated indentically
variables continuous all categorical
exposure varying amounts must be constant

Association between exposure to a given factor and binary response:


response
yes (1) No (0) total
expose yes a b a+b
d to (1)
factor no (0) c d c+d
total a+c b+d
relative risk (RR): likelihood of successful response in the exposed group relative to those who are not
exposed
proportion of successes if exposed a /(a+ b)
 RR = =
proportion of successes if unexposed c /(c+ d)
 RR < 1: response = 1 is less likely in the exposed group (-association between exposure and
response)
 RR = 1: no association between exposure and response
 RR > 1: response = 1 is more likely in the exposed group (+association between exposure and
response)
odds ratio (OR): odds of successful response if factor is present relative to odds of successful response if
factor is absent
P ( response=1| exposed=0
 OR = odds of success given present P ( respnse=1|exposed =1 ) ? a/b ad
= ¿= =
odds of success given absent ¿ c /d cb
Binary Variables
π
consider the binary outome
{
Z ij = 1 ℘ ij
0 1−π ij
which is one of ni observations in the ith group.

ni
we can aggreate observations in ith group so that π ij =π i : Y i=∑ Z ij ~ Bin( ni , π i ¿ and has E[
j=1
Yi ]= ni π i .
we can summarize information by modeling πi as a function of the covariates: g ( π i )=x ti β ⇒ π^ i
=
−1
g ( x ^β )
t
i
di
to ensure π ∈ [0, 1], it is often modeled as: π i=∫ f S (s) ds where f S (s) is a PDF called the
−∞
tolerance distribution
 experimental units have different tolerance level (level at which outcome changes) to conditions
we place them under
 from observing the levels we can construct a density f(s) describing the trend in tolerance levels
di
 for particular unit with tolerance level s, π i=P ( success ) =P ( s ≤ d i )= ∫ f S (s) ds
−∞
 d i is particular level of exposure (single explanatory variable)
common link functions and associated tolerance dbns:
πi
 Logistic (logit) link: log ( )
1−π i
=x ti β
o symmetric, linear between 0.2 and 0.8, underlying tolerance dbn is logistic dbn, commonly
used in medical field
o fitted values are the logarithm of the “odds”
πi
t
t −μ
 Probit link:
π i=ɸ i
σ ( )
⇒ ɸ−1 ¿
)= xi β *

o symmetric, similar to logit, tolerance dbn is Normal(μ, σ2 )


1−π
 Complementary log-log link: log ⁡(−log ⁡(¿ ¿i))=x ti β
¿
o asymmetric, tolerance dbn is Gumbel (extreme value) dbn
Logistic regression
nice explanation of maximum likelihood estimation:
 maximum likelihood estimation involves choosing that value of p, say p, that maximises L(p)
Intuitively, this means that we have chosen a value for p that maximises the probability of
occurrence of the sample results.
 p depends on b and we can thus express the likelihood as a function of the b’s and use maximum
likelihood estimation to determine estimates for the b’s that will maximise the likelihood.
Mathematically this involves differentiating the log likelihood with respect to the k+1 b parameters
and solving the resulting k+1 non-linear equations in the unknown b using Fisher’s method of
scoring – an iteratively weighted least squares procedure.
 This will yield estimates for the b which in turn will give us estimates of the unknown p
πi
Model: ηi=logit ( π i ) =log
1−π i
t
( )
=x ti β

ex b
( )
i

 ^y i=ni ^π i=ni t

1+e x b i

 w ii=ni π i (1−π i )
y −n π
 z i=ηi + i i i
ni π i (1−π i)

 l( π ; y) = ∑ ( y i log ( π i ) +(ni − y i)log ( 1−π i ) + log


( ))
ni
yi

 D=
( ()
2 ∑ y i log
yi
^y i
n−y
(
+(ni− y i ) log i i
ni−^y i )) 2
⩪ χ n− p where ^y i = ni π^ i

Model checking:
y i−ni π^ i
 Xi =
√ni π^ i (1− π^ i )


di = sign(
√[
y i−ni ^π i ¿ × 2 y i log
yi
( )
ni π^ i
+(ni− y i)log
ni− y i
(
n i−ni π^ i
when ni =1 and we are dealing with ungrouped data, the chi-square approximation does NOT hold
)]
and it is safest to look at the change in deviance for nested models rather than using the deviance
as goodness-of fit statistics for a given model
Interpretation:
 e ηi : odds of success for covariate pattern i
eη i

 : odds ratio of success of covariate pattern i relative to covariate pattern j


eη j

βi
e : interpret on exponential scale, any calculation done on coefficient scale
 dichotomous categorical: OR of success for presence of i relative to absence
 polychotomous categorical: OR of success for presence of i relative to reference category
 continuous: estimated change in the OR corresponding to a one unit increase in X j
o e r β i is estimated change in the OR corresponding to a r unit increase in X j
 βi : logarithm of the relevant odds ratio / estimated change in relevant odds ratio
βi
0< e < 1: (1 – A)% decrease in relative odds of success
β
1< e i
< 2: (A – 1)% increase in relative odds of success
βi
e > 1: (A – 1)% increase in relative odds of success / A fold increase in relative odds of success

notes:
πi
 predict(model) = log ( ) 1−π i
=x ti b

 fitted(model) = π^ i
 in R: matrix(y, n - y) ~ explanatory variables
Poisson regression
Yi :number of successes from exposure ni for covariate pattern i, i = 1, …, N
μi = niθi: expected number of events
ni: number of units / exposure ⇒ log ( ni ) is offset parameter depicting varying exposure
θi: per unit incidence / success rate
model: log ( μi ) =log ( ni ) + x ti β = log(ni) + ηi i.e μi=n e η ,θi =e η
i i

 ^y i=^μi=ni e x b i

 w ii=μ i
y i−μi
 z i=log ( μi ) +
μi
 l ( μ ; y ) = ∑ ( y i log ( μi ) −μi−log ( yi ! ) )

 D=
( ()
2 ∑ y i log
yi
^y i
t
)
−( y i− ^y i ) ⩪ χ 2n− p
( ( ))
≈ 2 ∑ o i log
oi
e^ i
Model checking: let oi = yi, ei = ni e x b
i

oi−ei
 Xi =
√ ei
 di = sign(
√[
oi −e i ¿ × 2 o i log
() oi
ei
−(o i−e i)
]
Interpretation:
η
 e i : rate of success for covariate pattern i
eη i

 η
: rate ratio of success of covariate pattern i relative to covariate pattern j
e j

eβ j
: interpret on exponential scale, manipulate in linear form (forming CI’s)
 dichotomous categorical: rate ratio of success for presence vs absence
 polychotomous categorical: OR of success for presence of j relative to reference category
 continuous: multiplicative increase in E(Yi) associated with a one unit increase in x j

o e i is estimated change in the incidence rate of success corresponding to a r unit
increase in Xj
 βi : difference in log( μi ) for one group compared to another group
βi
0< e < 1: (1 – A)% decrease in incidence rate of of success
1< eβ i
< 2: (A – 1)% increase in incidence rate of success
βi
e > 1: (A – 1)% increase in incidence rate of success / A fold increase in incidence rate of success

notes:
 predict(model) = log ( ni ) + x ti b
t

 fitted(model) = ni e x b
i

 in R: y ~ explanatory variables + offset(log(exposure)), family = poisson


Log-linear models
preliminary result:

[]
Y1
Y= ⋮ where Yi ~ P(λi) are independent and ΣYi = n ~ P(λ1 + … + λN), then Y|n ~ Multinomial(n, π1,
Yn
λi
…, πN) where πi =
Σ λk
Observe independent variables, say A(with I levels), B(with J levels) ,C(with K levels) and D(with L levels):
1. Independent Poisson sampling (no constraints)
Y ijkl = the number of observations in each cell of the cross-classification table N = IxJxKxL cells and
Y ijkl P( μijkl )
N
μ yp e−μp p

f ( y ; μ ) =∏
p=1 yp!
saturated model:
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
log ( μijkl )=log ⁡( μ+ λ i + λ j + λ k + λl + λij + λik + λil + λ jk + λ jl + λ kl + λ ijk + λ ijl + λikl + λ jkl + λ ijkl )
2. Multinomial sampling (grand total fixed)
 also conditional Poisson sampling with a fixed n
 observational studies
Y ijkl = the number of observations in each cell of the cross-classification table with Y ijkl P( μijkl )
subject to ∑ y ijkl =n
ijkl
N yp
θp
f ( y ; μ∨n )=n ! ∏
p=1 yp!
o joint PDF is multinomial, but can apply GLM exponential results as marginals are Poisson
E( Y ijkl ¿=μijkl =n θijkl is expected number of observed counts
θijkl = probability of being in ijkl cell of the cross-classification table
saturated model: E( Y ijkl ¿=n θijkl

log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD + λijAB + λikAC + λilAD + λBC BD CD ABC ABD ACD BCD ABCD
jk + λ jl + λ kl + λ ijk + λijl + λikl + λ jkl + λ ijkl
 log ( μijkl )=log ( n )+ log ( θijkl )
o where i, j, k, l = 1, …., N (over-parameterized)

μ
A B C
BC BD DCD AB
ABC AC
ABD AD
ACD BCD ABCD
log ⁡( ¿¿ 1111+ λ + λ + λ + λ + λ + λ + λ + λ jk + λ jl + λkl + λijk + λijl + λikl + λ jkl + λijkl )
i j k l ij ik il
log ( μijkl )=¿
(corner point)
o where i, j, k, l = 2, …., N (resolves over parameterization)
additive model: E( Y ijkl ¿=nθi ∙∙ ∙ θ∙ j ∙∙ θ∙ ∙k ∙ θ ∙∙∙ l (all variables are independent)
 log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD
minimal model: E( Y ijkl ¿=¿ n observation equally distributed across all cells (variables do not influence
cell frequencies)
 log ( μijkl )=μ

3. Product multinomial sampling (row totals fixed)


 prospective experiment with treatments randomly assigned, observational study with fixed amount
from each group
Y ijkl = the number of observations in each cell of the cross-classification table with Y ijkl P( μijkl )
subject to ∑ y ijkl = y i ∙∙ ∙
jkl
I yijkl
θijkl
f ( y ; μ∨ y 1 ∙ ∙∙ , … , y I ∙ ∙∙ ) =∏ y i∙ ∙∙ ! ∏
i=1 jkl y ijkl !
E( Y ijkl ¿=μijkl = y i ∙∙∙ θ ijkl is expected number of observed counts
θijkl = probability of being in jkl cell of the cross-classification table given you are level i
saturated model: E( Y ijkl ¿= y i ∙∙∙ θ ijkl

log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD + λijAB + λikAC + λilAD + λBC BD CD ABC ABD ACD BCD ABCD
jk + λ jl + λ kl + λ ijk + λijl + λikl + λ jkl + λ ijkl
o where i, j, k, l = 1, …., N (over-parameterized)

A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
log ( μijkl )=μ1111 + λi + λ j + λ k + λl + λ ij + λik + λil + λ jk + λ jl + λ kl + λ ijk + λijl + λikl + λ jkl + λ ijkl
(corner point)
o where i, j, k, l = 2, …., N (resolves over parameterization)
additive model: E( Y ijkl ¿= y i ∙∙∙ θ ∙ j ∙ ∙ θ∙∙ k ∙ θ∙∙ ∙l
 log ( μijkl )=μ+ λiA + λ Bj + λ Ck + λlD
minimal model: E( Y ijkl ¿=¿ y i ∙∙∙ allocated to level i are equally distributed across other variables
A
 log ( μijkl )=μ+ λi
Interpretation:
t

 e x β =μijkl
i
is expected cell frequency
 significant interaction terms imply there is association between relevant variables (they are not
independent)
Notes:
 models are hierarchical, i.e., if a higher order interaction is included, all related lower- order terms
are also included.
 The hypotheses of interest mostly centre around the inclusion of interaction terms.
 Fitting the saturated model that contains all possible interactions is equivalent to reproducing the
data. The aim is usually to summarize the data with a parsimonious model, one that contains as few
terms as possible
 Goodness of fit statistics as described for Poisson regression models are valid and hypothesis
testing is based on comparing these goodness of fit statistics for nested models.
 Pearson and deviance residuals can be calculated
 fitted(model) = ^ μijkl = expected frequency
 predict(model) = log( ^μijkl )
 in R: frequency ~ categories, family = poisson
 xtabs(frequency/fitted ~ row heading +column heading + tables split by)
Decision Theory and Bayesian statistics
Bayesian philosophy:
inferential / subjective probability: measure of belief in the assertion that an outcome will occur
 acquired by systematic but subjective evaluation e.g. probability wheel
 must be:
o internally logically consistent
o any “objective” data must be correctly integrated with the prior subjective evaluations
o totality of information is properly incorporated into the decision making process.

2. Decision theory (choosing a ∈ A)


purpose: provide alternative statistical paradigm, one which can give advice and guidance to decision
makers (DM), as to which of a set of alternative courses of action will best satisfy the goals of the decision
maker, given a situation of imperfect information and/or uncertain outcomes
Decision problem components:
1. actions a ∈ A: decision maker must select a ∈ A. (A can be discrete or represent continuum of
alternatives)
2. unknown state of nature θ ∈ Θ: state space Θ represents set of unknown future or present
conditions which will affect outcome of decision (Θ can be discrete or represent continuum of
alternatives)
3. consequences: loss, L ( a ,θ ) , or payoff V ( a ,θ ) experienced if action a ∈ A is taken and the
state is θ
 L ( a ,θ ) = −V ( a , θ ) by definition
Pareto optimality principle: ai dominates aj ⇔ L ( ai ,θ ) ≤ L ( aj , θ) ∀ θ ∈ Θ with strict
inequality for at least one θ
 must be dominated by one decision, not a combination
 in rational decision making, we restrict A to set of non-dominated decisions

Game Theory
player I chooses aI ∈ A I , player II chooses a II ∈ A II , player I gets payoff Pa a
I II
, player II gets
payoff Qa a I II

A I , A II discrete: (assume further)


 player I chooses i ∈{0, …, m} and gets payoff Pij , player I chooses j ∈{0, …, n} and gets
payoff Qij ,
 can represent as m × n payoff matrices P = [ Pij ¿ and Q = [ Qij ¿
let the game be a zero sum game: Pij =−Q ij for all i, j ⇒ what one player wins the other loses (assume
further)
 can now summarize in one matrix P where Pij = payoff to I (I wants to maximize) and losses to
II (I wants to minimize)
let v I =max min Pij such that I chooses i* and v II =min max Pij such that II chooses j*
i j j i
 vI is I’s guaranteed minimum payoff, v II is II’s guaranteed maximum loss
 equilibrium solution exists ⇔ v I =v II
 equilibrium solution does not exist ⇒ v I ≤ v II
Mixed strategy:
let choice of a I and a II be randomized with P( aI = i) = xi , P( a II = j) = yj and
∑ x i=∑ y j=1
i j
 I’s expected payoff = II’s expected loss = x T Py
T
max min ∑ y j ∑ x i P ij ⇒ max min ∑ x i P ij
 v I =max min x Py
x y
=
x y j ( i ) x j i
min is achieved when

yj = 1 for smallest ∑ x i Pij


i
o maximize: v
o subject to: v- ∑ x i Pij ≤0 for all j so that v is the smallest
i

∑ x i Pij
i

∑ xi =1 so that x are probabilities


i
xi ≥ 0 for all i so that x are probabilities
T
 v II =min max x Py
y x
o minimize: w
o subject to: w- ∑ y j Pij ≥ 0 for all i so that w is the largest
j

∑ y i Pij
i

∑ yj =1 so that y are probabilities


j
yj≥0 for all i so that y are probabilities

 if finite solution exists, v I =v II always


Special case: Prisoners dilemma (non-zero-sum game) ⇒ saddle point is both turn state witness
Games against nature
DM plays first and selects a ∈ A based on assumptions about nature:
 MIN-MAX criterion ⇒ a* = a ∈ A to minimize max ⁡[ ¿θ ∈ Θ L ( a , θ ) ]
(upper bound on loss)
 MIN-MIN criterion ⇒ a* = a ∈ A to minimize min ⁡[¿ θ ∈Θ L ( a , θ ) ]
(nature is co-operative)
 Hurwicz criterion ⇒ a ∈* = a ∈ A to minimize
β min ⁡[ ¿ θ ∈Θ L ( a , θ ) ]+(1−β )max ⁡[ ¿ θ ∈Θ L ( a ,θ ) ] for 0 < β < 1
 Problems: factors external to the decision problem affect the result (very detrimental state)
 Solution: think of losses relative to best achieved with perfect foresight ⇒ define regrets
Losses defined as regrets: “best case for each state (minimum for each column) is always 0 and other
losses are relative to best case”
Regret when a is taken and state is θ: R ( a , θ )=L ( a , θ ) - min ⁡[¿ α ∈ A L ( α , θ ) ] =
max ⁡[ ¿ α ∈ A V ( α , θ ) ]−V ( a , θ ) ≥0
 MIN-MAX REGRET criterion ⇒ a* = a ∈ A to minimize max ⁡[ ¿θ ∈ Θ R ( a , θ ) ]
Expected Value Theory
PMF or PDF of θ: π(θ) where π(θ) is the subjective probability that the true state is θ

 expected loss for a ∈ A: E[ L ( a ,θ ) ¿= ∫ L ( a , θ ) π ( θ ) dθ or E[ L ( a ,θ ) ¿=∑ L ( a , θr ) π ( θr )


r
θ ∈Θ
 EXPECTED VALUE criterion ⇒ a* = a ∈ A to minimize E[ L ( a ,θ ) ¿ or E[ R ( a , θ ) ¿ (same
decision)
 violated when DM is not risk neutral
Utility value function: increasing monotone function u(z) where z is the payoff. must satisfy following
axioms:
 completeness: defined for all payoffs
 transitivity: if A > B and B > C then A > C
 independence:
 continuity:
 EXPECTED UTILITY criterion ⇒ a* = a ∈ A to minimize E[ u(L ( a , θ ) )¿ or E[
u( R ( a , θ ) )¿ (same decision)
 Problem: estimating utility functions is tedious.
 Solution: use approximation if risk averse & losses/payoffs follow normal dbn or u(z) quadratic over
range of interest
[ L ( a , θ)]
o E[ u(L ( a , θ ) )¿ is minimized at a ∈ A ⇔ E[ is minimized at a ∈ A
L ( a ,θ ) ¿+ K . Var ¿
[V ( a , θ ) ]
o E[ u(V ( a , θ ) )¿ is maximized at a ∈ A ⇔ E[ is minimized at a ∈ A
V ( a ,θ ) ¿−K . Var ¿
o choice of K

Use and Value of Information


 Benefits of information: reduce uncertainty ⇒ improve quality of decision ⇒ gain value
 Costs of information: collection can be costly (surveys, consult experts) ⇒ loss of value
EXPECTED VALUE OF PERFECT INFORMATION: (EVPI)
 perfect information allows us to know with certainty which state will occur
L( α , θ)
min L ( α ,θ )
o EVPI = Eθ ¿ ] −¿ α∈A ]≥0
min ¿ Eθ ¿
α∈A
V (α , θ )
max V ( α , θ )
o EVPI = α∈ A ] E¿ ]≥0
E¿ −max ¿
α∈A
 gaining information only justified if cost ≤ EVPI
EXPECTED VALUE OF SAMPLE INFORMATION: (EVSI)
 imperfect information gives us more insight to value of θ but does not actually reveal value of θ
 Outcomes of survey: x 1 , …., x n (realizations of random variable X)
 Similar previous situations allows us to assess the conditional probabilities of outcomes for each
state: f( x i∨θ j ¿
“Calculate the posterior probabilities and optimal action for each possible X” (here 2 sample outcomes, 2
states, 2 actions)
Value of X
x1 x2
f (x∨θ1 )π (θ1 ) joint probabilities

f ( x∨θ2 ) π (θ2 )
m(x) predictive probabilities
π ( θ 1| x ) posterior probabilities

π (θ2∨x )
a1 posterior expected
regrets/loss
a2
Eθ∨ X [L ( a , θ )∨x ]
optimal action action minimizing
Eθ∨ X [L ( a , θ )∨x ]

 optimal action under new information: a ∈ A minimizing Eθ∨ X [L ( a , θ )∨x i ] =


posterior expected loss
 EVSI = prior expected loss (before sampling) – weighted average of posterior expected loss
¿
 EVSI = min ¿ Eπ[ R ( a , θ ) ¿ ¿−¿ (
α∈A
min ( E [ R ( a , θ )|x 1 ] )m ( x1 ) + …+min ( E [ R ( a∗¿ , θ )|x n ] )m ( x n ) ¿
α∈A α∈A
L( α , θ) Eθ∨ X [L ( α , θ ) ∨X ]
 EVSI = Eθ ¿ ] −¿ min ¿
α∈ A
min ¿
α∈A EX ¿
 the above are equivalent
EVSI
Efficiency of information process: 0≤ ≤1
EVPI
3. Decision Rules and Risks (choosing δ ∈ D )
Purpose: analytically select the best δ ∈ D which can then be implemented mechanically
Decision rule δ: X→A :x →δ (x) i.e. if we observe X = x, then we take action
δ ( x )=a∈ A
 e.g. accept/reject decision, estimator selected
Risk: measure of the consequence of using the rule δ for a particular value of θ

 R ( δ , θ )=E X∨θ [ L ( δ ( x ) ,θ )|θ ]=∫ L ( δ ( x ) , θ ) f ( x∨θ) dx∨∑ L ( δ ( x r ) , θ ) f (x r∨θ)


X r
 losses can be defined in regret terms
Pareto optimality: δ i dominates δ j ⇔ R ( δ i , θ ) ≤ R ( δ j ,θ ) ∀ θ ∈ Θ with strict inequality for at
least one θ ⇒ δ j is inadmissible
 in rational decision making, we restrict � to set of non-dominated/admissible rules
Game theory principle
 MIN MAX RISK decision rule: δ ∈ � that minimizes r M ( δ )=max ⁡[ ¿ θ ∈Θ R ( δ , θ ) ]
Expected value principle
 BAYES RULE: δ ∈ � that minimizes r E ( δ ) =Eθ [ R ( δ ,θ ) ]
 admissible rule ⇒ Bayes rule for some probability dbn on θ
 Bayes rule, dbn on θ gives positive mass to subsets of Θ ⇒ rule is admissable
 Bayes rule minimizes posterior expected loss ( E θ∨ X [ L ( a , θ ) ∨x i ]¿ ⇒ use latter to find Bayes
rule
L( α , θ)
EVSI = minimum expected risk using constant rule – expected risk of Bayes rule = Eθ ¿ ]-
min ¿
α∈A
Eθ [ R ( δ Bayes ,θ ) ]
4. Inference and Decision theory
Bayesian Frequentist
probability subjective or inferential measure of relative long-run frequency in hypothetical
uncertainty trials
parameters random variables unknown constants
point estimation posterior summary estimator with desired properties
interval credibility interval: A% chance θ lies in confidence interval: A% of sampling
estimation interval given X (posterior probability) contain the true value of the parameter
hypothesis posterior probability of a hypothesis Assume hypothesis is true and measure
testing the probability of observing estimate using
estimator’s distribution
Set up: perform inference on random variable θ after taking random sample X = ( X 1 , … , X n ¿ whose
dbn depends on θ:
 conditional joint dbn of x: f ( x|θ ) =f ( x 1|θ ) … f ( x n|θ ) =L(θ ; x)
 prior dbn of θ = “measure of subjective uncertainty”: π(θ)
 joint dbn of θ and x: L(θ ; x) π (θ)
 marginal/ predictive dbn of x : m ( x )=∫ L(θ ; x ) π (θ)dθ is constant w.r.t. θ
Θ
L (θ ; x ) π (θ)
 posterior dbn of θ given x: π ( θ| x ) = ∝
m( x)
L ( θ ; x ) π ( θ ) ∝ g ( t ,θ ) π ( θ )
oReason: terms that don’t depend on θ and appear in numerator and denominator cancel
(just drop them)
 T is a sufficient statistic for θ, and t = T(x), then by factorization theorem
L ( θ ; x )=g ( t ,θ ) h (x) where h(x) doesn’t depend on θ ⇒ can factorize and
cancel
 m ( x ) doesn’t depend on θ. Can be calculated to make the dbn a proper PDF or
PMF (normalisng term)
 Chose action (hypothesis or estimate) which minimizes posterior expected loss

Simple hypothesis testing:


states: θ = θ0 or θ = θ1
actions: accept H0: θ = θ0, accept H1 : θ = θ1
prior dbn: π(θ0) = π0
π ( θ1|x ) 1−π 0 L(θ 1 ; x )
posterior odds ratio: = × =prior odds ratio × Bayes factor
π ( θ0|x ) π0 L(θ 0 ; x )
Loss (in regrets so that one equals 0):
θ= θ = θ1 posterior expected
θ0 loss
accept 0 c01 π ( θ 1 x ) c 01 |
H0
accept c10 0 π ( θ 0 x ) c10 |
H1
L( θ1 ; x )
Optimal decision: reject H0 (accept H1) ⇔ π ( θ 0|x ) c10 < π ( θ 1| x ) c 01 ⇔ >
L(θ 0 ; x )
π 0 c10
1−π 0 c 01
 same form as Neymann-Pearson ⇒ most powerful for any “significance level”
 here rejection region is determined by loss and prior likelihoods as opposed to arbitrary
significance level
Bayesian Estimation:
states: possible value of θ
action: estimate of θ (prior estimate based on π(θ), posterior estimate based on π(θ|x)
decision: choice of estimator θ=δ^ (x )
loss function: L( θ^ , θ) is loss/regret when estimate is θ^ and true value is θ
L ( θ , θ )=0, L( θ^ ,θ) > 0 for any θ^ ≠θ , L' (θ ,θ )=0 as is minimized at true value
L ( θ^ ,θ ) ≈ k (θ)(θ−θ)
^ 2
by second order Taylor expansion
optimal decision: θ^ is function of x that minimizes: Eθ∨ x [ L ( θ^ ,θ )∨ X=x ]
 L ( θ^ ,θ )=k (θ−θ)
^ 2 ^
⇒ θ=E [θ∨ X=x ] = posterior expectation of θ given x
(most common)
Eθ∨ x [ L ( θ^ ,θ )|X =x ]∨¿θ^ =E [θ∨ X ]=kVar (θ∨X =x)
¿
 ^ |
L ( θ^ ,θ ) = k |θ−θ ⇒ θ ^ is the median of dbn of θ
 L ( θ^ ,θ ) = k ∀ θ^ ≠θ ⇒ θ^ is the mode of dbn of θ
properties of estimators used to compare:
 E X ∨θ [θ^ (X )∨θ] measures bias
 Var X∨θ [ θ(^ X)∨θ]
 [ ]
E X ∨θ ( θ^ ( X )−θ ) |θ =Var [θ^ ( X )∨θ ]+ ( θ−E [ θ^ ( X )|θ ])
2 2
measures risk
(1 – α)% Creditability interval
α
¿
α 2
¿ ¿¿
2 ¿
¿¿ 1−α
⇒ (1 – α)% CI given by ( ¿ ) where the bound are percentiles of posterior dbn of θ
¿=1−α
¿ 2
θ¿ ¿¿
¿
P¿ ¿
¿
θ¿
Examples with conjugate priors:
Binomial sampling: X ∨θ Bin(n ,θ)
X ∨θ Bin(n ,θ)
θ ~ Beta (α , β )
θ∨ X Beta (α + x , β+ n−x )
α+ x α+ β α n x
^
E [ θ|X ] =θ= = ( ) +
α + β +n α + β +n α+ β α+ β+ n n () =

α +β n
× prior mean+ × sample mean
α + β +n α + β +n
α + nθ
 E [ θ^ ( X )|θ ]=
α + β +n
nθ(1−θ)
 Var [ θ^ ( X )∨θ ] =
( α + β +n )2
specifying α, β:
1. specify likelihood of a range of θ (e.g. 95% sure θ is between 0.2 and 0.4)
α αβ
2. specify mean (m) and variance (v) of θ: solve m = ,v= 2
α+β ( α + β ) (α + β+ 1)
3. specify mean (m) and sample size (n) big enough to place equal confidence in sample and prior: n
α
= α + β, m =
α+β
Poisson sampling (sample size = n): X ∨θ P(θ)
X ∨θ P(θ)
θ~ Gam(α , λ)
θ∨ X Ga(α + ∑ x i , λ +n)

Gamma sampling (order k, sample size = n): X ∨θ Ga( k ,θ)


X ∨θ Ga(k ,θ)
θ ~ Ga(α , λ)
θ∨ X Ga(α +nk , λ+ ∑ x i )

Uniform sampling (sample size = n): X ∨θ U (0,θ)


X ∨θ U (0,θ)
θ~ Pareto(α , β )
θ∨ X Pareto(α + n ,max ⁡{β , x 1 , … , x n })
Normal sampling: X ∨μ , τ N μ , ( 1τ )
1
 instead of variance, τ = is measure of precision
σ2
n
n s2
(
π ( μ , τ|x ) ∝ π ( μ , τ ) × τ 2 exp −τ
2 ) where ns
2
= ∑ ( x i−μ )2
i
n 2

( S
) (
π ( μ , τ|x ) ∝ π ( μ , τ ) × τ 2 exp −τ xx × exp −τ
2 2 )
n ( μ−x́ )
where S xx =∑ ( x i−x́ )2
i

n
 τ MLE = and S xx is sufficient for τ
S xx
 μMLE = x́ and x́ is sufficient for μ
 ideally perform inference about μ, τ at same time using prior joint dbn
 evaluation of joint posterior is computationally tedious ⇒ perform inference one at a time
Case 1: τ known, μ unknown

( 1τ )
X ∨μ N μ ,

1
μ~ N (m , )
ɸ
ɸm +nτ x́ 1
μ∨ X N (
ɸ+nτ ɸ+nτ )
,
ɸm +nτ x́ ɸ nτ ɸ nτ
E [ μ| X ] = μ^ = = m+ x́ = × prior mean+ × sample mean
ɸ+ nτ ɸ+nτ ɸ+ nτ ɸ+ nτ ɸ+nτ
 as n→∞, relative quality of prior information decreases ⇒ ϕ → 0
Credibility intervals: using posterior normal dbn
Case 2: μ known, τ unknown
n s2
X ∨τ Ga
n
2( +1,
2 )
τ ~ Ga ( α , λ )
2

( n
τ ∨X Ga α + , λ +
2
ns
2 )
n+ 2α
E [ τ|X ]= τ^ =
n s2 +2 λ
n s2 +2 λ
1
τ[|]
E [ σ 2|X ] =E X = =
α −1
×
λ
+
n/2
n+2 α −2 α −1+ n/2 α −1 α−1+ n/2
× s
2

n s2 2 α +n 1
Credibility interval: let u=2 λ+
(2 )
τ . Then u∨ X Ga
2 2 (
, = χ 2n +2 α )
Bayesian Regression
1 1
Set up: y= X β +e where
τ ( )
e MVN 0, I ⇒ y MVN ( X β , I )
τ
Least Squares estimate: ^β=( X t X )−1 X t y use: X y
t
=
(X X) ^β
t

t
S R =^e e^ =( y −X ^β ) ( y− X β^ ) = y y− β^ X X β^
t t t t
Residual sum of squares:
General case: τ, μ unknown
n n
L ( β , τ ; y , X ) ∝ τ 2 exp ( −τ
2
( y− Xβ )t ( y −Xβ ) =¿ ) τ 2 exp ( −τ2 ( β− ^β ) X X ( β− ^β )) ×exp ⁡( −τ2 S )
t t
R

Case 1: τ known, μ unknown


n
L ( β ; y , X ) ∝ τ 2 exp ( −τ2 ( β− ^β) X X ( β− β^ ))
t t

1
y∨β MVN ( X β , I )
τ
β ~ N ( b0 , T 0 )
−1

β∨ y , X N ( b1 ,T −11 )
t −1 t ^)
where T 1 =T 0 + τ X X and b1=T 1 (T 0 b 0 +τ X X β

E [ β| y , X ] = β^ = b1
Case 2: b0 = 0, T 0=kτI for k > 0
^ = b1 = (kI + X t X)−1 X t y
Ridge regression estimator: E [ β| y , X ] = β
 prior assumptions imply all betas are independent, all equally likely to be positive or negative
 generalization of frequentist approach (recovered when k = 0 ⇔ prior variance is infinite)
 admissible under squared error loss for k > 0 not for k = 0
Prediction: given future observation x0 , want information about predicted value Y 0=x to β with
posterior dbn Y 0∨ y , X
1
we know: Y 0∨β N x to β , ( τ )
and β∨ y , X N ( b1 ,T 1 )
−1
⇒ Y 0 , β ∨ y , X Normal and

integrating over β, Y 0∨ y , X Norm al


Determine parameters of posterior:
 E [ Y 0| y , X ]=x to b1 EX[X|Y] = EZ[EX[X | Y, Z] | Y]
1 t −1
 Var [ Y 0| y , X ] = +x (T ) x Var(X) = E(Var(X|Y)) + Var(E(X|Y)))
τ o 1 0
1
(
Y 0∨ y , X N x to b1 , + x to ( T −1
τ 1 ) x0 )
notes:
 to find dbn of normal posterior, often required to rearrange and complete square in the exponent. if
parameter of interest is θ:
2
b
2(1 /a) (
θ− )
2 −1
b
o 2
aθ – 2θb + terms independent of θ ∝ a
( )
θ− ⇒ f ( θ ) ∝ e
a
a
is functional form

of N ( ba , 1a )
t t t
o θ Aθ−2 θ B + terms independent of θ ∝ ( θ−A−1 B ) A ( θ−A−1 B ) ⇒ functional
form of N ( A−1 B , A−1 )
Monte Carlo simulation

Use: calculation of E[θ∨X ] , Var [θ∨ X ] , quantiles, integrals of form ∫ r ( θ ) π ( θ|x ) dθ when
Θ

π ( θ| x ) is non-standard ( ∫ π ( θ|x ) dθ=1 ¿


Θ
1. simulate a large number of (pseudo) random observations θ1, …, θN from π ( θ| x )
 if π ( θ| x ) is standard or CDF = ψ(θ) is invertible: probability integral transform
 if not: probability integral transform with accept/reject algorithm
2. use the sample moments as estimates for the theoretical moments

3. ∫ r ( θ ) π ( θ|x ) dθ ≈
∑ r (θi )
Θ N
Probability integral transform
1. Generate U ~ U(0, 1)
2. Define θ = ψ-1(U) where ψ is CDF
CDF of θ = P(θ ≤ α) = P(U ≤ ψ(α)) = ψ(α) ⇒ θ is a observation from dbn with CDF ψ
Probability integral transform with accept/reject algorithm
1. choose a standard (or CDF invertible) dbn h(θ) s.t. h(θ) > 0 ⇔ π ( θ| x ) > 0
 more similar h(θ) and π ( θ| x ) , the higher acceptance rate
 useful technique: h(θ) = π(θ)
 joint prior: π(a, b) = π(a|b)π(b) ⇒ generate bi from π(b), ai from π(a|bi)
π ( θ ) g(t , θ)
2. Define C=max ⁡
h(θ)
θ∨h ( θ ) >0
π ( θ ) g(t ,θ)
3. Define = γ (θ ) =
C h(θ)
4. Generate θi from h(θ)
5. Independently, generate U i from U(0, 1)
U
6. accept θi if (¿ ¿ i≤ γ ( θ ) ) and reject otherwise. if θi is accepted:
i
¿
U
U
¿
U
P(¿ ¿ i ≤ γ ( θi ) )=π ( θ ) g (t , θ)∝ π ( θ| x )
PDF of θi = h( θi |
(¿ i≤ γ ( θi )|θi ¿ ) h ( θi )
¿
P¿
( ¿ ¿ i≤ γ ( θ i ))=¿
¿
Notes
 results depend on explicitly stated prior ⇒ problematic when objectivity is sought (legal, scientific).
Counter-arguemnts:
o existence of non-informative priors
o contributions of data and prior can often be clearly distinguished
o subjective choices in frequentist approach: choice of model, stopping rule in sampling,
choice of hypotheses
o situations in which real prior information does exist and cannot be ignored
 Conjugate prior: posterior dbn has same functional form as prior (calculate both to prove
something is a conjugate prior)
o in this case we can write Bayesian estimate as weighted sum of data and prior information
(credibility formula)
 precision: spread of data relative to each other
 accuracy: distance of estimate to true value
 as n → ∞, Bayesian estimate → maximum likelihood estimate in value (as prior information is
counted less and less relative to sample information) but still differs in interpretation
 as n → ∞, credibility interval → conventional confidence interval from sampling theory in value of
bounds, but still differs in interpretation
 functional form: PDF stripped of all terms that do not depend on the RV
 mode: value of X that maximizes the pdf
Non-Informative priors
Empirical Bayes (sequential sampling)
Set up: have sequence of parameters θ1 , … ,θ n and measurements X1 , … , Xn and there is no
systematic trend, draw inference on θn +1
Ideally use empirical of θi , but only have empirical of X i . Proceed as follows:
 if we suppose θi G ( θ|η ) , we can use π ( θ )=G ( θ|η^ ) where η^ is found by maximizing
L ( x ; η )=∏ f ( x i|η )

 where f ( xi ∨η) = ∫ f ( xi|θ i) g ( θi|η ) dθ


Θ
x
¿
Beta Binomial: X i ∨θ i ~ Bin( θi , m¿ , θi ∨α , β Beta(α , β) , ⇒ π (θ )=Beta( α^ , ^β)
f¿
¿
2 2 2 2
Normal: X i |θi N ( θ i , 1 ) , θi|μ , σ N ( μ , σ ) , xi ∨μ , σ N ( μ , σ +1 ) ⇒ π ( θ ) =N ( u^ , σ
^)
Notes:
 generally need: G (θ|η ) to be conjugate prior for f ( x|θ ) , f ( x i|η ) analytically tractable
 Empirical Bayes estimate vs MLE: former is adjusted to the long term average

Chapter 7: Credibility factors


Suppose: ^
θ=E [θ∨ X ]=Z × sample estimate+(1−Z) × prior estimate , then Z ∈ (0, 1) is the
credibility factor
 Z = 0: no creditability given to sample, Z = 1: total credibility given to sample
 Is desirable due to intuitive interpretation of estimate and prior parameters (as opposed to
numerical methods)
This form of credibility formula is ensured when conjugate priors are used:
x α n
 Binomial-Beta: E[θ∨X ]=Z +(1−Z) where Z =
n α+β n+α + β
n
 Normal (known
2
σ X ): E[θ∨X ]=Z X́ +(1−Z ) μ where Z = σ 2X
n+ 2
σp
α n
 Poisson-Gamma: E[θ∨X ]=Z X́ +(1−Z ) where Z =
λ n+ λ
Would like to achieve these properties without restriction to conjugate prior. Formally we want estimator
(decision rule) that is:
a) linear in terms of observations
b) depends on a prior that can be specified in terms of mean and variance

EBCT model 1:
Definitions:
 Xi1 , … , X ¿ are claim history (claim size or #claims) for n periods for group/risk i, i = 1, …., m
 X it follow a marginal dbn which depends on parameter θi (dim(θ) ≥ 1)
X
 E ¿
¿
¿
X
 Var ¿
¿
¿
Aim: estimate unconditional mean of marginal dbn, namely m(θi )
Method: solve for coefficients of linear estimator that minimize the posterior square error loss
Assumptions:
 θi ,θ j are i.i.d ⇒ prior moments independent of i: Eθ [ m(θi) ] = E [ m(θ) ] for all i
 X kt , X iu are independent for k ≠ i (even if t = u)
 X it ∨θi are i.i.d
 no systematic trend over time
 exposure to risk is constant over all groups and time periods
Notation:
∑ X́ i
 Global mean:
X́´ = i
m
∑ X it
 Group mean: t
X́ i=
n
 Within groups SS: SSE = ∑ ∑ ( X it− X́ i )2
i t
2
 between groups SS: SST= ∑ ( X́ i− X́´ )
i
s
[¿¿ 2(θ)]
^ n+ E
EBCT estimate for m(θi ) : m(θi )=Z X́ i +(1−Z) E [ m ( θ ) ] where Z = Var [m ( θ ) ]
n
¿
 linear rule for easy interpretation, credibility formula irrespective of prior dbn, no need for
conjugate prior
Unbiased estimates for prior parameters (using data from other risks ⇒ empirical Bayes approach)
parameter unbiased estimate*
E [m(θ )] X́´
s SSE
[¿ ¿ 2(θ)] m(n−1)
E¿
m(θ )
Var ¿
] max 0 ;{ SST

SSE
m−1 mn (n−1) } 2
*prove by taking expectation, using E[X] =E[E[X|θ]], (n – 1)s2/σ2 ~ χ n−1
EBCT model 2:
Definitions:
 Y i1, … , Y ¿ are claim history (claim size or #claims) for n periods for group/risk i, i = 1, …., m
 Pi 1 , … , P¿ are corresponding exposure to risk / risk volume
Y
 X it = it are claim per unit risk volume
Pit
 X it follow a marginal dbn which depends on parameter θi (dim(θ) ≥ 1)
X
 E ¿ (implies Y it ∝ Pit )
¿
¿
X Y
Var ¿ (implies Var ¿ = Pit
2
 s (θi )¿
¿ ¿
¿ ¿
Assumptions: (fix)
 θi ,θ j are i.i.d ⇒ prior moments independent of i: Eθ [ m(θi) ] = E [ m(θ) ] for all i
 X it ∨θi are independent (not identical due to dependence of variance on Pit ¿
 no systematic trend over time
Notation:
´
 Ṕ= ∑ Ṕi
i

 Ṕi=∑ Pit
t

∑ Ṕi ( 1− Ṕi / Ṕ )2
 i
P∗¿
mn−1
∑ ∑ Y it ∑ X́ i Ṕi
X́´ =
 i t i
=
Ṕ´ Ṕ´
∑ Y it ∑ X it Pit
 X́ i= t
= t

Ṕi Ṕi
 SSE = ∑ ∑ Pit ( X it− X́ i ) 2
i t
2
 SSQ = ∑ ∑ Pit ( X it− X́´ )
i t
s
[¿¿ 2(θ)]
^ Ṕi+ E
EBCT estimate for X i , n+1 : m(θi )=Z X́ i +(1−Z) E [ m ( θ ) ] where Z = Var [m (θ ) ]
Ṕi
¿
 linear rule for easy interpretation, credibility formula irrespective of prior dbn, no need for
conjugate prior
 expectation of Y i ,n +1 : ^)
Pi ,n +1 m(θ i
s
 expectation of standard deviation of Y i ,n +1 : Pi ,n +1 E [¿¿ 2(θ)]
√¿
Unbiased estimates for prior parameters (using data from other risks ⇒ empirical Bayes approach)
parameter unbiased estimate*
E [m(θ )] X́´
s SSE
[¿ ¿ 2(θ)] m(n−1)
E¿
m(θ ) P∗¿
]
Var ¿ SSQ SSE
(¿ ¿
−1
( −
mn−1 m(n−1) ) }

0;¿
max ¿
2
*prove by taking expectation, using E[X] =E[E[X|θ]], (n – 1)s2/σ2 ~ χ n−1
Chapter 8: More about loss distributions
Properties of extreme event dbns:
 extremely long tails: lim P ( X > x+ t| X > x )=1
x→∞
 generally non-negative
Ways of deriving dbns: specify hazard rate, EVT, mixture dbn
f (x)
1. Hazard rate / failure rate / force of mortality: r ( x )= ⇒ r ( x ) h≈ P(x ≤ X ≤ x +h∨ X ≥ x)
1−F( x )
x

This implies: F(x) = 1 – −∫ r ( s) ds (specify dbn by specifying hazard rate)


e 0

ɣ−1
 r(x) increasing: shorter tail than exponential e.g. r(x) = cɣ x for ɣ > 1
ɣ−1
 r(x) constant: exponential dbn e.g. r(x) = cɣ x for ɣ = 1
ɣ−1
 r(x) decreasing: longer tail than exponential e.g. r(x) = cɣ x for ɣ < 1 (Weibull), r(x) =
α
(Pareto)
λ+ x
Power transformations:
 Y = X Ɣ ~ Exp(λ) ⇒ X ~ Weibull(λ, ɣ)
 Y = X Ɣ ~ Pareto(α, λ) ⇒ X ~ Burr(α, λ, ɣ) = 3 parameter Pareto
2. Extreme value theory
Set up: random sample X 1 , … , X n with maximum Y n
Y n−b n
if for sequence an , bn , lim P
n →∞ ( an )
≤ z =lim F (an z+ bn )n=G ( z )
n→∞
for each z where G(z) is proper

dbn, then we have:


 EVT(1): G(z) = exp[ −e−z ¿ -∞<z<∞ (Gumbel)
−α
 EVT(II): G(z) = exp[ −z ¿ y > 0, α > 0 (Frechet)
 EVT(III): G(z) = exp[ −(−z)α ¿
y < 0, α > 0 (Reversed Weibull)
3. Mixture dbn: X follows different dbn for different states

 predictive dbn: f(x) = ∑ f ( x|θ i ) P(θ=θ i) or f(x) = ∫ f ( x|θ ) h(θ)dθ


i
Θ
 basic stochastic process generating x in given state: f ( x|θi )
 prior probability for states: P(θ=θi) and h(θ)
4. censored and truncated observations
censored: only record up to threshold level M ⇒ all we know is x > M w.p. P(X > M) = 1 – F(M|θ) (adjust
likelihood accordingly)
truncated: do not observe values less than threshold M. two cases:
 case 1: only observed values of Z = X – M conditional on X > M where X ~ f(x|θ) (no information on
observations less)
F X ( z + M|θ )−F X (M ∨θ)
o F Z (z) = P(Z ≤ z) = P(X ≤ z + M | X > M] =
1−F X ( M ∨θ)
f X (z + M ∨θ)
o f Z ( z∨θ )=
1−F X (M ∨θ)
 case 2: told there are n observations less than M but only given exact values for m oberservations
greater than M
o n truncated: we know is x < M w.p. P(X < M) = F(M|θ) (adjust likelihood accordingly)
o m observed: treat normally
Notes on some distributions:
Beta distribution
Γ ( α + β) 1
=
{
α −1 β−1
f ( x )= k x (1−x) , 0< x <1 with k = Γ (α ) Γ ( β ) 1 ,α
0 , otherwise ∫ x α−1 (1−x )β−1 dx
0
> 0, β > 0
α
 E(X) =
α+β
αβ
 Var(X) = 2
( α + β ) (α + β+ 1)
 can change shape easily by adjusting values of α and β
 if α = β = 1: X ~ U(0, 1)
Gamma(α, λ)
Γ (α + r)
 E [ X r ]=
Γ (α ) λr
Multivariate normal dbn
1 t −1

f ( x )= 0.5 0.5
e−0.5 ( x−μ ) Σ (x−μ)

(2 π ) ∨Σ ¿
Not examinable in 2016:
Non-Informative priors
1. Min max criterion: composite hypotheses and constant risk rules
 recall: MIN MAX RISK decision rule: δ ∈ � that minimizes r M ( δ )=max ⁡[ ¿ θ ∈Θ R ( δ , θ ) ]
Composite hypotheses (decision is choice of a hypothesis)
Loss (in regrets so that one equals 0):
θ ≤ θ0 θ > θ0
accept H0 0 c01
accept H1 c10 0
all admissible rules will be of form: reject H 0 ⇔ x > k (direction of inequality depends on
hypotheses)

{
r M ( δ )= c 10 P ( X > k|θ ) ,∧θ ≤ θ0
c 01 P ( X ≤ k|θ ) ,∧θ>θ0

For X|θ ~ N(θ, σ),


{ ( ( )) ( )}
r M ( δ )=max c 10 1−Φ
k −θ0
σ
, c01 Φ
k −θ0
σ
⇒ minimized when:

Φ ( k −θσ )= c c+ c
0

10
10

01
(at equality)

Constant Risk rules: (decision is choice of an estimate)


 principle: R ( δ , θ )=r and δ is admissible ⇒ δ is a min-max rule
 establish admissible rules by noting the form of the Bayes rule for arbitrary parameters: Bayes
optimal ⇒ admissible
Binomial sampling:
^ α +X
 θ=E [ θ| X ] = ⇒ admissible rules have the form δ(x) = ax + b
α + β +n
X∨θ [ L θ ,θ |θ ]
 solve for a, b so that R ( δ , θ )=E (^ ) does not depend on θ (smallest solution)
x + √ n/2
 for L ( θ^ ,θ )=( θ−θ)
^ 2
, this occurs when δ(x) =
n+ √ n
2. Prior dbn with minimal information
1
Θ discrete: Θ = {θ1, …, θM} ⇒ π ( θ i )=
M
Θ continuous:
 θ is a location parameter: π(θ) ∝ k (uniform dbn)
o reason: θ is a location parameter ⇔ L(θ; x) is data translated ⇒ π(θ) = π(θ – θ 0) = k
1
 θ is a scale parameter: π(θ) ∝
k
o reason: ϕ = log(θ) is location parameter ⇒ π(ϕ) = k ⇒ π(θ) = k
|∂∂ ϕθ |= 1θ
 θ is not purely scale or location:
Conjugate prior: θ=E^ [ θ| X ] = weighted average of prior and sample estimate. Set parameters
so that weighting of prior →0
Transformation whose likelihood is data translated: ϕ = h(θ) s.t. L(ϕ;x) is data translated ⇒ π(θ) ∝

|∂∂ ϕθ|
o reason: ϕ = h(θ) is location parameter ⇒ π(ϕ) = k
o example: binomial samping: ϕ = arcsin √θ (find by trial and error)
1 2
 Example: Normal sampling τ, μ unknown: π(μ, τ) = π(μ|τ)π(τ) ∝ ⇒ τ|x ~ χ n−1 and μ|x ~
τ
t n−1
improper prior π(θ): prior that is not a valid dbn ⇒ Can use it as long as π(θ|x) is proper

You might also like