0% found this document useful (0 votes)
9 views

slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Hierarchical

modelling
Omar Sosa Rodriguez
June 2021
Faculty exists to
make AI real

2
Faculty is Predicting COVID hospitalisations

a world Our forecasting technology is helping the NHS to manage critical resources through COVID-19.

leader in Preventing radicalisation

applied AI Our online harm detection technology is helping security agencies to identify malicious content.

Improving asset performance


We configure, deploy and maintain high
performance AI systems to solve our Our computer vision technology automates the inspection of physical infrastructure.
customers’ most important problems.

Optimising marketing spend

Our uplift modelling technology is helping a fashion retailer to optimise a $200m marketing spend.
3
We have delivered

230
data science customers

We work in

13
countries

We cover

23
sectors

4 4
Contents
01 The challenge of multilevel structures

02 Recap: Bayesian inference

Bayesian HIerarchical Modelling


03 Traditional approaches

04 Hierarchical modelling

05 Group level predictors

06 Q&A

5
01
Multilevel
structures
01.1 Example: Radon in Minnesota

6
Very often the data that we are trying to model will have some categorical features.

Other Examples
01. Geography
02. Demographics
03. Culture
04. Dates*
05. Names / IDs
06. Species
etc.

7
01
Multilevel
structures
01.1 Example: Radon in Minnesota

8
Radon concentration was measured in houses across many counties in Minnesota.
Sometimes in the basement, sometimes in the ground floor.

9
Some counties have a very limited number of observations

10
Some counties have a very limited number of observations

11
Only a few counties have enough data

12
02
Bayesian
inference

13
We start with a likelihood function which encodes our assumption about the data generating process. It
tells us which values of the parameters are more “consistent” with the data we have.

14
Informative data results in a tight likelihood function.

15
Uninformative data yields a diffuse likelihood.

16
Bayesian inference also quantifies domain expertise through a prior probability distribution. t

17
A tight prior distribution represents high confidence in our domain expertise.

18
A diffuse prior represents high uncertainty or poor domain expertise.

19
Bayesian inference is the result of a rigorous combination between prior knowledge and observations. t

20
Our posterior knowledge is obtained via Bayes’ rule.

21
Our posterior knowledge is obtained via Bayes’ rule.

22
Poor domain expertise and non-informative data is the correct recipe for cooking poor inferences. t

23
The functional form form of the posterior is not
the final goal. We want to use it compute
expectation values, which requires solving
complicated integrals

Questions are functions.


Answers are expectations

24
Any expectation can be approximated with
samples and MCMC is the gold standard for
sampling from an arbitrary distribution.
Efficient MCMC is implemented in many
python libraries.

25
03
Traditional
approaches
03.1 Complete pooling

03.2 No pooling

26
03
Traditional
approaches
03.1 Complete pooling
03.2 No pooling

27
Complete pooling ignores the group-level information and considers all data as belonging to
the same category. All groups are described with the same model.

...

y1 y2 y3 y4 y5 y6 y7 ... yN

Group 1 2 3 J

28
Note: we will reserve the index on the data variables to label the group-membership
as opposed to the usual practice of labelling the individual observations.

...

y1 y2 y3 ... yJ

1 2 3 J

29
Example

Data for all counties in Minnesota is modelled with the same linear model.

Likelihood: p(y| θ)

α captures the average level at the ground floor.

β is the average increase from basement to ground floor

σ measures simultaneously the within-county variation and


across counties

30
Example

Data for all counties in Minnesota is modelled with the same linear model.

Likelihood: p(y| θ)

α captures the average level at the ground floor.

β is the average increase from basement to ground floor

σ measures simultaneously the within-county variation and


across counties

31
Example

Data for all counties in Minnesota is modelled with the same linear model.

Likelihood: p(y| θ)

α captures the average level at the ground floor.

β is the average increase from basement to ground floor

σ measures simultaneously the within-county variation and


across counties

32
Example

Data for all counties in Minnesota is modelled with the same linear model.

Likelihood: p(y| θ)

α captures the average level at the ground floor.

β is the average increase from basement to ground floor

σ measures simultaneously the within-county variation and


across counties

33
Example

Data for all counties in Minnesota is modelled with the same linear model.

priors: p( θ)

α captures the average level at the ground floor.

β is the average increase from basement to ground floor

σ measures simultaneously the within-county variation and


across counties

34
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

35
Complete pooling - Numpyro

Numpyro provides some helper ...


functions to build models
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

36
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
All of our model is defined in a
function.
def complete_pooling(floor, log_radon=None):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

37
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


Sample statements contribute to
the total posterior density α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

38
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


Some statements correspond to
the priors α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

39
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α + β * floor)
Some to the likelihood. Notice the σ = sample("σ", InverseGamma(1, 0.5))
obs keyword argument.
sample("log_radon", Normal(μ, σ), obs=log_radon)

40
Complete pooling - Numpyro

...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma

def complete_pooling(floor, log_radon=None):


α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))
Can also sample transformations
of parameters
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)

41
Complete pooling - Numpyro

Obtaining the posterior samples it’s from numpyro.infer import MCMC, NUTS
really easy with Numpyro

settings = dict(num_samples=2000, num_warmup=2000,


num_chains=4)
mcmc = MCMC(NUTS(complete_pooling), **settings)
data = dict(
floor=[-1, -1, 0, -1, 0, -1, ...],
log_radon=[0.78, 1.06, -0.35, 0.40, ...],
)
mcmc.run(seed, **data)
samples = mcmc.get_samples()

42
Complete pooling - Numpyro

from numpyro.infer import MCMC, NUTS

settings = dict(num_samples=2000, num_warmup=2000,


num_chains=4)
Construct your MCMC instance mcmc = MCMC(NUTS(complete_pooling), **settings)
data = dict(
floor=[-1, -1, 0, -1, 0, -1, ...],
log_radon=[0.78, 1.06, -0.35, 0.40, ...],
)
mcmc.run(seed, **data)
samples = mcmc.get_samples()

43
Complete pooling - Numpyro

from numpyro.infer import MCMC, NUTS

settings = dict(num_samples=2000, num_warmup=2000,


num_chains=4)
mcmc = MCMC(NUTS(complete_pooling), **settings)
data = dict(
floor=[-1, -1, 0, -1, 0, -1, ...],
log_radon=[0.78, 1.06, -0.35, 0.40, ...],
)
mcmc.run(seed, **data)
samples = mcmc.get_samples()

44
Complete pooling - Numpyro

from numpyro.infer import MCMC, NUTS

settings = dict(num_samples=2000, num_warmup=2000,


num_chains=4)
mcmc = MCMC(NUTS(complete_pooling), **settings)
data = dict(
floor=[-1, -1, 0, -1, 0, -1, ...],
log_radon=[0.78, 1.06, -0.35, 0.40, ...],
)
mcmc.run(seed, **data)
samples = mcmc.get_samples()

45
Posterior estimates

Running MCMC provides the posterior samples for each parameter specified in the model.
The samples can the be used to provide probabilistic predictions.

46
Posterior estimates

Running MCMC provides the posterior samples for each parameter specified in the model.
The samples can the be used to provide probabilistic predictions.

47
The complete pooling model does a good job at capturing the overall radon levels in Minnesota.

48
The complete pooling model does a good job at capturing the overall radon levels in Minnesota.
But “county level” estimates can be virtually useless.

49
03
Traditional
approaches
03.1 Complete pooling

03.2 No pooling

50
No pooling treats each group independently. Fitting a different model for each group.

θ1 θ2 θ3 ... θJ

y1 y2 y3 ... yJ

51
Example

For j = 1, 2, … , J:

αj captures each level at the ground floor.

βj is each increase from basement to ground floor.

τ measures the within-county variation only.

52
Example

For j = 1, 2, … , J:

αj to captures each county’s level at the ground floor.

βj is each increase from basement to ground floor.

τ measures the within-county variation only.

53
Example

For j = 1, 2, … , J:

αj to captures each county’s level at the ground floor.

βj is each county’s increase from basement to ground floor.

τ measures the within-county variation only.

54
Example

For j = 1, 2, … , J:

αj to captures each county’s level at the ground floor.

βj is each county’s increase from basement to ground floor.

τ measures the average within-county variation only.

55
Example

For j = 1, 2, … , J:

αj to captures each county’s level at the ground floor.

βj is each county’s increase from basement to ground floor.

τ measures the average within-county variation only.

56
No pooling - Numpyro implementation

...
from numpyro import plate

def no_pooling(county, floor, log_radon=None):


# N_COUNTIES was defined in global scope
with plate("counties", N_COUNTIES):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

57
No pooling - Numpyro implementation

...
from numpyro import plate
We must now also provide the
county as input
def no_pooling(county, floor, log_radon=None):
# N_COUNTIES was defined in global scope
with plate("counties", N_COUNTIES):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

58
No pooling - Numpyro implementation

...
from numpyro import plate

def no_pooling(county, floor, log_radon=None):


Use a plate for automatic # N_COUNTIES was defined in global scope
broadcasting
with plate("counties", N_COUNTIES):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

59
No pooling - Numpyro implementation

...
from numpyro import plate

def no_pooling(county, floor, log_radon=None):


# N_COUNTIES was defined in global scope
with plate("counties", N_COUNTIES):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))
Each observation is modelled with
its corresponding parameters
μ = deterministic("μ", α[county] + β[county] * floor)
τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

60
We obtain separate estimates for each county, although in many cases the posterior is very wide.
Plots below are only showing 15 selected counties among the 80+ possible ones.

61
We obtain separate estimates for each county, although in many cases the posterior is very wide.
Plots below are only showing 15 selected counties among the 80+ possible ones.

62
We are able to make bespoke predictions, but counties with uninformative data end up
with very uncertain and/or extreme estimates.

63
04
Hierarchical
modelling
04.1 Partial pooling

04.2 Hierarchical modelling

64
04
Hierarchical
modelling
04.1 Partial pooling
04.2 Hierarchical modelling

65
We can understand No pooling and Complete Pooling approaches as the 2 extremes of
a continuous spectrum of modelling approaches:

Complete pooling No pooling


All groups have the same All groups have independent
parameter parameters
posterior mean

66
We can understand No pooling and Complete Pooling approaches as the 2 extremes of
a continuous spectrum of modelling approaches:

Complete pooling No pooling


All groups have the same All groups have independent
parameter parameters
posterior mean

Independence among group


parameters

67
Mathematically, we obtain partial pooling by introducing a latent model that
describes the distribution of the group parameters.

θ1 θ2 θ3 ... θJ

y1 y2 y3 ... yJ

68
Mathematically, we obtain partial pooling by introducing a latent model that
describes the distribution of the group parameters.

θ1 θ2 θ3 ... θJ

y1 y2 y3 ... yJ

69
A popular choice is to use a latent Gaussian model, parametrized in terms of its mean and variance. t

μ, σ

θ1 θ2 θ3 ... θJ

y1 y2 y3 ... yJ

70
A high latent variance, imposes weak correlations and allows the parameter to vary independently. t

θ1 θ2 θ3 ... θJ

y1 y2 y3 ... yJ

71
A small variance, imposes strong correlations and forces the parameters to be similar

θ1 θ2 θ3 θJ

y1 y2 y3 ... yJ

72
Example

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

73
Example

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

74
Example

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

75
Example

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

76
Partial pooling - Numpyro implementation

...

def partial_pooling(σ_α, county, floor, log_radon=None):

μ_α = sample("μ_α", Normal(0, 5))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

77
Partial pooling - Numpyro implementation

The amount of pooling


required, σα,, will be provided ...
alongside the data
def partial_pooling(σ_α, county, floor, log_radon=None):

μ_α = sample("μ_α", Normal(0, 5))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

78
Partial pooling - Numpyro implementation

...

def partial_pooling(σ_α, county, floor, log_radon=None):

μ_α = sample("μ_α", Normal(0, 5))


The intercepts are given a
model instead of just a prior with plate("counties", N_COUNTIES):
α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

79
Partial pooling - Numpyro implementation

...

We provide hyper-priors for the


def partial_pooling(σ_α, county, floor, log_radon=None):
parameters of the latent model
μ_α = sample("μ_α", Normal(0, 5))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

80
We can verify that the partial pooling can recover the complete pooling and no pooling extremes. t

81
At zero allowed variation, all group estimates have to be the same. t

82
At “infinite” allowed variation, all group estimates can be completely uncorrelated. t

83
Other values of σα result in a middle ground between the two extremes.

84
We can visualize the full spectrum of models. t

85
From complete pooling

86
To no pooling.

87
We can visualize the full spectrum of models. t

88
We can visualize the full spectrum of models. t

89
How much pooling should we use?

90
04
Hierarchical
modelling
04.1 Partial pooling

04.2 Hierarchical modelling

91
Example

In a hierarchical model, the right amount of pooling is also estimated from the data. t

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

92
Example

In a hierarchical model, the right amount of pooling is also estimated from the data. t

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

σα controls how similar the group


countly-level α’s are

p(σα)

0 1 2 ...
σα
93
Hierarchical model - Numpyro implementation

...

def hierarchical(county, floor, log_radon=None):


μ_α = sample("μ_α", Normal(0, 5))
σ_α = sample("σ_α", Exponential(1))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

94
Hierarchical model - Numpyro implementation

We no longer provide a ...


hard-coded pooling
def hierarchical(county, floor, log_radon=None):
μ_α = sample("μ_α", Normal(0, 5))
σ_α = sample("σ_α", Exponential(1))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

95
Hierarchical model - Numpyro implementation

...

def hierarchical(county, floor, log_radon=None):


We simply specify a hyper-prior μ_α = sample("μ_α", Normal(0, 5))
σ_α = sample("σ_α", Exponential(1))

with plate("counties", N_COUNTIES):


α = sample("α", Normal(μ_α, σ_α))
β = sample("β", Normal(0, 1))

μ = deterministic("μ", α[county] + β[county] * floor)


τ = sample("τ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, τ), obs=log_radon)

...

96
The posterior estimates exhibit a natural amount of pooling

97
The posterior estimates exhibit a natural amount of pooling

98
Complete pooling

No pooling

Hierarchical

99
Complete pooling

No pooling

Hierarchical

100
Complete pooling

No pooling

Hierarchical

101
05
Group-level
predictors

102
Sometimes, we might have some further information about the different groups.
This information does not directly relate to the data, but rather to the groups.

103
We incorporate the group level info by turning into our second model into a regression:

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

γα captures the relationship between Uranium


and Radon.

σα controls how similar the group


countly-level α’s are

104
We incorporate the group level info by turning into our second model into a regression:

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

γα captures the relationship between Uranium


and Radon.

σα controls how similar the group


countly-level α’s are

105
We incorporate the group level info by turning into our second model into a regression:

αj captures each level at the ground floor. For j = 1, 2, ... , J:


βj is each increase from basement to ground floor.

τ measures the within-county variation only.

μα captures the average level at the ground


floor in Minnesota

γα captures the relationship between Uranium


and Radon.

σα controls how similar the group


countly-level α’s are

106
No group variation

Uncontrolled variation

Controlled variation

Controlled & Explained

107
No group variation

Uncontrolled variation

Controlled variation

Controlled & Explained

108
No group variation

Uncontrolled variation

Controlled variation

Controlled & Explained

109
No group variation

Uncontrolled variation

Controlled variation

Controlled & Explained

110
Key points
➔ Categorical data is a recurrent challenge
in data science

➔ Traditional approaches usually account for


the either the similarity across groups, or
the differences between them. Never both.

➔ Hierarchical modelling provides a rigorous


framework for combining both approaches.

➔ Hierarchical models implement the


different data sources at the right level

➔ Hierarchical modelling works by splitting


the sources of uncertainty.

➔ Straightforward approach (and elegant)


to make predictions for unseen groups.

111
Warnings! K
➔ Careful with the non-bayesian approaches to
hierarchical models -- prepackaged software
will make assumptions on your behalf.

➔ With the need for a Bayesian comes a need for


patience.

➔ The complexity of hierarchical models can


quickly get out of hands even with seemingly
innocent data.

➔ It will often require parametrize your model in


very specific ways.

➔ You will have to tune your MCMC very


deliberately.

112
06
Q&A

113

You might also like