slides
slides
modelling
Omar Sosa Rodriguez
June 2021
Faculty exists to
make AI real
2
Faculty is Predicting COVID hospitalisations
a world Our forecasting technology is helping the NHS to manage critical resources through COVID-19.
applied AI Our online harm detection technology is helping security agencies to identify malicious content.
Our uplift modelling technology is helping a fashion retailer to optimise a $200m marketing spend.
3
We have delivered
230
data science customers
We work in
13
countries
We cover
23
sectors
4 4
Contents
01 The challenge of multilevel structures
04 Hierarchical modelling
06 Q&A
5
01
Multilevel
structures
01.1 Example: Radon in Minnesota
6
Very often the data that we are trying to model will have some categorical features.
Other Examples
01. Geography
02. Demographics
03. Culture
04. Dates*
05. Names / IDs
06. Species
etc.
7
01
Multilevel
structures
01.1 Example: Radon in Minnesota
8
Radon concentration was measured in houses across many counties in Minnesota.
Sometimes in the basement, sometimes in the ground floor.
9
Some counties have a very limited number of observations
10
Some counties have a very limited number of observations
11
Only a few counties have enough data
12
02
Bayesian
inference
13
We start with a likelihood function which encodes our assumption about the data generating process. It
tells us which values of the parameters are more “consistent” with the data we have.
14
Informative data results in a tight likelihood function.
15
Uninformative data yields a diffuse likelihood.
16
Bayesian inference also quantifies domain expertise through a prior probability distribution. t
17
A tight prior distribution represents high confidence in our domain expertise.
18
A diffuse prior represents high uncertainty or poor domain expertise.
19
Bayesian inference is the result of a rigorous combination between prior knowledge and observations. t
20
Our posterior knowledge is obtained via Bayes’ rule.
21
Our posterior knowledge is obtained via Bayes’ rule.
22
Poor domain expertise and non-informative data is the correct recipe for cooking poor inferences. t
23
The functional form form of the posterior is not
the final goal. We want to use it compute
expectation values, which requires solving
complicated integrals
24
Any expectation can be approximated with
samples and MCMC is the gold standard for
sampling from an arbitrary distribution.
Efficient MCMC is implemented in many
python libraries.
25
03
Traditional
approaches
03.1 Complete pooling
03.2 No pooling
26
03
Traditional
approaches
03.1 Complete pooling
03.2 No pooling
27
Complete pooling ignores the group-level information and considers all data as belonging to
the same category. All groups are described with the same model.
...
y1 y2 y3 y4 y5 y6 y7 ... yN
Group 1 2 3 J
28
Note: we will reserve the index on the data variables to label the group-membership
as opposed to the usual practice of labelling the individual observations.
...
y1 y2 y3 ... yJ
1 2 3 J
29
Example
Data for all counties in Minnesota is modelled with the same linear model.
Likelihood: p(y| θ)
30
Example
Data for all counties in Minnesota is modelled with the same linear model.
Likelihood: p(y| θ)
31
Example
Data for all counties in Minnesota is modelled with the same linear model.
Likelihood: p(y| θ)
32
Example
Data for all counties in Minnesota is modelled with the same linear model.
Likelihood: p(y| θ)
33
Example
Data for all counties in Minnesota is modelled with the same linear model.
priors: p( θ)
34
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)
35
Complete pooling - Numpyro
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)
36
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
All of our model is defined in a
function.
def complete_pooling(floor, log_radon=None):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)
37
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)
38
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
μ = deterministic("μ", α + β * floor)
σ = sample("σ", InverseGamma(1, 0.5))
sample("log_radon", Normal(μ, σ), obs=log_radon)
39
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
μ = deterministic("μ", α + β * floor)
Some to the likelihood. Notice the σ = sample("σ", InverseGamma(1, 0.5))
obs keyword argument.
sample("log_radon", Normal(μ, σ), obs=log_radon)
40
Complete pooling - Numpyro
...
from numpyro import sample, deterministic
from numpyro.distributions import Normal, InverseGamma
41
Complete pooling - Numpyro
Obtaining the posterior samples it’s from numpyro.infer import MCMC, NUTS
really easy with Numpyro
42
Complete pooling - Numpyro
43
Complete pooling - Numpyro
44
Complete pooling - Numpyro
45
Posterior estimates
Running MCMC provides the posterior samples for each parameter specified in the model.
The samples can the be used to provide probabilistic predictions.
46
Posterior estimates
Running MCMC provides the posterior samples for each parameter specified in the model.
The samples can the be used to provide probabilistic predictions.
47
The complete pooling model does a good job at capturing the overall radon levels in Minnesota.
48
The complete pooling model does a good job at capturing the overall radon levels in Minnesota.
But “county level” estimates can be virtually useless.
49
03
Traditional
approaches
03.1 Complete pooling
03.2 No pooling
50
No pooling treats each group independently. Fitting a different model for each group.
θ1 θ2 θ3 ... θJ
y1 y2 y3 ... yJ
51
Example
For j = 1, 2, … , J:
52
Example
For j = 1, 2, … , J:
53
Example
For j = 1, 2, … , J:
54
Example
For j = 1, 2, … , J:
55
Example
For j = 1, 2, … , J:
56
No pooling - Numpyro implementation
...
from numpyro import plate
57
No pooling - Numpyro implementation
...
from numpyro import plate
We must now also provide the
county as input
def no_pooling(county, floor, log_radon=None):
# N_COUNTIES was defined in global scope
with plate("counties", N_COUNTIES):
α = sample("α", Normal(0, 5))
β = sample("β", Normal(0, 1))
58
No pooling - Numpyro implementation
...
from numpyro import plate
59
No pooling - Numpyro implementation
...
from numpyro import plate
60
We obtain separate estimates for each county, although in many cases the posterior is very wide.
Plots below are only showing 15 selected counties among the 80+ possible ones.
61
We obtain separate estimates for each county, although in many cases the posterior is very wide.
Plots below are only showing 15 selected counties among the 80+ possible ones.
62
We are able to make bespoke predictions, but counties with uninformative data end up
with very uncertain and/or extreme estimates.
63
04
Hierarchical
modelling
04.1 Partial pooling
64
04
Hierarchical
modelling
04.1 Partial pooling
04.2 Hierarchical modelling
65
We can understand No pooling and Complete Pooling approaches as the 2 extremes of
a continuous spectrum of modelling approaches:
66
We can understand No pooling and Complete Pooling approaches as the 2 extremes of
a continuous spectrum of modelling approaches:
67
Mathematically, we obtain partial pooling by introducing a latent model that
describes the distribution of the group parameters.
θ1 θ2 θ3 ... θJ
y1 y2 y3 ... yJ
68
Mathematically, we obtain partial pooling by introducing a latent model that
describes the distribution of the group parameters.
θ1 θ2 θ3 ... θJ
y1 y2 y3 ... yJ
69
A popular choice is to use a latent Gaussian model, parametrized in terms of its mean and variance. t
μ, σ
θ1 θ2 θ3 ... θJ
y1 y2 y3 ... yJ
70
A high latent variance, imposes weak correlations and allows the parameter to vary independently. t
θ1 θ2 θ3 ... θJ
y1 y2 y3 ... yJ
71
A small variance, imposes strong correlations and forces the parameters to be similar
θ1 θ2 θ3 θJ
y1 y2 y3 ... yJ
72
Example
73
Example
74
Example
75
Example
76
Partial pooling - Numpyro implementation
...
...
77
Partial pooling - Numpyro implementation
...
78
Partial pooling - Numpyro implementation
...
...
79
Partial pooling - Numpyro implementation
...
...
80
We can verify that the partial pooling can recover the complete pooling and no pooling extremes. t
81
At zero allowed variation, all group estimates have to be the same. t
82
At “infinite” allowed variation, all group estimates can be completely uncorrelated. t
83
Other values of σα result in a middle ground between the two extremes.
84
We can visualize the full spectrum of models. t
85
From complete pooling
86
To no pooling.
87
We can visualize the full spectrum of models. t
88
We can visualize the full spectrum of models. t
89
How much pooling should we use?
90
04
Hierarchical
modelling
04.1 Partial pooling
91
Example
In a hierarchical model, the right amount of pooling is also estimated from the data. t
92
Example
In a hierarchical model, the right amount of pooling is also estimated from the data. t
p(σα)
0 1 2 ...
σα
93
Hierarchical model - Numpyro implementation
...
...
94
Hierarchical model - Numpyro implementation
...
95
Hierarchical model - Numpyro implementation
...
...
96
The posterior estimates exhibit a natural amount of pooling
97
The posterior estimates exhibit a natural amount of pooling
98
Complete pooling
No pooling
Hierarchical
99
Complete pooling
No pooling
Hierarchical
100
Complete pooling
No pooling
Hierarchical
101
05
Group-level
predictors
102
Sometimes, we might have some further information about the different groups.
This information does not directly relate to the data, but rather to the groups.
103
We incorporate the group level info by turning into our second model into a regression:
104
We incorporate the group level info by turning into our second model into a regression:
105
We incorporate the group level info by turning into our second model into a regression:
106
No group variation
Uncontrolled variation
Controlled variation
107
No group variation
Uncontrolled variation
Controlled variation
108
No group variation
Uncontrolled variation
Controlled variation
109
No group variation
Uncontrolled variation
Controlled variation
110
Key points
➔ Categorical data is a recurrent challenge
in data science
111
Warnings! K
➔ Careful with the non-bayesian approaches to
hierarchical models -- prepackaged software
will make assumptions on your behalf.
112
06
Q&A
113