Skip to content

Formula to compute BIC in sklearn.mixture.GaussianMixture wrong #23443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thomasmooon opened this issue May 23, 2022 · 6 comments
Closed

Formula to compute BIC in sklearn.mixture.GaussianMixture wrong #23443

thomasmooon opened this issue May 23, 2022 · 6 comments

Comments

@thomasmooon
Copy link

Describe the bug

The formula is wrong:
return -2 * self.score(X) * X.shape[0] + self._n_parameters() * np.log(X.shape[0])

Proposed fix:

Replace formula with return self._n_parameters() * np.log(X.shape[0]) -2 * self.score(X)

Steps/Code to Reproduce

# imports
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.mixture import GaussianMixture as GM

clusters = [i for i in range(2,8)]
models = []
random_state = 42
n_init = 10

# train one model per cluster size
cols = ["petal width (cm)","petal length (cm)"]
models = [GM(c, n_init=n_init, init_params="random", random_state=random_state).\
          fit(X[cols]) for c in clusters]

# get best model from BIC
bic = [model.bic(X[cols]) for model in models]
metrics = pd.DataFrame({"clusters":clusters,"bic":bic})
metrics["best"] = metrics["bic"] == min(metrics["bic"])
metrics

The above will output BIC using the inbuilt methods model.bic() as follows:

clusters bic best
2 364.579644 False
3 353.501255 False
4 366.865550 False
5 189.449838 True
6 212.580815 False
7 400.169872 False

Using the correct BIC formula (see e.g. https://en.wikipedia.org/wiki/Bayesian_information_criterion)

bics = [model._n_parameters() * np.log(X.shape[0]) -2*model.score(X[cols]) for k,model in enumerate(models)]
bics 

leads to the following BIC's
[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]

Expected Results

[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]

Actual Results

[364.579644,
353.501255,
366.865550,
189.449838,
212.580815,
400.169872]

Versions

System:
    python: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.6
        scipy: 1.4.1
       Cython: 0.29.30
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@thomasmooon thomasmooon added Bug Needs Triage Issue requires triage labels May 23, 2022
@glemaitre
Copy link
Member

I recall to have look at this formula sometimes ago (#21481) and reformulated the mathematical section: https://scikit-learn.org/stable/modules/linear_model.html#lasso-lars-ic

The n constant is coming from making the assumption to have a Gaussian model and plug it in the BIC formula. I don't recall that it was an issue there.

@thomasmooon
Copy link
Author

The current formula is:
return -2 * self.score(X) * X.shape[0] + self._n_parameters() * np.log(X.shape[0])

The proposed correction is:
return self._n_parameters() * np.log(X.shape[0]) -2 * self.score(X)

The general formula of the BIC is
BIC = k*ln(n)-2ln(L)

Now let

k = self._n_parameters()
n = X.shape[0]
ln(L) = self.score(X)

Then

BIC = k*ln(n)-2ln(L)
    = self._n_parameters() * np.log(X.shape[0]) -2 * self.score(X)

Which in turn is the proposed corretion.

Example

When one computes the silhouette scores (to evaluate the best number of clusters) in contrast to the BIC, one will retrieve

For n_clusters = 2 The average silhouette_score is : 0.7669465622459366
For n_clusters = 3 The average silhouette_score is : 0.41538251879888394
For n_clusters = 4 The average silhouette_score is : 0.36208909622898705
For n_clusters = 5 The average silhouette_score is : 0.06036439944496741
For n_clusters = 6 The average silhouette_score is : 0.08875521208115827
For n_clusters = 7 The average silhouette_score is : 0.13079963966127484

So in conclusion, silhoutte indices that 2 clusters are optimal.
As a reminder from above: GMM with BIC suggest 5 clusters.
Now have a look at the clustering for n=2 and n=5.

image

image

For convenience here are the BIC for 2,...,7 clusters again. The score for n=2 clusters is the best here and consistent with the visual clustering and silhouette.

[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]

@Micky774
Copy link
Contributor

Micky774 commented May 24, 2022

My current understanding is that since GaussianMixture.score is defined to:

Compute the per-sample average log-likelihood of the given data X.

then the total log-likelihood (the actual factor BIC uses) is then attained as self.score(X) * X.shape[0] as is currently the case.

@thomasmooon
Copy link
Author

@Micky774 agreed. Then I have absolutely no idea what's going wrong here. In any case, here is the colab notebook for reproducibility.

@Micky774
Copy link
Contributor

Micky774 commented May 24, 2022

@Micky774 agreed. Then I have absolutely no idea what's going wrong here. In any case, here is the colab notebook for reproducibility.

I'm not sure there even is anything wrong here -- AIC/BIC and the silhouette score are all heuristics and it isn't altogether surprising to me that they don't agree in this case. These are good rule-of-thumb heuristics of course, but none are nearly robust enough where this divergence in opinion raises a red flag for me.

@lesteve
Copy link
Member

lesteve commented May 25, 2022

I am going to close this one, since it seems there is no issue in scikit-learn and this turned into a more generic machine learning discussion.

Thanks @Micky774 for the input!

@lesteve lesteve closed this as completed May 25, 2022
@lesteve lesteve removed Bug Needs Triage Issue requires triage labels May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants