Skip to content

Formula to compute BIC in sklearn.mixture.GaussianMixture wrong #23443

Closed
@thomasmooon

Description

@thomasmooon

Describe the bug

The formula is wrong:
return -2 * self.score(X) * X.shape[0] + self._n_parameters() * np.log(X.shape[0])

Proposed fix:

Replace formula with return self._n_parameters() * np.log(X.shape[0]) -2 * self.score(X)

Steps/Code to Reproduce

# imports
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.mixture import GaussianMixture as GM

clusters = [i for i in range(2,8)]
models = []
random_state = 42
n_init = 10

# train one model per cluster size
cols = ["petal width (cm)","petal length (cm)"]
models = [GM(c, n_init=n_init, init_params="random", random_state=random_state).\
          fit(X[cols]) for c in clusters]

# get best model from BIC
bic = [model.bic(X[cols]) for model in models]
metrics = pd.DataFrame({"clusters":clusters,"bic":bic})
metrics["best"] = metrics["bic"] == min(metrics["bic"])
metrics

The above will output BIC using the inbuilt methods model.bic() as follows:

clusters bic best
2 364.579644 False
3 353.501255 False
4 366.865550 False
5 189.449838 True
6 212.580815 False
7 400.169872 False

Using the correct BIC formula (see e.g. https://en.wikipedia.org/wiki/Bayesian_information_criterion)

bics = [model._n_parameters() * np.log(X.shape[0]) -2*model.score(X[cols]) for k,model in enumerate(models)]
bics 

leads to the following BIC's
[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]

Expected Results

[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]

Actual Results

[364.579644,
353.501255,
366.865550,
189.449838,
212.580815,
400.169872]

Versions

System:
    python: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.6
        scipy: 1.4.1
       Cython: 0.29.30
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions