-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Formula to compute BIC in sklearn.mixture.GaussianMixture wrong #23443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I recall to have look at this formula sometimes ago (#21481) and reformulated the mathematical section: https://scikit-learn.org/stable/modules/linear_model.html#lasso-lars-ic The |
The current formula is: The proposed correction is: The general formula of the BIC is Now let
Then
Which in turn is the proposed corretion. ExampleWhen one computes the silhouette scores (to evaluate the best number of clusters) in contrast to the BIC, one will retrieve
So in conclusion, silhoutte indices that For convenience here are the BIC for 2,...,7 clusters again. The score for n=2 clusters is the best here and consistent with the visual clustering and silhouette.
|
My current understanding is that since
then the total log-likelihood (the actual factor BIC uses) is then attained as |
@Micky774 agreed. Then I have absolutely no idea what's going wrong here. In any case, here is the colab notebook for reproducibility. |
I'm not sure there even is anything wrong here -- AIC/BIC and the silhouette score are all heuristics and it isn't altogether surprising to me that they don't agree in this case. These are good rule-of-thumb heuristics of course, but none are nearly robust enough where this divergence in opinion raises a red flag for me. |
I am going to close this one, since it seems there is no issue in scikit-learn and this turned into a more generic machine learning discussion. Thanks @Micky774 for the input! |
Describe the bug
The formula is wrong:
return -2 * self.score(X) * X.shape[0] + self._n_parameters() * np.log(X.shape[0])
Proposed fix:
Replace formula with
return self._n_parameters() * np.log(X.shape[0]) -2 * self.score(X)
Steps/Code to Reproduce
The above will output BIC using the inbuilt methods model.bic() as follows:
Using the correct BIC formula (see e.g. https://en.wikipedia.org/wiki/Bayesian_information_criterion)
leads to the following BIC's
[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]
Expected Results
[57.180072607292026,
86.96960303077948,
116.92208468419805,
145.6026996264713,
175.62029248920572,
206.73427255638083]
Actual Results
[364.579644,
353.501255,
366.865550,
189.449838,
212.580815,
400.169872]
Versions
The text was updated successfully, but these errors were encountered: