-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix linear svc handling sample weights under class_weight="balanced" #30057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. Please sync again with Please also add a changelog entry for v1.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to review how compute_class_weight
is used in scikit-learn. I suspect that this problem is not just for LinearSVC
but for any model that accepts class_weight="balanced"
. As a result we should probably change compute_class_weight
to accept an additional sample_weight
parameter and implement the fix within compute_class_weight
itself rather than in _fit_liblinear
only.
I have merged it into the main branch and checked the check_sample_weight_equivalence test with the following output
Perhaps I need to dig in a bit more, seems like the fix hasn't fully solved the issue |
I also noticed other failures related to the liblinear solver used in
We need to investigate if those are caused by fundamental limitations of the liblinear solver, an unappropriate expecation on the numerical precision of the solver on particularly challenging test data with many repeated data points (a "bug" in the sample_weight estimator check itself) or the way we (mis-)use liblinear it in So I think we can keep the xfail marker for this estimator and merge the fix from this PR independently of the resolution of the problem described in the previous paragraph. |
sklearn/utils/class_weight.py
Outdated
@@ -69,7 +72,12 @@ def compute_class_weight(class_weight, *, classes, y): | |||
if not all(np.isin(classes, le.classes_)): | |||
raise ValueError("classes should have valid labels that are in y") | |||
|
|||
recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64)) | |||
n_effective_samples = np.bincount(y_ind, weights=sample_weight).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result np.bincount(y_ind, weights=sample_weight)
should be stored in a local variable to avoid computing it twice in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI is still failing: since the new parameter is added to a function part of the scikit-learn public API, it needs to be documented in the docstring. We should also use the .. versionadded:: 1.6
directive on this part of the docstring. Look for other example of usage of versionadded
in our codebase as reference.
And see adding a new parameter to a public function, we also need to document this part of the PR as an enhancement
in a dedicated file entry under the doc/whats_new/upcoming_changes/sklearn.utils
folder.
And we need a fix
changelog entry for the actual fix for the sklearn.svm.LinearSVC
estimator impacted by this change.
Oh no, I think I did something wrong in my git pull origin --rebase, and it has all the changes of the past few days, will try revert it. |
…imator that is known to handle sample_weight properly
1f59cc6
to
defba7b
Compare
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@snath-xoc since we merged #30149, we need to move This explains the remaining conflicts on this PR. |
Oh great, will move the xfails then, thanks for the pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_array_equal
should be reserved to comparing arrays with non-continuous dtypes (e.g. integers, str, object...).
To compare floating-point values arrays, we should almost always use assert_allclose
(or keep assert_array_almost_equal
on older code that already use that function).
Similarly, to compare Python scalar float values, it's better to use assert a == pytest.approx(b)
instead of assert a == b
to avoid failures related to rounding errors.
Other than that an other small suggestions below, LGTM.
assert_array_almost_equal(cw, [2.0 / 3, 2.0, 1.0]) | ||
assert len(cw_rep) == len(classes) | ||
|
||
class_counts = np.bincount(y + 2, weights=sw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe renamed this to class_counts_weighted
and the other one to class_counts_repeated
to make the assertions easier to follow?
Similarly, for cw
and cw_rep
.
doc/whats_new/upcoming_changes/sklearn.linear_model/30057.fix.rst
Outdated
Show resolved
Hide resolved
Maybe we could update the docstring to reflect the changes scikit-learn/sklearn/utils/class_weight.py Lines 26 to 31 in c4e1819
Maybe in plain english, something like: |
There are a few estimators supporting
scikit-learn/sklearn/ensemble/_forest.py Line 879 in c4e1819
scikit-learn/sklearn/linear_model/_stochastic_gradient.py Lines 617 to 620 in c4e1819
Idem for scikit-learn/sklearn/svm/_base.py Line 748 in c4e1819
|
Thank you for that @antoinebaker I can start looking into these next week, I recall looking into BaseSVC as well and the fix should be somewhat similar(-ish). |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@@ -69,7 +75,10 @@ def compute_class_weight(class_weight, *, classes, y): | |||
if not all(np.isin(classes, le.classes_)): | |||
raise ValueError("classes should have valid labels that are in y") | |||
|
|||
recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64)) | |||
weighted_class_counts = np.bincount(y_ind, weights=sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to validate sample_weight first using _check_sample_weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snath-xoc, this PR is almost ready for merge. This is the last thing. Something like
sample_weight = _check_sample_weight(sample_weight, y)
weighted_class_counts = np.bincount(y_ind, weights=sample_weight)
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Fixes #30056 and cross-linked to meta-issue #16298
Under class_weight='balanced' strategy class weights are calculated as:
n_samples / (n_classes * np.bincount(y))
Sample weights were previously not passed through under this strategy leading to different class weights when calculated on weighted vs repeated samples.
TO DO: