Skip to content

Fix linear svc handling sample weights under class_weight="balanced" #30057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 11, 2025

Conversation

snath-xoc
Copy link
Contributor

@snath-xoc snath-xoc commented Oct 13, 2024

Fixes #30056 and cross-linked to meta-issue #16298

Under class_weight='balanced' strategy class weights are calculated as:

n_samples / (n_classes * np.bincount(y))

Sample weights were previously not passed through under this strategy leading to different class weights when calculated on weighted vs repeated samples.

TO DO:

  • Add tests for compute_class_weight

Copy link

github-actions bot commented Oct 13, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: d31272d. Link to the linter CI: here

@adrinjalali adrinjalali changed the title Fix linear svc Fix linear svc handling sample weights under class_weight="balanced" Oct 14, 2024
@ogrisel
Copy link
Member

ogrisel commented Oct 16, 2024

Thanks for the PR.

Please sync again with main to make sure that check_sample_weight_equivalence can work on your branch, and then add an entry for LinearSVC with class_weight="balanced" to sklearn.utils._test_common.instance_generator.PER_ESTIMATOR_CHECK_PARAMS.

Please also add a changelog entry for v1.6.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to review how compute_class_weight is used in scikit-learn. I suspect that this problem is not just for LinearSVC but for any model that accepts class_weight="balanced". As a result we should probably change compute_class_weight to accept an additional sample_weight parameter and implement the fix within compute_class_weight itself rather than in _fit_liblinear only.

@snath-xoc
Copy link
Contributor Author

I have merged it into the main branch and checked the check_sample_weight_equivalence test with the following output

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-09
Comparing the output of LinearSVC.decision_function revealed that fitting with `sample_weight` is not equivalent to fitting with removed or repeated data points.
Mismatched elements: 15 / 45 (33.3%)
Max absolute difference among violations: 4.05525969e-07
Max relative difference among violations: 4.24634322e-05
 ACTUAL: array([[ 0.938949, -0.948518, -0.873634],
       [-1.505082,  0.967837, -0.951662],
       [-1.042943, -0.954705,  0.934979],...
 DESIRED: array([[ 0.938949, -0.948518, -0.873634],
       [-1.505082,  0.967837, -0.951661],
       [-1.042943, -0.954705,  0.934979],...

Perhaps I need to dig in a bit more, seems like the fix hasn't fully solved the issue

@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2024

I also noticed other failures related to the liblinear solver used in LinearSVC as part of #30143 with violation of the numerical assertion on the same scale as what you observe:

Max absolute difference among violations: 4.05525969e-07
Max relative difference among violations: 4.24634322e-05

We need to investigate if those are caused by fundamental limitations of the liblinear solver, an unappropriate expecation on the numerical precision of the solver on particularly challenging test data with many repeated data points (a "bug" in the sample_weight estimator check itself) or the way we (mis-)use liblinear it in LinearSVC (a bug in the scikit-learn estimator code).

So I think we can keep the xfail marker for this estimator and merge the fix from this PR independently of the resolution of the problem described in the previous paragraph.

@@ -69,7 +72,12 @@ def compute_class_weight(class_weight, *, classes, y):
if not all(np.isin(classes, le.classes_)):
raise ValueError("classes should have valid labels that are in y")

recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64))
n_effective_samples = np.bincount(y_ind, weights=sample_weight).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result np.bincount(y_ind, weights=sample_weight) should be stored in a local variable to avoid computing it twice in a row.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is still failing: since the new parameter is added to a function part of the scikit-learn public API, it needs to be documented in the docstring. We should also use the .. versionadded:: 1.6 directive on this part of the docstring. Look for other example of usage of versionadded in our codebase as reference.

And see adding a new parameter to a public function, we also need to document this part of the PR as an enhancement in a dedicated file entry under the doc/whats_new/upcoming_changes/sklearn.utils folder.

And we need a fix changelog entry for the actual fix for the sklearn.svm.LinearSVC estimator impacted by this change.

@snath-xoc
Copy link
Contributor Author

Oh no, I think I did something wrong in my git pull origin --rebase, and it has all the changes of the past few days, will try revert it.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Member

ogrisel commented Nov 14, 2024

@snath-xoc since we merged #30149, we need to move _xfail_checks to a dictionary entry in PER_ESTIMATOR_XFAIL_CHECKS (or even in _get_expected_failed_checks when needed).

This explains the remaining conflicts on this PR.

@snath-xoc
Copy link
Contributor Author

Oh great, will move the xfails then, thanks for the pointer

@snath-xoc snath-xoc marked this pull request as ready for review November 18, 2024 12:40
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_array_equal should be reserved to comparing arrays with non-continuous dtypes (e.g. integers, str, object...).

To compare floating-point values arrays, we should almost always use assert_allclose (or keep assert_array_almost_equal on older code that already use that function).

Similarly, to compare Python scalar float values, it's better to use assert a == pytest.approx(b) instead of assert a == b to avoid failures related to rounding errors.

Other than that an other small suggestions below, LGTM.

assert_array_almost_equal(cw, [2.0 / 3, 2.0, 1.0])
assert len(cw_rep) == len(classes)

class_counts = np.bincount(y + 2, weights=sw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe renamed this to class_counts_weighted and the other one to class_counts_repeated to make the assertions easier to follow?

Similarly, for cw and cw_rep.

@antoinebaker
Copy link
Contributor

Maybe we could update the docstring to reflect the changes

class_weight : dict, "balanced" or None
If "balanced", class weights will be given by
`n_samples / (n_classes * np.bincount(y))`.
If a dictionary is given, keys are classes and values are corresponding class
weights.
If `None` is given, the class weights will be uniform.

Maybe in plain english, something like:
If "balanced", class weights will be inversely proportional to the class frequencies.
or maybe add a comment:
If "balanced", class weights will be given by `n_samples / (n_classes * np.bincount(y))` or their weighted equivalent if `sample_weight` is provided.

@antoinebaker
Copy link
Contributor

There are a few estimators supporting class_weight="balanced" and sample_weight for which I have some doubts.

ForestClassifier seems to rely on compute_sample_weight to compute the class weights:

expanded_class_weight = compute_sample_weight(class_weight, y_original)

BaseSGDClassifier does not take into account sample_weight

# Allocate datastructures from input arguments
self._expanded_class_weight = compute_class_weight(
self.class_weight, classes=self.classes_, y=y
)

Idem for BaseSVC

self.class_weight_ = compute_class_weight(self.class_weight, classes=cls, y=y_)

@snath-xoc
Copy link
Contributor Author

There are a few estimators supporting class_weight="balanced" and sample_weight for which I have some doubts.

ForestClassifier seems to rely on compute_sample_weight to compute the class weights:

expanded_class_weight = compute_sample_weight(class_weight, y_original)

BaseSGDClassifier does not take into account sample_weight

# Allocate datastructures from input arguments
self._expanded_class_weight = compute_class_weight(
self.class_weight, classes=self.classes_, y=y
)

Idem for BaseSVC

self.class_weight_ = compute_class_weight(self.class_weight, classes=cls, y=y_)

Thank you for that @antoinebaker I can start looking into these next week, I recall looking into BaseSVC as well and the fix should be somewhat similar(-ish).

@@ -69,7 +75,10 @@ def compute_class_weight(class_weight, *, classes, y):
if not all(np.isin(classes, le.classes_)):
raise ValueError("classes should have valid labels that are in y")

recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64))
weighted_class_counts = np.bincount(y_ind, weights=sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to validate sample_weight first using _check_sample_weight

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snath-xoc, this PR is almost ready for merge. This is the last thing. Something like

sample_weight = _check_sample_weight(sample_weight, y)
weighted_class_counts = np.bincount(y_ind, weights=sample_weight)

snath-xoc and others added 2 commits January 24, 2025 16:26
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
@jeremiedbb jeremiedbb enabled auto-merge (squash) February 11, 2025 17:32
@jeremiedbb jeremiedbb merged commit 9523006 into scikit-learn:main Feb 11, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LinearSVC does not correctly handle sample_weight under class_weight strategy 'balanced'
4 participants