Skip to content

[MRG] FIX sample_weight invariance for linear models #19616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 12, 2021

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Mar 4, 2021

This is a tentative fix for #17444 but doing so breaks a test added as part of #19426.

The new normalize in base model uses sample_weight.sum() instead of X.shape[0] to scale X_var. This effectively fixes the problem of #17444 as can be seen in the new test: sklearn.linear_model/tests/test_ridge.py::test_sample_weight_invariance.

However that requires updating the _scale_alpha_inplace to make test_linear_model_sample_weights_normalize_in_pipeline pass for ridge.

However now we get this test that fails for Lasso / ElasticNet. I suspect this is because of this dark magic:

# simplify things by rescaling sw to sum up to n_samples
# => np.average(x, weights=sw) = np.mean(sw * x)
sample_weight = sample_weight * (n_samples / np.sum(sample_weight))
# Objective function is:
# 1/2 * np.average(squared error, weights=sw) + alpha * penalty
# but coordinate descent minimizes:
# 1/2 * sum(squared error) + alpha * penalty
# enet_path therefore sets alpha = n_samples * alpha
# With sw, enet_path should set alpha = sum(sw) * alpha
# Therefore, we rescale alpha = sum(sw) / n_samples * alpha
# Note: As we rescaled sample_weights to sum up to n_samples,
# we don't need this
# alpha *= np.sum(sample_weight) / n_samples

but I am not sure what is the correct fix.

/cc @maikia @agramfort @rth.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 5, 2021

I pushed more tests to highlight all the remaining problems. In particular I added test_enet_ridge_consistency that passes on main both with normalize=True and False but is now broken in this branch with normalize=True (with the change sklearn/linear_model/_base.py) which I think reveals a problem in our ElasticNet model.

I would need more time to write the weighted objective functions for all models (Ridge, Lasso, ElasticNet) and write down the expected equivalences between their alpha parameters that do not have the same meaning and extend the tests and fix the code accordingly.

But feel free to give it a try without waiting for me.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 5, 2021

Also note that the 2 test_sample_weight_invariance functions in this PR (one in test_ridge.py and the other in test_coordinate_descent.py) should be longterm replaced by the common tests of #17441 which revealed #17444 which this PR is attempting to fix.

@lorentzenchr
Copy link
Member

I suspect this is because of this dark magic:

@ogrisel As opposed to benchmarking, there is no dark margic involved in this: This objective function of Enet ist invariant to rescaling of sample weights. We use this freedom: By a lucky hunch we end up with sum(sw_rescaled) = n_samples which then simplifies the call of enet_path.

@agramfort
Copy link
Member

@lorentzenchr do you have a bit of bandwidth to take over here?

@lorentzenchr
Copy link
Member

@lorentzenchr do you have a bit of bandwidth to take over here?

I'll try.

@lorentzenchr
Copy link
Member

This whole normalize=True thing makes me doubt my mental health status. I think I got most things right. I had to restrict test_sample_weight_invariance in coordinate descent to normalize=False.

@lorentzenchr lorentzenchr changed the title [WIP] trying to fix normalize=True and sample_weight invariance for linear models [MRG] trying to fix normalize=True and sample_weight invariance for linear models May 7, 2021
@lorentzenchr
Copy link
Member

@ogrisel @agramfort What do you think? Does this solve the issue? Or is 599a63f a show stopper for you? I think it's ok and we want to get rid of normalize anyway.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @lorentzenchr for cleaning up this mess 🙏

@lorentzenchr lorentzenchr changed the title [MRG] trying to fix normalize=True and sample_weight invariance for linear models [MRG] FIX sample_weight invariance for Ridge regression Jun 9, 2021
@lorentzenchr lorentzenchr changed the title [MRG] FIX sample_weight invariance for Ridge regression [MRG] FIX sample_weight invariance for linear models Jun 9, 2021
@lorentzenchr
Copy link
Member

@ogrisel Are you fine with the current status? If so let's count each of our reviews as +1/2 which adds up to +1 => merge:smirk:

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (may count as a half review:smirk:)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ogrisel @lorentzenchr !

@rth rth merged commit 5f3bfcc into scikit-learn:main Jun 12, 2021
@agramfort
Copy link
Member

🍻 @ogrisel @lorentzenchr 👏

@ogrisel ogrisel deleted the fix-ridge-sample-weight-normalize-true branch June 15, 2021 14:16
@ogrisel
Copy link
Member Author

ogrisel commented Jun 15, 2021

This whole normalize=True thing makes me doubt my mental health status. I think I got most things right. I had to restrict test_sample_weight_invariance in coordinate descent to normalize=False.
@ogrisel @agramfort What do you think? Does this solve the issue? Or is 599a63f a show stopper for you? I think it's ok and we want to get rid of normalize anyway.

Sorry for the lack of reply. I guess I unconsciously ignored the notifications related to this cursed PR. We probably still have bugs with normalize=True and sample_weight but so be it. This automatic normalization is a maintenance nightmare and let allocate more time and effort to non-deprecated features in scikit-learn :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants