Skip to content

Fix elasticnet cv sample weight #29308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

snath-xoc
Copy link
Contributor

@snath-xoc snath-xoc commented Jun 19, 2024

Fixes #22914

What does this implement/fix? Explain your changes.

Adapted from the pull request #23045 by s-banach

Modifies _alpha_grid function in linear_model._coordinate_descent to accept a sample_weight argument and implements changes to be compatible with _preprocess_data

TODO

  • see if we can merge the new test with the existing sample_weight tests or update them if necessary.

s-banach and others added 10 commits April 3, 2022 22:11
It seems like this single call to _preprocess_data suffices in all cases.
This tiny example was given in scikit-learn#22914.
The test merely asserts that alpha_max is large enough to force the coefficient to 0.
As per reviewer's suggestions:
(1) Clarify eps=1.
(2) Parameterize `fit_intercept`.
(1) Give the name `n_samples` to the quantity `X.shape[0]`.
(2) Clarify that `y_offset` and `X_scale` are not used, since these are already applied to the data by `_preprocess_data`.
Copy link

github-actions bot commented Jun 19, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bf62b35. Link to the linter CI: here

snath-xoc and others added 2 commits June 20, 2024 22:10
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…ied alpha_grid_ to accommodate for MultitaskCV y shape
@snath-xoc snath-xoc marked this pull request as ready for review June 21, 2024 14:51
@ogrisel
Copy link
Member

ogrisel commented Jun 27, 2024

@snath-xoc I merged main into this PR's branch to make the test_check_inplace_ensure_writeable failure go away as I think it was fixed there.

Don't forget to git pull origin fix_elasticnet_cv_sample_weight --rebase before adding new commits to this branch.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up.

Aside from the formatting issue reported for by the linter and the missing changelog entry, here are few more suggestion:

@ogrisel
Copy link
Member

ogrisel commented Jun 27, 2024

Feel free to ignore the linter reports on files that are not changed in this PR. This will be fixed in #29359.

@ogrisel
Copy link
Member

ogrisel commented Jun 28, 2024

Merging main again to silence the unrelated linting problems.

@snath-xoc snath-xoc closed this Jul 5, 2024
@snath-xoc snath-xoc deleted the fix_elasticnet_cv_sample_weight branch July 5, 2024 10:34
@snath-xoc snath-xoc restored the fix_elasticnet_cv_sample_weight branch July 5, 2024 10:34
@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

@snath-xoc why did you close this PR?

@snath-xoc snath-xoc reopened this Jul 5, 2024
@@ -979,7 +979,6 @@ def fit(self, X, y, sample_weight=None, check_input=True):
accept_sparse="csc",
order="F",
dtype=[np.float64, np.float32],
force_writeable=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the removal of this line and the other force_writeable=True lines below were not intentional (maybe when resolving conflicts with main)?

I think this is what causes the test failures on the CI.


# We weight the first fold 2 times more.
sw[:n_samples] = 2
# We weight the first fold n times more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We weight the first fold n times more.
# We re-weight the first cross-validation group with random integer weights.
# The samples in the other groups are left with unit weights.

reg_sw.fit(X, y, sample_weight=sw)

# We repeat the first fold 2 times and provide splits ourselves
if sparse_container is not None:
X = X.toarray()
X = np.r_[X[:n_samples], X]
X_rep = np.repeat(X, sw.astype(int), axis=0)
##Need to know number of repitions made in total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##Need to know number of repitions made in total
# Inspect the total number of random repetitions so as to adjust the size of
# the first cross-validation group accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think that computing the number of repetitions is not needed, see the other suggestions below.

X_rep = np.repeat(X, sw.astype(int), axis=0)
##Need to know number of repitions made in total
n_reps = X_rep.shape[0] - X.shape[0]
X = X_rep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not rename change the X variable to keep the code easier to follow.

Maybe you could instead name the variables X_with_weights, y_with_weights, groups_with_weights on one hand and X_with_repetitions, y_with_repetitions and groups_with_repetitions on the other hand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly for the names of the 2 cross-validation splitters (if you adapt the code to use metadata routing) or the results of their splits if you prefer to precompute them instead of leveraging metadata routing.

groups = np.r_[
np.full(2 * n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using n_reps, you could use:

groups_with_repetitions = np.repeat(groups_with_weights, sw.astype(int), axis=0)

as is done for X and y.

reg.fit(X, y)

# ensure that we chose meaningful alphas, i.e. not boundaries
assert alphas[0] < reg.alpha_ < alphas[-1]
assert_allclose(reg_sw.alphas_, reg.alphas_)
assert reg_sw.alpha_ == reg.alpha_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also compare the values of the mse_path_ attributes. prior to comparing the coef_ values.

@snath-xoc snath-xoc mentioned this pull request Jul 9, 2024
1 task
@ogrisel
Copy link
Member

ogrisel commented Aug 7, 2024

Closing as most review comments have been addressed in #29442.

@ogrisel ogrisel closed this Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calculation of alphas in ElasticNetCV doesn't use sample_weight
3 participants