-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix elasticnet cv sample weight #29308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix elasticnet cv sample weight #29308
Conversation
It seems like this single call to _preprocess_data suffices in all cases.
This tiny example was given in scikit-learn#22914. The test merely asserts that alpha_max is large enough to force the coefficient to 0.
As per reviewer's suggestions: (1) Clarify eps=1. (2) Parameterize `fit_intercept`.
(1) Give the name `n_samples` to the quantity `X.shape[0]`. (2) Clarify that `y_offset` and `X_scale` are not used, since these are already applied to the data by `_preprocess_data`.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…ied alpha_grid_ to accommodate for MultitaskCV y shape
@snath-xoc I merged Don't forget to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up.
Aside from the formatting issue reported for by the linter and the missing changelog entry, here are few more suggestion:
Feel free to ignore the linter reports on files that are not changed in this PR. This will be fixed in #29359. |
Merging |
…ear_model/tests/test_coordinate_descent/test_enet_cv_sample_weight_correctness
@snath-xoc why did you close this PR? |
@@ -979,7 +979,6 @@ def fit(self, X, y, sample_weight=None, check_input=True): | |||
accept_sparse="csc", | |||
order="F", | |||
dtype=[np.float64, np.float32], | |||
force_writeable=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the removal of this line and the other force_writeable=True
lines below were not intentional (maybe when resolving conflicts with main
)?
I think this is what causes the test failures on the CI.
|
||
# We weight the first fold 2 times more. | ||
sw[:n_samples] = 2 | ||
# We weight the first fold n times more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We weight the first fold n times more. | |
# We re-weight the first cross-validation group with random integer weights. | |
# The samples in the other groups are left with unit weights. |
reg_sw.fit(X, y, sample_weight=sw) | ||
|
||
# We repeat the first fold 2 times and provide splits ourselves | ||
if sparse_container is not None: | ||
X = X.toarray() | ||
X = np.r_[X[:n_samples], X] | ||
X_rep = np.repeat(X, sw.astype(int), axis=0) | ||
##Need to know number of repitions made in total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##Need to know number of repitions made in total | |
# Inspect the total number of random repetitions so as to adjust the size of | |
# the first cross-validation group accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think that computing the number of repetitions is not needed, see the other suggestions below.
X_rep = np.repeat(X, sw.astype(int), axis=0) | ||
##Need to know number of repitions made in total | ||
n_reps = X_rep.shape[0] - X.shape[0] | ||
X = X_rep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not rename change the X
variable to keep the code easier to follow.
Maybe you could instead name the variables X_with_weights
, y_with_weights
, groups_with_weights
on one hand and X_with_repetitions
, y_with_repetitions
and groups_with_repetitions
on the other hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And similarly for the names of the 2 cross-validation splitters (if you adapt the code to use metadata routing) or the results of their splits if you prefer to precompute them instead of leveraging metadata routing.
groups = np.r_[ | ||
np.full(2 * n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2) | ||
np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using n_reps
, you could use:
groups_with_repetitions = np.repeat(groups_with_weights, sw.astype(int), axis=0)
as is done for X
and y
.
reg.fit(X, y) | ||
|
||
# ensure that we chose meaningful alphas, i.e. not boundaries | ||
assert alphas[0] < reg.alpha_ < alphas[-1] | ||
assert_allclose(reg_sw.alphas_, reg.alphas_) | ||
assert reg_sw.alpha_ == reg.alpha_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also compare the values of the mse_path_
attributes. prior to comparing the coef_
values.
Closing as most review comments have been addressed in #29442. |
Fixes #22914
What does this implement/fix? Explain your changes.
Adapted from the pull request #23045 by s-banach
Modifies _alpha_grid function in linear_model._coordinate_descent to accept a sample_weight argument and implements changes to be compatible with _preprocess_data
TODO