Skip to content

[MRG] Sample weights for ElasticNetCV #16449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Jun 26, 2021
Merged

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Feb 16, 2020

Reference Issues/PRs

Partially solves #3702: Adds sample_weight to ElasticNetCV and LassoCV, but only for dense feature array X.
It is a follow up of PR #15436.

Any other comments?

DO NOT MERGE BEFORE #15436 as it is based on that branch.

@lorentzenchr
Copy link
Member Author

@agramfort How would you like to calculate the cross-validated mean square error? Weighted by sample_weight or unweighted. I think this was discussed somewhere else. I would prefer the weighted version.

Furthermore, just as info, the current approach to rescale X by sqrt(sw) might use more memory copies as the unweighted version.

@rth
Copy link
Member

rth commented Feb 16, 2020

The diff on Github doesn't seem to take into account the merge #15436 PR, maybe merging master would help?

How would you like to calculate the cross-validated mean square error? Weighted by sample_weight or unweighted. I think this was discussed somewhere else. I would prefer the weighted version.

I couldn't find the corresponding issue, maybe it could be worth opening one. There was #15651 but it's a different topic.

In the short term, I think we want to be consistent with GridSearchCV(ElasticNet()).fit(X, y, sample_weight). As far as I can tell from

out = parallel(delayed(_fit_and_score)(clone(base_estimator),
and the corresponding _fit_and_score function, sample_weight is not used for scoring there. We could discuss what is the right thing to do (or whether there should be an option to allow taking into account sample weight for scoring) in a separate issue.

@lorentzenchr
Copy link
Member Author

@jjerphan If you feel comfortable, I'd appreciate your review approval very much:smirk:

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lorentzenchr,

Here are a few lasts suggestions before approval. 🙂

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @lorentzenchr!

@lorentzenchr
Copy link
Member Author

lorentzenchr commented May 26, 2021

Decision in yesterday's dev meeting (2021-05-25): Remove use_weights_in_cv. This functionality will come with SLEP006 (hopefully) and makes this PR non-controversial.

@lorentzenchr
Copy link
Member Author

To have it easily referenced, here is the code that is needed for use_weights_in_cv:

  • LinearModelCV
    """
    use_weights_in_cv : bool, default=False
        If `True`, the MSE over test folds is calculated as a weighted average,
        weighted by the sum of `sample_weight` of each test fold. Here,
        `sample_weight=None` acts like `sample_weight=1`, which means the sum
        of weights in the test fold is the number of observations.
    """
  • fit
    if not self.use_weights_in_cv:
        # The mean is computed over folds.
        mean_mse = np.mean(mse_paths, axis=1)
    else:
        if sample_weight is None:
            # Note that both len(test) and sample_weight[test].sum() can
            # have different values for different folds.
            sw_paths = [len(test) for train, test in folds]
        else:
            sw_paths = [sample_weight[test].sum() for train, test in folds]
        # The average is computed over folds.
        mean_mse = np.average(mse_paths, axis=1, weights=sw_paths)
  • test_enet_cv_sample_weight_correctness
    @pytest.mark.parametrize("use_weights_in_cv", [True, False])
    @pytest.mark.parametrize("fit_intercept", [True, False])
    def test_enet_cv_sample_weight_correctness(use_weights_in_cv, fit_intercept):
        """Test that ElasticNetCV with sample weights gives correct results."""
        ...
        if use_weights_in_cv:
          # Note: The order of groups from LeaveOneGroupOut is the same for both.
          assert_allclose(reg_sw.mse_path_, reg.mse_path_)

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides LGTM

thx a lot @lorentzenchr for taking the time to dive deep into this

@lorentzenchr
Copy link
Member Author

@rth Do you want to merge? @agramfort and @jjerphan have already approved. This PR only adds sample_weight to some fit methods, nothing more API-wise. test_enet_cv_grid_search assures equivalency with GridSearchCV.

@agramfort
Copy link
Member

@lorentzenchr you need to rebase this one

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! LGTM. Please merge main in to resolve conflicts.

@lorentzenchr
Copy link
Member Author

@rth Main merged and all CI green.

@rth rth merged commit 0b1070c into scikit-learn:main Jun 26, 2021
@lorentzenchr lorentzenchr deleted the enet_cv_sw branch June 26, 2021 11:47
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants