Skip to content

Fix for passing pandas series in sample_weights in RidgeCV leading to an error(#5606) #6307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

kaichogami
Copy link
Contributor

Currently there seems to be no tests for checking pandas compatibility, and as @jnothman mentioned in the issue perhaps tests for various cases checking for compatibility needs to be added.
@jakevdp @amueller @jnothman reviews please, and let me know what should be done.

@jakevdp
Copy link
Member

jakevdp commented Feb 8, 2016

This will work for this particular case, but it only fixes one of the many cases where this needs to be done. I think a more comprehensive fix would be this:

  • add a validation utility that accepts sample_weight: e.g. check_X_y_sample_weight(), or perhaps add an optional sample_weight parameter to check_X_y (this should handle the None case automatically)
  • Use this function in all cases throughout sklearn where it needs to be used
  • add a new test in sklearn.utils.estimator_checks to make sure sample weights are correctly handled in any estimator that accepts them.

I realize that's a much bigger job than just fixing Ridge, but the piecemeal approach is bound to miss some things.

@MechCoder
Copy link
Member

As far as the common tests for sample_weight is concerned, there is some work going on here. #5515

@kaichogami
Copy link
Contributor Author

@jakevdp That sounds like a good idea. I would like to work on that although it would take quite some time before I make something substantial as I think I have a lot to understand about code base and sample_weights. I will let you know once I come up with something.
@MechCoder Thanks for sharing that.

Used the validation change in linear_model/ridge
@kaichogami kaichogami reopened this Feb 13, 2016
@kaichogami
Copy link
Contributor Author

@jakevdp I have created a new validation test for sample_weight by adding an optional parameter for the existing check_X_y function. I have set the default value to be False so that

  • We don't have to change check_X_y everywhere to return three variables in case its used where sample_weights where do not exist..
  • It checks the None case automatically.

If you think the current structure is fine, I can go ahead and use it in other estimators. Also do you think we need to check for anything else other than check_consistent_length and ensuring its 1-d ? Please let me know what you think.
Thank you!

@jakevdp
Copy link
Member

jakevdp commented Feb 17, 2016

Hmm... I'm not certain about this. The default value of False seems a bit problematic.

Maybe it would be cleaner to have a separate function for the case of checking X, y, and sample_weights?

@kaichogami
Copy link
Contributor Author

@jakevdp I agree that sample_weight=False looks disorganized. I have now added a new function check_sample_weight which performs previous checks as before. Although here it doesn't automatically checks for None. Please let me know what you think of it now.
I have messed up with conflicts which I will resolve later in the evening.
Thank you!

@jakevdp
Copy link
Member

jakevdp commented Feb 18, 2016

I'm not certain a separate function just for sample_weights alone makes sense. I was thinking something along the lines of

X, y, sample_weights = check_X_y_sample_weights(X, y, sample_weights)

But now that I write this out, it does look quite ugly.

The other thing I thought of is that we could use the object() default trick in your original approach, e.g.

NOT_SPECIFIED = object()

def check_X_y(X, y, sample_weights=NOT_SPECIFIED):
    # do stuff with X and y
    if sample_weights is NOT_SPECIFIED:
        return X, y
    else:
        # do stuff with sample_weights
        return X, y, sample_weights

@kaichogami
Copy link
Contributor Author

@jakevdp Is there any specific reason why we are using object() and not NOT_SPECIFIED = True? Is it related to your comment where you mentioned, using False would be problematic?
Also in the above case, I think when sample_weights would be None, check_array will fail. We can avoid this by checking whether sample_weight is not None from where the function is expected to be called. Is this what should be done? (Not checking None case automatically)

@jakevdp
Copy link
Member

jakevdp commented Feb 18, 2016

We want a sentinel value that the user could never accidentally pass. If we use True, False, None, or something like that, I can imagine situations in which the user might pass any of these to an estimator (e.g. clf.fit(X, y, sample_weights=True)) at which point they'd get a confusing error about sizes of iterables not matching because check_X_y would only return X, y rather than X, y, sample_weights.

Doing a sample_weights is object() check means there's virtually no possible way for the user to accidentally pass the sentinel value.

@kaichogami
Copy link
Contributor Author

@jakevdp Thanks for explaining. I now understand what you meant in the previous comment.
I have updated my original approach where, instead of default value being False, is now NOT_SPECIFIED=object() trick you mentioned above.
Please let me know what you think and then we can begin on introducing this function in other parts of sklearn.

@amueller
Copy link
Member

amueller commented Oct 7, 2016

This looks like a good fix. Could you please rebase and add a test? There should be a test in estimator_checks.py to make sure that this works whenever an estimator takes sample_weights.

@amueller amueller added this to the 0.19 milestone Oct 7, 2016
@amueller amueller added the Bug label Oct 7, 2016
@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

I think #5642 should be preferred as more general.

@amueller
Copy link
Member

amueller commented Oct 8, 2016

@jnothman I agree. Sorry, I'm way less on top of things than you ;)

@kaichogami
Copy link
Contributor Author

Alright. Should I close this PR in favor of #5642?

On 09-Oct-2016 12:15 am, "Andreas Mueller" notifications@github.com wrote:

@jnothman https://github.com/jnothman I agree. Sorry, I'm way less on
top of things than you ;)


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#6307 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGHZ4va2CUdwNwbZ7JAEqZNd3uqEqMO6ks5qx-SsgaJpZM4HVgYO
.

@amueller
Copy link
Member

amueller commented Oct 8, 2016

Usually we wait until one is merged, but whatever you like.

@jnothman
Copy link
Member

Fixed by #7825

@jnothman jnothman closed this Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants