Fix for passing pandas series in `sample_weights` in `RidgeCV` leading to an error(#5606) #6307

kaichogami · 2016-02-08T13:05:52Z

Currently there seems to be no tests for checking pandas compatibility, and as @jnothman mentioned in the issue perhaps tests for various cases checking for compatibility needs to be added.
@jakevdp @amueller @jnothman reviews please, and let me know what should be done.

jakevdp · 2016-02-08T14:13:08Z

This will work for this particular case, but it only fixes one of the many cases where this needs to be done. I think a more comprehensive fix would be this:

add a validation utility that accepts sample_weight: e.g. check_X_y_sample_weight(), or perhaps add an optional sample_weight parameter to check_X_y (this should handle the None case automatically)
Use this function in all cases throughout sklearn where it needs to be used
add a new test in sklearn.utils.estimator_checks to make sure sample weights are correctly handled in any estimator that accepts them.

I realize that's a much bigger job than just fixing Ridge, but the piecemeal approach is bound to miss some things.

MechCoder · 2016-02-08T17:56:31Z

As far as the common tests for sample_weight is concerned, there is some work going on here. #5515

kaichogami · 2016-02-09T17:37:43Z

@jakevdp That sounds like a good idea. I would like to work on that although it would take quite some time before I make something substantial as I think I have a lot to understand about code base and sample_weights. I will let you know once I come up with something.
@MechCoder Thanks for sharing that.

Used the validation change in linear_model/ridge

kaichogami · 2016-02-13T12:57:20Z

@jakevdp I have created a new validation test for sample_weight by adding an optional parameter for the existing check_X_y function. I have set the default value to be False so that

We don't have to change check_X_y everywhere to return three variables in case its used where sample_weights where do not exist..
It checks the None case automatically.

If you think the current structure is fine, I can go ahead and use it in other estimators. Also do you think we need to check for anything else other than check_consistent_length and ensuring its 1-d ? Please let me know what you think.
Thank you!

jakevdp · 2016-02-17T21:13:08Z

Hmm... I'm not certain about this. The default value of False seems a bit problematic.

Maybe it would be cleaner to have a separate function for the case of checking X, y, and sample_weights?

kaichogami · 2016-02-18T06:38:42Z

@jakevdp I agree that sample_weight=False looks disorganized. I have now added a new function check_sample_weight which performs previous checks as before. Although here it doesn't automatically checks for None. Please let me know what you think of it now.
I have messed up with conflicts which I will resolve later in the evening.
Thank you!

jakevdp · 2016-02-18T07:14:42Z

I'm not certain a separate function just for sample_weights alone makes sense. I was thinking something along the lines of

X, y, sample_weights = check_X_y_sample_weights(X, y, sample_weights)

But now that I write this out, it does look quite ugly.

The other thing I thought of is that we could use the object() default trick in your original approach, e.g.

NOT_SPECIFIED = object()

def check_X_y(X, y, sample_weights=NOT_SPECIFIED):
    # do stuff with X and y
    if sample_weights is NOT_SPECIFIED:
        return X, y
    else:
        # do stuff with sample_weights
        return X, y, sample_weights

…into ridge

kaichogami · 2016-02-18T18:30:55Z

@jakevdp Is there any specific reason why we are using object() and not NOT_SPECIFIED = True? Is it related to your comment where you mentioned, using False would be problematic?
Also in the above case, I think when sample_weights would be None, check_array will fail. We can avoid this by checking whether sample_weight is not None from where the function is expected to be called. Is this what should be done? (Not checking None case automatically)

jakevdp · 2016-02-18T23:30:28Z

We want a sentinel value that the user could never accidentally pass. If we use True, False, None, or something like that, I can imagine situations in which the user might pass any of these to an estimator (e.g. clf.fit(X, y, sample_weights=True)) at which point they'd get a confusing error about sizes of iterables not matching because check_X_y would only return X, y rather than X, y, sample_weights.

Doing a sample_weights is object() check means there's virtually no possible way for the user to accidentally pass the sentinel value.

kaichogami · 2016-02-19T07:49:12Z

@jakevdp Thanks for explaining. I now understand what you meant in the previous comment.
I have updated my original approach where, instead of default value being False, is now NOT_SPECIFIED=object() trick you mentioned above.
Please let me know what you think and then we can begin on introducing this function in other parts of sklearn.

amueller · 2016-10-07T23:07:26Z

This looks like a good fix. Could you please rebase and add a test? There should be a test in estimator_checks.py to make sure that this works whenever an estimator takes sample_weights.

jnothman · 2016-10-08T11:43:50Z

I think #5642 should be preferred as more general.

amueller · 2016-10-08T18:44:15Z

@jnothman I agree. Sorry, I'm way less on top of things than you ;)

kaichogami · 2016-10-08T20:08:39Z

Alright. Should I close this PR in favor of #5642?

On 09-Oct-2016 12:15 am, "Andreas Mueller" notifications@github.com wrote:

@jnothman https://github.com/jnothman I agree. Sorry, I'm way less on
top of things than you ;)

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#6307 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGHZ4va2CUdwNwbZ7JAEqZNd3uqEqMO6ks5qx-SsgaJpZM4HVgYO
.

amueller · 2016-10-08T23:54:38Z

Usually we wait until one is merged, but whatever you like.

jnothman · 2016-12-20T02:35:21Z

Fixed by #7825

kaichogami closed this Feb 13, 2016

kaichogami force-pushed the ridge branch from 4a3459d to 1aa0ec2 Compare February 13, 2016 12:38

Added new validation test for sample_weight.

14f031e

Used the validation change in linear_model/ridge

kaichogami reopened this Feb 13, 2016

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

777ea36

…into ridge

Used object() in check_X_y to check for sample_weight

89b7e9a

kaichogami force-pushed the ridge branch from 6cd01fb to 89b7e9a Compare February 19, 2016 07:39

amueller added this to the 0.19 milestone Oct 7, 2016

amueller added the Bug label Oct 7, 2016

jnothman closed this Dec 20, 2016

Uh oh!

Fix for passing pandas series in sample_weights in RidgeCV leading to an error(#5606) #6307

Fix for passing pandas series in sample_weights in RidgeCV leading to an error(#5606) #6307

Uh oh!

Conversation

kaichogami commented Feb 8, 2016

Uh oh!

jakevdp commented Feb 8, 2016

Uh oh!

MechCoder commented Feb 8, 2016

Uh oh!

kaichogami commented Feb 9, 2016

Uh oh!

kaichogami commented Feb 13, 2016

Uh oh!

jakevdp commented Feb 17, 2016

Uh oh!

kaichogami commented Feb 18, 2016

Uh oh!

jakevdp commented Feb 18, 2016

Uh oh!

kaichogami commented Feb 18, 2016

Uh oh!

jakevdp commented Feb 18, 2016

Uh oh!

kaichogami commented Feb 19, 2016

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

jnothman commented Oct 8, 2016

Uh oh!

amueller commented Oct 8, 2016

Uh oh!

kaichogami commented Oct 8, 2016

Uh oh!

amueller commented Oct 8, 2016

Uh oh!

jnothman commented Dec 20, 2016

Uh oh!

Uh oh!

Fix for passing pandas series in `sample_weights` in `RidgeCV` leading to an error(#5606) #6307

Fix for passing pandas series in `sample_weights` in `RidgeCV` leading to an error(#5606) #6307