Skip to content

[MRG + 1] Fix for OvR partial_fit bug #7786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 5, 2016

Conversation

srivatsan-ramesh
Copy link
Contributor

Reference Issue

PR #6239 and issue #6203

What does this implement/fix? Explain your changes.

partial_fit() of OvR was not working properly when the mini-batches did not contain all the classes.
The LabelBinarizer.fit() should be called with classes_ parameter and not the y parameter of partial_fit()
And added tests to check the partial_fit() function.

Any other comments?

The PR #6239 did not seem to be the correct fix.

@srivatsan-ramesh srivatsan-ramesh changed the title Fix for OvR partial_fit bug [MRG] Fix for OvR partial_fit bug Oct 29, 2016
@amueller amueller added this to the 0.19 milestone Nov 1, 2016
@amueller amueller added the Bug label Nov 1, 2016
@amueller amueller modified the milestones: 0.18.1, 0.19 Nov 2, 2016
@amueller
Copy link
Member

amueller commented Nov 2, 2016

LGTM, thanks :)

@amueller amueller changed the title [MRG] Fix for OvR partial_fit bug [MRG + 1] Fix for OvR partial_fit bug Nov 2, 2016
# A new class value which was not in the first call of partial_fit
# It should raise ValueError
y1 = [5] + y[7:-1]
assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use assert_raises_regex to check the error message as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then you can drop this line.

@raghavrv
Copy link
Member

raghavrv commented Nov 5, 2016

Travis seems to fail... Could you fix that?

@srivatsan-ramesh
Copy link
Contributor Author

@raghavrv It says HTTPError : 409 Client Error What do I do?

@raghavrv
Copy link
Member

raghavrv commented Nov 5, 2016

I've not looked into it, but I've restarted the build for you... Let's see if it is some spurious error.

@srivatsan-ramesh
Copy link
Contributor Author

Ok 👍

@srivatsan-ramesh
Copy link
Contributor Author

@raghavrv Thank you, Travis ran successfully.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

self.label_binarizer_ = LabelBinarizer(sparse_output=True)
self.label_binarizer_.fit(self.classes_)

if not set(self.classes_).issuperset(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for large y, iteration may be much slower than using np.setdiff1d..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes np.setdiff1d seems to be faster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to the use of np.setdiff1d it is not possible to partial_fit with a sparse y

assert_equal(np.mean(pred == y), np.mean(pred1 == y))

# Test when mini batches doesn't have all classes
# with SGDClassifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader of the test, it's not clear why you would want to test with both these base estimators.

Also: can we do it in a loop for clarity that they're the same test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR #6239 , the author of the PR was able to create a fix which worked with MultinomialNB but not with SGDClassifier. That's why I added SGDClassifier also. But i think only testing with SGDClassifier is enough, @jnothman ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, happy for it to just be SGDClassifier

# A new class value which was not in the first call of partial_fit
# It should raise ValueError
y1 = [5] + y[7:-1]
assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then you can drop this line.

@jnothman jnothman merged commit 9fd70a8 into scikit-learn:master Nov 5, 2016
@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

Thanks, @srivatsan-ramesh

@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

Sorry, I forgot to have you write an entry in doc/whats_new.rst Could you please submit a quick PR describing the fix?

@srivatsan-ramesh srivatsan-ramesh deleted the dev branch November 6, 2016 06:34
srivatsan-ramesh added a commit to srivatsan-ramesh/scikit-learn that referenced this pull request Nov 6, 2016
@srivatsan-ramesh
Copy link
Contributor Author

@jnothman Created a PR!

jnothman pushed a commit that referenced this pull request Nov 6, 2016
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
@elcombato
Copy link

Due to np.setdiff1d in this line of partial_fit, it is not possible to pass a sparse matrix for y.

Both the commit ed08d38 by @srivatsan-ramesh and the reviewed version by @jnothman 5fd925a throw an error with a sparse y.

Or is there an alternative, if I use partial_fit to perform multilabel classification and pass a sparse matrix of y which is binarized with MultiLabelBinarizer?

if y is sparse check with:

if np.setdiff1d(y.indices, self.classes_)

@lesteve
Copy link
Member

lesteve commented Nov 15, 2016

Hmmm the docstring does say that sparse matrices are accepted for y. I am not sure what the scikit-learn convention is in general about whether a sparse y should be accepted in fit and similar functions.

I guess it is not really worth using a sparse matrix for y in general. A work-around is to do y = y.todense() before calling partial_fit.

@jnothman
Copy link
Member

sparse y may be used to represent multilabel problems. It could be accepted
wherever multilabel inputs are, in theory.

On 15 November 2016 at 21:42, Loïc Estève notifications@github.com wrote:

Hmmm the docstring does say that sparse matrices are accepted for y. I am
not sure what the scikit-learn convention is in general about whether a
sparse y should be accepted in fit and similar functions.

I guess it is not really worth using a sparse matrix for y in general. A
work-around is to do y = y.todense() before calling partial_fit.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7786 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66v7s3jkkIhDLEKLSTkjoaDXpj6_ks5q-YyAgaJpZM4KkLcl
.

@lesteve
Copy link
Member

lesteve commented Nov 15, 2016

sparse y may be used to represent multilabel problems. It could be accepted wherever multilabel inputs are, in theory.

I see, it does look like we are missing tests for this then ... it would be good to do something similar to #7590 for multilabel problems, i.e. test that sparse and dense y are giving results that are close to each other.

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants