-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG added classes
parameter to LabelBinarizer
#1643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think you need a couple of tests :) Please don't forget to test the exceptions, including the actual message value for the one that displays the class difference. |
Definitely will do ;) |
Done :) |
Renamed to MRG |
# will be ignored but a warning will be shown | ||
lb = LabelBinarizer(classes=[1, 2]) | ||
with warnings.catch_warnings(record=True) as w: | ||
transformed = lb.fit_transform([0, 1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing this :)
Looks good to me. Could you please rebase on top of master to fix travis. +1 for merging when travis is green. |
seems like I broke multilabel stuff.. no idea how. will fix now. |
Ok, fixed. cc @mblondel ;) |
This is not fully backward compatible now :-/ there was an attribute |
You should specify what type of representation it will lead if you specify the What is the use case of the |
Misunderstood the doc. Don't take this comment into account. |
I'm not sure I understand the multilabel question. It is for multi-label input with prespecified classes. |
I think that you need to add a constructor argument for the |
|
Should be good now :) |
Ok, travis is happy now.... |
I don't think you treat the case Instead of booleans, it may be better to use strings? For instance: Otherwise, it looks good ! |
There is a test here. Do you think it is sufficient? |
Sorry, but I messed when writing the preceeding sentence. I mean
From my point of view, the string format in the attribute should reduce the number of nested Note that it would be cool to support sparse binary indicator matrix which would bring many more sparse format (csc, csr, ...). Honestly, do as you think it's the best. |
|
||
multilabel : bool or None (default) | ||
Whether or not data will be multilabel. | ||
If None, it will be inferred during fitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you just infer it on transform? I wanted to match the current behavior when fit
is called. Currently it is enforced that the data in transform is of the same kind (multilable or not) then during fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think inferring it in transform would be enough. But I don't really see why users would want to fit multiclass data and transform multilabel data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I would also change the behavior when using fit
to have it unified. I'm not sure who wrote this code in the first place. I guess either you or @larsmans? I also don't see much of a reason for the current behavior actually...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum... so what do I do if someone calls inverse_transform
before calling either fit
or transform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I'm giving up.
Don't give up !! this pr is very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I'm giving up.
I gave my opinion. Now as always in scikit-learn, the majority votes. If
more people agree with you, I will respect the majority's opinion.
As far as I see it, there are 2 different issues. Whether to use
constructor or method parameter, and whether to use one string parameter or
two boolean parameters.
I think at least we agree on the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue is more that I don't know what to do with the input validation in transform if I do a method parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Feb 12, 2013 at 5:56 PM, Andreas Mueller
notifications@github.comwrote:
My issue is more that I don't know what to do with the input validation in
transform if I do a method parameter.
If the string takes the "auto" value by default, doesn't the same problem
exist for the constructor parameter too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is implemented currently is that if fit
is not called, there is a fall-back to multiclass. I.e. if you don't specify multi-label but pass a multi-label y
to transform, an error is raised.
I replaced the two boolean arguments by a single string argument. I didn't adjust the tests yet. I did not replace the boolean attributes. Do you think I should do that, too? Replacing the I just also tried replacing the properties with strings and that led to one more level of @arjoly @mblondel which part did you expect to get simpler with string arguments? |
hm maybe I can fix it somehow... |
Ok, I think it is actually nicer now with strings :) |
@arjoly I introduced a new function |
Ok so I didn't deprecate the property |
This adds another property to the interface... |
Value with which positive labels must be encoded. | ||
|
||
classes : ndarray if int or None (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this line.
Can you add a test to check that:
Can you also
|
Is there any other class except It is probably best if we do respect the ordering of the labels, as long as that is not incompatible with code in other places. |
The current code does a |
if label_type == "multilabel-indicator": | ||
# nothing to do as y is already a label indicator matrix | ||
return y | ||
elif label_type == "multilabel-list" or len(self.classes_) > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the case len(self.classes_) > 2
used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because for two classes, the output is 1d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Any news on this pr? |
sorry I'm swamped. I actually forgot the state of this. I'm trying to catch up currently. |
If I understand correctly, out-of- |
Also, I think parallel functionality should (as a rule) be available in |
classes : ndarray of int or None (default) | ||
Array of possible classes. | ||
|
||
label_type : string, default="auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this target_type
?
Adds a
classes
parameter toLabelBinarizer
. I wanted that to addpartial_fit
to the naive Bayes models.If
classes
is specified, there is no need to callfit
.This PR still needs tests.