Skip to content

MRG added classes parameter to LabelBinarizer #1643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

amueller
Copy link
Member

Adds a classes parameter to LabelBinarizer. I wanted that to add partial_fit to the naive Bayes models.
If classes is specified, there is no need to call fit.
This PR still needs tests.

@ogrisel
Copy link
Member

ogrisel commented Jan 31, 2013

I think you need a couple of tests :) Please don't forget to test the exceptions, including the actual message value for the one that displays the class difference.

@amueller
Copy link
Member Author

Definitely will do ;)

@amueller
Copy link
Member Author

amueller commented Feb 2, 2013

Done :)

@amueller
Copy link
Member Author

amueller commented Feb 2, 2013

Renamed to MRG

# will be ignored but a warning will be shown
lb = LabelBinarizer(classes=[1, 2])
with warnings.catch_warnings(record=True) as w:
transformed = lb.fit_transform([0, 1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this :)

@ogrisel
Copy link
Member

ogrisel commented Feb 3, 2013

Looks good to me. Could you please rebase on top of master to fix travis. +1 for merging when travis is green.

@amueller
Copy link
Member Author

amueller commented Feb 3, 2013

seems like I broke multilabel stuff.. no idea how. will fix now.

@amueller
Copy link
Member Author

amueller commented Feb 3, 2013

Ok, fixed. cc @mblondel ;)

@amueller
Copy link
Member Author

amueller commented Feb 3, 2013

This is not fully backward compatible now :-/ there was an attribute multilabel that should have been called multilabel_, so now the semantics of LabelBinarizer.multilabel have changed :-(

@arjoly
Copy link
Member

arjoly commented Feb 4, 2013

You should specify what type of representation it will lead if you specify the classes argument.

What is the use case of the multilabel argument?

@arjoly
Copy link
Member

arjoly commented Feb 4, 2013

You should specify what type of representation it will lead if you specify the classes argument.

Misunderstood the doc. Don't take this comment into account.

@amueller
Copy link
Member Author

amueller commented Feb 4, 2013

I'm not sure I understand the multilabel question. It is for multi-label input with prespecified classes.

@arjoly
Copy link
Member

arjoly commented Feb 5, 2013

I think that you need to add a constructor argument for the self.indicator_matrix_ attribute.

@amueller
Copy link
Member Author

amueller commented Feb 5, 2013

@arjoly there is no such attribute in master, right? That is only in your branch afaik.
Never mind. I'll add it and also a test.

@amueller
Copy link
Member Author

amueller commented Feb 8, 2013

Should be good now :)

@amueller
Copy link
Member Author

amueller commented Feb 8, 2013

Ok, travis is happy now....

@arjoly
Copy link
Member

arjoly commented Feb 9, 2013

I don't think you treat the case multiclass=False and indicator_matrix=True.

Instead of booleans, it may be better to use strings? For instance: "multiclass", "indicator_matrix" and "list_tuple_labels"?

Otherwise, it looks good !

@amueller
Copy link
Member Author

amueller commented Feb 9, 2013

There is a test here. Do you think it is sufficient?
We could do a string argument. Would you then also make the fitted attributes a string?

@arjoly
Copy link
Member

arjoly commented Feb 9, 2013

I don't think you treat the case multiclass=False and indicator_matrix=True.

Sorry, but I messed when writing the preceeding sentence. I mean multilabel=False and indicator_matrix=True. (You say it is not a multilabel format, but impose a multilabel format)

Would you then also make the fitted attributes a string?

From my point of view, the string format in the attribute should reduce the number of nested if. However with 3 supported formats, this might not be necessary.

Note that it would be cool to support sparse binary indicator matrix which would bring many more sparse format (csc, csr, ...).

Honestly, do as you think it's the best.


multilabel : bool or None (default)
Whether or not data will be multilabel.
If None, it will be inferred during fitting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you just infer it on transform? I wanted to match the current behavior when fit is called. Currently it is enforced that the data in transform is of the same kind (multilable or not) then during fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think inferring it in transform would be enough. But I don't really see why users would want to fit multiclass data and transform multilabel data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I would also change the behavior when using fit to have it unified. I'm not sure who wrote this code in the first place. I guess either you or @larsmans? I also don't see much of a reason for the current behavior actually...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... so what do I do if someone calls inverse_transform before calling either fit or transform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm giving up.

Don't give up !! this pr is very useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm giving up.

I gave my opinion. Now as always in scikit-learn, the majority votes. If
more people agree with you, I will respect the majority's opinion.

As far as I see it, there are 2 different issues. Whether to use
constructor or method parameter, and whether to use one string parameter or
two boolean parameters.
I think at least we agree on the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is more that I don't know what to do with the input validation in transform if I do a method parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Tue, Feb 12, 2013 at 5:56 PM, Andreas Mueller
notifications@github.comwrote:

My issue is more that I don't know what to do with the input validation in
transform if I do a method parameter.

If the string takes the "auto" value by default, doesn't the same problem
exist for the constructor parameter too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is implemented currently is that if fit is not called, there is a fall-back to multiclass. I.e. if you don't specify multi-label but pass a multi-label y to transform, an error is raised.

@amueller
Copy link
Member Author

I replaced the two boolean arguments by a single string argument. I didn't adjust the tests yet. I did not replace the boolean attributes. Do you think I should do that, too?

Replacing the __init__ parameter did not result in getting rid of any nested ifs and I don't know why it should.
The code definitely got more convoluted now :-/

I just also tried replacing the properties with strings and that led to one more level of ifs and a lot of label_type in ["..", ".."].

@arjoly @mblondel which part did you expect to get simpler with string arguments?

@amueller
Copy link
Member Author

hm maybe I can fix it somehow...

@amueller
Copy link
Member Author

Ok, I think it is actually nicer now with strings :)

@amueller
Copy link
Member Author

@arjoly I introduced a new function _get_label_type that should probably also move to the new utils module. I guess it depends on which PR gets merged first.

@amueller
Copy link
Member Author

Ok so I didn't deprecate the property multilabel_ as it is used in the multiclass module.

@amueller
Copy link
Member Author

This adds another property to the interface...

@amueller
Copy link
Member Author

Travis should be happy now. Are the other devs, too? pint @arjoly @ogrisel (@mblondel said he is busy).

Value with which positive labels must be encoded.

classes : ndarray if int or None (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this line.

@arjoly
Copy link
Member

arjoly commented Feb 21, 2013

Can you add a test to check that:

  • the order of labels in the argument classes is properly taken into account (LabelBinarizer(classes=[1, 2, 3]) is not equal to LabelBinarizer(classes=[3, 1, 2])) ?
  • redundant labels with the classes argument is treated in some way (LabelBinarizer(classes=[1, 2, 3, 3])) ?
  • a call to the transform method with redundant labels and label in a shuffled order works properly for the list of list of labels format?
  • _get_label_type works? Can you also add some documentation? What should happen when y is not properly formated?

Can you also

  • update the narrative doc and add one or more examples to illustrate the new features?
  • update whats_new.rst?

@amueller
Copy link
Member Author

amueller commented Mar 2, 2013

Is there any other class except SGDClassifier and related that have a classes parameter somewhere?
I just checked an SGDClassifiers partial fit will break if the entries in classes are not sorted or not unique if I understand the code correctly.

It is probably best if we do respect the ordering of the labels, as long as that is not incompatible with code in other places.

@amueller
Copy link
Member Author

amueller commented Mar 2, 2013

The current code does a unique on the labels, so redundant and shuffled classes will all lead to the same result.

if label_type == "multilabel-indicator":
# nothing to do as y is already a label indicator matrix
return y
elif label_type == "multilabel-list" or len(self.classes_) > 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the case len(self.classes_) > 2 used here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because for two classes, the output is 1d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@arjoly
Copy link
Member

arjoly commented May 2, 2013

Any news on this pr?

@amueller
Copy link
Member Author

amueller commented May 6, 2013

sorry I'm swamped. I actually forgot the state of this. I'm trying to catch up currently.

@jnothman
Copy link
Member

If I understand correctly, out-of-classes labels will be ignored with a warning at fit time, but will throw an error at transform time. Does this make sense? (Why?) If so, should this be documented?

@jnothman
Copy link
Member

Also, I think parallel functionality should (as a rule) be available in LabelEncoder and LabelBinarizer. Indeed, LabelBinarizer should use a LabelEncoder, after which its job is simple.

classes : ndarray of int or None (default)
Array of possible classes.

label_type : string, default="auto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this target_type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants