MRG added `classes` parameter to LabelBinarizer #1643

amueller · 2013-01-31T12:30:21Z

Adds a classes parameter to LabelBinarizer. I wanted that to add partial_fit to the naive Bayes models.
If classes is specified, there is no need to call fit.
This PR still needs tests.

ogrisel · 2013-01-31T15:33:34Z

I think you need a couple of tests :) Please don't forget to test the exceptions, including the actual message value for the one that displays the class difference.

amueller · 2013-01-31T15:36:46Z

Definitely will do ;)

amueller · 2013-02-02T15:08:25Z

Done :)

amueller · 2013-02-02T16:09:10Z

Renamed to MRG

ogrisel · 2013-02-03T10:00:15Z

sklearn/tests/test_preprocessing.py

+    # will be ignored but a warning will be shown
+    lb = LabelBinarizer(classes=[1, 2])
+    with warnings.catch_warnings(record=True) as w:
+        transformed = lb.fit_transform([0, 1, 2])


Thanks for testing this :)

ogrisel · 2013-02-03T10:02:07Z

Looks good to me. Could you please rebase on top of master to fix travis. +1 for merging when travis is green.

amueller · 2013-02-03T14:39:16Z

seems like I broke multilabel stuff.. no idea how. will fix now.

amueller · 2013-02-03T17:22:42Z

Ok, fixed. cc @mblondel ;)

amueller · 2013-02-03T17:23:53Z

This is not fully backward compatible now :-/ there was an attribute multilabel that should have been called multilabel_, so now the semantics of LabelBinarizer.multilabel have changed :-(

arjoly · 2013-02-04T09:20:58Z

You should specify what type of representation it will lead if you specify the classes argument.

What is the use case of the multilabel argument?

arjoly · 2013-02-04T11:49:55Z

You should specify what type of representation it will lead if you specify the classes argument.

Misunderstood the doc. Don't take this comment into account.

amueller · 2013-02-04T19:19:13Z

I'm not sure I understand the multilabel question. It is for multi-label input with prespecified classes.

arjoly · 2013-02-05T08:13:43Z

I think that you need to add a constructor argument for the self.indicator_matrix_ attribute.

amueller · 2013-02-05T08:40:15Z

~~@arjoly there is no such attribute in master, right? That is only in your branch afaik.~~
Never mind. I'll add it and also a test.

amueller · 2013-02-08T22:13:02Z

Should be good now :)

amueller · 2013-02-08T23:47:40Z

Ok, travis is happy now....

arjoly · 2013-02-09T11:49:42Z

I don't think you treat the case multiclass=False and indicator_matrix=True.

Instead of booleans, it may be better to use strings? For instance: "multiclass", "indicator_matrix" and "list_tuple_labels"?

Otherwise, it looks good !

amueller · 2013-02-09T11:53:03Z

There is a test here. Do you think it is sufficient?
We could do a string argument. Would you then also make the fitted attributes a string?

arjoly · 2013-02-09T12:21:54Z

I don't think you treat the case multiclass=False and indicator_matrix=True.

Sorry, but I messed when writing the preceeding sentence. I mean multilabel=False and indicator_matrix=True. (You say it is not a multilabel format, but impose a multilabel format)

Would you then also make the fitted attributes a string?

From my point of view, the string format in the attribute should reduce the number of nested if. However with 3 supported formats, this might not be necessary.

Note that it would be cool to support sparse binary indicator matrix which would bring many more sparse format (csc, csr, ...).

Honestly, do as you think it's the best.

mblondel · 2013-02-09T15:15:12Z

sklearn/preprocessing.py

+
+    multilabel : bool or None (default)
+        Whether or not data will be multilabel.
+        If None, it will be inferred during fitting.


Why do you need this?

Would you just infer it on transform? I wanted to match the current behavior when fit is called. Currently it is enforced that the data in transform is of the same kind (multilable or not) then during fit.

Yes, I think inferring it in transform would be enough. But I don't really see why users would want to fit multiclass data and transform multilabel data.

Ok, then I would also change the behavior when using fit to have it unified. I'm not sure who wrote this code in the first place. I guess either you or @larsmans? I also don't see much of a reason for the current behavior actually...

Hum... so what do I do if someone calls inverse_transform before calling either fit or transform?

Ok, I think I'm giving up.

Don't give up !! this pr is very useful.

Ok, I think I'm giving up.

I gave my opinion. Now as always in scikit-learn, the majority votes. If
more people agree with you, I will respect the majority's opinion.

As far as I see it, there are 2 different issues. Whether to use
constructor or method parameter, and whether to use one string parameter or
two boolean parameters.
I think at least we agree on the latter.

My issue is more that I don't know what to do with the input validation in transform if I do a method parameter.

On Tue, Feb 12, 2013 at 5:56 PM, Andreas Mueller
notifications@github.comwrote:

My issue is more that I don't know what to do with the input validation in
transform if I do a method parameter.

If the string takes the "auto" value by default, doesn't the same problem
exist for the constructor parameter too?

The way it is implemented currently is that if fit is not called, there is a fall-back to multiclass. I.e. if you don't specify multi-label but pass a multi-label y to transform, an error is raised.

amueller · 2013-02-16T22:50:45Z

I replaced the two boolean arguments by a single string argument. I didn't adjust the tests yet. I did not replace the boolean attributes. Do you think I should do that, too?

Replacing the __init__ parameter did not result in getting rid of any nested ifs and I don't know why it should.
The code definitely got more convoluted now :-/

I just also tried replacing the properties with strings and that led to one more level of ifs and a lot of label_type in ["..", ".."].

@arjoly @mblondel which part did you expect to get simpler with string arguments?

amueller · 2013-02-16T22:54:56Z

hm maybe I can fix it somehow...

amueller · 2013-02-16T23:18:24Z

Ok, I think it is actually nicer now with strings :)

amueller · 2013-02-16T23:19:22Z

@arjoly I introduced a new function _get_label_type that should probably also move to the new utils module. I guess it depends on which PR gets merged first.

…ror message.

\

…ator_matrix_

amueller · 2013-02-16T23:38:54Z

Ok so I didn't deprecate the property multilabel_ as it is used in the multiclass module.

amueller · 2013-02-16T23:39:21Z

This adds another property to the interface...

amueller · 2013-02-20T22:54:09Z

Travis should be happy now. Are the other devs, too? pint @arjoly @ogrisel (@mblondel said he is busy).

arjoly · 2013-02-21T07:54:03Z

sklearn/preprocessing.py

        Value with which positive labels must be encoded.

+    classes : ndarray if int or None (default)


I don't understand this line.

arjoly · 2013-02-21T08:29:04Z

Can you add a test to check that:

the order of labels in the argument classes is properly taken into account (LabelBinarizer(classes=[1, 2, 3]) is not equal to LabelBinarizer(classes=[3, 1, 2])) ?
redundant labels with the classes argument is treated in some way (LabelBinarizer(classes=[1, 2, 3, 3])) ?
a call to the transform method with redundant labels and label in a shuffled order works properly for the list of list of labels format?
_get_label_type works? Can you also add some documentation? What should happen when y is not properly formated?

Can you also

update the narrative doc and add one or more examples to illustrate the new features?
update whats_new.rst?

amueller · 2013-03-02T16:00:46Z

Is there any other class except SGDClassifier and related that have a classes parameter somewhere?
I just checked an SGDClassifiers partial fit will break if the entries in classes are not sorted or not unique if I understand the code correctly.

It is probably best if we do respect the ordering of the labels, as long as that is not incompatible with code in other places.

amueller · 2013-03-02T16:04:08Z

The current code does a unique on the labels, so redundant and shuffled classes will all lead to the same result.

arjoly · 2013-03-07T07:47:51Z

sklearn/preprocessing.py

+        if label_type == "multilabel-indicator":
+            # nothing to do as y is already a label indicator matrix
+            return y
+        elif label_type == "multilabel-list" or len(self.classes_) > 2:


Why is the case len(self.classes_) > 2 used here?

because for two classes, the output is 1d.

arjoly · 2013-05-02T13:28:54Z

Any news on this pr?

amueller · 2013-05-06T18:37:02Z

sorry I'm swamped. I actually forgot the state of this. I'm trying to catch up currently.

jnothman · 2013-05-22T21:23:35Z

If I understand correctly, out-of-classes labels will be ignored with a warning at fit time, but will throw an error at transform time. Does this make sense? (Why?) If so, should this be documented?

jnothman · 2013-05-23T00:10:52Z

Also, I think parallel functionality should (as a rule) be available in LabelEncoder and LabelBinarizer. Indeed, LabelBinarizer should use a LabelEncoder, after which its job is simple.

jnothman · 2013-05-28T06:59:57Z

sklearn/preprocessing.py

+    classes : ndarray of int or None (default)
+        Array of possible classes.
+
+    label_type : string, default="auto"


can we call this target_type?

ogrisel reviewed Feb 3, 2013
View reviewed changes

mblondel reviewed Feb 9, 2013
View reviewed changes

amueller added 5 commits February 17, 2013 00:26

ENH added classes parameter to LabelBinarizer

9beffbd

ENH added "multilabel" option to LabelBinarizer

c6bac77

TST Added tests for classes and multilabel options for LabelBinarizer

5cf9219

TST added test for ignoring additional labels during fit, improved er…

36e3b57

…ror message.

DOC added multilabel parameter to docstring.

092c56d

\

FIX don't deprecate multilabel_, better deprecation message for indic…

2f5a5b1

…ator_matrix_

FIX some doctests

cd09bee

arjoly reviewed Feb 21, 2013
View reviewed changes

ENH some minor fixes pointed out by arjoly

1f70d84

arjoly mentioned this pull request Mar 6, 2013

[MRG] FIX numpy 1.3 issues with the new multilabel metrics #1741

Merged

arjoly reviewed Mar 7, 2013
View reviewed changes

arjoly mentioned this pull request May 22, 2013

FIX helper to check multilabel types #1985

Closed

jnothman mentioned this pull request May 23, 2013

ENH support multilabel targets in LabelEncoder #1987

Closed

jnothman reviewed May 28, 2013
View reviewed changes

arjoly mentioned this pull request May 29, 2013

MRG add log loss (cross-entropy loss) to metrics #2013

Closed

amueller closed this Apr 28, 2014

jnothman mentioned this pull request Jul 17, 2014

[WIP] Make LabelEncoder more friendly to new labels #3243

Closed

jnothman mentioned this pull request Aug 31, 2014

[MRG] Label Encoder Unseen Labels #3599

Closed

5 tasks

		Value with which positive labels must be encoded.

		classes : ndarray if int or None (default)

MRG added classes parameter to LabelBinarizer #1643

MRG added classes parameter to LabelBinarizer #1643

Conversation

amueller commented Jan 31, 2013

ogrisel commented Jan 31, 2013

amueller commented Jan 31, 2013

amueller commented Feb 2, 2013

amueller commented Feb 2, 2013

Choose a reason for hiding this comment

ogrisel commented Feb 3, 2013

amueller commented Feb 3, 2013

amueller commented Feb 3, 2013

amueller commented Feb 3, 2013

arjoly commented Feb 4, 2013

arjoly commented Feb 4, 2013

amueller commented Feb 4, 2013

arjoly commented Feb 5, 2013

amueller commented Feb 5, 2013

amueller commented Feb 8, 2013

amueller commented Feb 8, 2013

arjoly commented Feb 9, 2013

amueller commented Feb 9, 2013

arjoly commented Feb 9, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Feb 16, 2013

amueller commented Feb 16, 2013

amueller commented Feb 16, 2013

amueller commented Feb 16, 2013

amueller commented Feb 16, 2013

amueller commented Feb 16, 2013

amueller commented Feb 20, 2013

Choose a reason for hiding this comment

arjoly commented Feb 21, 2013

amueller commented Mar 2, 2013

amueller commented Mar 2, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented May 2, 2013

amueller commented May 6, 2013

jnothman commented May 22, 2013

jnothman commented May 23, 2013

Choose a reason for hiding this comment

MRG added `classes` parameter to LabelBinarizer #1643

MRG added `classes` parameter to LabelBinarizer #1643