[MRG] Added the `"error-strict"` option to OneHotEncoder and start deprecating unknown values in range #7327

vighneshbirodkar · 2016-09-01T17:42:00Z

Added '"error-strict"` option to error on any unknown values
Make underlying label encoders private
Add attributes to map input features to output features and vice-versa.
Decide how to handle non-integer arrays for range checking

vighneshbirodkar · 2016-09-01T17:54:15Z

sklearn/preprocessing/data.py

+        for i in range(n_features):
+            le = self.label_encoders_[i]
+
+            self._max_values[i] = np.max(X[:, i])


@jnothman @amueller
I am casting X to np.object above because, if the user passes strings in the input, the casting (to int) would fail. If the user passes handle_unknown='error-strict' and then passes a string input, this would lead to weird results.

edit: Any suggestions on how to handle this ?

We shouldn't generally cast to object. Is there not a way to use check_array to allow either integer or object?

I can try casting to int and then cast to an object if it fails. But sting values like "123" are going to be successfully cast to int. Are we OK with that ?

vighneshbirodkar · 2016-09-02T10:42:10Z

I have added 2 new attributes: one_hot_feature_index_ and feature_index_range_. Please refer the documentation for more details

vighneshbirodkar · 2016-09-02T15:26:25Z

@amueller @jnothman @GaelVaroquaux
I think this is ready for review. I won't be able to spend time on this on Sunday, but I can resume my work on Monday

jnothman · 2016-09-03T14:44:57Z

sklearn/preprocessing/data.py

-        - array : ``n_values[i]`` is the number of categorical values in
-                  ``X[:, i]``. Each feature value should be
-                  in ``range(n_values[i])``
+    values : 'auto', 'seen', int, list of ints, or list of lists of objects


amueller · 2016-09-06T19:40:30Z

Just letting you know that I'm very angry at past @amueller for this mess.

vighneshbirodkar · 2016-09-07T23:02:15Z

@jnothman Could you take another look at this ?

jnothman · 2016-09-08T00:11:28Z

sklearn/preprocessing/data.py

+                le.fit(np.arange(self.values, dtype=np.int))
+            elif isinstance(self.values, list):
+                if len(self.values) != X.shape[1]:
+                    raise ValueError("Shape mismatch: if n_values is a list,"


"n_values" -> "values"

jnothman · 2016-09-08T00:26:25Z

@vighneshbirodkar, I think this is a large and intricate PR and I'm now disinclined to have it in for 0.18. I think your code can be simpler in general, and I think we would benefit from leaving this in as elegant a state as possible.

We need to be careful to test behaviour when values are out of range and out of specified set; at fit time, and at transform time. I've admittedly not looked at your tests, or those already present, yet, but I thought this is worth highlighting. We should also ensure that that behaviour meets reasonable use-cases: should we support the case where a user wants some values of a feature left unencoded?

In short, thank you for continuing to work on this, but I think you should expect to work on it and receive a more complete review after the release. (I hope that's okay @amueller)

amueller · 2016-09-08T14:59:10Z

@jnothman sad but fair enough ;) Let's go for a quick 0.19 then ;)

…checking

vighneshbirodkar · 2016-12-28T09:50:12Z

@amueller @jnothman
I am assigning values and n_values to the internal attribute _values to remove the redundant checking as @jnothman pointed out.

vighneshbirodkar · 2016-12-28T11:00:56Z

@jnothman The existing implementation also contains an np.max call. What could be a possible work around ?

jnothman · 2016-12-29T12:36:29Z

if X has a numeric dtype, use max?

vighneshbirodkar · 2016-12-29T12:55:36Z

@jnothman, you mean use the numpy array's max method ?

jnothman · 2016-12-29T12:58:31Z

Perhaps I've misunderstood: what's the problem with np.max? (sorry, it's been a while)

vighneshbirodkar · 2016-12-29T13:05:13Z

The problem is that we won't support sparse input if we use max

jnothman · 2016-12-29T20:13:08Z

See sklearn.utils.fixes.sparse_min_max

…

On 30 December 2016 at 00:05, Vighnesh Birodkar ***@***.***> wrote: The problem is that we won't support sparse input if we use max — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7327 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xb5o9xCnEBwjoEfLfN8z5bMG12jks5rM7AKgaJpZM4Jy8r4> .

Arturus · 2017-04-19T18:44:55Z

Do we have a chance to get string support in 0.19?

stephen-hoover · 2017-04-21T00:15:38Z

What's left to do here? Is there anything I can do to help this PR? I also would like to see string support. :-)

jnothman · 2017-04-21T00:43:18Z

@vighneshbirodkar would you like someone else to take this over? As a likely reviewer, I'd really appreciate it if the code could be made more readable, though I appreciate that it's become a surprisingly complicated transformer.

stephen-hoover · 2017-04-22T19:55:20Z

I spent some time looking at the new OneHotEncoder this morning. It will be great to have this available!

I saw a couple of ways that the code could be tweaked to be more memory efficient. It looks like it's possible to avoid the need for any copies in either the fit or transform step. (Except for the final output of the transform, of course.)

I'm also interested in being able to accommodate pandas.DataFrame objects without needing to copy them to numpy object arrays. I think it's possible to do this by putting

if hasattr(X, 'iloc'):
    X = X.iloc

in a couple of places. You'd also need to bypass the check_array call if the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check. Would that be reasonable? It seems like a DataFrame would be a common way to get string inputs, and it'd be nice to avoid copying the whole thing before inspecting it in fit or creating the output arrays in transform.

jnothman · 2017-04-22T23:30:40Z

I'd be interested in a dataframe-friendly version of this, but wouldn't make it first priority. First get the API and tests right, and implementation legible, then make it go beyond.

vighneshbirodkar · 2017-04-23T00:55:09Z

Hey I won't mind if someone else takes over. Thanks.

…

On 22 Apr 2017 7:31 pm, "Joel Nothman" ***@***.***> wrote: I'd be interested in a dataframe-friendly version of this, but wouldn't make it first priority. First get the API and tests right, and implementation legible, then make it go beyond. On 23 Apr 2017 5:55 am, "Stephen Hoover" ***@***.***> wrote: I spent some time looking at the new OneHotEncoder this morning. It will be great to have this available! I saw a couple of ways that the code could be tweaked to be more memory efficient. It looks like it's possible to avoid the need for any copies in either the fit or transform step. (Except for the final output of the transform, of course.) I'm also interested in being able to accommodate pandas.DataFrame objects without needing to copy them to numpy object arrays. I think it's possible to do this by putting if hasattr(X, 'iloc'): X = X.iloc in a couple of places. You'd also need to bypass the check_array call if the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check. Would that be reasonable? It seems like a DataFrame would be a common way to get string inputs, and it'd be nice to avoid copying the whole thing before inspecting it in fit or creating the output arrays in transform. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7327# issuecomment-296397504>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AAEz69RxVDoaywSUkND3zvh2ZcYh91qVks5rylspgaJpZM4Jy8r4> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7327 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADpXgtEokFtDgZoArcCmfIJHE2R6up1gks5ryo3XgaJpZM4Jy8r4> .

jnothman · 2017-04-23T01:33:46Z

Thanks for your work, Vighnesh. On 23 April 2017 at 10:55, Vighnesh Birodkar <notifications@github.com> wrote:

…

Hey I won't mind if someone else takes over. Thanks. On 22 Apr 2017 7:31 pm, "Joel Nothman" ***@***.***> wrote: > I'd be interested in a dataframe-friendly version of this, but wouldn't > make it first priority. First get the API and tests right, and > implementation legible, then make it go beyond. > > On 23 Apr 2017 5:55 am, "Stephen Hoover" ***@***.***> wrote: > > I spent some time looking at the new OneHotEncoder this morning. It will be > great to have this available! > > I saw a couple of ways that the code could be tweaked to be more memory > efficient. It looks like it's possible to avoid the need for any copies in > either the fit or transform step. (Except for the final output of the > transform, of course.) > > I'm also interested in being able to accommodate pandas.DataFrame objects > without needing to copy them to numpy object arrays. I think it's possible > to do this by putting > > if hasattr(X, 'iloc'): > X = X.iloc > > in a couple of places. You'd also need to bypass the check_array call if > the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check. > Would that be reasonable? It seems like a DataFrame would be a common way > to get string inputs, and it'd be nice to avoid copying the whole thing > before inspecting it in fit or creating the output arrays in transform. > > — > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > <#7327# > issuecomment-296397504>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ > AAEz69RxVDoaywSUkND3zvh2ZcYh91qVks5rylspgaJpZM4Jy8r4> > . > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#7327# issuecomment-296408350>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ ADpXgtEokFtDgZoArcCmfIJHE2R6up1gks5ryo3XgaJpZM4Jy8r4> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7327 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz64hnamnxmPY8dF8pt060ux4XEzP5ks5ryqFvgaJpZM4Jy8r4> .

vighneshbirodkar reviewed Sep 1, 2016
View reviewed changes

vighneshbirodkar mentioned this pull request Sep 2, 2016

[MRG] Refactored OneHotEncoder #6602

Closed

vighneshbirodkar changed the title ~~[WIP] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range~~ [MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range Sep 2, 2016

jnothman reviewed Sep 3, 2016
View reviewed changes

jnothman reviewed Sep 8, 2016
View reviewed changes

amueller mentioned this pull request Sep 9, 2016

Added Pipeline friendly LabelBinarizer #7375

Closed

jnothman mentioned this pull request Sep 27, 2016

One Hot Encoder with strings #7493

Closed

vighneshbirodkar added 8 commits December 28, 2016 01:12

Fixed doctests

232b3fc

Fixed rst doc tests

52c4813

Replaced type in array with ellipsis

7a4cf51

flake fixes

48a68e7

Add NORMALIZE_WHITESPACE for python3 tests

7e3c48e

normalize whitespace for rst docs

633631c

normalizing whitespace again

b9fad64

docstring changes and minor optimizations

faebd86

vighneshbirodkar force-pushed the ohe_fix branch from 1e4d4b9 to faebd86 Compare December 28, 2016 06:13

vighneshbirodkar added 2 commits December 28, 2016 13:45

Made tests pass by creating arrays with object dtype

f9d6f9a

Assign both values and n_values to self._values and remove redundant …

75a4526

…checking

removed extra spaces for flake8 compat

97a2112

stephen-hoover mentioned this pull request Apr 25, 2017

[MRG] Support for strings in OneHotEncoder #8793

Closed

4 tasks

jorisvandenbossche mentioned this pull request Jun 18, 2017

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged

ageron mentioned this pull request Sep 15, 2017

LabelBinarizer doesn't work in Pipeline ageron/handson-ml#55

Closed

jnothman closed this in #9151 Nov 21, 2017

Uh oh!

[MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range #7327

[MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range #7327

Uh oh!

Conversation

vighneshbirodkar commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vighneshbirodkar Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

vighneshbirodkar Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

vighneshbirodkar commented Sep 2, 2016

Uh oh!

vighneshbirodkar commented Sep 2, 2016

Uh oh!

jnothman Sep 3, 2016

Choose a reason for hiding this comment

Uh oh!

amueller commented Sep 6, 2016

Uh oh!

vighneshbirodkar commented Sep 7, 2016

Uh oh!

jnothman Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 8, 2016

Uh oh!

amueller commented Sep 8, 2016

Uh oh!

vighneshbirodkar commented Dec 28, 2016

Uh oh!

vighneshbirodkar commented Dec 28, 2016

Uh oh!

jnothman commented Dec 29, 2016

Uh oh!

vighneshbirodkar commented Dec 29, 2016

Uh oh!

jnothman commented Dec 29, 2016

Uh oh!

vighneshbirodkar commented Dec 29, 2016

Uh oh!

jnothman commented Dec 29, 2016 via email

Uh oh!

Arturus commented Apr 19, 2017

Uh oh!

stephen-hoover commented Apr 21, 2017

Uh oh!

jnothman commented Apr 21, 2017

Uh oh!

stephen-hoover commented Apr 22, 2017

Uh oh!

jnothman commented Apr 22, 2017 via email • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vighneshbirodkar commented Apr 23, 2017 via email

Uh oh!

jnothman commented Apr 23, 2017 via email

Uh oh!

Uh oh!

[MRG] Added the `"error-strict"` option to OneHotEncoder and start deprecating unknown values in range #7327

[MRG] Added the `"error-strict"` option to OneHotEncoder and start deprecating unknown values in range #7327

vighneshbirodkar commented Sep 1, 2016 •

edited

Loading

vighneshbirodkar Sep 1, 2016 •

edited

Loading

jnothman commented Apr 22, 2017 via email •

edited by TomDLT

Loading