Skip to content

[MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range #7327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

vighneshbirodkar
Copy link
Contributor

@vighneshbirodkar vighneshbirodkar commented Sep 1, 2016

  • Added '"error-strict"` option to error on any unknown values
  • Make underlying label encoders private
  • Add attributes to map input features to output features and vice-versa.
  • Decide how to handle non-integer arrays for range checking

for i in range(n_features):
le = self.label_encoders_[i]

self._max_values[i] = np.max(X[:, i])
Copy link
Contributor Author

@vighneshbirodkar vighneshbirodkar Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @amueller
I am casting X to np.object above because, if the user passes strings in the input, the casting (to int) would fail. If the user passes handle_unknown='error-strict' and then passes a string input, this would lead to weird results.

edit: Any suggestions on how to handle this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't generally cast to object. Is there not a way to use check_array to allow either integer or object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try casting to int and then cast to an object if it fails. But sting values like "123" are going to be successfully cast to int. Are we OK with that ?

@vighneshbirodkar
Copy link
Contributor Author

I have added 2 new attributes: one_hot_feature_index_ and feature_index_range_. Please refer the documentation for more details

@vighneshbirodkar vighneshbirodkar changed the title [WIP] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range [MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range Sep 2, 2016
@vighneshbirodkar
Copy link
Contributor Author

@amueller @jnothman @GaelVaroquaux
I think this is ready for review. I won't be able to spend time on this on Sunday, but I can resume my work on Monday

- array : ``n_values[i]`` is the number of categorical values in
``X[:, i]``. Each feature value should be
in ``range(n_values[i])``
values : 'auto', 'seen', int, list of ints, or list of lists of objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'seen'?

@amueller
Copy link
Member

amueller commented Sep 6, 2016

Just letting you know that I'm very angry at past @amueller for this mess.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman Could you take another look at this ?

le.fit(np.arange(self.values, dtype=np.int))
elif isinstance(self.values, list):
if len(self.values) != X.shape[1]:
raise ValueError("Shape mismatch: if n_values is a list,"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n_values" -> "values"

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

@vighneshbirodkar, I think this is a large and intricate PR and I'm now disinclined to have it in for 0.18. I think your code can be simpler in general, and I think we would benefit from leaving this in as elegant a state as possible.

We need to be careful to test behaviour when values are out of range and out of specified set; at fit time, and at transform time. I've admittedly not looked at your tests, or those already present, yet, but I thought this is worth highlighting. We should also ensure that that behaviour meets reasonable use-cases: should we support the case where a user wants some values of a feature left unencoded?

In short, thank you for continuing to work on this, but I think you should expect to work on it and receive a more complete review after the release. (I hope that's okay @amueller)

@amueller
Copy link
Member

amueller commented Sep 8, 2016

@jnothman sad but fair enough ;) Let's go for a quick 0.19 then ;)

@vighneshbirodkar
Copy link
Contributor Author

@amueller @jnothman
I am assigning values and n_values to the internal attribute _values to remove the redundant checking as @jnothman pointed out.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman The existing implementation also contains an np.max call. What could be a possible work around ?

@jnothman
Copy link
Member

if X has a numeric dtype, use max?

@vighneshbirodkar
Copy link
Contributor Author

@jnothman, you mean use the numpy array's max method ?

@jnothman
Copy link
Member

Perhaps I've misunderstood: what's the problem with np.max? (sorry, it's been a while)

@vighneshbirodkar
Copy link
Contributor Author

The problem is that we won't support sparse input if we use max

@jnothman
Copy link
Member

jnothman commented Dec 29, 2016 via email

@Arturus
Copy link

Arturus commented Apr 19, 2017

Do we have a chance to get string support in 0.19?

@stephen-hoover
Copy link
Contributor

What's left to do here? Is there anything I can do to help this PR? I also would like to see string support. :-)

@jnothman
Copy link
Member

@vighneshbirodkar would you like someone else to take this over? As a likely reviewer, I'd really appreciate it if the code could be made more readable, though I appreciate that it's become a surprisingly complicated transformer.

@stephen-hoover
Copy link
Contributor

I spent some time looking at the new OneHotEncoder this morning. It will be great to have this available!

I saw a couple of ways that the code could be tweaked to be more memory efficient. It looks like it's possible to avoid the need for any copies in either the fit or transform step. (Except for the final output of the transform, of course.)

I'm also interested in being able to accommodate pandas.DataFrame objects without needing to copy them to numpy object arrays. I think it's possible to do this by putting

if hasattr(X, 'iloc'):
    X = X.iloc

in a couple of places. You'd also need to bypass the check_array call if the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check. Would that be reasonable? It seems like a DataFrame would be a common way to get string inputs, and it'd be nice to avoid copying the whole thing before inspecting it in fit or creating the output arrays in transform.

@jnothman
Copy link
Member

jnothman commented Apr 22, 2017 via email

@vighneshbirodkar
Copy link
Contributor Author

vighneshbirodkar commented Apr 23, 2017 via email

@jnothman
Copy link
Member

jnothman commented Apr 23, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants