-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Added the "error-strict"
option to OneHotEncoder and start deprecating unknown values in range
#7327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vighneshbirodkar
commented
Sep 1, 2016
•
edited
Loading
edited
- Added '"error-strict"` option to error on any unknown values
- Make underlying label encoders private
- Add attributes to map input features to output features and vice-versa.
- Decide how to handle non-integer arrays for range checking
for i in range(n_features): | ||
le = self.label_encoders_[i] | ||
|
||
self._max_values[i] = np.max(X[:, i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't generally cast to object. Is there not a way to use check_array
to allow either integer or object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try casting to int
and then cast to an object if it fails. But sting values like "123"
are going to be successfully cast to int. Are we OK with that ?
I have added 2 new attributes: |
"error-strict"
option to OneHotEncoder and start deprecating unknown values in range"error-strict"
option to OneHotEncoder and start deprecating unknown values in range
@amueller @jnothman @GaelVaroquaux |
- array : ``n_values[i]`` is the number of categorical values in | ||
``X[:, i]``. Each feature value should be | ||
in ``range(n_values[i])`` | ||
values : 'auto', 'seen', int, list of ints, or list of lists of objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'seen'?
Just letting you know that I'm very angry at past @amueller for this mess. |
@jnothman Could you take another look at this ? |
le.fit(np.arange(self.values, dtype=np.int)) | ||
elif isinstance(self.values, list): | ||
if len(self.values) != X.shape[1]: | ||
raise ValueError("Shape mismatch: if n_values is a list," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n_values" -> "values"
@vighneshbirodkar, I think this is a large and intricate PR and I'm now disinclined to have it in for 0.18. I think your code can be simpler in general, and I think we would benefit from leaving this in as elegant a state as possible. We need to be careful to test behaviour when values are out of range and out of specified set; at fit time, and at transform time. I've admittedly not looked at your tests, or those already present, yet, but I thought this is worth highlighting. We should also ensure that that behaviour meets reasonable use-cases: should we support the case where a user wants some values of a feature left unencoded? In short, thank you for continuing to work on this, but I think you should expect to work on it and receive a more complete review after the release. (I hope that's okay @amueller) |
@jnothman sad but fair enough ;) Let's go for a quick 0.19 then ;) |
1e4d4b9
to
faebd86
Compare
@jnothman The existing implementation also contains an |
if X has a numeric dtype, use max? |
@jnothman, you mean use the numpy array's max method ? |
Perhaps I've misunderstood: what's the problem with |
The problem is that we won't support sparse input if we use max |
See sklearn.utils.fixes.sparse_min_max
…On 30 December 2016 at 00:05, Vighnesh Birodkar ***@***.***> wrote:
The problem is that we won't support sparse input if we use max
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7327 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xb5o9xCnEBwjoEfLfN8z5bMG12jks5rM7AKgaJpZM4Jy8r4>
.
|
Do we have a chance to get string support in 0.19? |
What's left to do here? Is there anything I can do to help this PR? I also would like to see string support. :-) |
@vighneshbirodkar would you like someone else to take this over? As a likely reviewer, I'd really appreciate it if the code could be made more readable, though I appreciate that it's become a surprisingly complicated transformer. |
I spent some time looking at the new I saw a couple of ways that the code could be tweaked to be more memory efficient. It looks like it's possible to avoid the need for any copies in either the I'm also interested in being able to accommodate if hasattr(X, 'iloc'):
X = X.iloc in a couple of places. You'd also need to bypass the |
I'd be interested in a dataframe-friendly version of this, but wouldn't
make it first priority. First get the API and tests right, and
implementation legible, then make it go beyond.
|
Hey
I won't mind if someone else takes over.
Thanks.
…On 22 Apr 2017 7:31 pm, "Joel Nothman" ***@***.***> wrote:
I'd be interested in a dataframe-friendly version of this, but wouldn't
make it first priority. First get the API and tests right, and
implementation legible, then make it go beyond.
On 23 Apr 2017 5:55 am, "Stephen Hoover" ***@***.***> wrote:
I spent some time looking at the new OneHotEncoder this morning. It will be
great to have this available!
I saw a couple of ways that the code could be tweaked to be more memory
efficient. It looks like it's possible to avoid the need for any copies in
either the fit or transform step. (Except for the final output of the
transform, of course.)
I'm also interested in being able to accommodate pandas.DataFrame objects
without needing to copy them to numpy object arrays. I think it's possible
to do this by putting
if hasattr(X, 'iloc'):
X = X.iloc
in a couple of places. You'd also need to bypass the check_array call if
the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check.
Would that be reasonable? It seems like a DataFrame would be a common way
to get string inputs, and it'd be nice to avoid copying the whole thing
before inspecting it in fit or creating the output arrays in transform.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7327#
issuecomment-296397504>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
AAEz69RxVDoaywSUkND3zvh2ZcYh91qVks5rylspgaJpZM4Jy8r4>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7327 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADpXgtEokFtDgZoArcCmfIJHE2R6up1gks5ryo3XgaJpZM4Jy8r4>
.
|
Thanks for your work, Vighnesh.
On 23 April 2017 at 10:55, Vighnesh Birodkar <notifications@github.com>
wrote:
… Hey
I won't mind if someone else takes over.
Thanks.
On 22 Apr 2017 7:31 pm, "Joel Nothman" ***@***.***> wrote:
> I'd be interested in a dataframe-friendly version of this, but wouldn't
> make it first priority. First get the API and tests right, and
> implementation legible, then make it go beyond.
>
> On 23 Apr 2017 5:55 am, "Stephen Hoover" ***@***.***>
wrote:
>
> I spent some time looking at the new OneHotEncoder this morning. It will
be
> great to have this available!
>
> I saw a couple of ways that the code could be tweaked to be more memory
> efficient. It looks like it's possible to avoid the need for any copies
in
> either the fit or transform step. (Except for the final output of the
> transform, of course.)
>
> I'm also interested in being able to accommodate pandas.DataFrame objects
> without needing to copy them to numpy object arrays. I think it's
possible
> to do this by putting
>
> if hasattr(X, 'iloc'):
> X = X.iloc
>
> in a couple of places. You'd also need to bypass the check_array call if
> the input is a DataFrame, perhaps also with a hasattr(X, 'iloc') check.
> Would that be reasonable? It seems like a DataFrame would be a common way
> to get string inputs, and it'd be nice to avoid copying the whole thing
> before inspecting it in fit or creating the output arrays in transform.
>
> —
> You are receiving this because you were mentioned.
>
> Reply to this email directly, view it on GitHub
> <#7327#
> issuecomment-296397504>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
> AAEz69RxVDoaywSUkND3zvh2ZcYh91qVks5rylspgaJpZM4Jy8r4>
> .
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7327#
issuecomment-296408350>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
ADpXgtEokFtDgZoArcCmfIJHE2R6up1gks5ryo3XgaJpZM4Jy8r4>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7327 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64hnamnxmPY8dF8pt060ux4XEzP5ks5ryqFvgaJpZM4Jy8r4>
.
|