Skip to content

[MRG] Added CategoricalEncoder class deprecating OneHotEncoder #6559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

vighneshbirodkar
Copy link
Contributor

Following the discussion with @amueller and @MechCoder

New Features in CategoricalEncoder

  • Support for strings, negative integers ( and anything that can be put in an object array )
  • Specify discrete values using the classes parameter instead of n_values.
  • Uses a LabelEncoder instance for each column.

Changes

  • Changes _transform_selected to _apply_selected giving it the ability to optionally not return transformed values.
  • _apply_selected can no longer accept lists. It has to be given a np.array object. This is done because the input can be of np.object type and it cannot be always cast as a whole to np.int or np.float type. The transformed and non-transformed parts of the array are converted to the specified type before returning.

@vighneshbirodkar
Copy link
Contributor Author

@MechCoder @amueller
np.in1d is not supported on older Numpy versions because merge sort was not supported for all data types prior to version 1.6. Should I write a custom implementation on in1d or raise a ValueError on object arrays when a lower numpy version is installed ?

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel @jnothman Any suggestions ?

@vighneshbirodkar
Copy link
Contributor Author

All the tests pass now. I would be glad if someone could review this.

@raghavrv
Copy link
Member

Apologies if this comment is not helpful. Isn't the name OneHotEncoder pretty famous in the data domain. And it would be good to retain that? (Maybe in addition to the newly introduced CategoricalEncoder?)

@amueller
Copy link
Member

I don't think it's "famous". I think most people don't know what it means. People from R are very confused that you need to do anything to work with categorical data.

@vighneshbirodkar
Copy link
Contributor Author

@rvraghav93 Can you do a review of the code ?

sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.

handle_unknown : str, 'error' or 'ignore'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default?

@jnothman
Copy link
Member

Somewhat along the lines of @rvraghav93 (and why reviewing the whole thing might not be so interesting until this is resolved): It seems as if this is functionally a superset of the current OneHotEncoder. It is unclear to me what benefit lies in the name change. It's great that the interface and mechanism will be simplified, but I don't see why we should confuse or upset existing OneHotEncoder (and especially OneHotEncoder('auto')) users for the sake of a changing data structure.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman Because the attributes of OneHotEncoder like n_values_ and active_features_ don't hold any meaning for this implementation and it would be unnecessarily complex to support them exactly how they behave now.

@jnothman
Copy link
Member

The point is that the vast majority of users of OneHotEncoder will not actually access those attributes, so your "unnecessarily complex" is actually just a few lines of code in a property getter that will disappear after a deprecation period. Deprecating the whole class, on the other hand, makes life hard for many users. I don't think it's hard for us as the developers to accept that burden rather than passing one onto a large set of user-code maintainers.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman What will those property getters return ?

@jnothman
Copy link
Member

@jnothman What will those property getters return ?

In the cases where the new functionality is identical to the former, they will provide n_values_ and active_features_ as formerly expected.

@vighneshbirodkar
Copy link
Contributor Author

Closing in favor of #6602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants