-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] ENH: new CategoricalEncoder class #9151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
70d8165
Added CategoricalEncoder class - deprecating OneHotEncoder
vighneshbirodkar bea23a5
First round of updates
jorisvandenbossche fda6d27
fix + test specifying of categories
jorisvandenbossche 5f2b403
further clean-up + tests
jorisvandenbossche e175e4c
fix skipping pandas test
jorisvandenbossche dfaa9c0
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche 4f64648
feedback andy
jorisvandenbossche 01c3bd4
add encoding keyword to support ordinal encoding
jorisvandenbossche dcef19c
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche 2ed91e8
remove y from transform signature
jorisvandenbossche a589dd9
Remove sparse keyword in favor of encoding='onehot-dense'
jorisvandenbossche 17e5e69
Let encoding='ordinal' follow dtype keyword
jorisvandenbossche 47a88dd
add categories_ attribute
jorisvandenbossche 7b5b476
expand docs on ordinal + feedback
jorisvandenbossche 5f26bdc
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche 3dcc07f
feedback Andy
jorisvandenbossche 5f5934f
add whatsnew note
jorisvandenbossche c6a5d30
for now raise on unsorted passed categories
jorisvandenbossche ad5fdc7
Implement inverse_transform
jorisvandenbossche eb2f4b8
fix example to have sorted categories
jorisvandenbossche ce82c28
backport scipy sparse argmax
jorisvandenbossche 64aeff5
check handle_unknown before computation in fit
jorisvandenbossche 4f8efcf
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche a1c0982
make scipy backport private
jorisvandenbossche 85cf315
Directly construct CSR matrix
jorisvandenbossche b40bd8e
try to preserve original dtype if resulting dtype is not string
jorisvandenbossche 2d9b4dd
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche a31bb2a
Remove copying of data, only copy when needed in transform + add test
jorisvandenbossche 2ef5fb9
add test for input dtypes / categories_ dtypes
jorisvandenbossche 937446e
doc updates based on feedback
jorisvandenbossche a83102c
fix docstring example for python 2
jorisvandenbossche fbe9ea7
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche 21d9c0c
add checking of shape of X in inverse_transform
jorisvandenbossche 929362f
loopify dtype tests
jorisvandenbossche a6d55d1
reword example on unknown categories
jorisvandenbossche 9aeeb6d
clarify docs
jorisvandenbossche c39aa0c
remove repeated one
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? If a category doesn't exist in training, how can it produce a zero column?? I thought this was the description for when specifying categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I agree this can be confusing, but you can read it both ways: also in the case of
handle_unknown='ignore'
you end up with all zero's,just some zero's less as with the manually specified categories.Note that this is also how it is explained in the class docstring's explanation of
handle_unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought: IMO it is correct how I stated it, and I don't think it can be read both ways:
handle_unknown='ignore'
, and you encounter a category that didn't exist in training: you get all zerors (so eg [0, 0, 0])So for me the above explanation is correct. It is not that easy to clearly word the idea of "a zero for each dummy column for that specific feature in that row", so if you have a better wording than "the resulting one-hot encoded columns for this feature will be all zeros", always welcome.
Or if the above is not clear, can you try to clarify your confusion with the text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I misread. Thanks for the explanation.