-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209
Conversation
This pull request introduces 1 alert - view on lgtm.com new alerts:
Comment posted by lgtm.com |
Throw us a quick benchmark for strings, ints, few categories, many categories? Also, a comment on the unsorted case (not that I'm sure we want that in LabelEncoder, really...)? |
There are quite extensive benchmarks in the notebook I linked to in the issue: http://nbviewer.jupyter.org/gist/jorisvandenbossche/f399d43499785534f018a4f0c16d24dd Two summary plots (averaged for different number of categories, in the notebook more plots are available, but the trend is more or less the same): Factorizing step ( Encoding step ( |
I updated the figures above a bit as they turned out to be not fully correct for the transform step (the LabelEncoder does more than just encoding with numpy (which I wanted to measure), it also checks if there are unseen categories. Which is something I did not time for the other methods, so added a separate numpy one that also does not do that. The conclusion can also that the numpy method is quite OK, except for object dtype. So just special casing object dtype with a set/dict based one might be enough (and this one can then be used in the CategoricalEncoder when the categories are not sorted). |
why do we need the table if we don't have the unsorted case? |
If we only do numpy & python, we would use numpy for strings? So this wouldn't really impact CategoricalEncoder in most cases, right? |
Because it is much faster for object dtype + we want to support unsorted case in CategoricalEncoder |
Does unsorted mean they are in a different order or they are not sortable? |
Yes, I suppose most people with strings will actually have object dtype instead of numpy string dtype (= all people using pandas to store strings) |
For the CategoricalEncoder, I think ideally we want to allow user-specified categories that do not need to be sorted (now the sortedness is imposed and an error is raised if it is not the case). I think normally they will be sortable though, but so it is about being in a different order. |
On the other hand, users providing object arrays with strings will very often be doing that through Pandas. But yes, I agree with you that we need the Python implementation anyway. My only remaining hesitation about not using pandas is that this encoding is done all over the place (i.e. in every classifier), even repeatedly encoding the same data, and we may be wasting seconds of predict time by not adopting Pandas. Btw, you may be catching a Pandas fast path for your int tests by making the set of classes 0..n-1 rather than something gappy or starting above/below 0. It means that the transform is a no-op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need a what's new and the factorize seems to not work with read-only memmap.
def test_factorize_encode_utils(engine, values, expected): | ||
# test that all different encoders are equivalent | ||
|
||
if engine == 'numpy': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can parametrize this part as well and mark one of the solution as skipif with pytest, isn't it?
sklearn/preprocessing/label.py
Outdated
y = column_or_1d(y, warn=True) | ||
self.classes_, y = np.unique(y, return_inverse=True) | ||
return y | ||
# def fit_transform(self, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up
sklearn/preprocessing/label.py
Outdated
return _PANDAS_INSTALLED | ||
|
||
|
||
# def _encode_numpy(values, uniques=None, encode=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to see all LabelEncoder tests run on both number and object dtypes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention of putting _encode
directly into CategoricalEncoder
, rather than LabelEncoder
, in order to allow for unordered mappings?
sklearn/preprocessing/label.py
Outdated
Returns | ||
------- | ||
uniques | ||
If decode=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode isn't a thing.
I think I parametrized now all LabelEncoder tests that are useful to parametrize for both int and object. |
Yes, see the one but last commit I added. |
let's merge the CategoricalEncoder rewrite first then. can you go fix that
0.19dev to 0.20dev? I'm on the move.
|
I added an extra function to check unknown values with numpy/python for non-object/object dtypes. I will probably have to parametrize some CategoricalEncoder tests for this to ensure both paths are properly tested, but will only do that when the CategoricalEncoder rewrite is merged (will update that now). |
cc @ogrisel This should be ready enough now to review (I still need to check if I need to parametrize more OneHotEncoder tests to ensure they both use numerical and object dtype data, but will only get to that tomorrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments. Otherwise it looks good.
sklearn/preprocessing/label.py
Outdated
Returns | ||
------- | ||
uniques | ||
If encode=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backsticks and full stop
sklearn/preprocessing/label.py
Outdated
uniques | ||
If encode=False | ||
(uniques, encoded) | ||
If encode=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backsticks and full stop
sklearn/preprocessing/label.py
Outdated
uniques : array, optional | ||
If passed, uniques are not determined from passed values (this | ||
can be because the user specified categories, or because they | ||
already have been determined in fit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full stop.
sklearn/preprocessing/label.py
Outdated
can be because the user specified categories, or because they | ||
already have been determined in fit) | ||
encode : bool, default False | ||
If True, also encode the values into integer codes based on `uniques` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop.
sklearn/preprocessing/label.py
Outdated
------- | ||
diff : list | ||
The unique values present in `values` and not in `uniques` (the | ||
unknown values).If encode=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after full stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and backsticks
np.array([[10, 1, 55], [5, 2, 55]]), | ||
np.array([['b', 'A', 'cat'], ['a', 'B', 'cat']], dtype=object) | ||
], ids=['mixed', 'numeric', 'object']) | ||
def test_one_hot_encoder(X): | ||
X = [['abc', 1, 55], ['def', 2, 55]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm. I think that this should be removed.
# when specifying categories manually, unknown categories should already | ||
# raise when fitting | ||
enc = OneHotEncoder(categories=cats) | ||
assert_raises(ValueError, enc.fit, X2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
X = np.array([[1, 2]]).T | ||
enc = OneHotEncoder(categories=[[2, 1, 3]]) | ||
msg = re.escape('Unsorted categories are not supported') | ||
assert_raises_regex(ValueError, msg, enc.fit_transform, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.raises
we could also match the error message.
assert_array_equal(ret, [1, 0, 2, 0, 2]) | ||
|
||
msg = "unseen labels" | ||
assert_raise_message(ValueError, msg, le.transform, unknown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.raises
return _encode_numpy(values, uniques, encode) | ||
|
||
|
||
def _encode_check_unknown(values, uniques, return_mask=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we should start with _
since that this is used in different file. We are not really consistent by checking the code base. @jnothman what do you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem having private internal utils
e34e297
to
ea189ce
Compare
Updated the PR with master; this should be ready for final review / merging. |
Test failing. |
Whoops, sorry, rebasing error, should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicks. I've not checked tests, sorry.
sklearn/preprocessing/label.py
Outdated
encoded = np.searchsorted(uniques, values) | ||
return uniques, encoded | ||
else: | ||
return uniques |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "Unsorted categories are not supported for numerical categories" be validated and raised here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise please note in the _encode
docstring that this is not ensured, but is an assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be validated here (and it might be easier to follow), but in principle this will give (a little bit) more overhead as it is not needed to do that on each transform (encoding) if it is ensured in fit.
So for now making this more clear in the docstring.
@@ -37,6 +37,125 @@ | |||
] | |||
|
|||
|
|||
def _encode_numpy(values, uniques=None, encode=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note that this should be accessed through _encode
, where parameters are described.
sklearn/preprocessing/label.py
Outdated
|
||
def _encode(values, uniques=None, encode=False): | ||
""" | ||
Helper function to factorize (find uniques) and encode values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our convention usually puts this summary on the previous line.
sklearn/preprocessing/label.py
Outdated
Uses pure python method for object dtype, and numpy method for | ||
all other dtypes. | ||
The numpy method has the limitation that the `uniques` need to | ||
be sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify that this is not validated, but assumed for all non-object input.
sklearn/preprocessing/label.py
Outdated
Returns | ||
------- | ||
uniques | ||
If ``encode=False``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: Sorted if uniques parameter was None.
sklearn/preprocessing/label.py
Outdated
diff = _encode_check_unknown(values, uniques) | ||
if diff: | ||
raise ValueError( | ||
"y contains previously unseen labels: %s" % str(diff)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our style tends to prefer:
raise ValueError("y contains previously unseen labels: %s" %
str(diff))
sklearn/preprocessing/label.py
Outdated
return diff | ||
else: | ||
unique_values = np.unique(values) | ||
diff = list(np.setdiff1d(unique_values, uniques)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume_unique=True
@@ -195,8 +194,8 @@ class OneHotEncoder(_BaseEncoder): | |||
|
|||
- 'auto' : Determine categories automatically from the training data. | |||
- list : ``categories[i]`` holds the categories expected in the ith | |||
column. The passed categories must be sorted and should not mix | |||
strings and numeric values. | |||
column. The passed categories should not mix strings and numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify that this is "within a single feature".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test for ohe with specified, unsorted cats?
Yes, there is. See |
Ah of course. Sorry I missed that. |
Tests are passing now. |
@glemaitre your review here has a cross |
Is it worth to make an entry in the what's new in the enhancement section, thought it does not change anything for the user? |
I don't think a separate whatsnew entry is necessarily needed. I would see it as part of the enhancement to OneHotEncoder to support string data |
LGTM as well, merged! |
cool!!
|
Thanks! |
Reference Issues/PRs
Fixes #7432, see also #9151
What does this implement/fix?
This implements the encoding in the LabelEncoder in the different ways (the original numpy/searchsorted one, one based on sets/dict lookup (much faster for object dtype) and one using pandas), and dispatching to one of those methods depending on the dtype / depending on whether pandas is available.
This is WIP, mainly to get feedback on the idea. I am not happy yet with the implementation. Also for
fit_transform
there are shortcuts available for the numpy and pandas way, which I didn't add yet.