[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209

jorisvandenbossche · 2017-11-27T09:14:59Z

Reference Issues/PRs

Fixes #7432, see also #9151

What does this implement/fix?

This implements the encoding in the LabelEncoder in the different ways (the original numpy/searchsorted one, one based on sets/dict lookup (much faster for object dtype) and one using pandas), and dispatching to one of those methods depending on the dtype / depending on whether pandas is available.

This is WIP, mainly to get feedback on the idea. I am not happy yet with the implementation. Also for fit_transform there are shortcuts available for the numpy and pandas way, which I didn't add yet.

jnothman · 2017-11-27T10:01:22Z

This pull request introduces 1 alert - view on lgtm.com

new alerts:

1 for Wrong number of arguments in a call

Comment posted by lgtm.com

jnothman · 2017-11-27T11:37:07Z

Throw us a quick benchmark for strings, ints, few categories, many categories? Also, a comment on the unsorted case (not that I'm sure we want that in LabelEncoder, really...)?

jorisvandenbossche · 2017-11-27T12:50:26Z

There are quite extensive benchmarks in the notebook I linked to in the issue: http://nbviewer.jupyter.org/gist/jorisvandenbossche/f399d43499785534f018a4f0c16d24dd

Two summary plots (averaged for different number of categories, in the notebook more plots are available, but the trend is more or less the same):

Factorizing step (fit) (time in seconds on 10 million elements):

Encoding step (transform) (time in seconds on 10 million elements):

jorisvandenbossche · 2017-11-27T14:54:25Z

I updated the figures above a bit as they turned out to be not fully correct for the transform step (the LabelEncoder does more than just encoding with numpy (which I wanted to measure), it also checks if there are unseen categories. Which is something I did not time for the other methods, so added a separate numpy one that also does not do that.
Will probably need to add this checking for unseen labels to the benchmarks as well.

The conclusion can also that the numpy method is quite OK, except for object dtype. So just special casing object dtype with a set/dict based one might be enough (and this one can then be used in the CategoricalEncoder when the categories are not sorted).

amueller · 2017-11-27T20:17:32Z

why do we need the table if we don't have the unsorted case?

amueller · 2017-11-27T20:20:45Z

If we only do numpy & python, we would use numpy for strings? So this wouldn't really impact CategoricalEncoder in most cases, right?

jorisvandenbossche · 2017-11-27T20:20:54Z

why do we need the table if we don't have the unsorted case?

Because it is much faster for object dtype + we want to support unsorted case in CategoricalEncoder

amueller · 2017-11-27T20:21:26Z

Does unsorted mean they are in a different order or they are not sortable?

jorisvandenbossche · 2017-11-27T20:22:03Z

If we only do numpy & python, we would use numpy for strings? So this wouldn't really impact CategoricalEncoder in most cases, right?

Yes, I suppose most people with strings will actually have object dtype instead of numpy string dtype (= all people using pandas to store strings)

jorisvandenbossche · 2017-11-27T20:26:19Z

For the CategoricalEncoder, I think ideally we want to allow user-specified categories that do not need to be sorted (now the sortedness is imposed and an error is raised if it is not the case). I think normally they will be sortable though, but so it is about being in a different order.

jnothman · 2017-11-27T23:05:41Z

The conclusion can also that the numpy method is quite OK, except for object dtype. So just special casing object dtype with a set/dict based one might be enough (and this one can then be used in the CategoricalEncoder when the categories are not sorted).

On the other hand, users providing object arrays with strings will very often be doing that through Pandas. But yes, I agree with you that we need the Python implementation anyway.

My only remaining hesitation about not using pandas is that this encoding is done all over the place (i.e. in every classifier), even repeatedly encoding the same data, and we may be wasting seconds of predict time by not adopting Pandas.

Btw, you may be catching a Pandas fast path for your int tests by making the set of classes 0..n-1 rather than something gappy or starting above/below 0. It means that the transform is a no-op.

glemaitre

Then we need a what's new and the factorize seems to not work with read-only memmap.

glemaitre · 2018-06-07T11:11:50Z

sklearn/preprocessing/tests/test_label.py

+def test_factorize_encode_utils(engine, values, expected):
+    # test that all different encoders are equivalent
+
+    if engine == 'numpy':


you can parametrize this part as well and mark one of the solution as skipif with pytest, isn't it?

glemaitre · 2018-06-07T11:11:57Z

sklearn/preprocessing/label.py

-        y = column_or_1d(y, warn=True)
-        self.classes_, y = np.unique(y, return_inverse=True)
-        return y
+    # def fit_transform(self, y):


glemaitre · 2018-06-07T11:12:05Z

sklearn/preprocessing/label.py

+    return _PANDAS_INSTALLED
+
+
+# def _encode_numpy(values, uniques=None, encode=True):


jnothman

I think I'd like to see all LabelEncoder tests run on both number and object dtypes...

jnothman

Is the intention of putting _encode directly into CategoricalEncoder, rather than LabelEncoder, in order to allow for unordered mappings?

jnothman · 2018-06-21T07:25:18Z

sklearn/preprocessing/label.py

+    Returns
+    -------
+    uniques
+        If decode=False


decode isn't a thing.

jorisvandenbossche · 2018-06-21T07:38:46Z

I think I parametrized now all LabelEncoder tests that are useful to parametrize for both int and object.

jorisvandenbossche · 2018-06-21T07:40:16Z

Is the intention of putting _encode directly into CategoricalEncoder, rather than LabelEncoder, in order to allow for unordered mappings?

Yes, see the one but last commit I added.
For now I just replaced LabelEncoder with _encode, but the idea is indeed that in that way I can relax the restriction on sorted categories (at least for object dtype). Working on that next (but first need to fix handling of unknown categories in CategoricalEncoder as well)

jnothman · 2018-06-21T08:02:58Z

let's merge the CategoricalEncoder rewrite first then. can you go fix that 0.19dev to 0.20dev? I'm on the move.

jorisvandenbossche · 2018-06-21T08:40:48Z

I added an extra function to check unknown values with numpy/python for non-object/object dtypes. I will probably have to parametrize some CategoricalEncoder tests for this to ensure both paths are properly tested, but will only do that when the CategoricalEncoder rewrite is merged (will update that now).

jorisvandenbossche · 2018-06-21T12:38:56Z

cc @ogrisel This should be ready enough now to review (I still need to check if I need to parametrize more OneHotEncoder tests to ensure they both use numerical and object dtype data, but will only get to that tomorrow)

glemaitre

Couple of comments. Otherwise it looks good.

glemaitre · 2018-06-26T14:42:27Z

sklearn/preprocessing/label.py

+    Returns
+    -------
+    uniques
+        If encode=False


backsticks and full stop

glemaitre · 2018-06-26T14:42:33Z

sklearn/preprocessing/label.py

+    uniques
+        If encode=False
+    (uniques, encoded)
+        If encode=True


backsticks and full stop

glemaitre · 2018-06-26T14:42:54Z

sklearn/preprocessing/label.py

+    uniques : array, optional
+        If passed, uniques are not determined from passed values (this
+        can be because the user specified categories, or because they
+        already have been determined in fit)


glemaitre · 2018-06-26T14:43:05Z

sklearn/preprocessing/label.py

+        can be because the user specified categories, or because they
+        already have been determined in fit)
+    encode : bool, default False
+        If True, also encode the values into integer codes based on `uniques`


glemaitre · 2018-06-26T14:43:34Z

sklearn/preprocessing/label.py

+    -------
+    diff : list
+        The unique values present in `values` and not in `uniques` (the
+        unknown values).If encode=False


Missing space after full stop.

and backsticks

glemaitre · 2018-06-26T15:04:41Z

sklearn/preprocessing/tests/test_encoders.py

+    np.array([[10, 1, 55], [5, 2, 55]]),
+    np.array([['b', 'A', 'cat'], ['a', 'B', 'cat']], dtype=object)
+    ], ids=['mixed', 'numeric', 'object'])
+def test_one_hot_encoder(X):
    X = [['abc', 1, 55], ['def', 2, 55]]


Uhm. I think that this should be removed.

glemaitre · 2018-06-26T15:09:25Z

sklearn/preprocessing/tests/test_encoders.py

+    # when specifying categories manually, unknown categories should already
+    # raise when fitting
+    enc = OneHotEncoder(categories=cats)
+    assert_raises(ValueError, enc.fit, X2)


pytest.raises

glemaitre · 2018-06-26T15:10:22Z

sklearn/preprocessing/tests/test_encoders.py

+    X = np.array([[1, 2]]).T
+    enc = OneHotEncoder(categories=[[2, 1, 3]])
+    msg = re.escape('Unsorted categories are not supported')
+    assert_raises_regex(ValueError, msg, enc.fit_transform, X)


pytest.raises we could also match the error message.

glemaitre · 2018-06-26T15:12:41Z

sklearn/preprocessing/tests/test_label.py

+    assert_array_equal(ret, [1, 0, 2, 0, 2])
+
+    msg = "unseen labels"
+    assert_raise_message(ValueError, msg, le.transform, unknown)


pytest.raises

glemaitre · 2018-06-26T15:21:17Z

sklearn/preprocessing/label.py

+        return _encode_numpy(values, uniques, encode)
+
+
+def _encode_check_unknown(values, uniques, return_mask=False):


I am not sure that we should start with _ since that this is used in different file. We are not really consistent by checking the code base. @jnothman what do you recommend?

No problem having private internal utils

jorisvandenbossche · 2018-06-29T07:48:03Z

Updated the PR with master; this should be ready for final review / merging.

jnothman · 2018-06-30T12:15:48Z

Test failing.

jorisvandenbossche · 2018-07-01T09:17:25Z

Whoops, sorry, rebasing error, should be fixed now.

jnothman

Nitpicks. I've not checked tests, sorry.

jnothman · 2018-07-01T12:12:26Z

sklearn/preprocessing/label.py

+        encoded = np.searchsorted(uniques, values)
+        return uniques, encoded
+    else:
+        return uniques


Should "Unsorted categories are not supported for numerical categories" be validated and raised here?

Otherwise please note in the _encode docstring that this is not ensured, but is an assumption.

It could be validated here (and it might be easier to follow), but in principle this will give (a little bit) more overhead as it is not needed to do that on each transform (encoding) if it is ensured in fit.
So for now making this more clear in the docstring.

jnothman · 2018-07-01T12:13:30Z

sklearn/preprocessing/label.py

@@ -37,6 +37,125 @@
 ]


+def _encode_numpy(values, uniques=None, encode=False):


Add a note that this should be accessed through _encode, where parameters are described.

jnothman · 2018-07-01T12:15:07Z

sklearn/preprocessing/label.py

+
+def _encode(values, uniques=None, encode=False):
+    """
+    Helper function to factorize (find uniques) and encode values.


our convention usually puts this summary on the previous line.

jnothman · 2018-07-01T12:15:36Z

sklearn/preprocessing/label.py

+    Uses pure python method for object dtype, and numpy method for
+    all other dtypes.
+    The numpy method has the limitation that the `uniques` need to
+    be sorted.


clarify that this is not validated, but assumed for all non-object input.

jnothman · 2018-07-01T12:16:18Z

sklearn/preprocessing/label.py

+    Returns
+    -------
+    uniques
+        If ``encode=False``.


note: Sorted if uniques parameter was None.

jnothman · 2018-07-01T12:18:01Z

sklearn/preprocessing/label.py

+        diff = _encode_check_unknown(values, uniques)
+        if diff:
+            raise ValueError(
+                "y contains previously unseen labels: %s" % str(diff))


our style tends to prefer:

raise ValueError("y contains previously unseen labels: %s" % str(diff))

jnothman · 2018-07-01T12:19:22Z

sklearn/preprocessing/label.py

+            return diff
+    else:
+        unique_values = np.unique(values)
+        diff = list(np.setdiff1d(unique_values, uniques))


assume_unique=True

jnothman · 2018-07-01T12:20:44Z

sklearn/preprocessing/_encoders.py

@@ -195,8 +194,8 @@ class OneHotEncoder(_BaseEncoder):

        - 'auto' : Determine categories automatically from the training data.
        - list : ``categories[i]`` holds the categories expected in the ith
-          column. The passed categories must be sorted and should not mix
-          strings and numeric values.
+          column. The passed categories should not mix strings and numeric


clarify that this is "within a single feature".

jnothman

Is there a test for ohe with specified, unsorted cats?

jorisvandenbossche · 2018-07-02T06:48:58Z

Is there a test for ohe with specified, unsorted cats?

Yes, there is. See test_one_hot_encoder_unsorted_categories

jnothman · 2018-07-02T06:50:33Z

Yes, there is. See test_one_hot_encoder_unsorted_categories

Ah of course. Sorry I missed that.

jorisvandenbossche · 2018-07-02T12:04:14Z

Tests are passing now.

jnothman · 2018-07-02T22:05:41Z

@glemaitre your review here has a cross

glemaitre · 2018-07-02T22:14:47Z

Is it worth to make an entry in the what's new in the enhancement section, thought it does not change anything for the user?

jorisvandenbossche · 2018-07-03T07:35:52Z

I don't think a separate whatsnew entry is necessarily needed. I would see it as part of the enhancement to OneHotEncoder to support string data

ogrisel · 2018-07-03T07:51:20Z

LGTM as well, merged!

jnothman · 2018-07-03T07:56:27Z

cool!!

jorisvandenbossche · 2018-07-03T08:08:58Z

Thanks!

Optionally use hashtable or pandas for LabelEncoder

081e784

jorisvandenbossche mentioned this pull request Nov 27, 2017

Make LabelEncoder use a hash table #7432

Closed

jorisvandenbossche added 2 commits June 6, 2018 14:35

clean-up

391f007

Merge remote-tracking branch 'upstream/master' into labelencoder-sets

c2efafe

glemaitre requested changes Jun 7, 2018

View reviewed changes

jorisvandenbossche added 2 commits June 20, 2018 18:25

remove pandas for now; simplify; document

9131c46

move detection unseen labels into encode function

ed64b91

jnothman reviewed Jun 20, 2018

View reviewed changes

use new _encode instead of LabelEncoder in CategoricalEncoder

7268e6b

jnothman reviewed Jun 21, 2018

View reviewed changes

sklearn/preprocessing/label.py Outdated

Returns

-------

uniques

If decode=False

Copy link

Member

jnothman Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode isn't a thing.

parametrize more tests

787d617

jorisvandenbossche added 2 commits June 21, 2018 10:34

also properly handle unknown categories in CategoricalEncoder

7502e5e

reuse check function for LabelEncoder as well

f7e195a

jorisvandenbossche added 2 commits June 21, 2018 14:24

Merge remote-tracking branch 'upstream/master' into labelencoder-sets

bb15f13

allow unsorted categories passed by user for object dtype

3d1281c

parametrize some OneHotEncoder tests for object/int dtypes

626d217

glemaitre reviewed Jun 26, 2018

View reviewed changes

feedback guillaume

662da67

jorisvandenbossche added this to the 0.20 milestone Jun 29, 2018

Merge remote-tracking branch 'upstream/master' into labelencoder-sets

ea189ce

jorisvandenbossche force-pushed the labelencoder-sets branch from e34e297 to ea189ce Compare June 29, 2018 07:46

fixup merge master

0bbbe82

jnothman reviewed Jul 1, 2018

View reviewed changes

feedback Joel

4365536

jorisvandenbossche mentioned this pull request Jul 2, 2018

OneHotEncoder doesn't handle columns with mix of string and int #11379

Open

jnothman approved these changes Jul 2, 2018

View reviewed changes

glemaitre approved these changes Jul 2, 2018

View reviewed changes

ogrisel merged commit 0b0bd9b into scikit-learn:master Jul 3, 2018

jorisvandenbossche deleted the labelencoder-sets branch July 3, 2018 08:07

jorisvandenbossche mentioned this pull request Jul 20, 2018

ENH: LabelEncoder supports pandas Categorical dask/dask-ml#310

Merged

stsievert mentioned this pull request Sep 22, 2020

Add param based compile adriangb/scikeras#66

Merged

		return _PANDAS_INSTALLED


		# def _encode_numpy(values, uniques=None, encode=True):

		return _encode_numpy(values, uniques, encode)


		def _encode_check_unknown(values, uniques, return_mask=False):

		@@ -37,6 +37,125 @@
		]


		def _encode_numpy(values, uniques=None, encode=False):

Uh oh!

[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209

[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209

Uh oh!

Conversation

jorisvandenbossche commented Nov 27, 2017

Reference Issues/PRs

What does this implement/fix?

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

jorisvandenbossche commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 27, 2017

Uh oh!

amueller commented Nov 27, 2017

Uh oh!

amueller commented Nov 27, 2017

Uh oh!

jorisvandenbossche commented Nov 27, 2017

Uh oh!

amueller commented Nov 27, 2017

Uh oh!

jorisvandenbossche commented Nov 27, 2017

Uh oh!

jorisvandenbossche commented Nov 27, 2017

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 21, 2018

Uh oh!

jorisvandenbossche commented Jun 21, 2018

Uh oh!

jnothman commented Jun 21, 2018 via email

Uh oh!

jorisvandenbossche commented Jun 21, 2018

Uh oh!

jorisvandenbossche commented Jun 21, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Nov 27, 2017 •

edited

Loading