Skip to content

[MRG] Use hashtable (python sets/dicts) for object dtype data in Encoders #10209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 3, 2018

Conversation

jorisvandenbossche
Copy link
Member

Reference Issues/PRs

Fixes #7432, see also #9151

What does this implement/fix?

This implements the encoding in the LabelEncoder in the different ways (the original numpy/searchsorted one, one based on sets/dict lookup (much faster for object dtype) and one using pandas), and dispatching to one of those methods depending on the dtype / depending on whether pandas is available.

This is WIP, mainly to get feedback on the idea. I am not happy yet with the implementation. Also for fit_transform there are shortcuts available for the numpy and pandas way, which I didn't add yet.

@jnothman
Copy link
Member

This pull request introduces 1 alert - view on lgtm.com

new alerts:

  • 1 for Wrong number of arguments in a call

Comment posted by lgtm.com

@jnothman
Copy link
Member

Throw us a quick benchmark for strings, ints, few categories, many categories? Also, a comment on the unsorted case (not that I'm sure we want that in LabelEncoder, really...)?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 27, 2017

There are quite extensive benchmarks in the notebook I linked to in the issue: http://nbviewer.jupyter.org/gist/jorisvandenbossche/f399d43499785534f018a4f0c16d24dd

Two summary plots (averaged for different number of categories, in the notebook more plots are available, but the trend is more or less the same):

Factorizing step (fit) (time in seconds on 10 million elements):

index

Encoding step (transform) (time in seconds on 10 million elements):

index2

@jorisvandenbossche
Copy link
Member Author

I updated the figures above a bit as they turned out to be not fully correct for the transform step (the LabelEncoder does more than just encoding with numpy (which I wanted to measure), it also checks if there are unseen categories. Which is something I did not time for the other methods, so added a separate numpy one that also does not do that.
Will probably need to add this checking for unseen labels to the benchmarks as well.

The conclusion can also that the numpy method is quite OK, except for object dtype. So just special casing object dtype with a set/dict based one might be enough (and this one can then be used in the CategoricalEncoder when the categories are not sorted).

@amueller
Copy link
Member

why do we need the table if we don't have the unsorted case?

@amueller
Copy link
Member

If we only do numpy & python, we would use numpy for strings? So this wouldn't really impact CategoricalEncoder in most cases, right?

@jorisvandenbossche
Copy link
Member Author

why do we need the table if we don't have the unsorted case?

Because it is much faster for object dtype + we want to support unsorted case in CategoricalEncoder

@amueller
Copy link
Member

Does unsorted mean they are in a different order or they are not sortable?

@jorisvandenbossche
Copy link
Member Author

If we only do numpy & python, we would use numpy for strings? So this wouldn't really impact CategoricalEncoder in most cases, right?

Yes, I suppose most people with strings will actually have object dtype instead of numpy string dtype (= all people using pandas to store strings)

@jorisvandenbossche
Copy link
Member Author

For the CategoricalEncoder, I think ideally we want to allow user-specified categories that do not need to be sorted (now the sortedness is imposed and an error is raised if it is not the case). I think normally they will be sortable though, but so it is about being in a different order.

@jnothman
Copy link
Member

The conclusion can also that the numpy method is quite OK, except for object dtype. So just special casing object dtype with a set/dict based one might be enough (and this one can then be used in the CategoricalEncoder when the categories are not sorted).

On the other hand, users providing object arrays with strings will very often be doing that through Pandas. But yes, I agree with you that we need the Python implementation anyway.

My only remaining hesitation about not using pandas is that this encoding is done all over the place (i.e. in every classifier), even repeatedly encoding the same data, and we may be wasting seconds of predict time by not adopting Pandas.

Btw, you may be catching a Pandas fast path for your int tests by making the set of classes 0..n-1 rather than something gappy or starting above/below 0. It means that the transform is a no-op.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need a what's new and the factorize seems to not work with read-only memmap.

def test_factorize_encode_utils(engine, values, expected):
# test that all different encoders are equivalent

if engine == 'numpy':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can parametrize this part as well and mark one of the solution as skipif with pytest, isn't it?

y = column_or_1d(y, warn=True)
self.classes_, y = np.unique(y, return_inverse=True)
return y
# def fit_transform(self, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up

return _PANDAS_INSTALLED


# def _encode_numpy(values, uniques=None, encode=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to see all LabelEncoder tests run on both number and object dtypes...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention of putting _encode directly into CategoricalEncoder, rather than LabelEncoder, in order to allow for unordered mappings?

Returns
-------
uniques
If decode=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode isn't a thing.

@jorisvandenbossche
Copy link
Member Author

I think I parametrized now all LabelEncoder tests that are useful to parametrize for both int and object.

@jorisvandenbossche
Copy link
Member Author

Is the intention of putting _encode directly into CategoricalEncoder, rather than LabelEncoder, in order to allow for unordered mappings?

Yes, see the one but last commit I added.
For now I just replaced LabelEncoder with _encode, but the idea is indeed that in that way I can relax the restriction on sorted categories (at least for object dtype). Working on that next (but first need to fix handling of unknown categories in CategoricalEncoder as well)

@jnothman
Copy link
Member

jnothman commented Jun 21, 2018 via email

@jorisvandenbossche
Copy link
Member Author

I added an extra function to check unknown values with numpy/python for non-object/object dtypes. I will probably have to parametrize some CategoricalEncoder tests for this to ensure both paths are properly tested, but will only do that when the CategoricalEncoder rewrite is merged (will update that now).

@jorisvandenbossche
Copy link
Member Author

cc @ogrisel This should be ready enough now to review (I still need to check if I need to parametrize more OneHotEncoder tests to ensure they both use numerical and object dtype data, but will only get to that tomorrow)

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Otherwise it looks good.

Returns
-------
uniques
If encode=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backsticks and full stop

uniques
If encode=False
(uniques, encoded)
If encode=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backsticks and full stop

uniques : array, optional
If passed, uniques are not determined from passed values (this
can be because the user specified categories, or because they
already have been determined in fit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full stop.

can be because the user specified categories, or because they
already have been determined in fit)
encode : bool, default False
If True, also encode the values into integer codes based on `uniques`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop.

-------
diff : list
The unique values present in `values` and not in `uniques` (the
unknown values).If encode=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after full stop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and backsticks

np.array([[10, 1, 55], [5, 2, 55]]),
np.array([['b', 'A', 'cat'], ['a', 'B', 'cat']], dtype=object)
], ids=['mixed', 'numeric', 'object'])
def test_one_hot_encoder(X):
X = [['abc', 1, 55], ['def', 2, 55]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm. I think that this should be removed.

# when specifying categories manually, unknown categories should already
# raise when fitting
enc = OneHotEncoder(categories=cats)
assert_raises(ValueError, enc.fit, X2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest.raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

X = np.array([[1, 2]]).T
enc = OneHotEncoder(categories=[[2, 1, 3]])
msg = re.escape('Unsorted categories are not supported')
assert_raises_regex(ValueError, msg, enc.fit_transform, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest.raises we could also match the error message.

assert_array_equal(ret, [1, 0, 2, 0, 2])

msg = "unseen labels"
assert_raise_message(ValueError, msg, le.transform, unknown)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest.raises

return _encode_numpy(values, uniques, encode)


def _encode_check_unknown(values, uniques, return_mask=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we should start with _ since that this is used in different file. We are not really consistent by checking the code base. @jnothman what do you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem having private internal utils

@jorisvandenbossche jorisvandenbossche added this to the 0.20 milestone Jun 29, 2018
@jorisvandenbossche
Copy link
Member Author

Updated the PR with master; this should be ready for final review / merging.

@jnothman
Copy link
Member

Test failing.

@jorisvandenbossche
Copy link
Member Author

Whoops, sorry, rebasing error, should be fixed now.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks. I've not checked tests, sorry.

encoded = np.searchsorted(uniques, values)
return uniques, encoded
else:
return uniques
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "Unsorted categories are not supported for numerical categories" be validated and raised here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise please note in the _encode docstring that this is not ensured, but is an assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be validated here (and it might be easier to follow), but in principle this will give (a little bit) more overhead as it is not needed to do that on each transform (encoding) if it is ensured in fit.
So for now making this more clear in the docstring.

@@ -37,6 +37,125 @@
]


def _encode_numpy(values, uniques=None, encode=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that this should be accessed through _encode, where parameters are described.


def _encode(values, uniques=None, encode=False):
"""
Helper function to factorize (find uniques) and encode values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our convention usually puts this summary on the previous line.

Uses pure python method for object dtype, and numpy method for
all other dtypes.
The numpy method has the limitation that the `uniques` need to
be sorted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that this is not validated, but assumed for all non-object input.

Returns
-------
uniques
If ``encode=False``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Sorted if uniques parameter was None.

diff = _encode_check_unknown(values, uniques)
if diff:
raise ValueError(
"y contains previously unseen labels: %s" % str(diff))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our style tends to prefer:

            raise ValueError("y contains previously unseen labels: %s" %
                             str(diff))

return diff
else:
unique_values = np.unique(values)
diff = list(np.setdiff1d(unique_values, uniques))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume_unique=True

@@ -195,8 +194,8 @@ class OneHotEncoder(_BaseEncoder):

- 'auto' : Determine categories automatically from the training data.
- list : ``categories[i]`` holds the categories expected in the ith
column. The passed categories must be sorted and should not mix
strings and numeric values.
column. The passed categories should not mix strings and numeric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that this is "within a single feature".

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for ohe with specified, unsorted cats?

@jorisvandenbossche
Copy link
Member Author

Is there a test for ohe with specified, unsorted cats?

Yes, there is. See test_one_hot_encoder_unsorted_categories

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018

Yes, there is. See test_one_hot_encoder_unsorted_categories

Ah of course. Sorry I missed that.

@jorisvandenbossche
Copy link
Member Author

Tests are passing now.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018

@glemaitre your review here has a cross

@glemaitre
Copy link
Member

Is it worth to make an entry in the what's new in the enhancement section, thought it does not change anything for the user?

@jorisvandenbossche
Copy link
Member Author

I don't think a separate whatsnew entry is necessarily needed. I would see it as part of the enhancement to OneHotEncoder to support string data

@ogrisel ogrisel merged commit 0b0bd9b into scikit-learn:master Jul 3, 2018
@ogrisel
Copy link
Member

ogrisel commented Jul 3, 2018

LGTM as well, merged!

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018 via email

@jorisvandenbossche jorisvandenbossche deleted the labelencoder-sets branch July 3, 2018 08:07
@jorisvandenbossche
Copy link
Member Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make LabelEncoder use a hash table
5 participants