-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Attempt to speed up unique value discovery in _BaseEncoder
for polars and pandas series
#27911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Attempt to speed up unique value discovery in _BaseEncoder
for polars and pandas series
#27911
Conversation
…mpy 1.24" This reverts commit bd86ed4.
This is adding quite a bit of complexity for something which I think should be mostly in the corresponding libraries and not scikit-learn. I wonder why @ogrisel thinks it needs to be here. |
Because the numpy.unique has @jeromedockes could you please run a quick benchmark of this PR vs EDIT: I just opened the details of the description and indeed there is already a nice 10x speed-up at 1M samples. This should grow significantly beyond. |
@jeromedockes can you please check at 10M for polars to see if the speed up grows or no? If not maybe we can indeed remove the polars specialization for the sake of maintainability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that ideally we would be relying on array API unique_values
method instead.
Here we are modifying the default behavior or unique
from pandas and polars, and this seems a bit inconsistent in the sense that we don't necessarily do the same regarding other methods (?)
The issue with np.unique
is also not unique (pun intended) to encoders, and we have another PR (#26820) circumventing inefficiency of np.unique
in a different way.
In an IRL discussion with @glemaitre we think a better solution forward is to improve np.unique
, which would solve all the issues here in scikit-learn and elsewhere.
So I plan to work on that, which would in turn remove the need for this PR, or the other PR.
sklearn/preprocessing/_encoders.py
Outdated
if is_pandas: | ||
values = X.iloc[:, i] | ||
elif is_polars: | ||
values = X[X.columns[i]] | ||
else: | ||
values = Xi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use _safe_indexing
instead?
here are some plots for it seems polars categorical series store a metadata bit indicating if the unique values in the series are the same set as all the categories possible in that series. that is probably a frequent enough case, as it occurs when a categorical series is built inferring the categories from the data. When it is the case (and the series is in a single chunk), the unique values can be obtained looking only at the metadata (the mapping from categorical type to string values) and not the data, which is why it is so fast in the example above. Therefore >>> s.dtype, s.shape
(Categorical, (1000000,))
>>> s1 = s[:-1]
>>> s1.dtype, s1.shape
(Categorical, (999999,))
>>> %timeit s.unique()
24.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s1.unique()
5.29 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) so we would most likely almost never benefit from that optimization if this PR were merged. For other cases I agree that improving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the benchmark results and the analysis above and the limited complexity of the dataframe specific code, I would be in favor of including the optimizations suggested in this PR (while contributing to improve numpy in parallel).
Once the numpy implementation has been optimized, we can consider removing some dataframe specific code in scikit-learn but probably not all to still avoid sorting columns for nothing when the results has already been computed as part of the dataframe dtype metadata.
doc/whats_new/v1.4.rst
Outdated
:class:`preprocessing.OneHotEncoder` and :class:`preprocessing.TargetEncoder` | ||
can be faster on pandas DataFrames by using a more efficient way of finding | ||
unique values in the categorical columns. :pr:`27911` by :user:`Jérôme Dockès | ||
<jeromedockes>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have to be moved to target 1.5 instead of 1.4.
_BaseEncoder
for polars and pandas series_BaseEncoder
for polars and pandas series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
import pandas as pd | ||
|
||
if parse_version("1.4.0") <= parse_version(pd.__version__): | ||
return _unique_pandas(values, return_counts=return_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, if the pandas version is too old, then we return None
.
if _is_polars_series(values): | ||
# polars unique, arg_sort not supported for polars.Object dtype. | ||
if str(values.dtype) != "Object": | ||
return _unique_polars(values, return_counts=return_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding return None
.
# unique returns a NumpyExtensionArray for extension dtypes and a numpy | ||
# array for other dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment true? For this categorical dtype the output is not NumpyExtensionArray
or a np.ndarray
:
import pandas as pd
from pandas.arrays import NumpyExtensionArray
import numpy as np
x = pd.Series(["a", "b", "c", "b", "a", "b"],dtype="category")
uniques = x.unique()
assert not isinstance(uniques, NumpyExtensionArray)
assert not isinstance(uniques, np.ndarray)
try: | ||
unique = unique.reorder_categories(unique.categories.sort_values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the need to reorder_categories
here if we end up running unique.sort_values()
afterwards. Can you provide an example where this is required?
return unique.sort_values().to_numpy() | ||
if unique.dtype != object: | ||
return np.sort(unique) | ||
return _unique_python(unique, return_counts=False, return_inverse=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not covered based on codecov
.
return values | ||
else: | ||
return values, counts | ||
values = values[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to codecov
, everything below this line is not covered.
Related to this PR, there is also this draft PR by @adrinjalali as an attempt to make the |
Reference Issues/PRs
Address this comment in a follow-up PR to #27835
What does this implement/fix? Explain your changes.
this relies on the pandas or polars Series
unique
method rather thannumpy.unique
to identify categories as they can be faster.Any other comments?
so far I am seeing a speedup for pandas but not really for polars; unless I can make it faster for polars it is probably not worth the added complexity
toy benchmark
https://gist.github.com/jeromedockes/d2e1fc147b7ad0a6dfd686318cc9da57
results
if we change the type of the categorical column to contain integers rather than categories we see a small speedup for polars but almost 10x for the OrdinalEncoder on pandas