-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH detect categorical polars columns in HistGradientBoosting #27835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH detect categorical polars columns in HistGradientBoosting #27835
Conversation
categorical_columns_mask = np.asarray( | ||
[ | ||
c.dtype[0].name == "CATEGORICAL" | ||
and c.describe_categorical["is_dictionary"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM in polars, pandas and pyarrow from_dataframe
raises a NotImplementedError
for categorical columns for which is_dictionary
is false; columns produced by pandas and polars always have is_dictionary
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of is_dictionary
? Maybe you could add an educational inline comment to explain why expect this to be True for our use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns produced by pandas and polars always have is_dictionary True
What about pyarrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the protocol's description for is_dictionary
is:
whether a mapping of categorical values to other objects exists
when is_dictionary
is True
, the categorical column's attribute describe_categorical
, which is a dict, has a value for "categories", which is a Column providing an implicit mapping between mapping from int (the indices) to categories
>>> import pandas as pd
>>> df = pd.DataFrame({"A": pd.Series(["b", "a"], dtype="category")})
>>> df["A"].cat.categories
Index(['a', 'b'], dtype='object')
>>> int_col = df.__dataframe__().get_column(0)
>>> int_col.describe_categorical['is_dictionary']
True
>>> int_col.describe_categorical['categories']._col
0 a
1 b
dtype: object
so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"]
.
I suppose when is_dictionary
is False the category values are stored directly in the column (and describe_categorical["categories"]
is None), but I don't have an example.
In that case we could get the categories by loading the column, filtering nulls and getting unique values.
That is what is done for non-dataframe inputs, but the code that does it assumes the values are numbers so we may need to adapt it.
that is my impression but I may be misunderstanding the interchange protocol 😅 as I have no experience with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"].
I agree with your analysis but unfortunately I did not find an easy way to re-construct the list of unique categories as a numpy array (e.g. using numpy.from_dlpack
or even a Python list from the public dataframe interchange API only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't either (without using pyarrow). Then I guess we should give up on using the interchage protocol to get the categories, and use it only to identify categorical columns. We can then get the categories by loading the column from the original dataframe into a numpy array, as is done in main
. In that case we don't care about the value of "is_dictionary"
.
(As I mentioned I have the impression the easiest way to do it is to let the OrdinalEncoder find the categories.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Then we can later improve the implementation of category introspection in OrdinalEncoder
a separate, independent.
X.__dataframe__() | ||
.get_column(f_idx) | ||
.describe_categorical["categories"] | ||
._col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on the fact that both pandas and polars happen to store the original Series in ._col
is definitely not great. It is the strategy used in pandas.from_dataframe
for categorical columns. pyarrow has a buffers_to_array function that can actually load the buffers from a dataframe api interchange Column
another option would be to not rely on the dataframe interchange protocol to find the categories, but only to discover the categorical columns. Those can then be loaded from X
with np.asarray
and then we can remove missing values and duplicates in a way similar to what is done below when X
is not a dataframe.
the code would be different than what is done for non-dataframe X because that only works for numbers (eg it relies on np.isnan
) but polars categories are always strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is quite suboptimal. Decoding the categories array from the raw buffers seems non-trivial with numpy.
Maybe @jorisvandenbossche or @MarcoGorelli have a better suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we could use the OrdinalEncoder with categories="auto", and check that the number of categories is smaller than max_bins
after doing the preprocessing, by checking its categories_
fitted attribute?
otherwise I have the impression we implement the logic for finding categories and handling missing values twice, once here and once in the OrdinalEncoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case here we would just need the interchange protocol to discover which columns are categorical. and rather than implementing it in the gradient boosting maybe the compose.make_column_selector
could be generalized a bit to allow selecting by interchange DTypeKind
, so that the preprocessing done in the gradient boosting would rely more on ColumnTransformer, OrdinalEncoder and make_column_selector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good plan to me. @thomasjpfan any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improving make_column_selector
to select the categories makes sense. I would open another PR to see what it looks like in code.
maybe we could use the OrdinalEncoder with categories="auto",
This would most likely work. Originally, I used pandas
's unique
function to find the unique categories, because it was faster than having OrdinalEncoder
automatically do it. In the end, I wanted to updated OridinalEncoder
to have a fast path for pandas series.
Note, the issue with grabbing the categories from the dtype, is that it is a superset of the actual categories. For example, the input data could consistent of 2 unique categories, but the categorical dtype can consistent of 300 categories. OridinalEncoder
's current behavior is only to discover the categories that are actually in the data.
from pandas.api.types import CategoricalDtype
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
X = pd.DataFrame(
{
"f1": pd.Series(["c1", "c2"], dtype=CategoricalDtype(categories=[f"c{i}" for i in range(300)]))
}
)
enc = OrdinalEncoder().fit(X)
print(enc.categories_)
# [array(['c1', 'c2'], dtype=object)]
Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point and another reason for not relying on the dtype but on the actual column values.
Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins.
Indeed that makes more sense. A similar issue I noticed is that in the main branch if a categorical column contains values greater than max_bins
(before re-encoding them with the OrdinalEncoder
), this check raises a ValueError
. However as they get re-encoded later, I think what matters is the number of actual categories (the number of unique values) rather than the max of the column.
import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor
X = np.asarray([0, 1, 1000])[:, None]
y = [0., 0., 0.]
gb = HistGradientBoostingRegressor(categorical_features=[0])
gb.fit(X, y)
# ValueError: Categorical feature at index 0 is expected to be encoded with values < 255 but the largest value for the encoded categories is 1000.
If I remove the check fit and predict work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point and another reason for not relying on the dtype but on the actual column values.
At the same time, I find that having dimensions of model parameters, shape of the output of transform
or predict_proba
depend on the observed values in X
or y
is a bit annoying, especially when trying to consolidate models fit on resampled datasets with rare categories in features or classes in the target. I recently faced the y-variant of this issue when experimenting with a new kind of meta estimator for quantile regression:
In the predict_quantiles
method, I have to post-process the output of predict_proba
to deal with unobserved classes.
When the scikit-learn API was designed, we only had numpy arrays in mind. The expected number of classes (and later categories in features) was never present in numpy dtypes, hence we naturally had to infer it implicitly from the training set values.
But now that data science is moving more towards dataframe with richer dtype metadata, maybe this design decision is counter productive.
For classes / predict_proba
, I think we don't want to change the current behavior, at least not in the short term.
But when adding dataframe support for categorical feature aware transformers / estimators, maybe we should have an option to collapse categories to the observed ones or preserve all categories specified in the dtype, even when unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it can be annoying to need to perform this kind of post-processing. Although for the specific case of categorical features (not outputs) for the gradient boosting, I don't see which public attributes or method outputs would be changed by taking into account categories that do not show up in the training data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed my reflection was either for the categorical transformers and maybe later for categorical-encoded y
for classsifiers (which we do not handle in any special way as of now).
For categorical feature in HGBDT, the PR is good the way it is (delegating to OrdinalEncoder
and only use the observed categories observed in the training set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this PR the way it is now.
I think the optimization categories = X.iloc[:, f_idx].unique().dropna().to_numpy()
for pandas dataframes should be readded to OrdinalEncoder
in a follow-up PR if categorical dtype inspection is either to challenging to implement correctly with the DF interchange API or too disruptive in terms of change of behavior (as discussed in above threads).
Before considering a merge of this PR, I would just like to add pyarrow support in the tests as detailed in the comments below:
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
Also @jeromedockes please remove the WIP/Draft markers of this PR unless you still have TODOs in mind. In which case, please state them explicitly in a bullet list in the description of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up. Just a final suggestion for the test utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. Just a couple of small comments.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
Thanks @jeromedockes |
Reference Issues/PRs
This extends #26411 so that the option
categorical_features="from_dtype"
inHistGradientBoosting
also works when the input is a polars dataframeWhat does this implement/fix? Explain your changes.
this is a POC after (oral) discussion of the feasibility of generalizing #26411 with the dataframe interchange protocol
Any other comments?
Meta-issue for polars / pyarrow support: #25896.