-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
We now support polars as an output for our transformers and ColumnTransformer
, but our examples use pd.DataFrame
.
Our datasets
module also returnes either a numpy array or a pandas DataFrame.
I'm suggesting that we enable datasets to return a polars dataframe, and to switch our examples from pandas to polars.
This has a few benefits:
- Polars on users' systems takes advantage of multi-core CPUs which is the case for pretty much all users these days. So it's quite faster in most cases.
- Pandas is dealing with issues related to Arrow, and even if they don't require Arrow as a required dependency, there will be behavior changes whether Arrow is installed or not at least on String Dtype.
Another thing is the (in)stability issues related to Pandas API where we need to deal with deprecation warnings very often. Although I'm not sure how stable the API is on the places where we touch the API on the polars side (maybe cc @MarcoGorelli )
WDYT @scikit-learn/core-devs @scikit-learn/documentation-team @scikit-learn/contributor-experience-team
This is not really a core part of our library, but what we put in our examples and our default choices affect people since many learn from our examples.