Skip to content

RFC switch to Polars as the default dataframe lib in our examples #28341

@adrinjalali

Description

@adrinjalali

We now support polars as an output for our transformers and ColumnTransformer, but our examples use pd.DataFrame.

Our datasets module also returnes either a numpy array or a pandas DataFrame.

I'm suggesting that we enable datasets to return a polars dataframe, and to switch our examples from pandas to polars.

This has a few benefits:

  • Polars on users' systems takes advantage of multi-core CPUs which is the case for pretty much all users these days. So it's quite faster in most cases.
  • Pandas is dealing with issues related to Arrow, and even if they don't require Arrow as a required dependency, there will be behavior changes whether Arrow is installed or not at least on String Dtype.

Another thing is the (in)stability issues related to Pandas API where we need to deal with deprecation warnings very often. Although I'm not sure how stable the API is on the places where we touch the API on the polars side (maybe cc @MarcoGorelli )

WDYT @scikit-learn/core-devs @scikit-learn/documentation-team @scikit-learn/contributor-experience-team

This is not really a core part of our library, but what we put in our examples and our default choices affect people since many learn from our examples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions