Support polars parser in fetch_openml #28586

deanm0000 · 2024-03-06T19:56:52Z

What does this implement/fix? Explain your changes.

it allows for polars to be selected as a parser in fetch_openml

from sklearn.datasets import fetch_openml
bike_sharing = fetch_openml(
    "Bike_Sharing_Demand", version=2, as_frame=True, parser="polars"
)
bike_sharing.frame
shape: (17_379, 13)
┌────────┬──────┬───────┬──────┬───┬───────────┬──────────┬───────────┬───────┐
│ season ┆ year ┆ month ┆ hour ┆ … ┆ feel_temp ┆ humidity ┆ windspeed ┆ count │
│ ---    ┆ ---  ┆ ---   ┆ ---  ┆   ┆ ---       ┆ ---      ┆ ---       ┆ ---   │
│ enum   ┆ f64  ┆ f64   ┆ f64  ┆   ┆ f64       ┆ f64      ┆ f64       ┆ f64   │
╞════════╪══════╪═══════╪══════╪═══╪═══════════╪══════════╪═══════════╪═══════╡
│ spring ┆ 0.0  ┆ 1.0   ┆ 0.0  ┆ … ┆ 14.395    ┆ 0.81     ┆ 0.0       ┆ 16.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 1.0  ┆ … ┆ 13.635    ┆ 0.8      ┆ 0.0       ┆ 40.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 2.0  ┆ … ┆ 13.635    ┆ 0.8      ┆ 0.0       ┆ 32.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 3.0  ┆ … ┆ 14.395    ┆ 0.75     ┆ 0.0       ┆ 13.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 4.0  ┆ … ┆ 14.395    ┆ 0.75     ┆ 0.0       ┆ 1.0   │
│ …      ┆ …    ┆ …     ┆ …    ┆ … ┆ …         ┆ …        ┆ …         ┆ …     │
│ spring ┆ 1.0  ┆ 12.0  ┆ 19.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 119.0 │
│ spring ┆ 1.0  ┆ 12.0  ┆ 20.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 89.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 21.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 90.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 22.0 ┆ … ┆ 13.635    ┆ 0.56     ┆ 8.9981    ┆ 61.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 23.0 ┆ … ┆ 13.635    ┆ 0.65     ┆ 8.9981    ┆ 49.0  │
└────────┴──────┴───────┴──────┴───┴───────────┴──────────┴───────────┴───────┘

github-actions · 2024-03-06T19:58:17Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7f1e5bb. Link to the linter CI: here}

glemaitre · 2024-03-06T20:44:37Z

It would be interesting to be able to return some polars frame.

This said I think that we evaluate if this is worth to implement a new parser or only convert the dataframe once read by the pandas parser. So I have the following points or questions that I think could be useful to know which way to go:

OpenML now store the dataset in parquet format and I think that we should switch from ARFF to the parquet files (cf. Enable loading parquet file from OpenML #23361);
A pain point between the current liac-arff and pandas parsers is the consistency regarding the data types. If we had a new parser, we would need to be sure that this is consistent with pandas dataframe to avoid some new inconsistencies. I don't know yet if there is any corner case.
So before to write a new parser to read ARFF file, I'm wondering if we would have some performance gain (memory or speed-up) by using polars here. An alternative would be to use the parser="pandas" and have a parameter that would trigger a pl.from_pandas if we actually don't necessarily get drastic improvement in the parsing process (when considering the dataset available in OpenML). Maybe having as_frame repurpose to a string "pandas" or "polars" to select the type of dataframe would be a way.

When it comes implementing a parser to read the parquet file, then this is another story where having polars having its own implementation to read the file and not depending on pyarrow would be an advantage.

deanm0000 · 2024-03-06T21:35:48Z

The polars csv reader is many orders of magnitude faster than pandas's csv reader. Here's just one source saying as much. I don't really use pandas so I'm not well versed on interoperability between polars and pandas but I don't see why there'd be issues going from polars to pandas for the few types available.

I'm with you in terms of not refining this and implementing the parquet download (which I didn't know existed) but then the case against pandas only becomes stronger since pyarrow reads parquets not pandas. I think it'd be fine to just return a pyarrow Table or for convenience have a parameter to pick which output but the loading would be pyarrow and the parameter would dictate if it gets returned directly, turned to pandas, or turned to polars.

adrinjalali · 2024-03-07T06:05:42Z

I would say, if the user changes the parser, they should expect some inconsistency in dtypes. It probably should be very rare, and we can try to make them consistent, but different parsers mean different parsing and therefore sometimes different dtypes.

I also think it would be a pity for the polars parser to use pandas under the hood. If users want to do data frame stuff and not have pandas in their environment, they should be able to do so.

I do agree we should start reading the parquet files though, but with a very minimal dependency there. Definitely don't want to require pyarrow to read parquet files.

glemaitre · 2024-03-07T08:28:34Z

Basically, my feeling there is therefore to not implement another ARFF parser but to write a proper parser for the parquet file using polars.

We can still use the current ARFF-pandas parser when requested a pandas dataframe. On this side, if pandas really impose pyarrow as a dependency then we can switch to the parquet without questioning. If this is a "light" arrow version then we have to check that we can do so (I did not recently check the thread of what is going on that side).

adrinjalali · 2024-03-07T19:12:17Z

Cc @MarcoGorelli

MarcoGorelli · 2024-03-09T20:36:38Z

Cc @MarcoGorelli

Thanks for the ping!

I think it'd be amazing if fetch_openml could return a LazyFrame for Polars, so that then Polars can apply predicate and projection pushdown for the IO part

But reading with pandas and then converting to Polars for preprocessing (as has been done in #28601) is already really nice!

polars_enum

7f1e5bb

github-actions bot added the module:datasets label Mar 6, 2024

deanm0000 closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support polars parser in fetch_openml #28586

Support polars parser in fetch_openml #28586

Uh oh!

deanm0000 commented Mar 6, 2024

Uh oh!

github-actions bot commented Mar 6, 2024

Uh oh!

glemaitre commented Mar 6, 2024

Uh oh!

deanm0000 commented Mar 6, 2024

Uh oh!

adrinjalali commented Mar 7, 2024

Uh oh!

glemaitre commented Mar 7, 2024

Uh oh!

adrinjalali commented Mar 7, 2024

Uh oh!

MarcoGorelli commented Mar 9, 2024

Uh oh!

Uh oh!

Uh oh!

Support polars parser in fetch_openml #28586

Support polars parser in fetch_openml #28586

Uh oh!

Conversation

deanm0000 commented Mar 6, 2024

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Mar 6, 2024

✔️ Linting Passed

Uh oh!

glemaitre commented Mar 6, 2024

Uh oh!

deanm0000 commented Mar 6, 2024

Uh oh!

adrinjalali commented Mar 7, 2024

Uh oh!

glemaitre commented Mar 7, 2024

Uh oh!

adrinjalali commented Mar 7, 2024

Uh oh!

MarcoGorelli commented Mar 9, 2024

Uh oh!

Uh oh!