Skip to content

Support polars parser in fetch_openml #28586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

deanm0000
Copy link

What does this implement/fix? Explain your changes.

it allows for polars to be selected as a parser in fetch_openml

from sklearn.datasets import fetch_openml
bike_sharing = fetch_openml(
    "Bike_Sharing_Demand", version=2, as_frame=True, parser="polars"
)
bike_sharing.frame
shape: (17_379, 13)
┌────────┬──────┬───────┬──────┬───┬───────────┬──────────┬───────────┬───────┐
│ season ┆ year ┆ month ┆ hour ┆ … ┆ feel_temp ┆ humidity ┆ windspeed ┆ count │
│ ---    ┆ ---  ┆ ---   ┆ ---  ┆   ┆ ---       ┆ ---      ┆ ---       ┆ ---   │
│ enum   ┆ f64  ┆ f64   ┆ f64  ┆   ┆ f64       ┆ f64      ┆ f64       ┆ f64   │
╞════════╪══════╪═══════╪══════╪═══╪═══════════╪══════════╪═══════════╪═══════╡
│ spring ┆ 0.0  ┆ 1.0   ┆ 0.0  ┆ … ┆ 14.395    ┆ 0.81     ┆ 0.0       ┆ 16.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 1.0  ┆ … ┆ 13.635    ┆ 0.8      ┆ 0.0       ┆ 40.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 2.0  ┆ … ┆ 13.635    ┆ 0.8      ┆ 0.0       ┆ 32.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 3.0  ┆ … ┆ 14.395    ┆ 0.75     ┆ 0.0       ┆ 13.0  │
│ spring ┆ 0.0  ┆ 1.0   ┆ 4.0  ┆ … ┆ 14.395    ┆ 0.75     ┆ 0.0       ┆ 1.0   │
│ …      ┆ …    ┆ …     ┆ …    ┆ … ┆ …         ┆ …        ┆ …         ┆ …     │
│ spring ┆ 1.0  ┆ 12.0  ┆ 19.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 119.0 │
│ spring ┆ 1.0  ┆ 12.0  ┆ 20.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 89.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 21.0 ┆ … ┆ 12.88     ┆ 0.6      ┆ 11.0014   ┆ 90.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 22.0 ┆ … ┆ 13.635    ┆ 0.56     ┆ 8.9981    ┆ 61.0  │
│ spring ┆ 1.0  ┆ 12.0  ┆ 23.0 ┆ … ┆ 13.635    ┆ 0.65     ┆ 8.9981    ┆ 49.0  │
└────────┴──────┴───────┴──────┴───┴───────────┴──────────┴───────────┴───────┘

Copy link

github-actions bot commented Mar 6, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7f1e5bb. Link to the linter CI: here

@glemaitre
Copy link
Member

It would be interesting to be able to return some polars frame.

This said I think that we evaluate if this is worth to implement a new parser or only convert the dataframe once read by the pandas parser. So I have the following points or questions that I think could be useful to know which way to go:

  • OpenML now store the dataset in parquet format and I think that we should switch from ARFF to the parquet files (cf. Enable loading parquet file from OpenML #23361);
  • A pain point between the current liac-arff and pandas parsers is the consistency regarding the data types. If we had a new parser, we would need to be sure that this is consistent with pandas dataframe to avoid some new inconsistencies. I don't know yet if there is any corner case.
  • So before to write a new parser to read ARFF file, I'm wondering if we would have some performance gain (memory or speed-up) by using polars here. An alternative would be to use the parser="pandas" and have a parameter that would trigger a pl.from_pandas if we actually don't necessarily get drastic improvement in the parsing process (when considering the dataset available in OpenML). Maybe having as_frame repurpose to a string "pandas" or "polars" to select the type of dataframe would be a way.

When it comes implementing a parser to read the parquet file, then this is another story where having polars having its own implementation to read the file and not depending on pyarrow would be an advantage.

@deanm0000
Copy link
Author

The polars csv reader is many orders of magnitude faster than pandas's csv reader. Here's just one source saying as much. I don't really use pandas so I'm not well versed on interoperability between polars and pandas but I don't see why there'd be issues going from polars to pandas for the few types available.

I'm with you in terms of not refining this and implementing the parquet download (which I didn't know existed) but then the case against pandas only becomes stronger since pyarrow reads parquets not pandas. I think it'd be fine to just return a pyarrow Table or for convenience have a parameter to pick which output but the loading would be pyarrow and the parameter would dictate if it gets returned directly, turned to pandas, or turned to polars.

@adrinjalali
Copy link
Member

I would say, if the user changes the parser, they should expect some inconsistency in dtypes. It probably should be very rare, and we can try to make them consistent, but different parsers mean different parsing and therefore sometimes different dtypes.

I also think it would be a pity for the polars parser to use pandas under the hood. If users want to do data frame stuff and not have pandas in their environment, they should be able to do so.

I do agree we should start reading the parquet files though, but with a very minimal dependency there. Definitely don't want to require pyarrow to read parquet files.

@glemaitre
Copy link
Member

Basically, my feeling there is therefore to not implement another ARFF parser but to write a proper parser for the parquet file using polars.

We can still use the current ARFF-pandas parser when requested a pandas dataframe. On this side, if pandas really impose pyarrow as a dependency then we can switch to the parquet without questioning. If this is a "light" arrow version then we have to check that we can do so (I did not recently check the thread of what is going on that side).

@deanm0000 deanm0000 closed this Mar 7, 2024
@adrinjalali
Copy link
Member

Cc @MarcoGorelli

@MarcoGorelli
Copy link
Contributor

Cc @MarcoGorelli

Thanks for the ping!

I think it'd be amazing if fetch_openml could return a LazyFrame for Polars, so that then Polars can apply predicate and projection pushdown for the IO part

But reading with pandas and then converting to Polars for preprocessing (as has been done in #28601) is already really nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants