-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support polars parser in fetch_openml #28586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It would be interesting to be able to return some This said I think that we evaluate if this is worth to implement a new parser or only convert the dataframe once read by the pandas parser. So I have the following points or questions that I think could be useful to know which way to go:
When it comes implementing a parser to read the parquet file, then this is another story where having |
The polars csv reader is many orders of magnitude faster than pandas's csv reader. Here's just one source saying as much. I don't really use pandas so I'm not well versed on interoperability between polars and pandas but I don't see why there'd be issues going from polars to pandas for the few types available. I'm with you in terms of not refining this and implementing the parquet download (which I didn't know existed) but then the case against pandas only becomes stronger since pyarrow reads parquets not pandas. I think it'd be fine to just return a pyarrow Table or for convenience have a parameter to pick which output but the loading would be pyarrow and the parameter would dictate if it gets returned directly, turned to pandas, or turned to polars. |
I would say, if the user changes the I also think it would be a pity for the I do agree we should start reading the parquet files though, but with a very minimal dependency there. Definitely don't want to require pyarrow to read parquet files. |
Basically, my feeling there is therefore to not implement another ARFF parser but to write a proper parser for the parquet file using We can still use the current ARFF-pandas parser when requested a pandas dataframe. On this side, if pandas really impose pyarrow as a dependency then we can switch to the parquet without questioning. If this is a "light" arrow version then we have to check that we can do so (I did not recently check the thread of what is going on that side). |
Thanks for the ping! I think it'd be amazing if But reading with pandas and then converting to Polars for preprocessing (as has been done in #28601) is already really nice! |
What does this implement/fix? Explain your changes.
it allows for polars to be selected as a
parser
infetch_openml