-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC replace pandas with Polars in examples/gaussian_process/plot_gpr_co2.py #28804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
40e30a1
to
96a6ae9
Compare
96a6ae9
to
77bdbd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
co2_data = co2.frame | ||
co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]]) | ||
co2_data = co2_data[["date", "co2"]].set_index("date") | ||
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a round trip here? pl.DataFrame(pd.DataFrame)
doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review. Converting from/to pandas requires pyarrow as a dependency, and because scikit-learn doesn't have it, the following error will occur when trying to convert a pandas dataframe to Polars:
ModuleNotFoundError: pa.array requires 'pyarrow' module to be installed
There is an old related issue converting pl.to_pandas
pola-rs/polars#3398
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue on the polars side: pola-rs/polars#15845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we can do:
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns}) | |
co2_data = pl.from_dataframe(co2.frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like pl.from_dataframe
still needs pyarrow for older Polars versions and the checks fail for the minimum Polars version 0.19.12,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't really depend on polars and it's only used in our CI for docs, I'd be in favor of moving the min version to something that supports this.
WDYT @ogrisel @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that pl.from_dataframe
works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that'd be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that pl.from_dataframe works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.
I am okay with this option.
Use `pl.from_dataframe()` instead if converting to `numpy` Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
…into polars_co2_example
…version from 0.19.12 to 0.20.23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR now uses pola-rs/polars#15933 and directly calls pl.DataFrame
which doesn't require PyArrow anymore for the types in the input here.
However, that means minimum required version is the one released yesterday (28.04.2024). I'm personally okay with this. But I'd rather have another set of eyes on this.
@thomasjpfan WDYT here now? |
Let's merge this one then. |
Really nice. Thanks @raisadz |
…co2.py (scikit-learn#28804) Co-authored-by: raisa <> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
…co2.py (#28804) Co-authored-by: raisa <> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Reference Issues/PRs
Related to #28341
What does this implement/fix? Explain your changes.
Replacing pandas with Polars allows to avoid checking pandas version when resampling monthly CO2.