Skip to content

DOC replace pandas with Polars in examples/gaussian_process/plot_gpr_co2.py #28804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 17, 2024

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Apr 10, 2024

Reference Issues/PRs

Related to #28341

What does this implement/fix? Explain your changes.

Replacing pandas with Polars allows to avoid checking pandas version when resampling monthly CO2.

Copy link

github-actions bot commented Apr 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b2b6aef. Link to the linter CI: here

@raisadz raisadz force-pushed the polars_co2_example branch from 40e30a1 to 96a6ae9 Compare April 10, 2024 14:13
@raisadz raisadz force-pushed the polars_co2_example branch from 96a6ae9 to 77bdbd8 Compare April 10, 2024 14:16
@raisadz raisadz changed the title DOC replacing pandas with Polars in examples/gaussian_process/plot_gpr_co2.py DOC replace pandas with Polars in examples/gaussian_process/plot_gpr_co2.py Apr 10, 2024
@raisadz raisadz marked this pull request as ready for review April 10, 2024 14:47
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

co2_data = co2.frame
co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]])
co2_data = co2_data[["date", "co2"]].set_index("date")
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a round trip here? pl.DataFrame(pd.DataFrame) doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. Converting from/to pandas requires pyarrow as a dependency, and because scikit-learn doesn't have it, the following error will occur when trying to convert a pandas dataframe to Polars:

ModuleNotFoundError: pa.array requires 'pyarrow' module to be installed

There is an old related issue converting pl.to_pandas pola-rs/polars#3398

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue on the polars side: pola-rs/polars#15845

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we can do:

Suggested change
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})
co2_data = pl.from_dataframe(co2.frame)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pl.from_dataframe still needs pyarrow for older Polars versions and the checks fail for the minimum Polars version 0.19.12,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't really depend on polars and it's only used in our CI for docs, I'd be in favor of moving the min version to something that supports this.

WDYT @ogrisel @thomasjpfan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that pl.from_dataframe works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that pl.from_dataframe works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.

I am okay with this option.

raisadz and others added 5 commits April 25, 2024 15:15
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR now uses pola-rs/polars#15933 and directly calls pl.DataFrame which doesn't require PyArrow anymore for the types in the input here.

However, that means minimum required version is the one released yesterday (28.04.2024). I'm personally okay with this. But I'd rather have another set of eyes on this.

@adrinjalali
Copy link
Member

@thomasjpfan WDYT here now?

@adrinjalali
Copy link
Member

Let's merge this one then.

@adrinjalali adrinjalali merged commit e825502 into scikit-learn:main May 17, 2024
30 checks passed
@glemaitre
Copy link
Member

Really nice. Thanks @raisadz

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request May 20, 2024
…co2.py (scikit-learn#28804)

Co-authored-by: raisa <>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@jeremiedbb jeremiedbb mentioned this pull request May 20, 2024
14 tasks
jeremiedbb pushed a commit that referenced this pull request May 21, 2024
…co2.py (#28804)

Co-authored-by: raisa <>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants