Skip to content

DOC user polars instead of polars in plot_time_series_lagged_features #28601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 8, 2024

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Mar 9, 2024

Reference Issues/PRs

Related to #28341

What does this implement/fix? Explain your changes.

Any other comments?

Copy link

github-actions bot commented Mar 9, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 07a646a. Link to the linter CI: here


import polars as pl

df = pl.DataFrame({col: df[col].to_numpy() for col in df.columns})
Copy link
Contributor Author

@raisadz raisadz Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing this instead of pl.from_pandas() to avoid installing pyarrow.

@raisadz raisadz marked this pull request as ready for review March 9, 2024 18:17
@adrinjalali
Copy link
Member

Somehow I don't really love the idea of having to use both pandas and polars in the same script. It kinda defeats the purpose.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Mar 11, 2024

Hey,

@glemaitre had written in the Polars Discord

I think this would be worth to add a pl.from_pandas in the example. This would easily accepted as visible in the documentation and in the next release for sure. Modifying fetch_openml might require more discussion regarding the type of files to read (ARFF vs. parquet), directly accessing the HTTP or caching the file, etc. So the process will be slower.

https://discord.com/channels/908022250106667068/1214872002238742529/1215228420292870164

which is why I encouraged @raisadz to try this out


It kinda defeats the purpose.

Anecdotally, I've seen several colleagues and clients take a predominantly pandas pipeline, rewrite the most memory-intensive part to Polars (whilst keeping the rest in pandas), and find that that's already enough to avoid out-of-memory issues

@glemaitre
Copy link
Member

I would be fine with the conversion from pandas to polars for the moment. I think that we should make it explicit that the conversion is temporary until fetch_openml is natively loading polars frame. Once we have this feature in, then the example will perfect.

It at least shows that scikit-learn accepts such input.

It is definitely not perfect but better than not having a single example shown while awaiting for the fetch_openml upgrade that might not happen for the next release.

@adrinjalali WDYT?

@adrinjalali
Copy link
Member

@MarcoGorelli what's the status of being able to read parquet with polars w/o depending on pyarrow right now?

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Mar 11, 2024

pyarrow isn't required by Polars for reading Parquet files

Here's an example of using Polars to read the greatest data science dataset in the world (no pyarrow installed!):

In [1]: import polars as pl

In [2]: pl.read_parquet('iris.parquet').head()
Out[2]:
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_lengthsepal_widthpetal_lengthpetal_widthspecies │
│ ---------------     │
│ f64f64f64f64str     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.13.51.40.2setosa  │
│ 4.93.01.40.2setosa  │
│ 4.73.21.30.2setosa  │
│ 4.63.11.50.2setosa  │
│ 5.03.61.40.2setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

In [3]: import pyarrow
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 1
----> 1 import pyarrow

ModuleNotFoundError: No module named 'pyarrow'

@raisadz raisadz force-pushed the polars_features branch 2 times, most recently from 21276cc to fe1e557 Compare March 13, 2024 21:16
loc.body(col, extract_numeric(col) == list_min(extract_numeric(col)))
for col in styled_df.columns
if col != "loss"
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing pandas styling with great tables as they have native support for Polars. Apart from fetch_openml this completely replaces pandas.

@raisadz
Copy link
Contributor Author

raisadz commented Mar 13, 2024

Thank you for your reviews. I added a comment about fetch_openml loading a pandas dataframe and it will be replaced with Polars when it will be natively supported. I replaced the pandas styling with great tables as they have Polars support and apart from fetch_openml this should replace pandas in this notebook.

@raisadz
Copy link
Contributor Author

raisadz commented Mar 14, 2024

I updated the lock files to add great tables by running build_tools/update_environments_and_lock_files.py and added the Polars version check to handle the deprecated syntax.

@raisadz
Copy link
Contributor Author

raisadz commented Mar 19, 2024

I removed checking Polars version and replaced it with using list.get function instead of list.gather (and list.take for earlier versions). Please, let me know if you are fine with adding great-tables dependency to replace the pandas styler. Otherwise, I can roll back to the previous version. Or I can just print Polars dataframes directly. Please, let me know which option you would prefer.

for col in cols_to_convert:
mask[col] = df[col].apply(
lambda x: "font-weight: bold" if x == min_values[col] else ""
gt_styled_df = GT(styled_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, we don't want the extra dependency on great-tables. We should instead have a way to do the same with the current installed library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is no way to print something in bold in Polars, that's why I added great-tables that have this support.

Should I then convert the dataframe back to pandas when there is a styled output? As I thought you don't want to have both pandas and Polars dataframes #28601 (comment)?

Or do you know about some other way to print that table with final metrics with bold styling?

Alternatively, should I display the dataframe without styling? I can add the minimums as a separate dataframe below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we rather lose the bold styling than adding the extra dependency or jumping back and forth from pandas. Personally I vote for the last option then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviews. I removed the styling with great tables and added a dataframe displaying the losses that minimise each metric.

Comment on lines 25 to 34
# We convert to Polars for feature engineering, as it presents several advantages
# here:
#
# - It parallelises the expressions within the `with_columns` statement.
# - It automatically caches common subexpressions which are reused in multiple
# expressions (like `pl.col("count").shift(1)` below). See
# https://docs.pola.rs/user-guide/lazy/optimizations/ for more information.
# - It is very memory-efficient.
# - It is stricter about data types and schemas, allowing us to catch errors earlier
# in our development workflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this our place here to emphasize why one should use polars. I think that the example should only concentrate on the "consumer" part of a scikit-learn user meaning, that the underlying model will deal with a polars dataframe.

I think here the note regarding the caveats for fetch_openml is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey 👋

I just wanted to flag then when I opened #28576, one of the comments was

maybe we can highlight the benefit with a comment in the code or as narrative text

So, just looping in @ArturoAmorQ to make sure that scikit-learn devs are aligned here about highlighting benefits in examples, and that you're on the same page about the direction this is going in 🤗

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @glemaitre that comments regarding the benefits should remain as minimalistic as possible to keep the focus on the example's main narrative. Maybe in this case in particular the most relevant reason is the caching of common subexpressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left only the caching of common subexpressions and removed other reasons for using Polars.

@raisadz raisadz force-pushed the polars_features branch 6 times, most recently from 4395874 to a7b5a9b Compare March 30, 2024 19:10
Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the changes @raisadz, here's another batch of comments.

# Even if the score distributions overlap due to the variance in the dataset,
# it is true that the average RMSE is lower when `loss="squared_error"`, whereas
# the average MAPE is lower when `loss="absolute_error"` as expected
# (the same is true for the quantile 50). That is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the same is true for the quantile 50" would mean, given the context, that pinball_loss_50 is minimized by "quantile 50", which is not the case.

# We start by loading the data from the OpenML repository.
# We start by loading the data from the OpenML repository
# as a pandas dataframe. This will be replaced with Polars
# after `fetch_openml` will add a native support for it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# after `fetch_openml` will add a native support for it.
# once `fetch_openml` adds a native support for it.

Comment on lines +270 to +271
col_split.list.get(0).cast(pl.Float64),
col_split.list.get(2).cast(pl.Float64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honest question, why does the casting here changes the results so drastically? What is the dtype otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reviews @ArturoAmorQ . It looks like the fit times are slightly lower for Polars (apart from the squared loss) but otherwise there are no differences in the final table. Before casting the dtype is string (pl.Utf8). In the extract_numeric function that was used the mean and std were also casted to float before finding the minimum:

def extract_numeric(value):
    parts = value.split("±")
    mean_value = float(parts[0])
    std_value = float(parts[1].split()[0])

    return mean_value, std_value

Comment on lines +275 to +279
scores_df.select(
pl.col("loss").get(min_arg(col_name)).alias(col_name)
for col_name in scores_df.columns
if col_name != "loss"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever solution, though :)

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Any further comments on this one @glemaitre?

@glemaitre
Copy link
Member

Let me have another look more in depth but it looks good at a first glance.

@glemaitre glemaitre self-requested a review April 8, 2024 12:58
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can further improve when we make a good OpenML fetcher using the parquet files.

@glemaitre glemaitre changed the title DOC: replace pandas with polars for feature engineering in examples/applications/plot_time_series_lagged_features.py DOC user polars instead of polars in plot_time_series_lagged_features Apr 8, 2024
@glemaitre glemaitre enabled auto-merge (squash) April 8, 2024 19:10
@glemaitre glemaitre merged commit d1d1596 into scikit-learn:main Apr 8, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants