DOC user polars instead of polars in plot_time_series_lagged_features #28601

raisadz · 2024-03-09T16:15:57Z

Reference Issues/PRs

Related to #28341

What does this implement/fix? Explain your changes.

Any other comments?

github-actions · 2024-03-09T16:17:21Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 07a646a. Link to the linter CI: here}

raisadz · 2024-03-09T16:34:37Z

examples/applications/plot_time_series_lagged_features.py

+
+import polars as pl
+
+df = pl.DataFrame({col: df[col].to_numpy() for col in df.columns})


I am doing this instead of pl.from_pandas() to avoid installing pyarrow.

adrinjalali · 2024-03-11T10:31:19Z

Somehow I don't really love the idea of having to use both pandas and polars in the same script. It kinda defeats the purpose.

MarcoGorelli · 2024-03-11T10:37:32Z

Hey,

@glemaitre had written in the Polars Discord

I think this would be worth to add a pl.from_pandas in the example. This would easily accepted as visible in the documentation and in the next release for sure. Modifying fetch_openml might require more discussion regarding the type of files to read (ARFF vs. parquet), directly accessing the HTTP or caching the file, etc. So the process will be slower.

https://discord.com/channels/908022250106667068/1214872002238742529/1215228420292870164

which is why I encouraged @raisadz to try this out

It kinda defeats the purpose.

Anecdotally, I've seen several colleagues and clients take a predominantly pandas pipeline, rewrite the most memory-intensive part to Polars (whilst keeping the rest in pandas), and find that that's already enough to avoid out-of-memory issues

glemaitre · 2024-03-11T12:40:09Z

I would be fine with the conversion from pandas to polars for the moment. I think that we should make it explicit that the conversion is temporary until fetch_openml is natively loading polars frame. Once we have this feature in, then the example will perfect.

It at least shows that scikit-learn accepts such input.

It is definitely not perfect but better than not having a single example shown while awaiting for the fetch_openml upgrade that might not happen for the next release.

@adrinjalali WDYT?

adrinjalali · 2024-03-11T12:50:41Z

@MarcoGorelli what's the status of being able to read parquet with polars w/o depending on pyarrow right now?

MarcoGorelli · 2024-03-11T13:15:04Z

pyarrow isn't required by Polars for reading Parquet files

Here's an example of using Polars to read the greatest data science dataset in the world (no pyarrow installed!):

In [1]: import polars as pl

In [2]: pl.read_parquet('iris.parquet').head()
Out[2]:
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

In [3]: import pyarrow
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 1
----> 1 import pyarrow

ModuleNotFoundError: No module named 'pyarrow'

raisadz · 2024-03-13T21:20:22Z

examples/applications/plot_time_series_lagged_features.py

+            loc.body(col, extract_numeric(col) == list_min(extract_numeric(col)))
+            for col in styled_df.columns
+            if col != "loss"
+        ],


Replacing pandas styling with great tables as they have native support for Polars. Apart from fetch_openml this completely replaces pandas.

raisadz · 2024-03-13T21:28:13Z

Thank you for your reviews. I added a comment about fetch_openml loading a pandas dataframe and it will be replaced with Polars when it will be natively supported. I replaced the pandas styling with great tables as they have Polars support and apart from fetch_openml this should replace pandas in this notebook.

raisadz · 2024-03-14T14:27:06Z

I updated the lock files to add great tables by running build_tools/update_environments_and_lock_files.py and added the Polars version check to handle the deprecated syntax.

raisadz · 2024-03-19T15:39:45Z

I removed checking Polars version and replaced it with using list.get function instead of list.gather (and list.take for earlier versions). Please, let me know if you are fine with adding great-tables dependency to replace the pandas styler. Otherwise, I can roll back to the previous version. Or I can just print Polars dataframes directly. Please, let me know which option you would prefer.

glemaitre · 2024-03-28T06:14:14Z

examples/applications/plot_time_series_lagged_features.py

-for col in cols_to_convert:
-    mask[col] = df[col].apply(
-        lambda x: "font-weight: bold" if x == min_values[col] else ""
+gt_styled_df = GT(styled_df)


For sure, we don't want the extra dependency on great-tables. We should instead have a way to do the same with the current installed library.

It seems that there is no way to print something in bold in Polars, that's why I added great-tables that have this support.

Should I then convert the dataframe back to pandas when there is a styled output? As I thought you don't want to have both pandas and Polars dataframes #28601 (comment)?

Or do you know about some other way to print that table with final metrics with bold styling?

Alternatively, should I display the dataframe without styling? I can add the minimums as a separate dataframe below.

I guess we rather lose the bold styling than adding the extra dependency or jumping back and forth from pandas. Personally I vote for the last option then.

Thanks for your reviews. I removed the styling with great tables and added a dataframe displaying the losses that minimise each metric.

glemaitre · 2024-03-28T06:16:50Z

examples/applications/plot_time_series_lagged_features.py

+# We convert to Polars for feature engineering, as it presents several advantages
+# here:
+#
+# - It parallelises the expressions within the `with_columns` statement.
+# - It automatically caches common subexpressions which are reused in multiple
+#   expressions (like `pl.col("count").shift(1)` below). See
+#   https://docs.pola.rs/user-guide/lazy/optimizations/ for more information.
+# - It is very memory-efficient.
+# - It is stricter about data types and schemas, allowing us to catch errors earlier
+#   in our development workflow.


I don't think that this our place here to emphasize why one should use polars. I think that the example should only concentrate on the "consumer" part of a scikit-learn user meaning, that the underlying model will deal with a polars dataframe.

I think here the note regarding the caveats for fetch_openml is enough.

Hey 👋

I just wanted to flag then when I opened #28576, one of the comments was

maybe we can highlight the benefit with a comment in the code or as narrative text

So, just looping in @ArturoAmorQ to make sure that scikit-learn devs are aligned here about highlighting benefits in examples, and that you're on the same page about the direction this is going in 🤗

I agree with @glemaitre that comments regarding the benefits should remain as minimalistic as possible to keep the focus on the example's main narrative. Maybe in this case in particular the most relevant reason is the caching of common subexpressions?

I left only the caching of common subexpressions and removed other reasons for using Polars.

ArturoAmorQ

Thanks for implementing the changes @raisadz, here's another batch of comments.

ArturoAmorQ · 2024-04-02T08:34:43Z

examples/applications/plot_time_series_lagged_features.py

+# Even if the score distributions overlap due to the variance in the dataset,
+# it is true that the average RMSE is lower when `loss="squared_error"`, whereas
+# the average MAPE is lower when `loss="absolute_error"` as expected
+# (the same is true for the quantile 50). That is


"the same is true for the quantile 50" would mean, given the context, that pinball_loss_50 is minimized by "quantile 50", which is not the case.

ArturoAmorQ · 2024-04-02T08:36:48Z

examples/applications/plot_time_series_lagged_features.py

-# We start by loading the data from the OpenML repository.
+# We start by loading the data from the OpenML repository
+# as a pandas dataframe. This will be replaced with Polars
+# after `fetch_openml` will add a native support for it.


Suggested change

# after `fetch_openml` will add a native support for it.

# once `fetch_openml` adds a native support for it.

ArturoAmorQ · 2024-04-02T09:09:15Z

examples/applications/plot_time_series_lagged_features.py

+        col_split.list.get(0).cast(pl.Float64),
+        col_split.list.get(2).cast(pl.Float64),


Honest question, why does the casting here changes the results so drastically? What is the dtype otherwise?

Thank you for your reviews @ArturoAmorQ . It looks like the fit times are slightly lower for Polars (apart from the squared loss) but otherwise there are no differences in the final table. Before casting the dtype is string (pl.Utf8). In the extract_numeric function that was used the mean and std were also casted to float before finding the minimum:

def extract_numeric(value): parts = value.split("±") mean_value = float(parts[0]) std_value = float(parts[1].split()[0]) return mean_value, std_value

ArturoAmorQ · 2024-04-02T09:11:28Z

examples/applications/plot_time_series_lagged_features.py

+scores_df.select(
+    pl.col("loss").get(min_arg(col_name)).alias(col_name)
+    for col_name in scores_df.columns
+    if col_name != "loss"
+)


This is a clever solution, though :)

…rs_features

ArturoAmorQ

LGTM! Any further comments on this one @glemaitre?

glemaitre · 2024-04-08T12:58:15Z

Let me have another look more in depth but it looks good at a first glance.

glemaitre

LGTM. We can further improve when we make a good OpenML fetcher using the parquet files.

github-actions bot added the Documentation label Mar 9, 2024

raisadz force-pushed the polars_features branch from 3dcf465 to 8df6e42 Compare March 9, 2024 16:32

raisadz commented Mar 9, 2024

View reviewed changes

raisadz marked this pull request as ready for review March 9, 2024 18:17

MarcoGorelli mentioned this pull request Mar 9, 2024

Support polars parser in fetch_openml #28586

Closed

raisadz force-pushed the polars_features branch 2 times, most recently from 21276cc to fe1e557 Compare March 13, 2024 21:16

raisadz commented Mar 13, 2024

View reviewed changes

raisadz force-pushed the polars_features branch from 02da9bb to 855d6d4 Compare March 13, 2024 21:38

glemaitre reviewed Mar 28, 2024

View reviewed changes

raisadz force-pushed the polars_features branch 6 times, most recently from 4395874 to a7b5a9b Compare March 30, 2024 19:10

use Polars in "plot time series lagged features" example

7633e10

raisadz force-pushed the polars_features branch from a7b5a9b to 7633e10 Compare March 30, 2024 22:55

ArturoAmorQ reviewed Apr 2, 2024

View reviewed changes

Merge branch 'main' of github.com:scikit-learn/scikit-learn into pola…

17012a1

…rs_features

change the comments about data loading and losses

07a646a

ArturoAmorQ approved these changes Apr 2, 2024

View reviewed changes

glemaitre self-requested a review April 8, 2024 12:58

glemaitre approved these changes Apr 8, 2024

View reviewed changes

glemaitre changed the title ~~DOC: replace pandas with polars for feature engineering in examples/applications/plot_time_series_lagged_features.py~~ DOC user polars instead of polars in plot_time_series_lagged_features Apr 8, 2024

glemaitre enabled auto-merge (squash) April 8, 2024 19:10

glemaitre merged commit d1d1596 into scikit-learn:main Apr 8, 2024


		import polars as pl

		df = pl.DataFrame({col: df[col].to_numpy() for col in df.columns})

	# after `fetch_openml` will add a native support for it.
	# once `fetch_openml` adds a native support for it.

		col_split.list.get(0).cast(pl.Float64),
		col_split.list.get(2).cast(pl.Float64),

Uh oh!

DOC user polars instead of polars in plot_time_series_lagged_features #28601

DOC user polars instead of polars in plot_time_series_lagged_features #28601

Uh oh!

Conversation

raisadz commented Mar 9, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

raisadz Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Mar 11, 2024

Uh oh!

MarcoGorelli commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Mar 11, 2024

Uh oh!

adrinjalali commented Mar 11, 2024

Uh oh!

MarcoGorelli commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raisadz commented Mar 13, 2024

Uh oh!

raisadz commented Mar 14, 2024

Uh oh!

raisadz commented Mar 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Apr 8, 2024

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 9, 2024 •

edited

Loading

raisadz Mar 9, 2024 •

edited

Loading

MarcoGorelli commented Mar 11, 2024 •

edited

Loading

MarcoGorelli commented Mar 11, 2024 •

edited

Loading