-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC user polars instead of polars in plot_time_series_lagged_features #28601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
import polars as pl | ||
|
||
df = pl.DataFrame({col: df[col].to_numpy() for col in df.columns}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing this instead of pl.from_pandas()
to avoid installing pyarrow
.
Somehow I don't really love the idea of having to use both pandas and polars in the same script. It kinda defeats the purpose. |
Hey, @glemaitre had written in the Polars Discord
https://discord.com/channels/908022250106667068/1214872002238742529/1215228420292870164 which is why I encouraged @raisadz to try this out
Anecdotally, I've seen several colleagues and clients take a predominantly pandas pipeline, rewrite the most memory-intensive part to Polars (whilst keeping the rest in pandas), and find that that's already enough to avoid out-of-memory issues |
I would be fine with the conversion from pandas to polars for the moment. I think that we should make it explicit that the conversion is temporary until It at least shows that scikit-learn accepts such input. It is definitely not perfect but better than not having a single example shown while awaiting for the @adrinjalali WDYT? |
@MarcoGorelli what's the status of being able to read parquet with polars w/o depending on pyarrow right now? |
Here's an example of using Polars to read the greatest data science dataset in the world (no pyarrow installed!): In [1]: import polars as pl
In [2]: pl.read_parquet('iris.parquet').head()
Out[2]:
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
In [3]: import pyarrow
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 1
----> 1 import pyarrow
ModuleNotFoundError: No module named 'pyarrow' |
21276cc
to
fe1e557
Compare
loc.body(col, extract_numeric(col) == list_min(extract_numeric(col))) | ||
for col in styled_df.columns | ||
if col != "loss" | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing pandas styling with great tables as they have native support for Polars. Apart from fetch_openml
this completely replaces pandas.
Thank you for your reviews. I added a comment about |
02da9bb
to
855d6d4
Compare
I updated the lock files to add great tables by running |
I removed checking Polars version and replaced it with using |
for col in cols_to_convert: | ||
mask[col] = df[col].apply( | ||
lambda x: "font-weight: bold" if x == min_values[col] else "" | ||
gt_styled_df = GT(styled_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, we don't want the extra dependency on great-tables
. We should instead have a way to do the same with the current installed library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is no way to print something in bold in Polars, that's why I added great-tables that have this support.
Should I then convert the dataframe back to pandas when there is a styled output? As I thought you don't want to have both pandas and Polars dataframes #28601 (comment)?
Or do you know about some other way to print that table with final metrics with bold styling?
Alternatively, should I display the dataframe without styling? I can add the minimums as a separate dataframe below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we rather lose the bold styling than adding the extra dependency or jumping back and forth from pandas. Personally I vote for the last option then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviews. I removed the styling with great tables and added a dataframe displaying the losses that minimise each metric.
# We convert to Polars for feature engineering, as it presents several advantages | ||
# here: | ||
# | ||
# - It parallelises the expressions within the `with_columns` statement. | ||
# - It automatically caches common subexpressions which are reused in multiple | ||
# expressions (like `pl.col("count").shift(1)` below). See | ||
# https://docs.pola.rs/user-guide/lazy/optimizations/ for more information. | ||
# - It is very memory-efficient. | ||
# - It is stricter about data types and schemas, allowing us to catch errors earlier | ||
# in our development workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this our place here to emphasize why one should use polars. I think that the example should only concentrate on the "consumer" part of a scikit-learn user meaning, that the underlying model will deal with a polars dataframe.
I think here the note regarding the caveats for fetch_openml
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey 👋
I just wanted to flag then when I opened #28576, one of the comments was
maybe we can highlight the benefit with a comment in the code or as narrative text
So, just looping in @ArturoAmorQ to make sure that scikit-learn devs are aligned here about highlighting benefits in examples, and that you're on the same page about the direction this is going in 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @glemaitre that comments regarding the benefits should remain as minimalistic as possible to keep the focus on the example's main narrative. Maybe in this case in particular the most relevant reason is the caching of common subexpressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left only the caching of common subexpressions and removed other reasons for using Polars.
4395874
to
a7b5a9b
Compare
a7b5a9b
to
7633e10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing the changes @raisadz, here's another batch of comments.
# Even if the score distributions overlap due to the variance in the dataset, | ||
# it is true that the average RMSE is lower when `loss="squared_error"`, whereas | ||
# the average MAPE is lower when `loss="absolute_error"` as expected | ||
# (the same is true for the quantile 50). That is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the same is true for the quantile 50" would mean, given the context, that pinball_loss_50
is minimized by "quantile 50", which is not the case.
# We start by loading the data from the OpenML repository. | ||
# We start by loading the data from the OpenML repository | ||
# as a pandas dataframe. This will be replaced with Polars | ||
# after `fetch_openml` will add a native support for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# after `fetch_openml` will add a native support for it. | |
# once `fetch_openml` adds a native support for it. |
col_split.list.get(0).cast(pl.Float64), | ||
col_split.list.get(2).cast(pl.Float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honest question, why does the casting here changes the results so drastically? What is the dtype otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reviews @ArturoAmorQ . It looks like the fit times are slightly lower for Polars (apart from the squared loss) but otherwise there are no differences in the final table. Before casting the dtype is string (pl.Utf8). In the extract_numeric
function that was used the mean and std were also casted to float
before finding the minimum:
def extract_numeric(value):
parts = value.split("±")
mean_value = float(parts[0])
std_value = float(parts[1].split()[0])
return mean_value, std_value
scores_df.select( | ||
pl.col("loss").get(min_arg(col_name)).alias(col_name) | ||
for col_name in scores_df.columns | ||
if col_name != "loss" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever solution, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Any further comments on this one @glemaitre?
Let me have another look more in depth but it looks good at a first glance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can further improve when we make a good OpenML fetcher using the parquet files.
Reference Issues/PRs
Related to #28341
What does this implement/fix? Explain your changes.
Any other comments?