[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

PSSF23 · 2021-11-25T16:50:31Z

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

Speed up ../examples/linear_model/plot_poisson_regression_non_normal_loss.py by storing test predictions and reuse them on figures
Fix train/test X by removing y columns.

Any other comments?

The only way to really reduce the running time would be using a smaller dataset...

ogrisel

Interesting, however, prediction time seems to be negligible

Maybe we could try to reduce training time by using a small training set while keeping a large (larger) test set to properly assess calibration results:

df_train, df_test = train_test_split(df, test_size=0.5, random_state=0)

This would require checking that all the outputs and the text of the analysis are still relevant, though.

Finally if this change makes it such that test time become a non negligible computational step, then the following suggestions would make the code sligtly lighter to read:

examples/linear_model/plot_poisson_regression_non_normal_loss.py

PSSF23

@ogrisel Thanks! I will commit your suggestions and explore a different split. BTW where can I see how much time the example would take on your side? I noticed some differences on my local machine with time, but it's pretty hard to estimate.

PSSF23

@ogrisel Changing split ratios doesn't seem to change anything, but I noticed another potential issue:

ridge_glm = Pipeline(
    [
        ("preprocessor", linear_model_preprocessor),
        ("regressor", Ridge(alpha=1e-6)),
    ]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])

Is it okay to directly input df_train as X when it contains the y column df_train["Frequency"]? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"]) should be the correct X to use?

PSSF23

@ogrisel I have modified the X sets by removing y columns. This step had no effect on running time.

Another potential speed-up is on data preprocessing (transforming the data only once for all 3 linear models). However, that would make the use of Pipeline unnecessary, so I'm hesitant about this change. What do you think?

ogrisel

Not sure how to proceed here. Based on the circle CI timing, the reuse of the predictions seems to bring a small reduction of the running time (by around 10%). Maybe mutualizing the preprocessing could help a bit further with the runtime and the explicitness of the code as detailed in the comment below.

But maybe we can also live with the the current runtime. It's less than 1 minute so this is fine to me.

Maybe @lorentzenchr has other suggestions?

ogrisel · 2021-12-07T09:49:19Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

@@ -145,7 +145,11 @@
        ("preprocessor", linear_model_preprocessor),
        ("regressor", DummyRegressor(strategy="mean")),
    ]
-).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])
+).fit(
+    df_train.drop(columns=["Frequency"]),


Is it okay to directly input df_train as X when it contains the y column df_train["Frequency"]? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"]) should be the correct X to use?

This the fact that the "Frequency" columns is always dropped implicitly by the the linear_model_preprocessor is a bit confusing indeed.

We could make it more explicit by either:

dropping the "Frequency" columns each time df_train and df_test are passed as inputs to the pipeline (but I find this quite verbose);

with an inline comment in the preprocessor to emphasize the "Frequency" is always dropped by linear_model_preprocessor;

preprocess df_train and df_test only once at the beginning, then highlight that the result of the preprocessing does not contain any feature derived from the "Frequency" column. This might also have a beneficial impact on the overall runtime of the example but I am not sure it will be that impacting. Also I kind of liked to use this example as an opportunity to advocate for the user of pipelines.

Maybe you can try to experiment with the last suggestion locally to report how much of a speed improvement this brings?

Just for confirmation, you want to make linear_model_preprocessor preprocess df_train & df_test only once for all 3 linear models (Dummy, Ridge, Poisson) and drop Pipeline, right? I will try the changes locally.

I would use

y_train = df_train.pop("Frequency") y_test = df_test.pop("Frequency")

because I find it clean and it avoids data leakage from the beginning.

@lorentzenchr Thanks for suggestion. I reverted to the original setup which lets the preprocessor deal with columns, but am definitely open to implement improvements if needed.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

This reverts commit 39a6780.

PSSF23

@ogrisel Thanks. I reverted the explicit dropping for now and will experiment with preprocessing.

adrinjalali · 2022-02-08T10:02:29Z

@PSSF23 did you get anywhere with this?

PSSF23 · 2022-02-08T16:25:50Z

@adrinjalali unfortunately no. I didn't achieve any improvements yet.

lorentzenchr · 2022-02-08T16:46:37Z

Do you know where the main bottleneck is?

PSSF23 · 2022-02-08T16:59:59Z

@lorentzenchr the dataset is just large. I don't think any other change could affect the running time much.

adrinjalali · 2022-02-14T09:04:06Z

We could upload a smaller version of the dataset to openml and use that one instead.

ogrisel · 2022-05-10T07:51:18Z

As discussed in #23314, it might be possible to improve the solver.

jeremiedbb · 2023-03-02T23:45:24Z

The new pandas parser (#21938) gave this example the necessary speed-up to be in line with the other examples. Thanks for the investigations though.

ENH store test preds to reduce time

d81ae8c

PSSF23 mentioned this pull request Nov 25, 2021

Accelerate slow examples #21598

Closed

41 tasks

ogrisel reviewed Nov 26, 2021

View reviewed changes

examples/linear_model/plot_poisson_regression_non_normal_loss.py Outdated Show resolved Hide resolved

examples/linear_model/plot_poisson_regression_non_normal_loss.py Outdated Show resolved Hide resolved

PSSF23 commented Nov 26, 2021

View reviewed changes

PSSF23 added 2 commits November 26, 2021 09:43

ENH optimize loops & train/test splits

e190c3a

ENH remove unnecessary code

8a0aaad

PSSF23 commented Nov 26, 2021

View reviewed changes

adrinjalali changed the title ~~[MRG] ENH speed up poisson example by storing test predictions~~ [MRG] ENH speed up plot_poisson_regression_non_normal_loss.py Nov 29, 2021

FIX remove y labels from X sets

39a6780

PSSF23 commented Nov 30, 2021

View reviewed changes

ogrisel reviewed Dec 7, 2021

View reviewed changes

Revert "FIX remove y labels from X sets"

688a8d2

This reverts commit 39a6780.

PSSF23 commented Dec 7, 2021

View reviewed changes

lorentzenchr mentioned this pull request May 9, 2022

FEA add Cholesky based Newton solver to GLMs #23314

Closed

jeremiedbb closed this Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

PSSF23 commented Nov 25, 2021 •

edited

Loading

ogrisel left a comment

PSSF23 left a comment

PSSF23 left a comment •

edited

Loading

PSSF23 left a comment

ogrisel left a comment

ogrisel Dec 7, 2021 •

edited

Loading

PSSF23 Dec 7, 2021

lorentzenchr Dec 7, 2021

PSSF23 Dec 7, 2021 •

edited

Loading

PSSF23 left a comment

adrinjalali commented Feb 8, 2022

PSSF23 commented Feb 8, 2022

lorentzenchr commented Feb 8, 2022

PSSF23 commented Feb 8, 2022 •

edited

Loading

adrinjalali commented Feb 14, 2022

ogrisel commented May 10, 2022 •

edited

Loading

jeremiedbb commented Mar 2, 2023

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

Conversation

PSSF23 commented Nov 25, 2021 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment • edited Loading

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

PSSF23 Dec 7, 2021

Choose a reason for hiding this comment

lorentzenchr Dec 7, 2021

Choose a reason for hiding this comment

PSSF23 Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

adrinjalali commented Feb 8, 2022

PSSF23 commented Feb 8, 2022

lorentzenchr commented Feb 8, 2022

PSSF23 commented Feb 8, 2022 • edited Loading

adrinjalali commented Feb 14, 2022

ogrisel commented May 10, 2022 • edited Loading

jeremiedbb commented Mar 2, 2023

PSSF23 commented Nov 25, 2021 •

edited

Loading

PSSF23 left a comment •

edited

Loading

ogrisel Dec 7, 2021 •

edited

Loading

PSSF23 Dec 7, 2021 •

edited

Loading

PSSF23 commented Feb 8, 2022 •

edited

Loading

ogrisel commented May 10, 2022 •

edited

Loading