Skip to content

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

PSSF23
Copy link
Contributor

@PSSF23 PSSF23 commented Nov 25, 2021

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

  • Speed up ../examples/linear_model/plot_poisson_regression_non_normal_loss.py by storing test predictions and reuse them on figures
  • Fix train/test X by removing y columns.

Any other comments?

The only way to really reduce the running time would be using a smaller dataset...

@PSSF23 PSSF23 mentioned this pull request Nov 25, 2021
41 tasks
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, however, prediction time seems to be negligible

Maybe we could try to reduce training time by using a small training set while keeping a large (larger) test set to properly assess calibration results:

df_train, df_test = train_test_split(df, test_size=0.5, random_state=0)

This would require checking that all the outputs and the text of the analysis are still relevant, though.

Finally if this change makes it such that test time become a non negligible computational step, then the following suggestions would make the code sligtly lighter to read:

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Thanks! I will commit your suggestions and explore a different split. BTW where can I see how much time the example would take on your side? I noticed some differences on my local machine with time, but it's pretty hard to estimate.

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Changing split ratios doesn't seem to change anything, but I noticed another potential issue:

ridge_glm = Pipeline(
    [
        ("preprocessor", linear_model_preprocessor),
        ("regressor", Ridge(alpha=1e-6)),
    ]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])

Is it okay to directly input df_train as X when it contains the y column df_train["Frequency"]? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"]) should be the correct X to use?

@adrinjalali adrinjalali changed the title [MRG] ENH speed up poisson example by storing test predictions [MRG] ENH speed up plot_poisson_regression_non_normal_loss.py Nov 29, 2021
Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel I have modified the X sets by removing y columns. This step had no effect on running time.

Another potential speed-up is on data preprocessing (transforming the data only once for all 3 linear models). However, that would make the use of Pipeline unnecessary, so I'm hesitant about this change. What do you think?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to proceed here. Based on the circle CI timing, the reuse of the predictions seems to bring a small reduction of the running time (by around 10%). Maybe mutualizing the preprocessing could help a bit further with the runtime and the explicitness of the code as detailed in the comment below.

But maybe we can also live with the the current runtime. It's less than 1 minute so this is fine to me.

Maybe @lorentzenchr has other suggestions?

@@ -145,7 +145,11 @@
("preprocessor", linear_model_preprocessor),
("regressor", DummyRegressor(strategy="mean")),
]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])
).fit(
df_train.drop(columns=["Frequency"]),
Copy link
Member

@ogrisel ogrisel Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to directly input df_train as X when it contains the y column df_train["Frequency"]? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"]) should be the correct X to use?

This the fact that the "Frequency" columns is always dropped implicitly by the the linear_model_preprocessor is a bit confusing indeed.

We could make it more explicit by either:

  • dropping the "Frequency" columns each time df_train and df_test are passed as inputs to the pipeline (but I find this quite verbose);
  • with an inline comment in the preprocessor to emphasize the "Frequency" is always dropped by linear_model_preprocessor;
  • preprocess df_train and df_test only once at the beginning, then highlight that the result of the preprocessing does not contain any feature derived from the "Frequency" column. This might also have a beneficial impact on the overall runtime of the example but I am not sure it will be that impacting. Also I kind of liked to use this example as an opportunity to advocate for the user of pipelines.

Maybe you can try to experiment with the last suggestion locally to report how much of a speed improvement this brings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for confirmation, you want to make linear_model_preprocessor preprocess df_train & df_test only once for all 3 linear models (Dummy, Ridge, Poisson) and drop Pipeline, right? I will try the changes locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use

y_train = df_train.pop("Frequency")
y_test = df_test.pop("Frequency")

because I find it clean and it avoids data leakage from the beginning.

Copy link
Contributor Author

@PSSF23 PSSF23 Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentzenchr Thanks for suggestion. I reverted to the original setup which lets the preprocessor deal with columns, but am definitely open to implement improvements if needed.

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Thanks. I reverted the explicit dropping for now and will experiment with preprocessing.

@adrinjalali
Copy link
Member

@PSSF23 did you get anywhere with this?

@PSSF23
Copy link
Contributor Author

PSSF23 commented Feb 8, 2022

@adrinjalali unfortunately no. I didn't achieve any improvements yet.

@lorentzenchr
Copy link
Member

Do you know where the main bottleneck is?

@PSSF23
Copy link
Contributor Author

PSSF23 commented Feb 8, 2022

@lorentzenchr the dataset is just large. I don't think any other change could affect the running time much.

@adrinjalali
Copy link
Member

We could upload a smaller version of the dataset to openml and use that one instead.

@ogrisel
Copy link
Member

ogrisel commented May 10, 2022

As discussed in #23314, it might be possible to improve the solver.

@jeremiedbb
Copy link
Member

The new pandas parser (#21938) gave this example the necessary speed-up to be in line with the other examples. Thanks for the investigations though.

@jeremiedbb jeremiedbb closed this Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants