-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, however, prediction time seems to be negligible
Maybe we could try to reduce training time by using a small training set while keeping a large (larger) test set to properly assess calibration results:
df_train, df_test = train_test_split(df, test_size=0.5, random_state=0)
This would require checking that all the outputs and the text of the analysis are still relevant, though.
Finally if this change makes it such that test time become a non negligible computational step, then the following suggestions would make the code sligtly lighter to read:
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Thanks! I will commit your suggestions and explore a different split. BTW where can I see how much time the example would take on your side? I noticed some differences on my local machine with time
, but it's pretty hard to estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Changing split ratios doesn't seem to change anything, but I noticed another potential issue:
ridge_glm = Pipeline(
[
("preprocessor", linear_model_preprocessor),
("regressor", Ridge(alpha=1e-6)),
]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])
Is it okay to directly input df_train
as X
when it contains the y
column df_train["Frequency"]
? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"])
should be the correct X
to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel I have modified the X
sets by removing y
columns. This step had no effect on running time.
Another potential speed-up is on data preprocessing (transforming the data only once for all 3 linear models). However, that would make the use of Pipeline
unnecessary, so I'm hesitant about this change. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to proceed here. Based on the circle CI timing, the reuse of the predictions seems to bring a small reduction of the running time (by around 10%). Maybe mutualizing the preprocessing could help a bit further with the runtime and the explicitness of the code as detailed in the comment below.
But maybe we can also live with the the current runtime. It's less than 1 minute so this is fine to me.
Maybe @lorentzenchr has other suggestions?
@@ -145,7 +145,11 @@ | |||
("preprocessor", linear_model_preprocessor), | |||
("regressor", DummyRegressor(strategy="mean")), | |||
] | |||
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]) | |||
).fit( | |||
df_train.drop(columns=["Frequency"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to directly input df_train as X when it contains the y column df_train["Frequency"]? The same thing happens for all the predictions. I think df_train.drop(columns=["Frequency"]) should be the correct X to use?
This the fact that the "Frequency"
columns is always dropped implicitly by the the linear_model_preprocessor
is a bit confusing indeed.
We could make it more explicit by either:
- dropping the "Frequency" columns each time
df_train
anddf_test
are passed as inputs to the pipeline (but I find this quite verbose); - with an inline comment in the preprocessor to emphasize the "Frequency" is always dropped by
linear_model_preprocessor
; - preprocess
df_train
anddf_test
only once at the beginning, then highlight that the result of the preprocessing does not contain any feature derived from the "Frequency" column. This might also have a beneficial impact on the overall runtime of the example but I am not sure it will be that impacting. Also I kind of liked to use this example as an opportunity to advocate for the user of pipelines.
Maybe you can try to experiment with the last suggestion locally to report how much of a speed improvement this brings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for confirmation, you want to make linear_model_preprocessor
preprocess df_train
& df_test
only once for all 3 linear models (Dummy, Ridge, Poisson) and drop Pipeline
, right? I will try the changes locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use
y_train = df_train.pop("Frequency")
y_test = df_test.pop("Frequency")
because I find it clean and it avoids data leakage from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorentzenchr Thanks for suggestion. I reverted to the original setup which lets the preprocessor deal with columns, but am definitely open to implement improvements if needed.
This reverts commit 39a6780.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Thanks. I reverted the explicit dropping for now and will experiment with preprocessing.
@PSSF23 did you get anywhere with this? |
@adrinjalali unfortunately no. I didn't achieve any improvements yet. |
Do you know where the main bottleneck is? |
@lorentzenchr the dataset is just large. I don't think any other change could affect the running time much. |
We could upload a smaller version of the dataset to openml and use that one instead. |
As discussed in #23314, it might be possible to improve the solver. |
The new pandas parser (#21938) gave this example the necessary speed-up to be in line with the other examples. Thanks for the investigations though. |
Reference Issues/PRs
#21598
What does this implement/fix? Explain your changes.
../examples/linear_model/plot_poisson_regression_non_normal_loss.py
by storing test predictions and reuse them on figuresX
by removingy
columns.Any other comments?
The only way to really reduce the running time would be using a smaller dataset...