-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Benchmark bug in SGDRegressor.fit on sparse data #26095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc @OmarManzoor. |
Hi @ogrisel , |
But then the dense case is not impacted and that uses float32 as well. So this might not be the case. |
Indeed it could be. Can you check the final n_iter_ value in both cases? |
I removed the caching part from _synth_regression_sparse_dataset and used about 10 iterations to account for randomness. This is the code I used https://gist.github.com/OmarManzoor/ba85c684991ba132fbf544f0433c2bfa Sparse, float64 - Total number of iterations: [85, 84, 57, 76, 84, 65, 96, 81, 81, 93] |
Ouch. This does not look good. I didn't look at the script yet but it shows that we never converge most of the time. It might be interesting to have a look at the loss overtime. |
Oh, I see that you put a tolerance |
OK I think the issue is our benchmark here:
The tolerance is too small for the 32 bits implementation. I don't know if |
I see. So should we change this value for the 32 bit case? |
I would ask for advise from @jeremiedbb and @ogrisel to know what they think. I assume that |
Understood. |
Thanks for the investigation @OmarManzoor and @glemaitre! I think we could set |
I wonder why it works with a low |
Could it have to do with the size of the data. The dense dataset only has 200 features. |
There's a confusion here. If the stopping criterion of sgd was In the benchmaks the goal is to always perform the |
and catch |
There is a 10x increase in fit duration as reported here:
https://scikit-learn.org/scikit-learn-benchmarks/#linear_model.SGDRegressorBenchmark.time_fit?commits=e6b46675-b4afbeee&p-representation='sparse'
This happened between between e6b4667 (still fast) b4afbee (slow).
Note that the same estimator with dense data was not impacted.
I have not investigated the cause myself. Just spotted it when reviewing the benchmarks.
Looking at the commit messages of the output of
git log e6b46675..b4afbeee
, it could be the case that 1a669d8 introduced the regression but this needs confirmation:EDIT: this is actually a bug in the design of the benchmark rather than a performance regression. See the discussion below.
The text was updated successfully, but these errors were encountered: