Skip to content

Benchmark bug in SGDRegressor.fit on sparse data #26095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Apr 5, 2023 · 16 comments · Fixed by #26146
Closed

Benchmark bug in SGDRegressor.fit on sparse data #26095

ogrisel opened this issue Apr 5, 2023 · 16 comments · Fixed by #26146

Comments

@ogrisel
Copy link
Member

ogrisel commented Apr 5, 2023

There is a 10x increase in fit duration as reported here:

https://scikit-learn.org/scikit-learn-benchmarks/#linear_model.SGDRegressorBenchmark.time_fit?commits=e6b46675-b4afbeee&p-representation='sparse'

This happened between between e6b4667 (still fast) b4afbee (slow).

Note that the same estimator with dense data was not impacted.

I have not investigated the cause myself. Just spotted it when reviewing the benchmarks.

Looking at the commit messages of the output of git log e6b46675..b4afbeee, it could be the case that 1a669d8 introduced the regression but this needs confirmation:

EDIT: this is actually a bug in the design of the benchmark rather than a performance regression. See the discussion below.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 5, 2023

/cc @OmarManzoor.

@OmarManzoor
Copy link
Contributor

Hi @ogrisel ,
I was having a look at this and it seems that the benchmark uses a dtype of float32 by default. Is it possible that with float32 more iterations are taken to converge to a similar result as with float64?

@OmarManzoor
Copy link
Contributor

Hi @ogrisel , I was having a look at this and it seems that the benchmark uses a dtype of float32 by default. Is it possible that with float32 more iterations are taken to converge to a similar result as with float64?

But then the dense case is not impacted and that uses float32 as well. So this might not be the case.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 6, 2023

Indeed it could be. Can you check the final n_iter_ value in both cases?

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Apr 6, 2023

I removed the caching part from _synth_regression_sparse_dataset and used about 10 iterations to account for randomness. This is the code I used https://gist.github.com/OmarManzoor/ba85c684991ba132fbf544f0433c2bfa

Sparse, float64 - Total number of iterations: [85, 84, 57, 76, 84, 65, 96, 81, 81, 93]
Sparse, float32 - Total number of iterations: [1000, 960, 1000, 1000, 835, 1000, 839, 1000, 1000, 1000]

@glemaitre
Copy link
Member

Ouch. This does not look good. I didn't look at the script yet but it shows that we never converge most of the time. It might be interesting to have a look at the loss overtime.

@glemaitre
Copy link
Member

Oh, I see that you put a tolerance tol=1e-16. This cannot work with 32 bits because the smallest tolerance would be around 1e-7.

@glemaitre
Copy link
Member

OK I think the issue is our benchmark here:

estimator = SGDRegressor(max_iter=1000, tol=1e-16, random_state=0)

The tolerance is too small for the 32 bits implementation. I don't know if tol=1e-7 would be a good default there.

@OmarManzoor
Copy link
Contributor

OK I think the issue is our benchmark here:

estimator = SGDRegressor(max_iter=1000, tol=1e-16, random_state=0)

The tolerance is too small for the 32 bits implementation. I don't know if tol=1e-7 would be a good default there.

I see. So should we change this value for the 32 bit case?

@glemaitre
Copy link
Member

I would ask for advise from @jeremiedbb and @ogrisel to know what they think. I assume that 1e-16 was to get a model that converged. So I don't know if we should have a tolerance that depends on the dtype or a common tolerance that works for both dtypes.

@OmarManzoor
Copy link
Contributor

I would ask for advise from @jeremiedbb and @ogrisel to know what they think. I assume that 1e-16 was to get a model that converged. So I don't know if we should have a tolerance that depends on the dtype or a common tolerance that works for both dtypes.

Understood.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 6, 2023

Thanks for the investigation @OmarManzoor and @glemaitre!

I think we could set tol to 1e-7 for both dtypes. WDYT @jeremiedbb?

@ogrisel ogrisel changed the title Performance regression on SGDRegressor.fit on sparse data Benchmark bug in SGDRegressor.fit on sparse data Apr 6, 2023
@ogrisel
Copy link
Member Author

ogrisel commented Apr 6, 2023

I wonder why it works with a low tol value for dense data though.

@OmarManzoor
Copy link
Contributor

I wonder why it works with a low tol value for dense data though.

Could it have to do with the size of the data. The dense dataset only has 200 features.

@jeremiedbb
Copy link
Member

Oh, I see that you put a tolerance tol=1e-16. This cannot work with 32 bits because the smallest tolerance would be around 1e-7.

There's a confusion here. 1e-16 is representable in float32, and you can check that a number is smaller than that. Machine precision represents the max relative difference between 2 consecutive representable numbers.

If the stopping criterion of sgd was if (new_score / prev_score) < tol, 1e-16 would be fine.
However the current stopping criterion is if score < best_score + tol, In that case, as long as tol < best_score * float32.eps, the tol is rounded out and the expression becomes ìf score < best_score.

In the benchmaks the goal is to always perform the max_iter iterations, So probably setting tol=None would be more appropriate.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 6, 2023

and catch ConvergenceWarning and assert that model.n_iter_ == model.max_iter after calling fit to make that explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants