-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Sparse data representations results in worse models than dense data for some classifiers #25198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the bug report. Indeed, the data format should not affect the model. In your example, this issue is due to a combination of having a stochastic solver, a small sparse dataset, and a relatively large intercept. Explanations:
Related issues:
Quick fix 1: Use the "lbfgs" solver on small datasets, add a warning when using "sag" or "saga" with sparse data and Quick fix 2: Use Slow fix: To properly solve this issue, we could consider one of these options:
|
Thanks Tom - this was very helpful. The quick fix 2 does remove differences in the models resulting from training on sparse v dense data. |
@TomDLT Regarding slow fix B:
Can the intercept decay be dependent on the sparsity of the data in |
That would make perfect sense. It has been proposed before, and the answer was #612 (comment):
For reference, the intercept decay comes from Léon Bottou:
|
Can I work on this issue? |
Describe the bug
Using scipy sparse matrices with sklearn LogisticRegression greatly improves speed and therefore is desirable in many scenarios.
However, it appears that sparse versus dense data representations yield different (worse) results for some sklearn classifiers.
My perhaps naive assumption is that sparse versus dense is just a method of representing the data and operations performed on the sparse or dense data (including model training) should yield identical or nearly identical results.
A notebook gist looking at sparse versus dense results for nine solvers can be found here: https://gist.github.com/mmulthaup/db619d8b5ea4baf4a00153b055a7e9a8
Steps/Code to Reproduce
Expected Results
Dense AUC: 1.0
Sparse AUC: 1.0
Actual Results
Dense AUC: 1.0
Sparse AUC: 0.584
Versions
The text was updated successfully, but these errors were encountered: