-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix OverflowError on DictVectorizer #15463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix OverflowError on DictVectorizer #15463
Conversation
…into feature/overflowerror_dictvectorizer
Thanks for the pull-request ! |
The problem is that this only happens for sparse arrays that have more than 2**31 elements which corresponds to 10s GB RAM. I think it's OK to merge it without unit test given that's idential to what we did earlier for |
@norvan Could you please run a small performance benchmark before and after this PR to make sure performance is not degraded? It shouldn't be, but better to double check. For instance running evaluation code in jupyter with |
Ran the following code on both stable sklearn and after my changes.
Conclusion is that the change does not impair performance in a statistically meaningful way considering the tolerance. from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random
n_samples = 1000000
data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]
vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)
%timeit vectorized_data = vectorizer.transform(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM.
Thanks for the benchmarks @norvan !
Fixes #13045
DictVectorizer's transform step raises an OverflowError if the input data has too many samples.
Replacing indptr to a list resolves this issue without impairing performance.
Similar solution has been proposed by @rth on #9147
Checks on performance was run locally to confirm there was no increase in the duration of the transform step:
Duration was:
2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
before the change2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after the changeCode: