[MRG] Fix OverflowError on DictVectorizer #15463

norvan · 2019-11-02T17:39:15Z

DictVectorizer's transform step raises an OverflowError if the input data has too many samples.
Replacing indptr to a list resolves this issue without impairing performance.

Similar solution has been proposed by @rth on #9147

Checks on performance was run locally to confirm there was no increase in the duration of the transform step:

Duration was:

2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Code:

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)

…into feature/overflowerror_dictvectorizer

TomDLT · 2019-11-02T17:48:12Z

Thanks for the pull-request !
Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

rth · 2019-11-02T17:57:20Z

Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

The problem is that this only happens for sparse arrays that have more than 2**31 elements which corresponds to 10s GB RAM. I think it's OK to merge it without unit test given that's idential to what we did earlier for CountVectorizer in #9147

rth · 2019-11-02T18:00:20Z

@norvan Could you please run a small performance benchmark before and after this PR to make sure performance is not degraded? It shouldn't be, but better to double check.

For instance running evaluation code in jupyter with %timeit DictVectorizer should be enough. Thanks!

norvan · 2019-11-02T18:47:06Z

Ran the following code on both stable sklearn and after my changes.
Duration was:

2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Conclusion is that the change does not impair performance in a statistically meaningful way considering the tolerance.

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)

rth

Minor comment, otherwise LGTM.

Thanks for the benchmarks @norvan !

doc/whats_new/v0.22.rst

norvan added 2 commits November 2, 2019 17:21

Prevent OverflowError by instantiating indptr as a list

c5a0051

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

14f2c64

…into feature/overflowerror_dictvectorizer

norvan changed the title ~~Fix OverflowError on DictVectorizer~~ [WIP] Fix OverflowError on DictVectorizer Nov 2, 2019

Add What's New entry for OverflowError on DictVectorizer

2c9455c

norvan changed the title ~~[WIP] Fix OverflowError on DictVectorizer~~ [MRG] Fix OverflowError on DictVectorizer Nov 2, 2019

rth approved these changes Nov 2, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

Add missing PR number on What's New

dff8ae2

jnothman approved these changes Nov 2, 2019

View reviewed changes

jnothman merged commit 77aec1f into scikit-learn:master Nov 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix OverflowError on DictVectorizer #15463

[MRG] Fix OverflowError on DictVectorizer #15463

norvan commented Nov 2, 2019 •

edited

Loading

TomDLT commented Nov 2, 2019

rth commented Nov 2, 2019

rth commented Nov 2, 2019

norvan commented Nov 2, 2019 •

edited

Loading

rth left a comment

[MRG] Fix OverflowError on DictVectorizer #15463

[MRG] Fix OverflowError on DictVectorizer #15463

Conversation

norvan commented Nov 2, 2019 • edited Loading

TomDLT commented Nov 2, 2019

rth commented Nov 2, 2019

rth commented Nov 2, 2019

norvan commented Nov 2, 2019 • edited Loading

rth left a comment

Choose a reason for hiding this comment

norvan commented Nov 2, 2019 •

edited

Loading

norvan commented Nov 2, 2019 •

edited

Loading