more memory-efficient word count calculation #4968

aupiff · 2015-07-12T16:08:59Z

(SPLITTING UP PR #4941)

previously, line 760 in text.py

values = np.ones(len(j_indices))

would cause memory issues as len(j_indices) is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and the X.sum_duplicates calculation without a perceptible performance hit.

amueller · 2015-07-12T16:30:28Z

travis is unhappy

jmschrei · 2015-08-03T19:35:20Z

@aupiff Is this still being worked on?

aupiff · 2015-08-04T02:37:28Z

@jmschrei Yes! I forgot about this for a little while. I'm fixing a few failing tests now.

aupiff · 2015-08-04T05:07:22Z

@jmschrei all tests passing now! sorry this took so long to fix.

jmschrei · 2015-08-04T15:45:08Z

Excellent! Would it be possible for you to provide some simple benchmarks, and the code which runs those benchmarks as well? It would make it easier for us to ensure the quality of your code.

jnothman · 2015-08-09T12:32:37Z

sklearn/feature_extraction/text.py

            for feature in analyze(doc):
                try:
-                    j_indices.append(vocabulary[feature])
+                    vocab_idx = vocabulary[feature]
+                    word_occurences = j_idx_to_count_dict.get(vocab_idx, 0)


Why not use a Counter?

amueller · 2015-08-13T17:03:06Z

👍 for some benchmarks :)

ephes · 2015-08-16T10:14:07Z

I've added this vectorizer to my benchmark script for #5122 and added a line to the results table.

…n#4968

jnothman · 2016-09-15T13:57:28Z

Superseded by #7272

more memory-efficient word count calculation

e4edd12

fixing iteritems compatibility bug

97f9f9c

fixing np.frombuffer / frombuffer_empty issue

a68215c

jnothman reviewed Aug 9, 2015
View reviewed changes

jnothman mentioned this pull request Aug 16, 2015

[WIP] Text vectorizers memory usage #5122

Closed

ephes added a commit to ephes/scikit-learn that referenced this pull request Aug 17, 2015

replaced chunkwise approach with dict based approach from scikit-lear…

cc1a97e

…n#4968

jnothman mentioned this pull request Sep 24, 2015

CountVectorizer Performance Problem #5306

Closed

rth pushed a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016

replaced chunkwise approach with dict based approach from scikit-lear…

e375951

…n#4968

rth pushed a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016

Improved memory usage in CountVectorizer (PR scikit-learn#4968)

5c544e9

rth mentioned this pull request Aug 28, 2016

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

Merged

jnothman closed this Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

more memory-efficient word count calculation #4968

more memory-efficient word count calculation #4968

Uh oh!

aupiff commented Jul 12, 2015

Uh oh!

amueller commented Jul 12, 2015

Uh oh!

jmschrei commented Aug 3, 2015

Uh oh!

aupiff commented Aug 4, 2015

Uh oh!

aupiff commented Aug 4, 2015

Uh oh!

jmschrei commented Aug 4, 2015

Uh oh!

jnothman Aug 9, 2015

Uh oh!

amueller commented Aug 13, 2015

Uh oh!

ephes commented Aug 16, 2015

Uh oh!

jnothman commented Sep 15, 2016

Uh oh!

Uh oh!

Uh oh!

more memory-efficient word count calculation #4968

more memory-efficient word count calculation #4968

Uh oh!

Conversation

aupiff commented Jul 12, 2015

Uh oh!

amueller commented Jul 12, 2015

Uh oh!

jmschrei commented Aug 3, 2015

Uh oh!

aupiff commented Aug 4, 2015

Uh oh!

aupiff commented Aug 4, 2015

Uh oh!

jmschrei commented Aug 4, 2015

Uh oh!

jnothman Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

amueller commented Aug 13, 2015

Uh oh!

ephes commented Aug 16, 2015

Uh oh!

jnothman commented Sep 15, 2016

Uh oh!

Uh oh!