-
-
Notifications
You must be signed in to change notification settings - Fork 26k
more memory-efficient word count calculation #4968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
travis is unhappy |
@aupiff Is this still being worked on? |
@jmschrei Yes! I forgot about this for a little while. I'm fixing a few failing tests now. |
@jmschrei all tests passing now! sorry this took so long to fix. |
Excellent! Would it be possible for you to provide some simple benchmarks, and the code which runs those benchmarks as well? It would make it easier for us to ensure the quality of your code. |
for feature in analyze(doc): | ||
try: | ||
j_indices.append(vocabulary[feature]) | ||
vocab_idx = vocabulary[feature] | ||
word_occurences = j_idx_to_count_dict.get(vocab_idx, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a Counter
?
👍 for some benchmarks :) |
I've added this vectorizer to my benchmark script for #5122 and added a line to the results table. |
Superseded by #7272 |
(SPLITTING UP PR #4941)
previously, line 760 in
text.py
would cause memory issues as
len(j_indices)
is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and theX.sum_duplicates
calculation without a perceptible performance hit.