Skip to content

more memory-efficient word count calculation #4968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

more memory-efficient word count calculation #4968

wants to merge 3 commits into from

Conversation

aupiff
Copy link

@aupiff aupiff commented Jul 12, 2015

(SPLITTING UP PR #4941)

previously, line 760 in text.py

values = np.ones(len(j_indices))

would cause memory issues as len(j_indices) is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and the X.sum_duplicates calculation without a perceptible performance hit.

@amueller
Copy link
Member

travis is unhappy

@jmschrei
Copy link
Member

jmschrei commented Aug 3, 2015

@aupiff Is this still being worked on?

@aupiff
Copy link
Author

aupiff commented Aug 4, 2015

@jmschrei Yes! I forgot about this for a little while. I'm fixing a few failing tests now.

@aupiff
Copy link
Author

aupiff commented Aug 4, 2015

@jmschrei all tests passing now! sorry this took so long to fix.

@jmschrei
Copy link
Member

jmschrei commented Aug 4, 2015

Excellent! Would it be possible for you to provide some simple benchmarks, and the code which runs those benchmarks as well? It would make it easier for us to ensure the quality of your code.

for feature in analyze(doc):
try:
j_indices.append(vocabulary[feature])
vocab_idx = vocabulary[feature]
word_occurences = j_idx_to_count_dict.get(vocab_idx, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a Counter?

@amueller
Copy link
Member

👍 for some benchmarks :)

@ephes
Copy link
Contributor

ephes commented Aug 16, 2015

I've added this vectorizer to my benchmark script for #5122 and added a line to the results table.

@jnothman
Copy link
Member

Superseded by #7272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants