-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
import unicodedata | ||
import warnings | ||
import numbers | ||
from ..externals.joblib import Parallel, delayed | ||
|
||
import numpy as np | ||
import scipy.sparse as sp | ||
|
@@ -84,6 +85,41 @@ def _check_stop_list(stop): | |
return stop | ||
|
||
|
||
############################################################################### | ||
# These two functions are required for the joblib parallelization of the | ||
# CV's analyze function. Since multiprocessing.Pool was causing me some | ||
# trouble with the pickling of instance members and lambda functions, I cut it | ||
# short by simply extracting the logic for a single case (word ngrams), | ||
# with some default parameters hardcoded. Hence, this is NOT meant to be a | ||
# complete solution, just the minimal code for my proof of concept. | ||
|
||
def _word_ngrams_single(tokens, stop_words=None): | ||
"""Turn tokens into a sequence of n-grams after stop words filtering""" | ||
# handle stop words | ||
if stop_words is not None: | ||
tokens = [w for w in tokens if w not in stop_words] | ||
|
||
# handle token n-grams | ||
min_n, max_n = (1, 1)#self.ngram_range | ||
if max_n != 1: | ||
original_tokens = tokens | ||
tokens = [] | ||
n_original_tokens = len(original_tokens) | ||
for n in xrange(min_n, | ||
min(max_n + 1, n_original_tokens + 1)): | ||
for i in xrange(n_original_tokens - n + 1): | ||
tokens.append(u" ".join(original_tokens[i: i + n])) | ||
|
||
return tokens | ||
|
||
def _analyze_single(doc): | ||
token_pattern = re.compile(ur"(?u)\b\w\w+\b") | ||
return _word_ngrams_single(token_pattern.findall(doc.decode('utf-8', 'strict'))) | ||
|
||
|
||
############################################################################### | ||
|
||
|
||
class CountVectorizer(BaseEstimator): | ||
"""Convert a collection of raw documents to a matrix of token counts | ||
|
||
|
@@ -432,7 +468,7 @@ def fit(self, raw_documents, y=None): | |
self.fit_transform(raw_documents) | ||
return self | ||
|
||
def fit_transform(self, raw_documents, y=None): | ||
def fit_transform(self, raw_documents, y=None, n_jobs=1, batch_size=1): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"""Learn the vocabulary dictionary and return the count vectors | ||
|
||
This is more efficient than calling fit followed by transform. | ||
|
@@ -467,15 +503,22 @@ def fit_transform(self, raw_documents, y=None): | |
|
||
analyze = self.build_analyzer() | ||
|
||
# TODO: parallelize the following loop with joblib? | ||
# (see XXX up ahead) | ||
for doc in raw_documents: | ||
term_count_current = Counter(analyze(doc)) | ||
term_counts.update(term_count_current) | ||
|
||
# Let's see if we can gain some speed by introducing a job batch mechanism | ||
for analysis in Parallel(n_jobs=n_jobs, | ||
batch_size=batch_size)(delayed(_analyze_single)(doc) | ||
for doc in raw_documents): | ||
term_count_current = Counter(analysis) | ||
term_counts.update(Counter(analysis)) | ||
document_counts.update(term_count_current.iterkeys()) | ||
|
||
term_counts_per_doc.append(term_count_current) | ||
|
||
# TODO: parallelize the following loop with joblib? | ||
# (see XXX up ahead) | ||
# for doc in raw_documents: | ||
# term_count_current = Counter(analyze(doc)) | ||
# term_counts.update(term_count_current) | ||
# document_counts.update(term_count_current.iterkeys()) | ||
# term_counts_per_doc.append(term_count_current) | ||
|
||
n_doc = len(term_counts_per_doc) | ||
max_features = self.max_features | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it.
max_n
will alway be1
since you set it that way the line before?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I say in the comment just above, this whole PR is absolutely not meant to be a complete solution. It's just a tentative proof of concept, studying the idea for a single case (I guess I should have said "unigrams", or "1-grams" instead of "word ngrams"), hence the hardcoding of certain parameters.