-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX CountVectorizer: check upper case in vocab only in fit #21251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX CountVectorizer: check upper case in vocab only in fit #21251
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v1.0.1.rst
with tag |Fix|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Aside
What do you think about adding a new benchmark for performance regressions such as this one to benchmark suite? This is not a blocker for this PR and can be done as a follow up PR.
@@ -1318,6 +1307,17 @@ def fit_transform(self, raw_documents, y=None): | |||
min_df = self.min_df | |||
max_features = self.max_features | |||
|
|||
if self.fixed_vocabulary_ and self.lowercase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered placing this check in _validate_vocabulary
, but the method is called here:
scikit-learn/sklearn/feature_extraction/text.py
Lines 2007 to 2010 in 31c66a9
@idf_.setter | |
def idf_(self, value): | |
self._validate_vocabulary() | |
if hasattr(self, "vocabulary_"): |
and I do not think it makes sense to warn here. (No action required)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I wrong to say that we could add a test and where we could check that the warning will be called during fit_transform
but not during transform
?
The issue is that the benchmark suite already takes quite some time to run. I would only add benchmarks for the main estimators / parameters. However we don't have any bench for the vectorizers. We could maybe add one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@glemaitre I updated the existing test to check that now we don't raise at transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @jeremiedbb |
Fixes #21242
Move the check for upper case chars in the vocabulary only at fit time.