Skip to content

FIX CountVectorizer: check upper case in vocab only in fit #21251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 20, 2021

Conversation

jeremiedbb
Copy link
Member

Fixes #21242

Move the check for upper case chars in the vocabulary only at fit time.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v1.0.1.rst with tag |Fix|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Aside

What do you think about adding a new benchmark for performance regressions such as this one to benchmark suite? This is not a blocker for this PR and can be done as a follow up PR.

@@ -1318,6 +1307,17 @@ def fit_transform(self, raw_documents, y=None):
min_df = self.min_df
max_features = self.max_features

if self.fixed_vocabulary_ and self.lowercase:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered placing this check in _validate_vocabulary, but the method is called here:

@idf_.setter
def idf_(self, value):
self._validate_vocabulary()
if hasattr(self, "vocabulary_"):

and I do not think it makes sense to warn here. (No action required)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I wrong to say that we could add a test and where we could check that the warning will be called during fit_transform but not during transform?

@thomasjpfan thomasjpfan added this to the 1.0.1 milestone Oct 6, 2021
@jeremiedbb
Copy link
Member Author

What do you think about adding a new benchmark for performance regressions such as this one to benchmark suite? This is not a blocker for this PR and can be done as a follow up PR.

The issue is that the benchmark suite already takes quite some time to run. I would only add benchmarks for the main estimators / parameters. However we don't have any bench for the vectorizers. We could maybe add one

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title CountVectorizer: check upper case in vocab only in fit FIX CountVectorizer: check upper case in vocab only in fit Oct 8, 2021
@jeremiedbb
Copy link
Member Author

@glemaitre I updated the existing test to check that now we don't raise at transform

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 7aabe53 into scikit-learn:main Oct 20, 2021
@glemaitre
Copy link
Member

Thanks @jeremiedbb

@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CountVectorizer.transform() much slower since version 1.0
3 participants