FIX CountVectorizer: check upper case in vocab only in fit #21251

jeremiedbb · 2021-10-06T10:10:21Z

Move the check for upper case chars in the vocabulary only at fit time.

thomasjpfan

Please add an entry to the change log at doc/whats_new/v1.0.1.rst with tag |Fix|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Aside

What do you think about adding a new benchmark for performance regressions such as this one to benchmark suite? This is not a blocker for this PR and can be done as a follow up PR.

thomasjpfan · 2021-10-06T16:41:29Z

sklearn/feature_extraction/text.py

@@ -1318,6 +1307,17 @@ def fit_transform(self, raw_documents, y=None):
        min_df = self.min_df
        max_features = self.max_features

+        if self.fixed_vocabulary_ and self.lowercase:


I considered placing this check in _validate_vocabulary, but the method is called here:

scikit-learn/sklearn/feature_extraction/text.py

Lines 2007 to 2010 in 31c66a9

@idf_.setter

def idf_(self, value):

self._validate_vocabulary()

if hasattr(self, "vocabulary_"):

and I do not think it makes sense to warn here. (No action required)

Yes I thought the same

Am I wrong to say that we could add a test and where we could check that the warning will be called during fit_transform but not during transform?

jeremiedbb · 2021-10-07T15:56:11Z

What do you think about adding a new benchmark for performance regressions such as this one to benchmark suite? This is not a blocker for this PR and can be done as a follow up PR.

The issue is that the benchmark suite already takes quite some time to run. I would only add benchmarks for the main estimators / parameters. However we don't have any bench for the vectorizers. We could maybe add one

thomasjpfan

LGTM

jeremiedbb · 2021-10-20T14:25:40Z

@glemaitre I updated the existing test to check that now we don't raise at transform

glemaitre

LGTM

glemaitre · 2021-10-20T15:16:40Z

Thanks @jeremiedbb

…arn#21251)

vocab check for upper only in fit

abfcfb3

github-actions bot added the module:feature_extraction label Oct 6, 2021

thomasjpfan reviewed Oct 6, 2021

View reviewed changes

thomasjpfan added this to the 1.0.1 milestone Oct 6, 2021

what's new entry

9ad6e1f

thomasjpfan approved these changes Oct 8, 2021

View reviewed changes

thomasjpfan changed the title ~~CountVectorizer: check upper case in vocab only in fit~~ FIX CountVectorizer: check upper case in vocab only in fit Oct 8, 2021

jeremiedbb added 3 commits October 19, 2021 10:34

Merge branch 'master' into count-vect-vocab-upper

c8a33c1

cln

76a6a86

update existing test

bd1ff22

black

cd04847

glemaitre approved these changes Oct 20, 2021

View reviewed changes

glemaitre merged commit 7aabe53 into scikit-learn:main Oct 20, 2021

glemaitre mentioned this pull request Oct 23, 2021

Release 1.0.1 #21404

Merged

10 tasks

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

FIX CountVectorizer: check upper case in vocab only in fit (scikit-le…

099d455

…arn#21251)

glemaitre pushed a commit that referenced this pull request Oct 25, 2021

FIX CountVectorizer: check upper case in vocab only in fit (#21251)

ff00693

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX CountVectorizer: check upper case in vocab only in fit (scikit-le…

dfd061d

…arn#21251)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX CountVectorizer: check upper case in vocab only in fit #21251

FIX CountVectorizer: check upper case in vocab only in fit #21251

Uh oh!

jeremiedbb commented Oct 6, 2021

Uh oh!

thomasjpfan left a comment •

edited

Loading

Uh oh!

thomasjpfan Oct 6, 2021

Uh oh!

jeremiedbb Oct 6, 2021

Uh oh!

glemaitre Oct 20, 2021

Uh oh!

jeremiedbb commented Oct 7, 2021

Uh oh!

thomasjpfan left a comment

Uh oh!

jeremiedbb commented Oct 20, 2021

Uh oh!

glemaitre left a comment

Uh oh!

glemaitre commented Oct 20, 2021

Uh oh!

Uh oh!

	@idf_.setter
	def idf_(self, value):
	self._validate_vocabulary()
	if hasattr(self, "vocabulary_"):

Uh oh!

FIX CountVectorizer: check upper case in vocab only in fit #21251

FIX CountVectorizer: check upper case in vocab only in fit #21251

Uh oh!

Conversation

jeremiedbb commented Oct 6, 2021

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Aside

Uh oh!

thomasjpfan Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Oct 7, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Oct 20, 2021

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 20, 2021

Uh oh!

Uh oh!

thomasjpfan left a comment •

edited

Loading