Skip to content

FIX CountVectorizer: check upper case in vocab only in fit #21251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 20, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,14 @@ Fixed models
between sparse and dense input. :pr:`21195`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.feature_extraction`
.................................

- |Efficiency| Fixed an efficiency regression introduced in version 1.0.0 in the
`transform` method of :class:`feature_extraction.text.CountVectorizer` which no
longer checks for uppercase characters in the provided vocabulary. :pr:`21251`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.linear_model`
...........................

Expand Down
11 changes: 9 additions & 2 deletions sklearn/feature_extraction/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,9 @@ def test_countvectorizer_custom_token_pattern_with_several_group():


def test_countvectorizer_uppercase_in_vocab():
vocabulary = ["Sample", "Upper", "CaseVocabulary"]
# Check that the check for uppercase in the provided vocabulary is only done at fit
# time and not at transform time (#21251)
vocabulary = ["Sample", "Upper", "Case", "Vocabulary"]
message = (
"Upper case characters found in"
" vocabulary while 'lowercase'"
Expand All @@ -445,8 +447,13 @@ def test_countvectorizer_uppercase_in_vocab():
)

vectorizer = CountVectorizer(lowercase=True, vocabulary=vocabulary)

with pytest.warns(UserWarning, match=message):
vectorizer.fit_transform(vocabulary)
vectorizer.fit(vocabulary)

with pytest.warns(None) as record:
vectorizer.transform(vocabulary)
assert not record


def test_tf_transformer_feature_names_out():
Expand Down
22 changes: 11 additions & 11 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -1194,17 +1194,6 @@ def _count_vocab(self, raw_documents, fixed_vocab):
j_indices = []
indptr = []

if self.lowercase:
for vocab in vocabulary:
if any(map(str.isupper, vocab)):
warnings.warn(
"Upper case characters found in"
" vocabulary while 'lowercase'"
" is True. These entries will not"
" be matched with any documents"
)
break

values = _make_int_array()
indptr.append(0)
for doc in raw_documents:
Expand Down Expand Up @@ -1327,6 +1316,17 @@ def fit_transform(self, raw_documents, y=None):
min_df = self.min_df
max_features = self.max_features

if self.fixed_vocabulary_ and self.lowercase:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered placing this check in _validate_vocabulary, but the method is called here:

@idf_.setter
def idf_(self, value):
self._validate_vocabulary()
if hasattr(self, "vocabulary_"):

and I do not think it makes sense to warn here. (No action required)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I wrong to say that we could add a test and where we could check that the warning will be called during fit_transform but not during transform?

for term in self.vocabulary:
if any(map(str.isupper, term)):
warnings.warn(
"Upper case characters found in"
" vocabulary while 'lowercase'"
" is True. These entries will not"
" be matched with any documents"
)
break

vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)

if self.binary:
Expand Down