-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393
Conversation
sklearn/feature_extraction/text.py
Outdated
was previously performed, "error" if it could not be | ||
performed (e.g. because of the use of a custom | ||
preprocessor / tokenizer) | ||
""" | ||
# NB: stop_words is validated, unlike self.stop_words | ||
if id(self.stop_words) != getattr(self, '_stop_words_id', None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting very nested. Perhaps reverse the condition here and return early instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Good point, reduced the nesting.
Would anyone else able to review this? Maybe @glemaitre or @qinhanmin2014 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay.
doc/whats_new/v0.20.rst
Outdated
- |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`, | ||
:class:`ensemble.BaggingRegressor` and :class:`ensemble.IsolationForest`, | ||
where ``max_features`` was sometimes rounded down to zero. | ||
:issue:`12388` by :user:`Connor Tann <Connossor>`. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant line :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- fixed by the rebase.
doc/whats_new/v0.20.rst
Outdated
:func:`feature_extraction.text.CountVectorizer` and other text vectorizers | ||
could error during stop words validation with custom preprocessors | ||
or tokenizers. :issue:`12393` by `Roman Yurchak`_. | ||
|
||
- |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in sklearn.ensemble
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually in the correct section, but there was some conflict that apparently got resolved by Github but must have resulted in this weird rendering.
Rebased it to be sure, should be fine now.
vec = CustomEstimator(stop_words=['and']) | ||
assert _check_stop_words_consistency(vec) == 'error' | ||
|
||
vec = CustomEstimator(tokenizer=lambda doc: re.compile(r'\w{1,}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this test? We now have no stop words to validate and _check_stop_words_consistency
will return True, which is checked in previous test (L1168).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rth so no reply here means that you want this test? I'm not opposed to it though, just feel that it's redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this is to validate that when we use a custom tokenizer, stop words consistency will still be checked. But you are right the were no stop words here, and my intention was to use a standard vectorizer not CustomEstimator
I think.
Should be fixed now.
gentle ping @rth, there's nothing much to do here (or you can share your opinion and I'll push to your branch) |
4ff060e
to
88d3fa4
Compare
Thanks for the review @qinhanmin2014 , there was indeed a bug in the last test :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will merge when green.
…ybutton * upstream/master: FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567) DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565) EXA Fix comment in plot-iris-logistic example (scikit-learn#12564) FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393) DOC Add skorch to related projects (scikit-learn#12561) MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286) MNT Remove unused assert_true imports (scikit-learn#12560) TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547) DOC: add a testimonial from JP Morgan (scikit-learn#12555)
…ikit-learn into add_codeblock_copybutton * 'add_codeblock_copybutton' of https://github.com/thoo/scikit-learn: Move an extension under sphinx_copybutton/ Move css/js file under sphinxext/ Fix max_depth overshoot in BFS expansion of trees (scikit-learn#12344) TST don't test utils.fixes docstrings (scikit-learn#12576) DOC Fix typo (scikit-learn#12563) FIX Workaround limitation of cloudpickle under PyPy (scikit-learn#12566) MNT bare asserts (scikit-learn#12571) FIX incorrect error when OneHotEncoder.transform called prior to fit (scikit-learn#12443) Retrigger travis:max time limit error DOC: Clarify `cv` parameter description in `GridSearchCV` (scikit-learn#12495) FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567) DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565) EXA Fix comment in plot-iris-logistic example (scikit-learn#12564) FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393) DOC Add skorch to related projects (scikit-learn#12561) MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286) MNT Remove unused assert_true imports (scikit-learn#12560) TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547) DOC: add a testimonial from JP Morgan (scikit-learn#12555)
…processors / tokenizers (scikit-learn#12393)" This reverts commit d55ce9d.
…processors / tokenizers (scikit-learn#12393)" This reverts commit d55ce9d.
Closes #12256
This fixes stop words validation in text vectorizers with custom preprocessors / tokenizers. In the end I went with trying to validate stop words and skipping the check if for some reason an exception is raised. I think it's a more general solution than checking if a custom tokenizer / preprocessor was provided as for a number of those it still makes sense to do stop word validation.
Tests are added to ensure that this does not produce silent exceptions which might be enough.