[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

rth · 2018-10-16T09:07:40Z

This fixes stop words validation in text vectorizers with custom preprocessors / tokenizers. In the end I went with trying to validate stop words and skipping the check if for some reason an exception is raised. I think it's a more general solution than checking if a custom tokenizer / preprocessor was provided as for a number of those it still makes sense to do stop word validation.

Tests are added to ensure that this does not produce silent exceptions which might be enough.

jnothman · 2018-10-16T12:33:44Z

sklearn/feature_extraction/text.py

+                        was previously performed, "error" if it could not be
+                        performed (e.g. because of the use of a custom
+                        preprocessor / tokenizer)
+        """
        # NB: stop_words is validated, unlike self.stop_words
        if id(self.stop_words) != getattr(self, '_stop_words_id', None):


this is getting very nested. Perhaps reverse the condition here and return early instead

Thanks for the review. Good point, reduced the nesting.

rth · 2018-11-06T16:48:39Z

Would anyone else able to review this? Maybe @glemaitre or @qinhanmin2014 ?

qinhanmin2014

Apologies for the delay.

qinhanmin2014 · 2018-11-08T14:12:30Z

doc/whats_new/v0.20.rst

 - |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`,
  :class:`ensemble.BaggingRegressor` and :class:`ensemble.IsolationForest`,
  where ``max_features`` was sometimes rounded down to zero.
  :issue:`12388` by :user:`Connor Tann <Connossor>`.

+


redundant line :)

Same here -- fixed by the rebase.

qinhanmin2014 · 2018-11-08T14:12:53Z

doc/whats_new/v0.20.rst

+  :func:`feature_extraction.text.CountVectorizer` and other text vectorizers
+  could error during stop words validation with custom preprocessors
+  or tokenizers. :issue:`12393` by `Roman Yurchak`_.
+
 - |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`,


this should be in sklearn.ensemble section

It was actually in the correct section, but there was some conflict that apparently got resolved by Github but must have resulted in this weird rendering.

Rebased it to be sure, should be fine now.

qinhanmin2014 · 2018-11-08T14:14:05Z

sklearn/feature_extraction/tests/test_text.py

+    vec = CustomEstimator(stop_words=['and'])
+    assert _check_stop_words_consistency(vec) == 'error'
+
+    vec = CustomEstimator(tokenizer=lambda doc: re.compile(r'\w{1,}')


What's the purpose of this test? We now have no stop words to validate and _check_stop_words_consistency will return True, which is checked in previous test (L1168).

@rth so no reply here means that you want this test? I'm not opposed to it though, just feel that it's redundant.

The point of this is to validate that when we use a custom tokenizer, stop words consistency will still be checked. But you are right the were no stop words here, and my intention was to use a standard vectorizer not CustomEstimator I think.

Should be fixed now.

qinhanmin2014 · 2018-11-11T08:46:17Z

gentle ping @rth, there's nothing much to do here (or you can share your opinion and I'll push to your branch)

rth · 2018-11-11T13:18:17Z

Thanks for the review @qinhanmin2014 , there was indeed a bug in the last test :)

qinhanmin2014

Thanks, will merge when green.

…ybutton * upstream/master: FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567) DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565) EXA Fix comment in plot-iris-logistic example (scikit-learn#12564) FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393) DOC Add skorch to related projects (scikit-learn#12561) MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286) MNT Remove unused assert_true imports (scikit-learn#12560) TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547) DOC: add a testimonial from JP Morgan (scikit-learn#12555)

…rs / tokenizers (scikit-learn#12393)

…ikit-learn into add_codeblock_copybutton * 'add_codeblock_copybutton' of https://github.com/thoo/scikit-learn: Move an extension under sphinx_copybutton/ Move css/js file under sphinxext/ Fix max_depth overshoot in BFS expansion of trees (scikit-learn#12344) TST don't test utils.fixes docstrings (scikit-learn#12576) DOC Fix typo (scikit-learn#12563) FIX Workaround limitation of cloudpickle under PyPy (scikit-learn#12566) MNT bare asserts (scikit-learn#12571) FIX incorrect error when OneHotEncoder.transform called prior to fit (scikit-learn#12443) Retrigger travis:max time limit error DOC: Clarify `cv` parameter description in `GridSearchCV` (scikit-learn#12495) FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567) DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565) EXA Fix comment in plot-iris-logistic example (scikit-learn#12564) FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393) DOC Add skorch to related projects (scikit-learn#12561) MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286) MNT Remove unused assert_true imports (scikit-learn#12560) TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547) DOC: add a testimonial from JP Morgan (scikit-learn#12555)

…rs / tokenizers (scikit-learn#12393)

…processors / tokenizers (scikit-learn#12393)" This reverts commit d55ce9d.

…rs / tokenizers (scikit-learn#12393)

rth added this to the 0.20.1 milestone Oct 16, 2018

jnothman approved these changes Oct 16, 2018

View reviewed changes

qinhanmin2014 approved these changes Nov 8, 2018

View reviewed changes

rth added 3 commits November 11, 2018 13:59

Fix regression in text vectorizers due to stop word validation

5961e0a

Fix typo

5072d52

Reduce nesting

88d3fa4

rth force-pushed the fix-stop-words-validation branch from 4ff060e to 88d3fa4 Compare November 11, 2018 13:03

Address Hanmin's comments

9c42f89

qinhanmin2014 approved these changes Nov 11, 2018

View reviewed changes

qinhanmin2014 merged commit eb36c28 into scikit-learn:master Nov 11, 2018

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 13, 2018

FIX stop words validation in text vectorizers with custom preprocesso…

ee7a871

…rs / tokenizers (scikit-learn#12393)

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

FIX stop words validation in text vectorizers with custom preprocesso…

d0f91ea

…rs / tokenizers (scikit-learn#12393)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX stop words validation in text vectorizers with custom preprocesso…

5b01701

…rs / tokenizers (scikit-learn#12393)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX stop words validation in text vectorizers with custom preprocesso…

2088072

…rs / tokenizers (scikit-learn#12393)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX stop words validation in text vectorizers with custom preprocesso…

d55ce9d

…rs / tokenizers (scikit-learn#12393)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX stop words validation in text vectorizers with custom pre…

d155c23

…processors / tokenizers (scikit-learn#12393)" This reverts commit d55ce9d.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX stop words validation in text vectorizers with custom pre…

dabc9f2

…processors / tokenizers (scikit-learn#12393)" This reverts commit d55ce9d.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX stop words validation in text vectorizers with custom preprocesso…

26f7308

…rs / tokenizers (scikit-learn#12393)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

Uh oh!

rth commented Oct 16, 2018 •

edited

Loading

Uh oh!

jnothman Oct 16, 2018

Uh oh!

rth Oct 16, 2018

Uh oh!

rth commented Nov 6, 2018

Uh oh!

qinhanmin2014 left a comment

Uh oh!

qinhanmin2014 Nov 8, 2018

Uh oh!

rth Nov 11, 2018

Uh oh!

qinhanmin2014 Nov 8, 2018

Uh oh!

rth Nov 11, 2018

Uh oh!

qinhanmin2014 Nov 8, 2018

Uh oh!

qinhanmin2014 Nov 11, 2018

Uh oh!

rth Nov 11, 2018

Uh oh!

qinhanmin2014 commented Nov 11, 2018

Uh oh!

rth commented Nov 11, 2018

Uh oh!

qinhanmin2014 left a comment

Uh oh!

Uh oh!

Uh oh!

[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

Uh oh!

Conversation

rth commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Nov 6, 2018

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 11, 2018

Uh oh!

rth commented Nov 11, 2018

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rth commented Oct 16, 2018 •

edited

Loading