Skip to content

[MRG] Fix stop words validation in text vectorizers with custom preprocessors / tokenizers #12393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 11, 2018

Conversation

rth
Copy link
Member

@rth rth commented Oct 16, 2018

Closes #12256

This fixes stop words validation in text vectorizers with custom preprocessors / tokenizers. In the end I went with trying to validate stop words and skipping the check if for some reason an exception is raised. I think it's a more general solution than checking if a custom tokenizer / preprocessor was provided as for a number of those it still makes sense to do stop word validation.

Tests are added to ensure that this does not produce silent exceptions which might be enough.

@rth rth added this to the 0.20.1 milestone Oct 16, 2018
was previously performed, "error" if it could not be
performed (e.g. because of the use of a custom
preprocessor / tokenizer)
"""
# NB: stop_words is validated, unlike self.stop_words
if id(self.stop_words) != getattr(self, '_stop_words_id', None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting very nested. Perhaps reverse the condition here and return early instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Good point, reduced the nesting.

@rth
Copy link
Member Author

rth commented Nov 6, 2018

Would anyone else able to review this? Maybe @glemaitre or @qinhanmin2014 ?

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay.

- |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`,
:class:`ensemble.BaggingRegressor` and :class:`ensemble.IsolationForest`,
where ``max_features`` was sometimes rounded down to zero.
:issue:`12388` by :user:`Connor Tann <Connossor>`.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant line :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- fixed by the rebase.

:func:`feature_extraction.text.CountVectorizer` and other text vectorizers
could error during stop words validation with custom preprocessors
or tokenizers. :issue:`12393` by `Roman Yurchak`_.

- |Fix| Fixed a bug affecting :class:`ensemble.BaggingClassifier`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in sklearn.ensemble section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually in the correct section, but there was some conflict that apparently got resolved by Github but must have resulted in this weird rendering.

Rebased it to be sure, should be fine now.

vec = CustomEstimator(stop_words=['and'])
assert _check_stop_words_consistency(vec) == 'error'

vec = CustomEstimator(tokenizer=lambda doc: re.compile(r'\w{1,}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this test? We now have no stop words to validate and _check_stop_words_consistency will return True, which is checked in previous test (L1168).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rth so no reply here means that you want this test? I'm not opposed to it though, just feel that it's redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this is to validate that when we use a custom tokenizer, stop words consistency will still be checked. But you are right the were no stop words here, and my intention was to use a standard vectorizer not CustomEstimator I think.

Should be fixed now.

@qinhanmin2014
Copy link
Member

gentle ping @rth, there's nothing much to do here (or you can share your opinion and I'll push to your branch)

@rth rth force-pushed the fix-stop-words-validation branch from 4ff060e to 88d3fa4 Compare November 11, 2018 13:03
@rth
Copy link
Member Author

rth commented Nov 11, 2018

Thanks for the review @qinhanmin2014 , there was indeed a bug in the last test :)

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will merge when green.

@qinhanmin2014 qinhanmin2014 merged commit eb36c28 into scikit-learn:master Nov 11, 2018
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 12, 2018
…ybutton

* upstream/master:
  FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567)
  DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565)
  EXA Fix comment in plot-iris-logistic example (scikit-learn#12564)
  FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393)
  DOC Add skorch to related projects (scikit-learn#12561)
  MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286)
  MNT Remove unused assert_true imports (scikit-learn#12560)
  TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547)
  DOC: add a testimonial from JP Morgan (scikit-learn#12555)
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 13, 2018
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 13, 2018
…ikit-learn into add_codeblock_copybutton

* 'add_codeblock_copybutton' of https://github.com/thoo/scikit-learn:
  Move an extension under sphinx_copybutton/
  Move css/js file under sphinxext/
  Fix max_depth overshoot in BFS expansion of trees (scikit-learn#12344)
  TST don't test utils.fixes docstrings (scikit-learn#12576)
  DOC Fix typo (scikit-learn#12563)
  FIX Workaround limitation of cloudpickle under PyPy (scikit-learn#12566)
  MNT bare asserts (scikit-learn#12571)
  FIX incorrect error when OneHotEncoder.transform called prior to fit (scikit-learn#12443)
  Retrigger travis:max time limit error
  DOC: Clarify `cv` parameter description in `GridSearchCV` (scikit-learn#12495)
  FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567)
  DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565)
  EXA Fix comment in plot-iris-logistic example (scikit-learn#12564)
  FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393)
  DOC Add skorch to related projects (scikit-learn#12561)
  MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286)
  MNT Remove unused assert_true imports (scikit-learn#12560)
  TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547)
  DOC: add a testimonial from JP Morgan (scikit-learn#12555)
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errow thrown when running TF-IDF vectorizer on scikit-learn 0.20
3 participants