Skip to content

[MRG] Fix incorrect error when OneHotEncoder.transform called prior to fit #12443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 12, 2018

Conversation

dillongardner
Copy link
Contributor

@dillongardner dillongardner commented Oct 23, 2018

Reference Issues/PRs

Fixes #12395

What does this implement/fix? Explain your changes.

Fix checks for categories_ attribute when transform is called.

The edge case when class is instantiated with categorical_features set to all False, _legacy_mode is set to True but _legacy_fit_transform never gets called. To fix this, explicitly tested for this and set categories_ to an empty list. This fix also allows get_feature_names to return an empty array.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = np.array([[3, 2, 1], [0, 1, 1]])

# Edge case: all non-categorical
cat = [False, False, False]
enc = OneHotEncoder(categorical_features=cat)
enc.fit(X)
print('Categories are {}'.format(enc.categories_))
print('Feature names are {}'.format(enc.get_feature_names()))
Categories are []
Feature names are []

Any other comments?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a non regression test and rename the title to a summary of what this PR does.

n_features = X.shape[1]
sel = np.zeros(n_features, dtype=bool)
sel[np.asarray(self.categorical_features)] = True
if sum(sel) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the code in detail, but isn't this equivalent to,

not self.categorical_features.any()

or something similar ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical_features can be a array of indices or a mask. This works in either case. I followed the pattern in sklearn.preprocessing.base._transform_selected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment above this code block to explain the specific case that this code is checking for?

@dillongardner dillongardner changed the title [MRG] Fix #12395 [MRG] Fix incorrect error when OneHotEncoder.transform called prior to fit Oct 23, 2018
@amueller
Copy link
Member

the failure is because you don't ignore the FutureWarning, I think.

X_tr = enc.fit_transform(X)
expected_features = np.array(list(), dtype='object')
assert_array_equal(X, X_tr)
assert_equal(enc.categories_, list())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true for assert_array_equal? Is it preferable to do assert np.array_equal(x, y)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note in the whatsnew file?

n_features = X.shape[1]
sel = np.zeros(n_features, dtype=bool)
sel[np.asarray(self.categorical_features)] = True
if sum(sel) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment above this code block to explain the specific case that this code is checking for?

@amueller
Copy link
Member

this looks good. I'm not sure if we need a whatsnew for a better error message. I'm fine with doing it this way for the 0.20.1 release and we can to a no-fitting transformation later. Wdyt? @jnothman @ogrisel ?

@jnothman jnothman merged commit 2afee93 into scikit-learn:master Nov 12, 2018
@jnothman
Copy link
Member

Thanks @dillongardner

thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 13, 2018
…ybutton

* upstream/master:
  Fix max_depth overshoot in BFS expansion of trees (scikit-learn#12344)
  TST don't test utils.fixes docstrings (scikit-learn#12576)
  DOC Fix typo (scikit-learn#12563)
  FIX Workaround limitation of cloudpickle under PyPy (scikit-learn#12566)
  MNT bare asserts (scikit-learn#12571)
  FIX incorrect error when OneHotEncoder.transform called prior to fit (scikit-learn#12443)
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 13, 2018
…ikit-learn into add_codeblock_copybutton

* 'add_codeblock_copybutton' of https://github.com/thoo/scikit-learn:
  Move an extension under sphinx_copybutton/
  Move css/js file under sphinxext/
  Fix max_depth overshoot in BFS expansion of trees (scikit-learn#12344)
  TST don't test utils.fixes docstrings (scikit-learn#12576)
  DOC Fix typo (scikit-learn#12563)
  FIX Workaround limitation of cloudpickle under PyPy (scikit-learn#12566)
  MNT bare asserts (scikit-learn#12571)
  FIX incorrect error when OneHotEncoder.transform called prior to fit (scikit-learn#12443)
  Retrigger travis:max time limit error
  DOC: Clarify `cv` parameter description in `GridSearchCV` (scikit-learn#12495)
  FIX remove FutureWarning in _object_dtype_isnan and add test (scikit-learn#12567)
  DOC Add 's' to "correspond" in docs for Hamming Loss. (scikit-learn#12565)
  EXA Fix comment in plot-iris-logistic example (scikit-learn#12564)
  FIX stop words validation in text vectorizers with custom preprocessors / tokenizers (scikit-learn#12393)
  DOC Add skorch to related projects (scikit-learn#12561)
  MNT Don't change self.n_values in OneHotEncoder.fit (scikit-learn#12286)
  MNT Remove unused assert_true imports (scikit-learn#12560)
  TST autoreplace assert_true(...==...) with plain assert (scikit-learn#12547)
  DOC: add a testimonial from JP Morgan (scikit-learn#12555)
@arnau126
Copy link

This change breaks transform in _legacy_mode. It raises NotFittedError when it's actually fitted. This is because check_is_fitted checks categories_ when _legacy_transform doesn't use it at all.

In _legacy_mode we should check _feature_indices_, _n_values_ and _active_features_ instead.

I propose:

def transform(self, X):
    if self._legacy_mode:
        check_is_fitted(self, ('_feature_indices_', '_n_values_', '_active_features_'))
        return _transform_selected(X, self._legacy_transform, self.dtype,
                                                    self._categorical_features, copy=True)
    else:
        check_is_fitted(self, 'categories_')
        return self._transform_new(X)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder throws unhelpful error messages when tranform called prior to fit
6 participants