Skip to content

TST test_bagging_regressor/classifier_with_missing_inputs fails with SimpleImputer #11482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Jul 11, 2018 · 15 comments

Comments

@qinhanmin2014
Copy link
Member

See #11480
Git blame shows that we've introduced regression in #11211
ping the author @jeremiedbb and the reviewers @jnothman @glemaitre @ogrisel @jorisvandenbossche
Below are the logs from test_bagging_regressor_with_missing_inputs:

__________________ test_bagging_regressor_with_missing_inputs __________________
    def test_bagging_regressor_with_missing_inputs():
        # Check that BaggingRegressor can accept X with missing/infinite data
        X = np.array([
            [1, 3, 5],
            [2, None, 6],
            [2, np.nan, 6],
            [2, np.inf, 6],
            [2, np.NINF, 6],
        ])
        y_values = [
            np.array([2, 3, 3, 3, 3]),
            np.array([
                [2, 1, 9],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
            ])
        ]
        for y in y_values:
            regressor = DecisionTreeRegressor()
            pipeline = make_pipeline(
                SimpleImputer(),
                SimpleImputer(missing_values=np.inf),
                SimpleImputer(missing_values=np.NINF),
                regressor
            )
>           pipeline.fit(X, y).predict(X)
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
pipeline   = Pipeline(memory=None,
     steps=[('simpleimputer-1', SimpleImputer(copy=True,...tion_leaf=0.0,
           presort=False, random_state=None, splitter='best'))])
regressor  = DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
    ...raction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
y          = array([2, 3, 3, 3, 3])
y_values   = [array([2, 3, 3, 3, 3]), array([[2, 1, 9],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8]])]
/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/tests/test_bagging.py:785: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:253: in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:218: in _fit
    **fit_params_steps[name])
/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/_joblib/memory.py:362: in __call__
    return self.func(*args, **kwargs)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:602: in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/base.py:462: in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
/home/travis/build/scikit-learn/scikit-learn/sklearn/impute.py:209: in fit
    X = self._validate_input(X)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
X = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
    def _validate_input(self, X):
        allowed_strategies = ["mean", "median", "most_frequent", "constant"]
        if self.strategy not in allowed_strategies:
            raise ValueError("Can only use these strategies: {0} "
                             " got strategy={1}".format(allowed_strategies,
                                                        self.strategy))
    
        if self.strategy in ("most_frequent", "constant"):
            dtype = None
        else:
            dtype = FLOAT_DTYPES
    
        if not is_scalar_nan(self.missing_values):
            force_all_finite = True
        else:
            force_all_finite = "allow-nan"
    
        try:
            X = check_array(X, accept_sparse='csc', dtype=dtype,
                            force_all_finite=force_all_finite, copy=self.copy)
        except ValueError as ve:
            if "could not convert" in str(ve):
                raise ValueError("Cannot use {0} strategy with non-numeric "
                                 "data. Received datatype :{1}."
                                 "".format(self.strategy, X.dtype.kind))
            else:
>               raise ve
E               ValueError: Input contains infinity or a value too large for dtype('float64').
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
allowed_strategies = ['mean', 'median', 'most_frequent', 'constant']
dtype      = (<type 'numpy.float64'>, <type 'numpy.float32'>, <type 'numpy.float16'>)
force_all_finite = 'allow-nan'
self       = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
ve         = ValueError("Input contains infinity or a value too large for dtype('float64').",)
@glemaitre
Copy link
Member

Uhm what is the reasoning to let infinity pass through the regression?

@jnothman
Copy link
Member

So the argument is that previously inf was a valid value for missing_values, but now it is not?

@jeremiedbb
Copy link
Member

Not sure but it seems that's because check_array is called with force_all_finite='allow-nan' if and only if missing_values is np.nan. But do you want to let inf pass through SimpleImputer or be able to do missing_values=np.inf ? or both ?

@glemaitre
Copy link
Member

@jnothman apparently we don't allow infinite values in check_array while we could fill-up missing_values with inf previously.

For me filling with inf is really strange use case.

@qinhanmin2014
Copy link
Member Author

FYI #9707 introduces the test.

@glemaitre
Copy link
Member

Yep but #9707 wanted to actually check that the BaggingRegressor let pass through NaN and infinite. I don't see why we should have a Imputer there.

@jnothman
Copy link
Member

jnothman commented Jul 11, 2018 via email

@glemaitre
Copy link
Member

@jeremiedbb What is your opinion on letting on the side missing_values=np.inf?

@glemaitre
Copy link
Member

glemaitre commented Jul 12, 2018

@qinhanmin2014 You might want to update your PR as proposed by @jnothman.

@jeremiedbb
Copy link
Member

Well I'd say why not allow it, it does not require huge changes in the imputers, but I don't know if this is common to encode the missing values as np.inf.
However, what's done in the code above is quite different because there are np.inf in the array that are not the marker for the missing values. So the question is do we want to let np.nan, np.inf go through the imputer ?

@qinhanmin2014
Copy link
Member Author

The PR is for someone who want to look at the complete log. I would prefer the author or one of the reviewer to submit a PR here since I'll need some time to go through the discussions in #11211.
Regarding the solution, I generally prefer not to add too many restrictions. It seems at least not harmful to provide users with more choices. But now, I'm OK as long as the tests pass with SimpleImputer (i.e., we can either modify SimpleImputer or modify the tests).

@jnothman
Copy link
Member

jnothman commented Jul 12, 2018 via email

@glemaitre
Copy link
Member

So the question is do we want to let np.nan, np.inf go through the imputer ?

I could not figure out any use case in which you would like to user specifically np.inf.

And inf in a feature space should ring alarm bells.

I would agree on not letting it pass.

@jnothman
Copy link
Member

jnothman commented Jul 13, 2018 via email

@amueller
Copy link
Member

So is there a PR/consensus on what to do? I don't think we need to be able to impute np.inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants