Skip to content

Imputer to maintain missing collumns #8613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Imputer to maintain missing collumns #8613

wants to merge 7 commits into from

Conversation

janvanrijn
Copy link
Contributor

The preprocessing.imputer module removes attributes that are completely empty. While this makes sense in general, when used in a pipeline this is undesirable (see issue #8539)

In consultation with @amueller I wrote an extension that (if desired) replaces these attributes with a constant. This way, in the pipeline we can always rely on a constant feature ordering (and if needed, remove the constant features afterwards)

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. A preliminary review:

empty_attribute_constant=-1)


def test_imputation_empty_column_missing_nan():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment describing what this intends to ensure?


X_imputed_mean = np.array([
[-1, 3, -1, -1, 5],
[-1, 1, -1, -1, 3],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be asserting that some strange behaviour regarding nans is maintained. I'd argue that behaviour -- where a feature gets considered "empty" due to the presence of a single NaN where missing_values != 'nan' -- is a bug.

@@ -116,12 +116,13 @@ class Imputer(BaseEstimator, TransformerMixin):
contain missing values).
"""
def __init__(self, missing_values="NaN", strategy="mean",
axis=0, verbose=0, copy=True):
axis=0, verbose=0, copy=True, empty_attribute_constant=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call it fill_empty?

- changed behavior of constant empty_column imputation: it does not prevent columns with NaN values from being removed.
- Added comments to unit tests.
@janvanrijn
Copy link
Contributor Author

@jnothman thanks for the quick review.

I added a commit that incorporated your requested changes. Re. your first point: I agree, and changed this behaviour. Unfortunately there is no longer a way to ensure that the number of columns that gets passed to the imputer is also returned, but that seems to be the risk for passing data with "NaN"s and missing_values != "NaN".

Let's see if Travis passes right now.

@lucasdavid
Copy link
Contributor

lucasdavid commented Nov 17, 2017

@janvanrijn & @jnothman any chance of this getting merged soon? #2034 seems stuck and I'm not quite sure how to handle imputation and one-hot encoding without one of them.
Furthermore, even after #2034, this feature might still be useful.

@janvanrijn
Copy link
Contributor Author

I would love to finish this PR, as most of my personal projects heavily rely on this fix. Please let me know how to proceed from here.

@amueller
Copy link
Member

@janvanrijn can you give a usage example? Would ColumnTransformer also solve the issue for you?

@janvanrijn
Copy link
Contributor Author

Currently, the sklearn data imputer removes columns completely if all values are missing. This is in my pipelines undesirable behaviour, as in many cases I am relying on the column indices to stay fixed. In my particular example for the OneHotEncoder, in where I tell it which attributes should be encoded and which not by giving the indices to "categorical_features". Don't know if there are more use-cases.

Could you provide the link to ColumnTransformer Documentation? Google doesn't find it yet. For clarity, this specific PR does not solve the problem that I want to impute Categorical columns and Numerical columns different.

@amueller
Copy link
Member

It's not merged yet, and it's here: #9012
docs here: https://15367-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.experimental.ColumnTransformer.html#sklearn.experimental.ColumnTransformer

Basically the idea is that you don't need to store indices to any columns but instead have different columns go through different branches of the pipeline. The new CategoricalEncoder actually doesn't have an option to select columns and applies the transformation to all columns.

@janvanrijn
Copy link
Contributor Author

Having the column encoder seems to also solve the problem that this PR tries to address.

Some questions:

  1. Column transformer can be part of a pipeline? (i.e., after all columns have been transformed I want to use an estimator)
  2. It would from my POV really be gold if the column transformer could auto-detect datatypes of a column (in case of a pandas dataframe), e.g., perform transform chain A on all columns that are of datatype float and transform chain B on all columns that that are of datatype string

@amueller
Copy link
Member

amueller commented Dec 12, 2017

  1. it's an estimator. It can be part of a pipeline or contain pipelines or any other estimator.

  2. I don't think we want to do that, because it's likely to be fragile and hard to implement without pandas. If you have a dataframe, you can just do df.dtype == np.float or something like that to get a mask to apply it to the relevant types.

@lucasdavid
Copy link
Contributor

lucasdavid commented Dec 12, 2017

Here's a use case:

features = ['a', 'b', 'c', 'd']
x = np.random.randn(200, len(features))
y = np.random.randint(2, size=200)
x[:, np.random.randint(len(features))] = np.nan
model = make_pipeline(
    Imputer(),
    RandomForestClassifier()
)
model.fit(x, y)
print(features)
print(model.feature_importances_)

This prints two lists: the first containing four features and the second containing three feature importances.
How do I know which importances belong to which features if Imputer clipped a random column?

This makes it really annoying to backtrack the feature_importances_ vector to the original feature space. The two ways I imagined to this so far were by:

  • Avoiding total imputation with this:

    for i in x.shape[1]:
       if np.isnan(x[:, i]).all():
           x[:, i] = 0.
  • Reconstructing the original vector, maybe with something like this:

    clipped_columns = np.isnan(model.named_pipes[1][1].statistics_)
    feature_importances = np.zeros(len(features))
    feature_importances[~clipped_columns] = model.feature_importances_

Either way, they seem very messy.

@amueller
Copy link
Member

amueller commented Dec 12, 2017

I think that's an important use-case, but I don't think fixing columns is the right approach. I think using inverse_transform might be better.

There's also some discussion on how to generalize this at #6425. If you had a feature selection method in there, you'd run into the same problem, but replacing features by zero columns seems pretty strange in that case. Or if you use polynomial features, you're pretty much lost without #6425.

Ideally we'd have a generic name to get correspondences of input and output features if possible (this breaks down as soon as you have a PCA in there).

@lucasdavid
Copy link
Contributor

I don't quite follow you. Imputer doesn't have a inverse_transform, does it?
So you are saying I should use this bellow to maintain feature/importance consistent?

features = ['a', 'b', 'c', 'd']
imputed_features = model.named_steps['imputer'].get_feature_names(features)
print(imputed_features, model.named_steps['rf'].feature_importances_)

@amueller
Copy link
Member

Sorry, I was talking about possible solutions. There is no solution now, and I would like to have a solution. But I'd like to have a solution that also works for feature selection, one-hot-encoding, polynomial features and imputation, not only imputation.

One possible solution for your case would be to add an inverse_transform, another, possibly more general solution to have get_feature_names.

Ideally the get_feature_names would be implemented in a way that you don't need to take apart the pipeline the way you're doing, though how that interface should look like is not entirely clear to me yet.
maybe

print(model.get_feature_names(features), model.named_steps['rf'].feature_importances_)

assuming that get_feature_names on a pipeline propagates until the last step (which might solve some of the use-cases, and probably most use-cases if we allow for slicing pipelines #8431).

@lucasdavid
Copy link
Contributor

lucasdavid commented Dec 12, 2017

Oh, I see. The generic get_feature_names does sound nice.

@amueller
Copy link
Member

Well, I recommend not holding your breath for it, but it's definitely on my agenda ;)

@janvanrijn
Copy link
Contributor Author

I honestly still think this PR is relevant, as the column transformer does not completely solve the issue. Will contribute an example later this week.

@janvanrijn
Copy link
Contributor Author

This is my example:

import numpy as np
import sklearn.compose
import sklearn.impute
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.tree

X = [[0, 0, 1],
     [0, 1, np.nan],
     [1, 0, np.nan],
     [1, 1, np.nan]]

y = [0, 1, 1, 0]

numeric = sklearn.pipeline.make_pipeline(sklearn.impute.SimpleImputer(),
                                         sklearn.preprocessing.StandardScaler())
nominal = sklearn.pipeline.make_pipeline(sklearn.impute.SimpleImputer(),
                                         sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore'))

transformer = sklearn.compose.ColumnTransformer(
     transformers=[('numeric', numeric, [0, 1]), ('nominal', nominal, [2])])
clf = sklearn.pipeline.make_pipeline(transformer,
                                     sklearn.tree.DecisionTreeClassifier())

sklearn.model_selection.cross_val_score(clf, X, y, cv=2)

I don't think it is a very important use case, as the column transformer already removed the requirement to give indices to OHE.
It hinges upon the fact that the imputer removes potentially all variables from a pipeline. Fingers crossed, but I don't think this will happen on the datasets from the OpenML100 anytime soon. If you want, I can update my PR

FFR, the output of this code:


/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py:542: FutureWarning: From version 0.22, errors during fit will result in a cross validation score of NaN by default. Use error_score='raise' if you want an exception raised or error_score=np.nan to adopt the behavior from version 0.22.
  FutureWarning)
Traceback (most recent call last):
  File "/home/jan/projects/sklearn-bot/jantest.py", line 26, in <module>
    sklearn.model_selection.cross_val_score(clf, X, y, cv=2)
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 402, in cross_val_score
    error_score=error_score)
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 240, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/compose/_column_transformer.py", line 449, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/home/jan/projects/scikit-learn/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
    fitted=fitted, replace_strings=True))
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 986, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 300, in fit_transform
    return last_step.fit_transform(Xt, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 501, in fit_transform
    return self.fit(X).transform(X)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 416, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 63, in _fit
    X = self._check_X(X)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 49, in _check_X
    X_temp = check_array(X, dtype=None)
  File "/home/jan/projects/scikit-learn/sklearn/utils/validation.py", line 585, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.

@amueller
Copy link
Member

a missingness indicator would also solve this ;) I think the issue here is not that it removes constant columns but that it removed all columns, right?
Can this happen in other cases in a pipeline? SelectFromModel could potentially remove all columns within cross-validation right?

@janvanrijn
Copy link
Contributor Author

a missingness indicator would also solve this ;) I think the issue here is not that it removes constant columns but that it removed all columns, right?

Agreed. When typing my initial message, I had in mind that the OneHotEncoder still requires a particular set of indices. When working on the example, I realized that this is no longer the case, and the current state has solved 99% of my issues. We probably have to move to the mindset that once the indices of columntransformer have been set, from there on all operations will be performed on all columns. Adding the indicator variable would be good in any case.

@amueller amueller added the Needs Decision Requires decision label Aug 5, 2019
@adrinjalali adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants