Imputer to maintain missing collumns #8613

janvanrijn · 2017-03-19T16:50:25Z

The preprocessing.imputer module removes attributes that are completely empty. While this makes sense in general, when used in a pipeline this is undesirable (see issue #8539)

In consultation with @amueller I wrote an extension that (if desired) replaces these attributes with a constant. This way, in the pipeline we can always rely on a constant feature ordering (and if needed, remove the constant features afterwards)

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

jnothman

thanks for the PR. A preliminary review:

jnothman · 2017-03-19T23:09:16Z

sklearn/preprocessing/tests/test_imputation.py

+                      empty_attribute_constant=-1)
+
+
+def test_imputation_empty_column_missing_nan():


Could you please add a comment describing what this intends to ensure?

jnothman · 2017-03-19T23:09:18Z

sklearn/preprocessing/tests/test_imputation.py

+
+    X_imputed_mean = np.array([
+        [-1, 3, -1, -1, 5],
+        [-1, 1, -1, -1, 3],


You seem to be asserting that some strange behaviour regarding nans is maintained. I'd argue that behaviour -- where a feature gets considered "empty" due to the presence of a single NaN where missing_values != 'nan' -- is a bug.

jnothman · 2017-03-19T23:09:23Z

sklearn/preprocessing/imputation.py

@@ -116,12 +116,13 @@ class Imputer(BaseEstimator, TransformerMixin):
      contain missing values).
    """
    def __init__(self, missing_values="NaN", strategy="mean",
-                 axis=0, verbose=0, copy=True):
+                 axis=0, verbose=0, copy=True, empty_attribute_constant=None):


can we just call it fill_empty?

- changed behavior of constant empty_column imputation: it does not prevent columns with NaN values from being removed. - Added comments to unit tests.

janvanrijn · 2017-03-20T14:19:52Z

@jnothman thanks for the quick review.

I added a commit that incorporated your requested changes. Re. your first point: I agree, and changed this behaviour. Unfortunately there is no longer a way to ensure that the number of columns that gets passed to the imputer is also returned, but that seems to be the risk for passing data with "NaN"s and missing_values != "NaN".

Let's see if Travis passes right now.

lucasdavid · 2017-11-17T13:11:55Z

@janvanrijn & @jnothman any chance of this getting merged soon? #2034 seems stuck and I'm not quite sure how to handle imputation and one-hot encoding without one of them.
Furthermore, even after #2034, this feature might still be useful.

janvanrijn · 2017-11-17T13:33:29Z

I would love to finish this PR, as most of my personal projects heavily rely on this fix. Please let me know how to proceed from here.

amueller · 2017-12-12T19:05:40Z

@janvanrijn can you give a usage example? Would ColumnTransformer also solve the issue for you?

janvanrijn · 2017-12-12T19:28:35Z

Currently, the sklearn data imputer removes columns completely if all values are missing. This is in my pipelines undesirable behaviour, as in many cases I am relying on the column indices to stay fixed. In my particular example for the OneHotEncoder, in where I tell it which attributes should be encoded and which not by giving the indices to "categorical_features". Don't know if there are more use-cases.

Could you provide the link to ColumnTransformer Documentation? Google doesn't find it yet. For clarity, this specific PR does not solve the problem that I want to impute Categorical columns and Numerical columns different.

amueller · 2017-12-12T19:32:29Z

It's not merged yet, and it's here: #9012
docs here: https://15367-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.experimental.ColumnTransformer.html#sklearn.experimental.ColumnTransformer

Basically the idea is that you don't need to store indices to any columns but instead have different columns go through different branches of the pipeline. The new CategoricalEncoder actually doesn't have an option to select columns and applies the transformation to all columns.

janvanrijn · 2017-12-12T19:47:19Z

Having the column encoder seems to also solve the problem that this PR tries to address.

Some questions:

Column transformer can be part of a pipeline? (i.e., after all columns have been transformed I want to use an estimator)
It would from my POV really be gold if the column transformer could auto-detect datatypes of a column (in case of a pandas dataframe), e.g., perform transform chain A on all columns that are of datatype float and transform chain B on all columns that that are of datatype string

amueller · 2017-12-12T20:31:52Z

it's an estimator. It can be part of a pipeline or contain pipelines or any other estimator.
I don't think we want to do that, because it's likely to be fragile and hard to implement without pandas. If you have a dataframe, you can just do df.dtype == np.float or something like that to get a mask to apply it to the relevant types.

lucasdavid · 2017-12-12T20:56:04Z

Here's a use case:

features = ['a', 'b', 'c', 'd']
x = np.random.randn(200, len(features))
y = np.random.randint(2, size=200)
x[:, np.random.randint(len(features))] = np.nan
model = make_pipeline(
    Imputer(),
    RandomForestClassifier()
)
model.fit(x, y)
print(features)
print(model.feature_importances_)

This prints two lists: the first containing four features and the second containing three feature importances.
How do I know which importances belong to which features if Imputer clipped a random column?

This makes it really annoying to backtrack the feature_importances_ vector to the original feature space. The two ways I imagined to this so far were by:

Avoiding total imputation with this:

for i in x.shape[1]:
   if np.isnan(x[:, i]).all():
       x[:, i] = 0.

Reconstructing the original vector, maybe with something like this:

clipped_columns = np.isnan(model.named_pipes[1][1].statistics_)
feature_importances = np.zeros(len(features))
feature_importances[~clipped_columns] = model.feature_importances_

Either way, they seem very messy.

amueller · 2017-12-12T21:00:26Z

I think that's an important use-case, but I don't think fixing columns is the right approach. I think using inverse_transform might be better.

There's also some discussion on how to generalize this at #6425. If you had a feature selection method in there, you'd run into the same problem, but replacing features by zero columns seems pretty strange in that case. Or if you use polynomial features, you're pretty much lost without #6425.

Ideally we'd have a generic name to get correspondences of input and output features if possible (this breaks down as soon as you have a PCA in there).

lucasdavid · 2017-12-12T21:30:20Z

I don't quite follow you. Imputer doesn't have a inverse_transform, does it?
So you are saying I should use this bellow to maintain feature/importance consistent?

features = ['a', 'b', 'c', 'd']
imputed_features = model.named_steps['imputer'].get_feature_names(features)
print(imputed_features, model.named_steps['rf'].feature_importances_)

amueller · 2017-12-12T21:42:43Z

Sorry, I was talking about possible solutions. There is no solution now, and I would like to have a solution. But I'd like to have a solution that also works for feature selection, one-hot-encoding, polynomial features and imputation, not only imputation.

One possible solution for your case would be to add an inverse_transform, another, possibly more general solution to have get_feature_names.

Ideally the get_feature_names would be implemented in a way that you don't need to take apart the pipeline the way you're doing, though how that interface should look like is not entirely clear to me yet.
maybe

print(model.get_feature_names(features), model.named_steps['rf'].feature_importances_)

assuming that get_feature_names on a pipeline propagates until the last step (which might solve some of the use-cases, and probably most use-cases if we allow for slicing pipelines #8431).

lucasdavid · 2017-12-12T21:49:16Z

Oh, I see. The generic get_feature_names does sound nice.

amueller · 2017-12-12T21:50:36Z

Well, I recommend not holding your breath for it, but it's definitely on my agenda ;)

janvanrijn · 2018-09-26T07:56:41Z

I honestly still think this PR is relevant, as the column transformer does not completely solve the issue. Will contribute an example later this week.

janvanrijn · 2018-09-26T16:56:28Z

This is my example:

import numpy as np
import sklearn.compose
import sklearn.impute
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.tree

X = [[0, 0, 1],
     [0, 1, np.nan],
     [1, 0, np.nan],
     [1, 1, np.nan]]

y = [0, 1, 1, 0]

numeric = sklearn.pipeline.make_pipeline(sklearn.impute.SimpleImputer(),
                                         sklearn.preprocessing.StandardScaler())
nominal = sklearn.pipeline.make_pipeline(sklearn.impute.SimpleImputer(),
                                         sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore'))

transformer = sklearn.compose.ColumnTransformer(
     transformers=[('numeric', numeric, [0, 1]), ('nominal', nominal, [2])])
clf = sklearn.pipeline.make_pipeline(transformer,
                                     sklearn.tree.DecisionTreeClassifier())

sklearn.model_selection.cross_val_score(clf, X, y, cv=2)

I don't think it is a very important use case, as the column transformer already removed the requirement to give indices to OHE.
It hinges upon the fact that the imputer removes potentially all variables from a pipeline. Fingers crossed, but I don't think this will happen on the datasets from the OpenML100 anytime soon. If you want, I can update my PR

FFR, the output of this code:


/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py:542: FutureWarning: From version 0.22, errors during fit will result in a cross validation score of NaN by default. Use error_score='raise' if you want an exception raised or error_score=np.nan to adopt the behavior from version 0.22.
  FutureWarning)
Traceback (most recent call last):
  File "/home/jan/projects/sklearn-bot/jantest.py", line 26, in <module>
    sklearn.model_selection.cross_val_score(clf, X, y, cv=2)
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 402, in cross_val_score
    error_score=error_score)
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 240, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/compose/_column_transformer.py", line 449, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/home/jan/projects/scikit-learn/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
    fitted=fitted, replace_strings=True))
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 986, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/pipeline.py", line 300, in fit_transform
    return last_step.fit_transform(Xt, y, **fit_params)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 501, in fit_transform
    return self.fit(X).transform(X)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 416, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 63, in _fit
    X = self._check_X(X)
  File "/home/jan/projects/scikit-learn/sklearn/preprocessing/_encoders.py", line 49, in _check_X
    X_temp = check_array(X, dtype=None)
  File "/home/jan/projects/scikit-learn/sklearn/utils/validation.py", line 585, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.

amueller · 2018-09-26T17:01:18Z

a missingness indicator would also solve this ;) I think the issue here is not that it removes constant columns but that it removed all columns, right?
Can this happen in other cases in a pipeline? SelectFromModel could potentially remove all columns within cross-validation right?

janvanrijn · 2018-09-26T17:16:32Z

a missingness indicator would also solve this ;) I think the issue here is not that it removes constant columns but that it removed all columns, right?

Agreed. When typing my initial message, I had in mind that the OneHotEncoder still requires a particular set of indices. When working on the example, I realized that this is no longer the case, and the current state has solved 99% of my issues. We probably have to move to the mindset that once the indices of columntransformer have been set, from there on all operations will be performed on all columns. Adding the indicator variable would be good in any case.

janvanrijn added 3 commits March 18, 2017 20:41

added option to impute empty columns with a constant

761b36d

imputer: constant replacement to supports multiple axis

ae70762

added unit tests

c5b3238

jnothman reviewed Mar 19, 2017

View reviewed changes

- Changed name of empty_attribute_constant to empty_column.

72152d1

- changed behavior of constant empty_column imputation: it does not prevent columns with NaN values from being removed. - Added comments to unit tests.

janvanrijn added 3 commits March 20, 2017 15:44

fixed lines >= 80 characters

eb4767f

updated documentation of Imputer

8b872ba

doxtest apparently expects imputer to indent 4 spaces

35acdf5

amueller added the Needs Decision Requires decision label Aug 5, 2019

github-actions bot added the module:preprocessing label Mar 2, 2020

adrinjalali closed this Jan 22, 2021

adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputer to maintain missing collumns #8613

Imputer to maintain missing collumns #8613

janvanrijn commented Mar 19, 2017

jnothman left a comment

jnothman Mar 19, 2017

jnothman Mar 19, 2017

jnothman Mar 19, 2017

janvanrijn commented Mar 20, 2017

lucasdavid commented Nov 17, 2017 •

edited

Loading

janvanrijn commented Nov 17, 2017

amueller commented Dec 12, 2017

janvanrijn commented Dec 12, 2017

amueller commented Dec 12, 2017

janvanrijn commented Dec 12, 2017

amueller commented Dec 12, 2017 •

edited

Loading

lucasdavid commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading

lucasdavid commented Dec 12, 2017

amueller commented Dec 12, 2017

lucasdavid commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017

janvanrijn commented Sep 26, 2018

janvanrijn commented Sep 26, 2018

amueller commented Sep 26, 2018

janvanrijn commented Sep 26, 2018

		empty_attribute_constant=-1)


		def test_imputation_empty_column_missing_nan():

Imputer to maintain missing collumns #8613

Imputer to maintain missing collumns #8613

Conversation

janvanrijn commented Mar 19, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

jnothman left a comment

Choose a reason for hiding this comment

jnothman Mar 19, 2017

Choose a reason for hiding this comment

jnothman Mar 19, 2017

Choose a reason for hiding this comment

jnothman Mar 19, 2017

Choose a reason for hiding this comment

janvanrijn commented Mar 20, 2017

lucasdavid commented Nov 17, 2017 • edited Loading

janvanrijn commented Nov 17, 2017

amueller commented Dec 12, 2017

janvanrijn commented Dec 12, 2017

amueller commented Dec 12, 2017

janvanrijn commented Dec 12, 2017

amueller commented Dec 12, 2017 • edited Loading

lucasdavid commented Dec 12, 2017 • edited Loading

amueller commented Dec 12, 2017 • edited Loading

lucasdavid commented Dec 12, 2017

amueller commented Dec 12, 2017

lucasdavid commented Dec 12, 2017 • edited Loading

amueller commented Dec 12, 2017

janvanrijn commented Sep 26, 2018

janvanrijn commented Sep 26, 2018

amueller commented Sep 26, 2018

janvanrijn commented Sep 26, 2018

lucasdavid commented Nov 17, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading

lucasdavid commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading

lucasdavid commented Dec 12, 2017 •

edited

Loading