-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Imputer to maintain missing collumns #8613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR. A preliminary review:
empty_attribute_constant=-1) | ||
|
||
|
||
def test_imputation_empty_column_missing_nan(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment describing what this intends to ensure?
|
||
X_imputed_mean = np.array([ | ||
[-1, 3, -1, -1, 5], | ||
[-1, 1, -1, -1, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to be asserting that some strange behaviour regarding nans is maintained. I'd argue that behaviour -- where a feature gets considered "empty" due to the presence of a single NaN where missing_values != 'nan'
-- is a bug.
sklearn/preprocessing/imputation.py
Outdated
@@ -116,12 +116,13 @@ class Imputer(BaseEstimator, TransformerMixin): | |||
contain missing values). | |||
""" | |||
def __init__(self, missing_values="NaN", strategy="mean", | |||
axis=0, verbose=0, copy=True): | |||
axis=0, verbose=0, copy=True, empty_attribute_constant=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just call it fill_empty
?
- changed behavior of constant empty_column imputation: it does not prevent columns with NaN values from being removed. - Added comments to unit tests.
@jnothman thanks for the quick review. I added a commit that incorporated your requested changes. Re. your first point: I agree, and changed this behaviour. Unfortunately there is no longer a way to ensure that the number of columns that gets passed to the imputer is also returned, but that seems to be the risk for passing data with "NaN"s and missing_values != "NaN". Let's see if Travis passes right now. |
@janvanrijn & @jnothman any chance of this getting merged soon? #2034 seems stuck and I'm not quite sure how to handle imputation and one-hot encoding without one of them. |
I would love to finish this PR, as most of my personal projects heavily rely on this fix. Please let me know how to proceed from here. |
@janvanrijn can you give a usage example? Would |
Currently, the sklearn data imputer removes columns completely if all values are missing. This is in my pipelines undesirable behaviour, as in many cases I am relying on the column indices to stay fixed. In my particular example for the OneHotEncoder, in where I tell it which attributes should be encoded and which not by giving the indices to "categorical_features". Don't know if there are more use-cases. Could you provide the link to ColumnTransformer Documentation? Google doesn't find it yet. For clarity, this specific PR does not solve the problem that I want to impute Categorical columns and Numerical columns different. |
It's not merged yet, and it's here: #9012 Basically the idea is that you don't need to store indices to any columns but instead have different columns go through different branches of the pipeline. The new |
Having the column encoder seems to also solve the problem that this PR tries to address. Some questions:
|
|
Here's a use case: features = ['a', 'b', 'c', 'd']
x = np.random.randn(200, len(features))
y = np.random.randint(2, size=200)
x[:, np.random.randint(len(features))] = np.nan
model = make_pipeline(
Imputer(),
RandomForestClassifier()
)
model.fit(x, y)
print(features)
print(model.feature_importances_) This prints two lists: the first containing four features and the second containing three feature importances. This makes it really annoying to backtrack the
Either way, they seem very messy. |
I think that's an important use-case, but I don't think fixing columns is the right approach. I think using There's also some discussion on how to generalize this at #6425. If you had a feature selection method in there, you'd run into the same problem, but replacing features by zero columns seems pretty strange in that case. Or if you use polynomial features, you're pretty much lost without #6425. Ideally we'd have a generic name to get correspondences of input and output features if possible (this breaks down as soon as you have a PCA in there). |
I don't quite follow you. features = ['a', 'b', 'c', 'd']
imputed_features = model.named_steps['imputer'].get_feature_names(features)
print(imputed_features, model.named_steps['rf'].feature_importances_) |
Sorry, I was talking about possible solutions. There is no solution now, and I would like to have a solution. But I'd like to have a solution that also works for feature selection, one-hot-encoding, polynomial features and imputation, not only imputation. One possible solution for your case would be to add an Ideally the print(model.get_feature_names(features), model.named_steps['rf'].feature_importances_) assuming that |
Oh, I see. The generic |
Well, I recommend not holding your breath for it, but it's definitely on my agenda ;) |
I honestly still think this PR is relevant, as the column transformer does not completely solve the issue. Will contribute an example later this week. |
This is my example:
I don't think it is a very important use case, as the column transformer already removed the requirement to give indices to OHE. FFR, the output of this code:
|
a missingness indicator would also solve this ;) I think the issue here is not that it removes constant columns but that it removed all columns, right? |
Agreed. When typing my initial message, I had in mind that the OneHotEncoder still requires a particular set of indices. When working on the example, I realized that this is no longer the case, and the current state has solved 99% of my issues. We probably have to move to the mindset that once the indices of columntransformer have been set, from there on all operations will be performed on all columns. Adding the indicator variable would be good in any case. |
The preprocessing.imputer module removes attributes that are completely empty. While this makes sense in general, when used in a pipeline this is undesirable (see issue #8539)
In consultation with @amueller I wrote an extension that (if desired) replaces these attributes with a constant. This way, in the pipeline we can always rely on a constant feature ordering (and if needed, remove the constant features afterwards)
Reference Issue
What does this implement/fix? Explain your changes.
Any other comments?