Skip to content

[MRG] IterativeImputer extended example #12100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sergeyf
Copy link
Contributor

@sergeyf sergeyf commented Sep 17, 2018

This PR builds on all of the IterativeImputer work. See #11977

It adds an example that shows how to cobble together a missForest instance using IterativeImputer

Pretty simple example. We already have a plot_missing_values.py to compare performance, so this is just an example of how to put the code together.

There is also a second example comparing different version of IterativeImputer: like plot_missing_values.py, except with RidgeCV vs HuberRegressor vs RandomForestRegressor vs KNeighborsRegressor vs DecisionTreeRegressor, etc. It's kind of interesting that you can just stick KNeighborsRegressor here because there's a whole non-iterative version of KNN imputation. It would be interesting to eventually compare them.

You find that HuberRegressor is by far the best, even beating performance with having all data.

Also, I found a bug in plot_missing_values.py that was the reason that IterativeImputer was doing so poorly. The CV k was set to 5 for all but the IterativeImputer pipeline. When you set it also to 5, it is on par with the other algorithms for the Boston dataset.

Paging @jnothman and @glemaitre.

@@ -40,6 +40,14 @@ Support for Python 3.4 and below has been officially dropped.
- An entry goes here
- An entry goes here

:mod:`sklearn.cluster`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in master, and I saw there was a conflict for this file. Hopefully this resolves it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad fix to the conflict. Please just git checkout master doc/whats_new/v0.21.rst

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this diff isn't right. I should say: git checkout iterativeimputer doc/whats_new/v0.21.rst

@sergeyf sergeyf changed the title first commit IterativeImputer as missForest example Sep 17, 2018
@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 18, 2018

Here is what the results look like for one of the two new examples. Note how excellent HuberRegressor is!

image

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 18, 2018

And the fixed results plot from plot_missing_values.py. IterativeImputer is no longer bad. And is better than full data on Diabetes.

image

@sergeyf sergeyf changed the title IterativeImputer as missForest example [MRG] IterativeImputer as missForest example Sep 19, 2018
@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 24, 2018

@jnothman Just pinging you here. Let me know if there's anything else that's needed for this PR.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, the imputation pipelines could include a MissingIndicator to help improve performance. But maybe that's beside the point.

These MCAR examples are boring, and we should look into porting @RianneSchouten's amputation algorithms (#6284)...

The left sides of your images are cut off.

Probably with the comparison of regressors, we don't need the separate missForest example.

It should be noted in the narrative docs that IterativeImputer includes the functionality of missForest.

Are there aspects of missForest that we do not have? Someone mentioned convergence detection and early stopping...?

make_union(IterativeImputer(missing_values=0, random_state=0),
make_union(IterativeImputer(missing_values=0,
random_state=0,
n_nearest_features=5),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use a function to encapsulate creating the pipeline with a specified imputer and getting the scores? That way we might avoid similar errors here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do.

y_missing = y_full.copy()

# Random Forest predictor with default values according to missForest docs
predictor = RandomForestRegressor(n_estimators=100, max_features='sqrt')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this uses a different random state in each imputation round. What if we want it to be deterministic? Is it a problem then if we use the exact same random state (e.g. random_state=0) in each imputation round? Is there a way to make each RF have a different, fixed random state???

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(What does missForest say about random determinism?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this to be random_state=0 for reproducibility. I can't think of any problem of using the same random state in each imputation round since missForest is not like MICE. I.e., we are not sampling from posteriors.

There is nothing in the missForest paper about random determinism - I searched the PDF for variants of both of those words and nothing relevant came up https://arxiv.org/abs/1105.0828

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 26, 2018

It is indeed boring! Should we try to get an implementation of ampute into this PR? Or a separate PR? And where should it go? Into sklearn.impute?

Re: not needing a separate missForest example. I included it for the extra verbosity, but I can move that verbosity into the regressor comparison example.

Re: adding missForest to narrative docs. Sure.

Re: convergence & early stopping. We don't have that yet. I can stick that into this PR.

@RianneSchouten
Copy link

What is the status of the multiple imputation example? Should I improve the example that I made two months ago and put it in a PR or is there something else decided?

I am working on some research comparing the MSE outcome of imputation methods with bias and statistical validity measures. I cannot wait to finish that and send it to you guys. But few more months to go.

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 26, 2018

@RianneSchouten It would be great to have the MI examples updated & completed to use IterativeImputer in sampling mode.

@sergeyf sergeyf changed the title [MRG] IterativeImputer as missForest example [WIP] IterativeImputer extended example Sep 26, 2018
@jnothman
Copy link
Member

jnothman commented Sep 26, 2018 via email

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 26, 2018

@RianneSchouten To add/clarify what @jnothman said: we would like to have a basic version of your ampute function as part of sklearn. Are you able to make a PR for that as part of your MI example? I would be happy to help if you'd like.

It would require a bunch of work on the order of what's in this unfinished PR: #7084 I.e. thorough tests, documentation in all the right places, etc.

Your MI example (after rebasing) might then serve two purposes: to demonstrate how to use the ampute function and how to use IterativeImputer for multiple imputation.

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 26, 2018

@jnothman Regarding the stopping criteria. missForest has an odd one:

After each iteration the difference between the previous and the new
imputed data matrix is assessed for the continuous and categorical
parts. The stopping criterion is defined such that the imputation
process is stopped as soon as both differences have become larger
once. In case of only one type of variable the computation stops as
soon as the corresponding difference goes up for the first
time. However, the imputation last performed where both differences
went up is generally less accurate than the previous one. Therefore,
whenever the computation stops due to the stopping criterion (and not
due to 'maxiter') the before last imputation matrix is returned.

I think this wouldn't work in sample posterior mode, and is thus not general enough.

Most common stopping criteria that I've seen use a tol value and work as follows: stop if |X_{i-1} - X_i| < tol. In missing value imputation, one would only look at the subset of values in X_i that were missing.

I plan on implementing the latter. Any objections?

@jnothman
Copy link
Member

I'm confused about what they mean by "the differences went up". Do they mean that the predictive performance on known (rather than missing) elements degraded?

@sergeyf
Copy link
Contributor Author

sergeyf commented Sep 27, 2018

I think it means the following:

|X_1 - X_0| = 4
|X_2 - X_1| = 3 <- diff went down from previous one of 4
|X_3 - X_2| = 2 <- diff went down from previous one of 3
|X_4 - X_3| = 3 <- diff went up from previous one of 2!

And then they return X_3 which is the final one before the diff goes up. That won't work when using IterativeImputer as MICE because the diffs will be random.

@RianneSchouten
Copy link

Something else, what is the planned time schedule for the MI example and the ampute function?

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
@glemaitre
Copy link
Member

I did not pay attention but we could improve the multi-label display probably

@sergeyf
Copy link
Contributor Author

sergeyf commented Jan 24, 2019

I'll just change the labels on the y-axis manually.

cross_val_score(
br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
cv=N_SPLITS
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
), columns=['Full Data']

keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
)

labels = ['Full Data',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this block

ax.set_xlabel('MSE (smaller is better)')
ax.set_yticks(np.arange(means.shape[0]))
ax.invert_yaxis()
ax.set_yticklabels(labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ax.set_yticklabels(labels)
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's neater, but you still don't get the 'Full Data' label. I'll leave it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you made an edit above to take care of this. OK.

rng = np.random.RandomState(0)

X_full, y_full = fetch_california_housing(return_X_y=True)
n_samples = X_full.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_samples = X_full.shape[0]
n_samples, n_features = X_full.shape


X_full, y_full = fetch_california_housing(return_X_y=True)
n_samples = X_full.shape[0]
n_features = X_full.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_features = X_full.shape[1]

@glemaitre
Copy link
Member

After making the last changes, LGTM

missing_features = rng.choice(n_features, n_samples, replace=True)
X_missing[missing_samples, missing_features] = np.nan

# Estimate the score after imputation (mean and median strategies) of the missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

@sergeyf
Copy link
Contributor Author

sergeyf commented Jan 24, 2019

OK, all addressed.

@glemaitre
Copy link
Member

@jnothman LGTM if you want to make a new check on it.

@jnothman
Copy link
Member

jnothman commented Jan 24, 2019 via email

@sergeyf
Copy link
Contributor Author

sergeyf commented Jan 24, 2019

Mostly it's the addition of Pandas to make the code shorter, some PEP8 line-length things. Also, since we swapped BayesianRidge for RidgeCV, the BayesianRidge is doing about as well as ExtraTrees.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm.

In this example we compare some predictors for the purpose of missing feature
imputation with :class:`sklearn.imputeIterativeImputer`::

:class:`sklearn.linear_model.BayesianRidge`: regularized linear regression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Should there be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if there was. At least it seems awkward that the tree and the forest are not together if it's not alphabetical. Make it alphabetical by class name. Hide the import path by putting a ~ after the first `.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Estimate the score after iterative imputation of the missing values
# with different predictors
predictors = [
BayesianRidge(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this ordering? It doesn't match the one above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. It's the same order as in the docstring?

    :class:`sklearn.linear_model.BayesianRidge`: regularized linear regression
    :class:`sklearn.tree.DecisionTreeRegressor`: non-linear regression
    :class:`sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN imputation approaches
    :class:`sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I misread.

@jnothman
Copy link
Member

Thanks!!

@jnothman jnothman merged commit dc304a4 into scikit-learn:iterativeimputer Jan 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants