[MRG] IterativeImputer extended example #12100

sergeyf · 2018-09-17T17:49:41Z

This PR builds on all of the IterativeImputer work. See #11977

It adds an example that shows how to cobble together a missForest instance using IterativeImputer

Pretty simple example. We already have a plot_missing_values.py to compare performance, so this is just an example of how to put the code together.

There is also a second example comparing different version of IterativeImputer: like plot_missing_values.py, except with RidgeCV vs HuberRegressor vs RandomForestRegressor vs KNeighborsRegressor vs DecisionTreeRegressor, etc. It's kind of interesting that you can just stick KNeighborsRegressor here because there's a whole non-iterative version of KNN imputation. It would be interesting to eventually compare them.

You find that HuberRegressor is by far the best, even beating performance with having all data.

Also, I found a bug in plot_missing_values.py that was the reason that IterativeImputer was doing so poorly. The CV k was set to 5 for all but the IterativeImputer pipeline. When you set it also to 5, it is on par with the other algorithms for the Boston dataset.

Paging @jnothman and @glemaitre.

sergeyf · 2018-09-17T17:52:13Z

doc/whats_new/v0.21.rst

@@ -40,6 +40,14 @@ Support for Python 3.4 and below has been officially dropped.
 - An entry goes here
 - An entry goes here

+:mod:`sklearn.cluster`


This is in master, and I saw there was a conflict for this file. Hopefully this resolves it.

I think this is a bad fix to the conflict. Please just git checkout master doc/whats_new/v0.21.rst

Yeah, this diff isn't right. I should say: git checkout iterativeimputer doc/whats_new/v0.21.rst

sergeyf · 2018-09-18T11:31:29Z

Here is what the results look like for one of the two new examples. Note how excellent HuberRegressor is!

sergeyf · 2018-09-18T11:33:25Z

And the fixed results plot from plot_missing_values.py. IterativeImputer is no longer bad. And is better than full data on Diabetes.

sergeyf · 2018-09-24T17:17:52Z

@jnothman Just pinging you here. Let me know if there's anything else that's needed for this PR.

jnothman

To be fair, the imputation pipelines could include a MissingIndicator to help improve performance. But maybe that's beside the point.

These MCAR examples are boring, and we should look into porting @RianneSchouten's amputation algorithms (#6284)...

The left sides of your images are cut off.

Probably with the comparison of regressors, we don't need the separate missForest example.

It should be noted in the narrative docs that IterativeImputer includes the functionality of missForest.

Are there aspects of missForest that we do not have? Someone mentioned convergence detection and early stopping...?

jnothman · 2018-09-26T14:24:15Z

examples/impute/plot_missing_values.py

-        make_union(IterativeImputer(missing_values=0, random_state=0),
+        make_union(IterativeImputer(missing_values=0,
+                                    random_state=0,
+                                    n_nearest_features=5),
                   MissingIndicator(missing_values=0)),
        RandomForestRegressor(random_state=0, n_estimators=100))
    iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing,


can you use a function to encapsulate creating the pipeline with a specified imputer and getting the scores? That way we might avoid similar errors here.

Sure, can do.

jnothman · 2018-09-26T14:27:24Z

examples/impute/plot_iterative_imputer_as_missforest.py

+y_missing = y_full.copy()
+
+# Random Forest predictor with default values according to missForest docs
+predictor = RandomForestRegressor(n_estimators=100, max_features='sqrt')


So this uses a different random state in each imputation round. What if we want it to be deterministic? Is it a problem then if we use the exact same random state (e.g. random_state=0) in each imputation round? Is there a way to make each RF have a different, fixed random state???

(What does missForest say about random determinism?)

I can change this to be random_state=0 for reproducibility. I can't think of any problem of using the same random state in each imputation round since missForest is not like MICE. I.e., we are not sampling from posteriors.

There is nothing in the missForest paper about random determinism - I searched the PDF for variants of both of those words and nothing relevant came up https://arxiv.org/abs/1105.0828

sergeyf · 2018-09-26T15:22:57Z

It is indeed boring! Should we try to get an implementation of ampute into this PR? Or a separate PR? And where should it go? Into sklearn.impute?

Re: not needing a separate missForest example. I included it for the extra verbosity, but I can move that verbosity into the regressor comparison example.

Re: adding missForest to narrative docs. Sure.

Re: convergence & early stopping. We don't have that yet. I can stick that into this PR.

RianneSchouten · 2018-09-26T16:31:32Z

What is the status of the multiple imputation example? Should I improve the example that I made two months ago and put it in a PR or is there something else decided?

I am working on some research comparing the MSE outcome of imputation methods with bias and statistical validity measures. I cannot wait to finish that and send it to you guys. But few more months to go.

sergeyf · 2018-09-26T16:59:06Z

@RianneSchouten It would be great to have the MI examples updated & completed to use IterativeImputer in sampling mode.

jnothman · 2018-09-26T22:08:19Z

The MI example needs to be rebased onto the iterativeimputer branch, and then needs review. And ampute would best live in sklearn/datasets/samples_generator.py I think

sergeyf · 2018-09-26T22:18:44Z

@RianneSchouten To add/clarify what @jnothman said: we would like to have a basic version of your ampute function as part of sklearn. Are you able to make a PR for that as part of your MI example? I would be happy to help if you'd like.

It would require a bunch of work on the order of what's in this unfinished PR: #7084 I.e. thorough tests, documentation in all the right places, etc.

Your MI example (after rebasing) might then serve two purposes: to demonstrate how to use the ampute function and how to use IterativeImputer for multiple imputation.

sergeyf · 2018-09-26T22:29:02Z

@jnothman Regarding the stopping criteria. missForest has an odd one:

After each iteration the difference between the previous and the new
imputed data matrix is assessed for the continuous and categorical
parts. The stopping criterion is defined such that the imputation
process is stopped as soon as both differences have become larger
once. In case of only one type of variable the computation stops as
soon as the corresponding difference goes up for the first
time. However, the imputation last performed where both differences
went up is generally less accurate than the previous one. Therefore,
whenever the computation stops due to the stopping criterion (and not
due to 'maxiter') the before last imputation matrix is returned.

I think this wouldn't work in sample posterior mode, and is thus not general enough.

Most common stopping criteria that I've seen use a tol value and work as follows: stop if |X_{i-1} - X_i| < tol. In missing value imputation, one would only look at the subset of values in X_i that were missing.

I plan on implementing the latter. Any objections?

jnothman · 2018-09-27T03:27:19Z

I'm confused about what they mean by "the differences went up". Do they mean that the predictive performance on known (rather than missing) elements degraded?

sergeyf · 2018-09-27T05:53:51Z

I think it means the following:

|X_1 - X_0| = 4
|X_2 - X_1| = 3 <- diff went down from previous one of 4
|X_3 - X_2| = 2 <- diff went down from previous one of 3
|X_4 - X_3| = 3 <- diff went up from previous one of 2!

And then they return X_3 which is the final one before the diff goes up. That won't work when using IterativeImputer as MICE because the diffs will be random.

RianneSchouten · 2018-09-27T05:58:46Z

Something else, what is the planned time schedule for the MI example and the ampute function?

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>

glemaitre · 2019-01-24T17:01:24Z

I did not pay attention but we could improve the multi-label display probably

sergeyf · 2019-01-24T17:02:16Z

I'll just change the labels on the y-axis manually.

examples/impute/plot_iterative_imputer_variants_comparison.py

…eyf/scikit-learn into iterativeimputer_missforest

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>

glemaitre · 2019-01-24T17:13:43Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+    cross_val_score(
+        br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
+        cv=N_SPLITS
+    )


Suggested change

)

), columns=['Full Data']

glemaitre · 2019-01-24T17:13:57Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+    keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
+)
+
+labels = ['Full Data',


remove this block

glemaitre · 2019-01-24T17:15:27Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+ax.set_xlabel('MSE (smaller is better)')
+ax.set_yticks(np.arange(means.shape[0]))
+ax.invert_yaxis()
+ax.set_yticklabels(labels)


Suggested change

ax.set_yticklabels(labels)

ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])

It's neater, but you still don't get the 'Full Data' label. I'll leave it as is.

Oh I see you made an edit above to take care of this. OK.

glemaitre · 2019-01-24T17:21:05Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+rng = np.random.RandomState(0)
+
+X_full, y_full = fetch_california_housing(return_X_y=True)
+n_samples = X_full.shape[0]


Suggested change

n_samples = X_full.shape[0]

n_samples, n_features = X_full.shape

glemaitre · 2019-01-24T17:21:14Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+
+X_full, y_full = fetch_california_housing(return_X_y=True)
+n_samples = X_full.shape[0]
+n_features = X_full.shape[1]


Suggested change

n_features = X_full.shape[1]

glemaitre · 2019-01-24T17:29:36Z

After making the last changes, LGTM

glemaitre · 2019-01-24T17:34:42Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+missing_features = rng.choice(n_features, n_samples, replace=True)
+X_missing[missing_samples, missing_features] = np.nan
+
+# Estimate the score after imputation (mean and median strategies) of the missing values


…eyf/scikit-learn into iterativeimputer_missforest

sergeyf · 2019-01-24T17:46:11Z

OK, all addressed.

glemaitre · 2019-01-24T18:42:14Z

@jnothman LGTM if you want to make a new check on it.

jnothman · 2019-01-24T21:11:36Z

Can you summarise what has changed?

sergeyf · 2019-01-24T21:18:22Z

Mostly it's the addition of Pandas to make the code shorter, some PEP8 line-length things. Also, since we swapped BayesianRidge for RidgeCV, the BayesianRidge is doing about as well as ExtraTrees.

jnothman

Otherwise lgtm.

jnothman · 2019-01-24T23:02:17Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+In this example we compare some predictors for the purpose of missing feature
+imputation with :class:`sklearn.imputeIterativeImputer`::
+
+    :class:`sklearn.linear_model.BayesianRidge`: regularized linear regression


Is there a reason for this ordering?

No. Should there be?

I think it would be better if there was. At least it seems awkward that the tree and the forest are not together if it's not alphabetical. Make it alphabetical by class name. Hide the import path by putting a ~ after the first `.

jnothman · 2019-01-24T23:02:42Z

examples/impute/plot_iterative_imputer_variants_comparison.py

+# Estimate the score after iterative imputation of the missing values
+# with different predictors
+predictors = [
+    BayesianRidge(),


Is there a reason for this ordering? It doesn't match the one above

I'm confused. It's the same order as in the docstring?

:class:`sklearn.linear_model.BayesianRidge`: regularized linear regression :class:`sklearn.tree.DecisionTreeRegressor`: non-linear regression :class:`sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN imputation approaches :class:`sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R

Oh, maybe I misread.

jnothman · 2019-01-25T02:50:37Z

Thanks!!

sergeyf added 2 commits September 17, 2018 20:43

first commit

cdc7183

undoing space

493a27a

sergeyf commented Sep 17, 2018

View reviewed changes

sergeyf changed the title ~~first commit~~ IterativeImputer as missForest example Sep 17, 2018

sergeyf mentioned this pull request Sep 17, 2018

Random Forest Imputation #9591

Closed

sergeyf added 4 commits September 17, 2018 21:01

newline

75ad2da

fixing bug in plot_missing_values and adding a bit more performance

1465f77

slight clarification

ea84910

another example

753f60d

fixing tests

2ec0401

sergeyf changed the title ~~IterativeImputer as missForest example~~ [MRG] IterativeImputer as missForest example Sep 19, 2018

jnothman reviewed Sep 26, 2018

View reviewed changes

modularizing plot_missing_values

39fb7f7

sergeyf changed the title ~~[MRG] IterativeImputer as missForest example~~ [WIP] IterativeImputer extended example Sep 26, 2018

sergeyf added 2 commits September 26, 2018 11:10

fixing cut off plot

de2c307

updating narrative docs

c135653

default for verbose should be 0

a8768c4

fixing doc error

4368c31

Update examples/impute/plot_iterative_imputer_variants_comparison.py

a102cec

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>

glemaitre reviewed Jan 24, 2019

View reviewed changes

examples/impute/plot_iterative_imputer_variants_comparison.py Outdated Show resolved Hide resolved

glemaitre reviewed Jan 24, 2019

View reviewed changes

examples/impute/plot_iterative_imputer_variants_comparison.py Outdated Show resolved Hide resolved

sergeyf and others added 4 commits January 24, 2019 09:07

fixing y-axis labels

57b83d6

Merge branch 'iterativeimputer_missforest' of https://github.com/serg…

5e9b45f

…eyf/scikit-learn into iterativeimputer_missforest

Update examples/impute/plot_iterative_imputer_variants_comparison.py

a86b535

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>

Update examples/impute/plot_iterative_imputer_variants_comparison.py

c0b7439

Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>

glemaitre reviewed Jan 24, 2019

View reviewed changes

glemaitre approved these changes Jan 24, 2019

View reviewed changes

glemaitre requested changes Jan 24, 2019

View reviewed changes

sergeyf added 3 commits January 24, 2019 09:38

minor change

332a01d

Merge branch 'iterativeimputer_missforest' of https://github.com/serg…

3ab4686

…eyf/scikit-learn into iterativeimputer_missforest

addressing reviewer comments

46e21dc

glemaitre approved these changes Jan 24, 2019

View reviewed changes

Merge branch 'iterativeimputer' into iterativeimputer_missforest

303d026

jnothman approved these changes Jan 24, 2019

View reviewed changes

reordering predictors

ddab109

jnothman merged commit dc304a4 into scikit-learn:iterativeimputer Jan 25, 2019

	ax.set_yticklabels(labels)
	ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])

	n_samples = X_full.shape[0]
	n_samples, n_features = X_full.shape

[MRG] IterativeImputer extended example #12100

[MRG] IterativeImputer extended example #12100

Conversation

sergeyf commented Sep 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergeyf commented Sep 18, 2018

sergeyf commented Sep 18, 2018 • edited Loading

sergeyf commented Sep 24, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergeyf commented Sep 26, 2018

RianneSchouten commented Sep 26, 2018

sergeyf commented Sep 26, 2018

jnothman commented Sep 26, 2018 via email

sergeyf commented Sep 26, 2018

sergeyf commented Sep 26, 2018

jnothman commented Sep 27, 2018

sergeyf commented Sep 27, 2018 • edited Loading

RianneSchouten commented Sep 27, 2018

glemaitre commented Jan 24, 2019

sergeyf commented Jan 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jan 24, 2019

Choose a reason for hiding this comment

sergeyf commented Jan 24, 2019

glemaitre commented Jan 24, 2019

jnothman commented Jan 24, 2019 via email

sergeyf commented Jan 24, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 25, 2019

sergeyf commented Sep 17, 2018 •

edited

Loading

sergeyf commented Sep 18, 2018 •

edited

Loading

sergeyf commented Sep 27, 2018 •

edited

Loading