[MRG] Merge IterativeImputer into master branch #11977

jnothman · 2018-09-03T03:32:33Z

This is a placeholder to track work on a new IterativeImputer (formerly MICE, ChainedImputer) that should accommodate the best features of R's MICE and missForest, especially as applicable to the settings of predictive modelling, clustering and decomposition.

Closes #11259

TODO:

complete the rename and some redesign ([MRG] ChainedImputer -> IterativeImputer, and documentation update #11350)
an example of using this as missforest ([MRG] IterativeImputer extended example #12100)
?an example of using this for multiple imputation ([WIP] Example of multiple imputation with IterativeImputer #13025)
avoid RidgeCV as default predictor due to non-smooth effect of regularisation parameter ([MRG+1] Changing default model for IterativeImputer to BayesianRidge #13038)
consider reordering parameters. predictors should go up in the list of parameters. ([MRG] IterativeImputer: n_iter->max_iter #13061)
n_iter -> max_iter with checks for convergence at max(abs(X - Xprev)) < tol ([MRG] IterativeImputer: n_iter->max_iter #13061)
rename predictor to estimator ([MRG] IterativeImputer: n_iter->max_iter #13061)
?missing indicator

Ping @sergeyf, @RianneSchouten

This reverts commit f819704.

…11350) Towards making this more generic than MICE

sergeyf · 2018-09-17T13:36:15Z

Here's a good recent paper about imputation methods: http://www.jmlr.org/papers/volume18/17-073/17-073.pdf

And a useful table from it:

As discussed, IterativeImputer can do perform any of the Sequential X algorithms, of which there are a bunch. I'm not sure what the difference is between Sequential KNN and Iterative KNN.

An example could perhaps compare RidgeCV, KNN, CART, RandomForest as model inputs?

jnothman · 2018-09-17T21:47:09Z

Here we are considering all the "sequential" approaches as being roughly the same thing. Although we could consider multiple classes in fact, if we think differences are substantial

sergeyf · 2018-09-18T04:54:47Z

I don't have any reasons to think differences are substantial, but I also don't have enough experience using various sequential imputers.

… a normal (#12177)

amueller · 2018-10-05T21:09:03Z

what's to do here? Does this need reviews?

sergeyf · 2018-10-05T21:17:32Z

@amueller not yet. We have to merge a few more examples and modifications in here before it's ready for more reviews.

sergeyf · 2018-10-05T21:18:43Z

@jnothman looks like the most recent merge caused some branch conflicts, but I don't have access to resolve them.

jnothman · 2018-10-07T22:03:13Z

PR to the IterativeImputer branch is welcome, but for simple merge conflicts it's fine if I resolve them.

sergeyf · 2018-10-07T22:27:35Z

Great, thanks.

jnothman · 2019-01-16T23:56:44Z

Test failures on some (but not the latest) version of scipy, around the usage of truncnorm.

jnothman · 2019-01-17T07:01:41Z

Some thoughts of things we might want here to improve usability:

reorder the parameters so that predictor comes higher up, and perhaps other reordering to emphasise importance to user (pity that SimpleImputer puts missing_values first)
stopping criteria where sample_posterior=False: change n_iter to max_iter and measure for some change < tol
with or without this, we might want some measure of convergence where sample_posterior=False, i.e. report that change that would otherwise be used for stopping. Reasonable definition of change might be max(X{t} - X{i-1})

sergeyf · 2019-01-17T16:59:03Z

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

Also, looks like you're having the same weird issue with the doctest that I'm having in the other PR.

jnothman · 2019-01-17T21:47:30Z

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

I am not proposing to apply it with sample_posterior=True. Without sample_posterior it would be good to have a way to identify how well it converged.

I can only get 26 in the doctest locally (haven't tried installing different dependencies), including with repeated transform, or repeated fit_transform, with different random_state.... Hmmm...

sergeyf · 2019-01-17T22:05:51Z

OK, that makes sense to me. I'll make a PR with the changes you suggested once work dies down a little and once we get the missforest example PR in.

Do you still want to wait on the amputer and MICE example before merging this? We may not get anyone to finish those up...

jnothman · 2019-01-18T00:23:26Z

I don't think they're essential. The multiple imputation example looks good but a bit complicated, and needs an API update. Amputation needs to happen separately.

glemaitre · 2019-02-13T17:53:14Z

doc/modules/classes.rst

   impute.MissingIndicator
-
+   


Could you remove this

@jnothman I assume you don't want me to make a PR for this =)

Oh sorry I did not see it was @jnothman PR. I'll fix it then :)

glemaitre · 2019-02-13T17:55:16Z

doc/modules/classes.rst

   impute.MissingIndicator
-
+   


Could you remove this

glemaitre · 2019-02-13T18:03:52Z

I would be OK to merge it.

I think that we need to improve the training examples. I find a bit redundant to current examples and maybe we should go beyond mean and std of predictions when doing the analysis.

However, I think that this something which can be done and improves when reworking the example of MICE. Maybe this is something which we can discuss during the sprint.

sergeyf · 2019-02-13T18:37:56Z

I agree. It would be good to get this into the hands of users (i.e. merged before next major release) as I assume there will be plenty of other things we'll learn from them - and that will inform further how to make the examples better.

jnothman · 2019-02-13T22:57:39Z

I'm not sure what you mean by improving training examples. Do you mean improving examples to use real-world missing data or data removed not completely at random?

When you say we should go beyond mean and std of predictions, do you mean mean and std of scores? Why should we go beyond it in this case?

glemaitre · 2019-02-14T08:01:47Z

Yes I mean only example, training seems a typo. My issue with the current example is that we show that the mean is slightly better with huge standard deviation. So this is actually difficult to conclude something. Having thing like wilcoxon test to check how many time the iterative imputer is better than other might bring more insight. Sent from my phone - sorry to be brief and potential misspell.

jnothman · 2019-02-14T08:07:24Z

Btw I'm now again not sure about your statement about stacking a missing indicator, @sergeyf. The feature indicating missingness for the column being imputed will be all 0 in training, since those samples where it is missing are deleted, aren't they?

sergeyf · 2019-02-14T15:29:25Z

Yea, they are 0 at training and 1 at predict time. Could this ever become a problem for any passed in estimator?

…

On Thu, Feb 14, 2019, 12:08 AM Joel Nothman ***@***.*** wrote: Btw I'm now again not sure about your statement about stacking a missing indicator, @sergeyf. The feature indicating missingness for the column being imputed will be all 0 in training, since those samples where it is missing are deleted, aren't they? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11977 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7BwOist1K-AFaGfJHoF6FTFv5RtQks5vNRlrgaJpZM4WW65f> .

jnothman · 2019-02-14T22:09:31Z

Something like K nearest neighbours might be affected detrimentally by that... Most estimators should not.

jnothman · 2019-02-14T22:10:49Z

Actually even K neighbours will just offset the distances where the query has a 1 but it should make no difference to the neighbours. Radius neighbours might be a problem

sergeyf · 2019-02-14T22:44:39Z

My instinct is to avoid the feature entirely if some estimators could have a problem. Or explicitly remove the associated missing column.

…

On Thu, Feb 14, 2019, 2:11 PM Joel Nothman ***@***.*** wrote: Actually even K neighbours will just offset the distances where the query has a 1 but it should make no difference to the neighbours. Radius neighbours might be a problem — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11977 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7GWQQm7t3syzliEoEattOUDy-3Wpks5vNd8bgaJpZM4WW65f> .

jnothman · 2019-02-15T01:02:28Z

Yes, if we implement it as a feature within this estimator, removing that column makes sense. But if we just recommend stacking the input, I don't think it's a big risk

jnothman · 2019-02-15T01:04:06Z

I'm going to merge this and get it road tested. Can you set the base for the multiple imputation pr to master?

Thanks and congratulations!

jnothman · 2019-02-15T01:06:37Z

Unfortunately GitHub has recorded me as the author of that commit :( I should have done the merge manually

sergeyf · 2019-02-15T01:47:53Z

Oh man I am very happy! Thanks to you and @glemaitre for your patience and partnership on this project. I learned a lot from you both, and I'm a better writer of software for it. And don't worry about the authorship of the PR. I don't care that much - it's more important to me that people find this work useful.

…

On Thu, Feb 14, 2019, 5:04 PM Joel Nothman ***@***.*** wrote: I'm going to merge this and get it road tested. Can you set the base for the multiple imputation pr to master? Thanks and congratulations! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11977 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7M5Z2v951VNv8_A7tmlzuLFTWfq4ks5vNge2gaJpZM4WW65f> .

amueller · 2019-02-15T04:56:42Z

AWESOME! Amazing work y'all!

glemaitre · 2019-02-15T08:08:28Z

@sergeyf Thanks for the hard work.

reckoner · 2019-02-20T17:02:53Z

Great work! In reading the documentation, it seems there is no variance adjustment for the so-imputed values for estimators. Is something that is out of scope?

sergeyf · 2019-02-20T17:06:31Z

@reckoner Do you have a pointer to what variance adjustment means in this case? What purpose does it serve?

reckoner · 2019-02-20T17:21:43Z

@sergeyf In Chapter 5 ("Estimation of imputation uncertainty") of

Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.

The issue is that the estimators that are generated from processing the dataset that have been filled out via imputation have to be adjusted for the variance in the imputed values. For example, the mean based on the so-imputed data has a different estimator variance (e.g., confidence interval) than that of the non-imputed data. Imputing multiple times means that the uncertainties (i.e., variance) of the multiple imputation is used to estimate this variance and flow this down to the ultimate estimator.

I don't know, outside of predict_proba, which is implemented for a few sklearn objects, how such variance adjustment would flow down to these few objects, but I don't see how it would percolate down to other objects that do not implement predict_proba.

I hope that helps.

sergeyf · 2019-02-20T17:26:41Z

Ah ok. Well, IterativeImputer by default doesn't do multiple imputation. You have to do it yourself by calling it with sample_posterior=True and different flags each time. We have an open PR where this is demonstrated (possibly the variance adjustments you're describing). #13025

Please take a look and comment there if you think it's appropriate. We're also happy to take contributions to that PR because the original person (who knows MICE-related topics better than I do) was unable to continue working on it.

reckoner · 2019-02-20T17:33:09Z

You are correct. Seems like this issue was raised here by @jnothman

#13025 (comment)

Let me study this carefully.

jnothman · 2019-02-20T20:49:26Z

Generally we don't report variance of coefficients in scikit-learn. Where we give predictive intervals we do not currently have a way to adjust these. So yes , that is diy.

By Sergey Feldman

This reverts commit 518b183.

By Sergey Feldman

jnothman and others added 4 commits September 3, 2018 13:23

FEA Reinstate ChainedImputer

dc67ec0

This reverts commit f819704.

Fix import of time

cbf89ec

Merge branch 'master' into iterativeimputer

e4fa514

[MRG] ChainedImputer -> IterativeImputer, and documentation update (#…

a4f2a89

…11350) Towards making this more generic than MICE

This was referenced Sep 17, 2018

Random Forest Imputation #9591

Closed

[MRG] IterativeImputer extended example #12100

Merged

benlawson mentioned this pull request Sep 26, 2018

[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

Merged

[MRG] sample from a truncated normal instead of clipping samples from…

09a9a21

… a normal (#12177)

jnothman added 2 commits October 8, 2018 09:00

Merge branch 'master' into iterativeimputer

d854b45

DOC Merge IterativeImputer what's news

caa089e

jnothman added 4 commits January 17, 2019 09:00

Merge branch 'master' into iterativeimputer

1550d65

Undo changes to v0.20.rst

f103c6b

Revert changes to v0.20.rst

9e10658

DOC Normalize whitespace in doctest

0aab6dc

Fix for SciPy 0.17

d34f227

Fix doctest

b44dff8

glemaitre reviewed Feb 13, 2019

View reviewed changes

doc/modules/classes.rst

impute.MissingIndicator

Copy link

Member

glemaitre Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this

jnothman merged commit b8d1226 into master Feb 15, 2019

rfratila mentioned this pull request Feb 15, 2019

Improve missing-data completion methods using data imputation Aifred-Health/VulcanAI#94

Open

qinhanmin2014 deleted the iterativeimputer branch March 3, 2019 12:08

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FEA Add IterativeImputer (scikit-learn#11977)

518b183

By Sergey Feldman

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FEA Add IterativeImputer (scikit-learn#11977)"

76ac471

This reverts commit 518b183.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FEA Add IterativeImputer (scikit-learn#11977)"

c2f5106

This reverts commit 518b183.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FEA Add IterativeImputer (scikit-learn#11977)

19774cd

By Sergey Feldman

[MRG] Merge IterativeImputer into master branch #11977

[MRG] Merge IterativeImputer into master branch #11977

Conversation

jnothman commented Sep 3, 2018 • edited Loading

sergeyf commented Sep 17, 2018 • edited Loading

jnothman commented Sep 17, 2018 via email

sergeyf commented Sep 18, 2018

amueller commented Oct 5, 2018

sergeyf commented Oct 5, 2018

sergeyf commented Oct 5, 2018

jnothman commented Oct 7, 2018 via email

sergeyf commented Oct 7, 2018

jnothman commented Jan 16, 2019

jnothman commented Jan 17, 2019

sergeyf commented Jan 17, 2019

jnothman commented Jan 17, 2019

sergeyf commented Jan 17, 2019

jnothman commented Jan 18, 2019 via email

glemaitre Feb 13, 2019

Choose a reason for hiding this comment

sergeyf Feb 13, 2019

Choose a reason for hiding this comment

glemaitre Feb 13, 2019

Choose a reason for hiding this comment

glemaitre Feb 13, 2019

Choose a reason for hiding this comment

glemaitre commented Feb 13, 2019

sergeyf commented Feb 13, 2019

jnothman commented Feb 13, 2019

glemaitre commented Feb 14, 2019 via email

jnothman commented Feb 14, 2019 via email

sergeyf commented Feb 14, 2019 via email

jnothman commented Feb 14, 2019 via email

jnothman commented Feb 14, 2019 via email

sergeyf commented Feb 14, 2019 via email

jnothman commented Feb 15, 2019 via email

jnothman commented Feb 15, 2019

jnothman commented Feb 15, 2019

sergeyf commented Feb 15, 2019 via email

amueller commented Feb 15, 2019

glemaitre commented Feb 15, 2019

reckoner commented Feb 20, 2019 • edited Loading

sergeyf commented Feb 20, 2019

reckoner commented Feb 20, 2019

sergeyf commented Feb 20, 2019

reckoner commented Feb 20, 2019

jnothman commented Feb 20, 2019 via email

jnothman commented Sep 3, 2018 •

edited

Loading

sergeyf commented Sep 17, 2018 •

edited

Loading

reckoner commented Feb 20, 2019 •

edited

Loading