[WIP] Basic version of MICE Imputation #8465

sergeyf · 2017-02-27T19:06:46Z

Reference Issue

This is in reference to #7840, and builds on #7838.

What does this implement/fix? Explain your changes.

This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf

To dos

(1) pep8, etc.
(2) Additional documentation.
(3) Tests. I could use some suggestions here.

sergeyf · 2017-02-28T04:41:54Z

@jnothman I could use some help with an error triggered by from sklearn import metrics:

Traceback (most recent call last):

  File "<ipython-input-1-aad81aec5908>", line 1, in <module>
    from sklearn import metrics

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\__init__.py", line 16, in <module>
    from .classification import accuracy_score

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 31, in <module>
    from ..preprocessing import LabelBinarizer, label_binarize

  File "D:\Anaconda2\lib\site-packages\sklearn\preprocessing\__init__.py", line 32, in <module>
    from .mice import MICEImputer

  File "D:\Anaconda2\lib\site-packages\sklearn\preprocessing\mice.py", line 12, in <module>
    from ..linear_model import BayesianRidge

  File "D:\Anaconda2\lib\site-packages\sklearn\linear_model\__init__.py", line 15, in <module>
    from .least_angle import (Lars, LassoLars, lars_path, LarsCV, LassoLarsCV,

  File "D:\Anaconda2\lib\site-packages\sklearn\linear_model\least_angle.py", line 25, in <module>
    from ..model_selection import check_cv

  File "D:\Anaconda2\lib\site-packages\sklearn\model_selection\__init__.py", line 17, in <module>
    from ._validation import cross_val_score

  File "D:\Anaconda2\lib\site-packages\sklearn\model_selection\_validation.py", line 28, in <module>
    from ..metrics.scorer import check_scoring

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\scorer.py", line 26, in <module>
    from . import (r2_score, median_absolute_error, mean_absolute_error,

ImportError: cannot import name r2_score

Looks to be some kind of circular import issue. If I move from ..linear_model import BayesianRidge into the init of MICEImputer in mice.py, the error goes away. Is that acceptable style?

codecov · 2017-02-28T05:47:24Z

Codecov Report

Merging #8465 into master will decrease coverage by -0.1%.
The diff coverage is 61.34%.

@@            Coverage Diff            @@
##           master    #8465     +/-   ##
=========================================
- Coverage   95.47%   95.38%   -0.1%     
=========================================
  Files         342      343      +1     
  Lines       60907    61065    +158     
=========================================
+ Hits        58154    58249     +95     
- Misses       2753     2816     +63

Impacted Files	Coverage Δ
sklearn/preprocessing/imputation.py	`94.23% <100%> (+0.07%)`	✅
sklearn/utils/estimator_checks.py	`93.28% <100%> (ø)`	✅
sklearn/preprocessing/init.py	`100% <100%> (ø)`	✅
sklearn/preprocessing/mice.py	`58.94% <58.94%> (ø)`
sklearn/dummy.py	`97.71% <83.33%> (-0.54%)`	❌

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ab5c67...c8cb82a. Read the comment docs.

raghavrv · 2017-02-28T07:12:24Z

Is that acceptable style?

It's okay for now I feel...

sergeyf · 2017-02-28T16:53:23Z

@raghavrv Could you look at the circleci failing? It doesn't look like it has anything to do with the code changes I'm making.

glemaitre · 2017-02-28T17:02:55Z

Can you rebase on master

raghavrv · 2017-02-28T07:13:47Z

examples/missing_values.py

+  MSE with the entire dataset = 3354.15
+  MSE without the samples containing missing values = 2968.98
+  MSE after mean imputation of the missing values = 3507.77
+  MSE after MICE imputation of the missing values = 3340.39


Do you have a nice usecase where this value would be more demonstrative of the advantage of MICE?

Here the MSE is better than "MSE with the entire dataset", and better than "MSE after mean imputation of the missing values". Were you hoping for a more dramatic improvement?

raghavrv · 2017-02-28T07:14:07Z

examples/missing_values.py

+Another option is the MICE imputer. This uses round-robin linear regression,
+treating every variable as an output in turn. The simple version implemented
+assumes Gaussian output variables. If your output variables are obviously
+non-Gaussian, consider transforming them for improve performance.


raghavrv · 2017-02-28T07:14:29Z

examples/missing_values.py

 """
 import numpy as np

-from sklearn.datasets import load_boston
+from sklearn.datasets import load_diabetes, load_boston


Could you do it in a separate line? (Helps avoid some merge conflicts...)

raghavrv · 2017-02-28T07:14:34Z

examples/missing_values.py

 from sklearn.ensemble import RandomForestRegressor
 from sklearn.pipeline import Pipeline
-from sklearn.preprocessing import Imputer
+from sklearn.preprocessing import Imputer, MICEImputer


Same comment...

raghavrv · 2017-02-28T07:15:47Z

examples/missing_values.py

 from sklearn.model_selection import cross_val_score

 rng = np.random.RandomState(0)

-dataset = load_boston()
+dataset_name = 'diabetes'  # 'diabetes' for another examples


Do you think it would be useful to instead make a bar plot for each dataset... (cc: @jnothman @amueller)

Sure, I could do that. Let me know if there's a consensus.

raghavrv · 2017-02-28T17:09:30Z

I've triggered a rebuild to see if it fixes the issue...

sergeyf · 2017-02-28T17:59:07Z

@glemaitre I rebased, but the changes to plot_ols.py are still in there. My git-fu is low-grade, so not sure why.

glemaitre · 2017-02-28T19:17:39Z

I rebased, but the changes to plot_ols.py are still in there
I should hope so ;). Rebase will bring the changes from master into your branch such that you have the last version of the code with your changes on the top. Some commits could have solved the issue that you saw in the CI if this is not linked to your changes.

sergeyf · 2017-02-28T19:35:07Z

@glemaitre What I mean is, it currently looks like my pull request is modifying examples/linear_model/plot_ols.py, but that's not the case. This change is already merged from #8241

glemaitre · 2017-02-28T19:47:48Z

Uhm, it seems that something went wrong somehow.
The commits from #8241 appear in your history with another commit hash than in master.
I would not be surprised that you merge at some point instead of rebasing, didn't you?

sergeyf · 2017-02-28T20:29:20Z

@glemaitre It's possible. I do most of my gitting by Googling, so who knows what happened here =)

Any ideas on how to fix it?

glemaitre · 2017-02-28T20:46:28Z

I would checkout in a new branch just to be sure and then remove the commit 0ab5c67
Check this for some help:
https://sethrobertson.github.io/GitFixUm/fixup.html#remove_deep

Then, you should rebase on master to get the changes from master and that should be fine.

sergeyf · 2017-02-28T21:43:22Z

@glemaitre Thanks, that mostly worked. There is a little tiny change left due to a mishap on my part, but we're no longer re-applying. I think the cleanest thing would be to close this pull request and just start a new one from scratch. Thoughts?

glemaitre · 2017-02-28T21:53:21Z

You can squash your commits into a single one in fact:
https://ariejan.net/2011/07/05/git-squash-your-latests-commits-into-one/

sergeyf · 2017-02-28T21:55:39Z

So, yes to the totally new pull request?

glemaitre · 2017-02-28T21:57:49Z

I would squash the changes. It would give the same results.

…7855)

…d. (scikit-learn#7944)

…kit-learn#7860) * DOC adding a warning on the relation between C and alpha * DOC removing extra character * DOC: changes to the relation described * DOC fixing typo * DOC fixing typo * DOC fixing link to Ridge * DOC link enhancement * DOC fixing line length

Until now we were in a edge case on assert_array_equal

K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )

…arn#7655) * ENH Implement mean squared log error in sklearn.metrics.regression * TST Add tests for mean squared log error. * DOC Write user guide and docstring about mean squared log error. * ENH Add neg_mean_squared_log_error in metrics.scorer

…scikit-learn#7949)

…cikit-learn#7952)

…scikit-learn#7838) * initial commit for return_std * initial commit for return_std * adding tests, examples, ARD predict_std * adding tests, examples, ARD predict_std * a smidge more documentation * a smidge more documentation * Missed a few PEP8 issues * Changing predict_std to return_std #1 * Changing predict_std to return_std scikit-learn#2 * Changing predict_std to return_std scikit-learn#3 * Changing predict_std to return_std final * adding better plots via polynomial regression * trying to fix flake error * fix to ARD plotting issue * fixing some flakes * Two blank lines part 1 * Two blank lines part 2 * More newlines! * Even more newlines * adding info to the doc string for the two plot files * Rephrasing "polynomial" for Bayesian Ridge Regression * Updating "polynomia" for ARD * Adding more formal references * Another asked-for improvement to doc string. * Fixing flake8 errors * Cleaning up the tests a smidge. * A few more flakes * requested fixes from Andy * Mini bug fix * Final pep8 fix * pep8 fix round 2 * Fix beta_ to alpha_ in the comments

raghavrv suggested changes Feb 28, 2017

View reviewed changes

sergeyf mentioned this pull request Feb 28, 2017

Applying to new data iskandr/fancyimpute#26

Closed

dalmia and others added 12 commits February 28, 2017 14:05

[MRG+1] DOC adding info about circleci build artifacts (scikit-learn#…

e48a48d

…7855)

BUG: for several datasets, download_if_missing keyword was ignore…

86cee5c

…d. (scikit-learn#7944)

Fix tests on numpy master (scikit-learn#7946)

4105ea7

Until now we were in a edge case on assert_array_equal

[MRG + 1] DOC refer to code elements in nested CV example description (…

44e7488

…scikit-learn#7949)

DOC: add bug fix for download_if_missing behavior to whatsnew. (s…

599b186

…cikit-learn#7952)

[MRG] Mention keras can run on top of TensorFlow (scikit-learn#7957)

4b4255e

Added 1/2 factor to SSE alpha term (scikit-learn#7962)

371b024

Harmonized README, added link. (scikit-learn#7965)

815aac5

sergeyf added 28 commits February 28, 2017 14:31

fixing pep8 errors

0428aca

fixing build failures

305fd95

mu

5ba7e5d

mu

b5d4595

mu

8cf3498

mu

efe22d9

init bug fix

ede98d8

fixing pep8 errors

850e011

fixing build failures

cd6c344

initial commit

8279731

init bug fix

64edea3

fixing pep8 errors

68546e4

mu

ae11e3c

mu

ab616f9

mu

7f585fd

mu

9f8c65f

mu

2cefa90

mu

afcef3c

mu

fd16ac4

mu

7d2256f

init bug fix

4c75257

fixing pep8 errors

b224348

fixing build failures

d6cdd5b

mu

9bafe7e

mu

8471e0f

mu

b4fbcf3

mu

040e140

Merge branch 'mice' of https://github.com/sergeyf/scikit-learn into mice

c8cb82a

sergeyf closed this Feb 28, 2017

sergeyf deleted the mice branch February 28, 2017 22:43

Uh oh!

[WIP] Basic version of MICE Imputation #8465

[WIP] Basic version of MICE Imputation #8465

Uh oh!

Conversation

sergeyf commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

To dos

Uh oh!

sergeyf commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

raghavrv commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

raghavrv Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

sergeyf Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

raghavrv Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

raghavrv Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

raghavrv Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

raghavrv Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

sergeyf Feb 28, 2017

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

sergeyf commented Feb 28, 2017

Uh oh!

glemaitre commented Feb 28, 2017

Uh oh!

Uh oh!

sergeyf commented Feb 27, 2017 •

edited

Loading

sergeyf commented Feb 28, 2017 •

edited

Loading

codecov bot commented Feb 28, 2017 •

edited

Loading