Skip to content

[WIP] Basic version of MICE Imputation #8465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 447 commits into from
Closed

Conversation

sergeyf
Copy link
Contributor

@sergeyf sergeyf commented Feb 27, 2017

Reference Issue

This is in reference to #7840, and builds on #7838.

What does this implement/fix? Explain your changes.

This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf

To dos

(1) pep8, etc.
(2) Additional documentation.
(3) Tests. I could use some suggestions here.

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@jnothman I could use some help with an error triggered by from sklearn import metrics:

Traceback (most recent call last):

  File "<ipython-input-1-aad81aec5908>", line 1, in <module>
    from sklearn import metrics

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\__init__.py", line 16, in <module>
    from .classification import accuracy_score

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 31, in <module>
    from ..preprocessing import LabelBinarizer, label_binarize

  File "D:\Anaconda2\lib\site-packages\sklearn\preprocessing\__init__.py", line 32, in <module>
    from .mice import MICEImputer

  File "D:\Anaconda2\lib\site-packages\sklearn\preprocessing\mice.py", line 12, in <module>
    from ..linear_model import BayesianRidge

  File "D:\Anaconda2\lib\site-packages\sklearn\linear_model\__init__.py", line 15, in <module>
    from .least_angle import (Lars, LassoLars, lars_path, LarsCV, LassoLarsCV,

  File "D:\Anaconda2\lib\site-packages\sklearn\linear_model\least_angle.py", line 25, in <module>
    from ..model_selection import check_cv

  File "D:\Anaconda2\lib\site-packages\sklearn\model_selection\__init__.py", line 17, in <module>
    from ._validation import cross_val_score

  File "D:\Anaconda2\lib\site-packages\sklearn\model_selection\_validation.py", line 28, in <module>
    from ..metrics.scorer import check_scoring

  File "D:\Anaconda2\lib\site-packages\sklearn\metrics\scorer.py", line 26, in <module>
    from . import (r2_score, median_absolute_error, mean_absolute_error,

ImportError: cannot import name r2_score

Looks to be some kind of circular import issue. If I move from ..linear_model import BayesianRidge into the init of MICEImputer in mice.py, the error goes away. Is that acceptable style?

@codecov
Copy link

codecov bot commented Feb 28, 2017

Codecov Report

Merging #8465 into master will decrease coverage by -0.1%.
The diff coverage is 61.34%.

@@            Coverage Diff            @@
##           master    #8465     +/-   ##
=========================================
- Coverage   95.47%   95.38%   -0.1%     
=========================================
  Files         342      343      +1     
  Lines       60907    61065    +158     
=========================================
+ Hits        58154    58249     +95     
- Misses       2753     2816     +63
Impacted Files Coverage Δ
sklearn/preprocessing/imputation.py 94.23% <100%> (+0.07%)
sklearn/utils/estimator_checks.py 93.28% <100%> (ø)
sklearn/preprocessing/init.py 100% <100%> (ø)
sklearn/preprocessing/mice.py 58.94% <58.94%> (ø)
sklearn/dummy.py 97.71% <83.33%> (-0.54%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ab5c67...c8cb82a. Read the comment docs.

@raghavrv
Copy link
Member

Is that acceptable style?

It's okay for now I feel...

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@raghavrv Could you look at the circleci failing? It doesn't look like it has anything to do with the code changes I'm making.

@glemaitre
Copy link
Member

Can you rebase on master

MSE with the entire dataset = 3354.15
MSE without the samples containing missing values = 2968.98
MSE after mean imputation of the missing values = 3507.77
MSE after MICE imputation of the missing values = 3340.39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a nice usecase where this value would be more demonstrative of the advantage of MICE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the MSE is better than "MSE with the entire dataset", and better than "MSE after mean imputation of the missing values". Were you hoping for a more dramatic improvement?

Another option is the MICE imputer. This uses round-robin linear regression,
treating every variable as an output in turn. The simple version implemented
assumes Gaussian output variables. If your output variables are obviously
non-Gaussian, consider transforming them for improve performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improving

"""
import numpy as np

from sklearn.datasets import load_boston
from sklearn.datasets import load_diabetes, load_boston
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do it in a separate line? (Helps avoid some merge conflicts...)

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import Imputer, MICEImputer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment...

from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)

dataset = load_boston()
dataset_name = 'diabetes' # 'diabetes' for another examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful to instead make a bar plot for each dataset... (cc: @jnothman @amueller)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I could do that. Let me know if there's a consensus.

@raghavrv
Copy link
Member

I've triggered a rebuild to see if it fixes the issue...

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@glemaitre I rebased, but the changes to plot_ols.py are still in there. My git-fu is low-grade, so not sure why.

@glemaitre
Copy link
Member

I rebased, but the changes to plot_ols.py are still in there
I should hope so ;). Rebase will bring the changes from master into your branch such that you have the last version of the code with your changes on the top. Some commits could have solved the issue that you saw in the CI if this is not linked to your changes.

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@glemaitre What I mean is, it currently looks like my pull request is modifying examples/linear_model/plot_ols.py, but that's not the case. This change is already merged from #8241

@glemaitre
Copy link
Member

Uhm, it seems that something went wrong somehow.
The commits from #8241 appear in your history with another commit hash than in master.
I would not be surprised that you merge at some point instead of rebasing, didn't you?

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@glemaitre It's possible. I do most of my gitting by Googling, so who knows what happened here =)

Any ideas on how to fix it?

@glemaitre
Copy link
Member

I would checkout in a new branch just to be sure and then remove the commit 0ab5c67
Check this for some help:
https://sethrobertson.github.io/GitFixUm/fixup.html#remove_deep

Then, you should rebase on master to get the changes from master and that should be fine.

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

@glemaitre Thanks, that mostly worked. There is a little tiny change left due to a mishap on my part, but we're no longer re-applying. I think the cleanest thing would be to close this pull request and just start a new one from scratch. Thoughts?

@glemaitre
Copy link
Member

You can squash your commits into a single one in fact:
https://ariejan.net/2011/07/05/git-squash-your-latests-commits-into-one/

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 28, 2017

So, yes to the totally new pull request?

@glemaitre
Copy link
Member

I would squash the changes. It would give the same results.

dalmia and others added 12 commits February 28, 2017 14:05
…kit-learn#7860)

* DOC adding a warning on the relation between C and alpha

* DOC removing extra character

* DOC: changes to the relation described

* DOC fixing typo

* DOC fixing typo

* DOC fixing link to Ridge

* DOC link enhancement

* DOC fixing line length
Until now we were in a edge case on assert_array_equal
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
…scikit-learn#7838)

* initial commit for return_std

* initial commit for return_std

* adding tests, examples, ARD predict_std

* adding tests, examples, ARD predict_std

* a smidge more documentation

* a smidge more documentation

* Missed a few PEP8 issues

* Changing predict_std to return_std #1

* Changing predict_std to return_std scikit-learn#2

* Changing predict_std to return_std scikit-learn#3

* Changing predict_std to return_std final

* adding better plots via polynomial regression

* trying to fix flake error

* fix to ARD plotting issue

* fixing some flakes

* Two blank lines part 1

* Two blank lines part 2

* More newlines!

* Even more newlines

* adding info to the doc string for the two plot files

* Rephrasing "polynomial" for Bayesian Ridge Regression

* Updating "polynomia" for ARD

* Adding more formal references

* Another asked-for improvement to doc string.

* Fixing flake8 errors

* Cleaning up the tests a smidge.

* A few more flakes

* requested fixes from Andy

* Mini bug fix

* Final pep8 fix

* pep8 fix round 2

* Fix beta_ to alpha_ in the comments
@sergeyf sergeyf closed this Feb 28, 2017
@sergeyf sergeyf deleted the mice branch February 28, 2017 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.