-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Basic version of MICE Imputation #8465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman I could use some help with an error triggered by
Looks to be some kind of circular import issue. If I move |
Codecov Report
@@ Coverage Diff @@
## master #8465 +/- ##
=========================================
- Coverage 95.47% 95.38% -0.1%
=========================================
Files 342 343 +1
Lines 60907 61065 +158
=========================================
+ Hits 58154 58249 +95
- Misses 2753 2816 +63
Continue to review full report at Codecov.
|
It's okay for now I feel... |
@raghavrv Could you look at the circleci failing? It doesn't look like it has anything to do with the code changes I'm making. |
Can you rebase on |
MSE with the entire dataset = 3354.15 | ||
MSE without the samples containing missing values = 2968.98 | ||
MSE after mean imputation of the missing values = 3507.77 | ||
MSE after MICE imputation of the missing values = 3340.39 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a nice usecase where this value would be more demonstrative of the advantage of MICE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the MSE is better than "MSE with the entire dataset", and better than "MSE after mean imputation of the missing values". Were you hoping for a more dramatic improvement?
examples/missing_values.py
Outdated
Another option is the MICE imputer. This uses round-robin linear regression, | ||
treating every variable as an output in turn. The simple version implemented | ||
assumes Gaussian output variables. If your output variables are obviously | ||
non-Gaussian, consider transforming them for improve performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improving
examples/missing_values.py
Outdated
""" | ||
import numpy as np | ||
|
||
from sklearn.datasets import load_boston | ||
from sklearn.datasets import load_diabetes, load_boston |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do it in a separate line? (Helps avoid some merge conflicts...)
examples/missing_values.py
Outdated
from sklearn.ensemble import RandomForestRegressor | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.preprocessing import Imputer | ||
from sklearn.preprocessing import Imputer, MICEImputer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment...
examples/missing_values.py
Outdated
from sklearn.model_selection import cross_val_score | ||
|
||
rng = np.random.RandomState(0) | ||
|
||
dataset = load_boston() | ||
dataset_name = 'diabetes' # 'diabetes' for another examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I could do that. Let me know if there's a consensus.
I've triggered a rebuild to see if it fixes the issue... |
@glemaitre I rebased, but the changes to |
|
@glemaitre What I mean is, it currently looks like my pull request is modifying |
Uhm, it seems that something went wrong somehow. |
@glemaitre It's possible. I do most of my gitting by Googling, so who knows what happened here =) Any ideas on how to fix it? |
I would checkout in a new branch just to be sure and then remove the commit 0ab5c67 Then, you should rebase on master to get the changes from master and that should be fine. |
@glemaitre Thanks, that mostly worked. There is a little tiny change left due to a mishap on my part, but we're no longer re-applying. I think the cleanest thing would be to close this pull request and just start a new one from scratch. Thoughts? |
You can squash your commits into a single one in fact: |
So, yes to the totally new pull request? |
I would squash the changes. It would give the same results. |
…kit-learn#7860) * DOC adding a warning on the relation between C and alpha * DOC removing extra character * DOC: changes to the relation described * DOC fixing typo * DOC fixing typo * DOC fixing link to Ridge * DOC link enhancement * DOC fixing line length
Until now we were in a edge case on assert_array_equal
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
…arn#7655) * ENH Implement mean squared log error in sklearn.metrics.regression * TST Add tests for mean squared log error. * DOC Write user guide and docstring about mean squared log error. * ENH Add neg_mean_squared_log_error in metrics.scorer
…scikit-learn#7838) * initial commit for return_std * initial commit for return_std * adding tests, examples, ARD predict_std * adding tests, examples, ARD predict_std * a smidge more documentation * a smidge more documentation * Missed a few PEP8 issues * Changing predict_std to return_std #1 * Changing predict_std to return_std scikit-learn#2 * Changing predict_std to return_std scikit-learn#3 * Changing predict_std to return_std final * adding better plots via polynomial regression * trying to fix flake error * fix to ARD plotting issue * fixing some flakes * Two blank lines part 1 * Two blank lines part 2 * More newlines! * Even more newlines * adding info to the doc string for the two plot files * Rephrasing "polynomial" for Bayesian Ridge Regression * Updating "polynomia" for ARD * Adding more formal references * Another asked-for improvement to doc string. * Fixing flake8 errors * Cleaning up the tests a smidge. * A few more flakes * requested fixes from Andy * Mini bug fix * Final pep8 fix * pep8 fix round 2 * Fix beta_ to alpha_ in the comments
Reference Issue
This is in reference to #7840, and builds on #7838.
What does this implement/fix? Explain your changes.
This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf
To dos
(1) pep8, etc.
(2) Additional documentation.
(3) Tests. I could use some suggestions here.