-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] IterativeImputer extended example #12100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] IterativeImputer extended example #12100
Conversation
doc/whats_new/v0.21.rst
Outdated
@@ -40,6 +40,14 @@ Support for Python 3.4 and below has been officially dropped. | |||
- An entry goes here | |||
- An entry goes here | |||
|
|||
:mod:`sklearn.cluster` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in master, and I saw there was a conflict for this file. Hopefully this resolves it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bad fix to the conflict. Please just git checkout master doc/whats_new/v0.21.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this diff isn't right. I should say: git checkout iterativeimputer doc/whats_new/v0.21.rst
@jnothman Just pinging you here. Let me know if there's anything else that's needed for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, the imputation pipelines could include a MissingIndicator to help improve performance. But maybe that's beside the point.
These MCAR examples are boring, and we should look into porting @RianneSchouten's amputation algorithms (#6284)...
The left sides of your images are cut off.
Probably with the comparison of regressors, we don't need the separate missForest example.
It should be noted in the narrative docs that IterativeImputer includes the functionality of missForest.
Are there aspects of missForest that we do not have? Someone mentioned convergence detection and early stopping...?
make_union(IterativeImputer(missing_values=0, random_state=0), | ||
make_union(IterativeImputer(missing_values=0, | ||
random_state=0, | ||
n_nearest_features=5), | ||
MissingIndicator(missing_values=0)), | ||
RandomForestRegressor(random_state=0, n_estimators=100)) | ||
iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use a function to encapsulate creating the pipeline with a specified imputer and getting the scores? That way we might avoid similar errors here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do.
y_missing = y_full.copy() | ||
|
||
# Random Forest predictor with default values according to missForest docs | ||
predictor = RandomForestRegressor(n_estimators=100, max_features='sqrt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this uses a different random state in each imputation round. What if we want it to be deterministic? Is it a problem then if we use the exact same random state (e.g. random_state=0) in each imputation round? Is there a way to make each RF have a different, fixed random state???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(What does missForest say about random determinism?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this to be random_state=0
for reproducibility. I can't think of any problem of using the same random state in each imputation round since missForest is not like MICE. I.e., we are not sampling from posteriors.
There is nothing in the missForest
paper about random determinism - I searched the PDF for variants of both of those words and nothing relevant came up https://arxiv.org/abs/1105.0828
It is indeed boring! Should we try to get an implementation of Re: not needing a separate missForest example. I included it for the extra verbosity, but I can move that verbosity into the regressor comparison example. Re: adding missForest to narrative docs. Sure. Re: convergence & early stopping. We don't have that yet. I can stick that into this PR. |
What is the status of the multiple imputation example? Should I improve the example that I made two months ago and put it in a PR or is there something else decided? I am working on some research comparing the MSE outcome of imputation methods with bias and statistical validity measures. I cannot wait to finish that and send it to you guys. But few more months to go. |
@RianneSchouten It would be great to have the MI examples updated & completed to use |
The MI example needs to be rebased onto the iterativeimputer branch, and
then needs review.
And ampute would best live in sklearn/datasets/samples_generator.py I think
|
@RianneSchouten To add/clarify what @jnothman said: we would like to have a basic version of your It would require a bunch of work on the order of what's in this unfinished PR: #7084 I.e. thorough tests, documentation in all the right places, etc. Your MI example (after rebasing) might then serve two purposes: to demonstrate how to use the |
@jnothman Regarding the stopping criteria. missForest has an odd one:
I think this wouldn't work in sample posterior mode, and is thus not general enough. Most common stopping criteria that I've seen use a I plan on implementing the latter. Any objections? |
I'm confused about what they mean by "the differences went up". Do they mean that the predictive performance on known (rather than missing) elements degraded? |
I think it means the following: |X_1 - X_0| = 4 And then they return X_3 which is the final one before the diff goes up. That won't work when using IterativeImputer as MICE because the diffs will be random. |
Something else, what is the planned time schedule for the MI example and the ampute function? |
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
I did not pay attention but we could improve the multi-label display probably |
I'll just change the labels on the y-axis manually. |
…eyf/scikit-learn into iterativeimputer_missforest
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
cross_val_score( | ||
br_estimator, X_full, y_full, scoring='neg_mean_squared_error', | ||
cv=N_SPLITS | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
), columns=['Full Data'] |
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1 | ||
) | ||
|
||
labels = ['Full Data', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this block
ax.set_xlabel('MSE (smaller is better)') | ||
ax.set_yticks(np.arange(means.shape[0])) | ||
ax.invert_yaxis() | ||
ax.set_yticklabels(labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ax.set_yticklabels(labels) | |
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's neater, but you still don't get the 'Full Data' label. I'll leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you made an edit above to take care of this. OK.
rng = np.random.RandomState(0) | ||
|
||
X_full, y_full = fetch_california_housing(return_X_y=True) | ||
n_samples = X_full.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples = X_full.shape[0] | |
n_samples, n_features = X_full.shape |
|
||
X_full, y_full = fetch_california_housing(return_X_y=True) | ||
n_samples = X_full.shape[0] | ||
n_features = X_full.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features = X_full.shape[1] |
After making the last changes, LGTM |
missing_features = rng.choice(n_features, n_samples, replace=True) | ||
X_missing[missing_samples, missing_features] = np.nan | ||
|
||
# Estimate the score after imputation (mean and median strategies) of the missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8
…eyf/scikit-learn into iterativeimputer_missforest
OK, all addressed. |
@jnothman LGTM if you want to make a new check on it. |
Can you summarise what has changed?
|
Mostly it's the addition of Pandas to make the code shorter, some PEP8 line-length things. Also, since we swapped BayesianRidge for RidgeCV, the BayesianRidge is doing about as well as ExtraTrees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm.
In this example we compare some predictors for the purpose of missing feature | ||
imputation with :class:`sklearn.imputeIterativeImputer`:: | ||
|
||
:class:`sklearn.linear_model.BayesianRidge`: regularized linear regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Should there be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if there was. At least it seems awkward that the tree and the forest are not together if it's not alphabetical. Make it alphabetical by class name. Hide the import path by putting a ~ after the first `.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Estimate the score after iterative imputation of the missing values | ||
# with different predictors | ||
predictors = [ | ||
BayesianRidge(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this ordering? It doesn't match the one above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. It's the same order as in the docstring?
:class:`sklearn.linear_model.BayesianRidge`: regularized linear regression
:class:`sklearn.tree.DecisionTreeRegressor`: non-linear regression
:class:`sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN imputation approaches
:class:`sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe I misread.
Thanks!! |
This PR builds on all of the
IterativeImputer
work. See #11977It adds an example that shows how to cobble together a missForest instance using
IterativeImputer
Pretty simple example. We already have a plot_missing_values.py to compare performance, so this is just an example of how to put the code together.
There is also a second example comparing different version of
IterativeImputer
: like plot_missing_values.py, except withRidgeCV
vsHuberRegressor
vsRandomForestRegressor
vsKNeighborsRegressor
vsDecisionTreeRegressor
, etc. It's kind of interesting that you can just stickKNeighborsRegressor
here because there's a whole non-iterative version of KNN imputation. It would be interesting to eventually compare them.You find that
HuberRegressor
is by far the best, even beating performance with having all data.Also, I found a bug in plot_missing_values.py that was the reason that
IterativeImputer
was doing so poorly. The CV k was set to 5 for all but theIterativeImputer
pipeline. When you set it also to 5, it is on par with the other algorithms for the Boston dataset.Paging @jnothman and @glemaitre.