-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779
Conversation
We will need some non-regression tests to check that we get the expected results. On the top of the head, we should test:
So from the code, I think that the diff is not what I would expect. I would only expect a change in the initial imputation with something like: if not self.keep_empty_features:
# drop empty features
valid_mask = np.flatnonzero(
np.logical_not(np.isnan(self.initial_imputer_.statistics_))
)
Xt = X[:, valid_mask]
mask_missing_values = mask_missing_values[:, valid_mask]
else:
# mark empty features as not missing and keep the original
# imputation
is_empty_feature = np.all(mask_missing_values, axis=0)
mask_missing_values[:, is_empty_feature] = False
Xt = X
Xt[:, is_empty_feature] = X_filled[:, is_empty_feature] Basically, |
In addition to the tests and the right fix, we need an entry in the changelog: Please add an entry to the change log at |
keep_empty_features
is Set to True
Thanks for the feedback @glemaitre . I have added two more tests covering various
I have updated the changes as suggested and apparently there is one case that is not covered. It is when the Additionally, I have added one new entry to the 1.5.2 changelogs. |
I don't think that the It means that the line: valid_mask = np.flatnonzero(
np.logical_not(np.isnan(self.initial_imputer_.statistics_))
) is not doing what we think: for most imputation, we expect to have is_empty_feature = np.all(mask_missing_values, axis=0) instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of additional small changes. But otherwise it looks good.
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Last nitpick otherwise LGTM. We will need a second reviewer. |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @arifqodari
Let's merge and hopefully we won't forget to clean it up when we fix the inconsistency in the SimpleImputer directly 😄 |
…eep_empty_features=True (scikit-learn#29779) Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Reference Issues
Fixes #29375
Empty Feature = Feature that consist exclusively of missing values.
What does this implement/fix? Explain your changes.
This PR fixes 2 cases mentioned on this issue #29375:
IterativeImputer
, whenkeep_empty_features
is set toTrue
, the iterative imputation step is skipped, and the result is constructed solely from the initial imputation (SimpleImputer
). This occurs becausemask_missing_values
is set toTrue
for all non-empty feature columns. As a result, the outcomes differ depending on whetherkeep_empty_features
is set toFalse
orTrue
. The proposed solution is to remove this step and retain the originalmask_missing_values
.ValueError: Found array with 0 sample(s) ...
exception is raised because the estimator attempts to fit data with 0 rows. The proposed solution is to use the imputed values from the initial imputation result when the estimator encounters empty features. These values are constant across all rows.Any other comments?