Skip to content

FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

arifqodari
Copy link
Contributor

Reference Issues

Fixes #29375

Empty Feature = Feature that consist exclusively of missing values.

What does this implement/fix? Explain your changes.

This PR fixes 2 cases mentioned on this issue #29375:

  1. In the IterativeImputer, when keep_empty_features is set to True, the iterative imputation step is skipped, and the result is constructed solely from the initial imputation (SimpleImputer). This occurs because mask_missing_values is set to True for all non-empty feature columns. As a result, the outcomes differ depending on whether keep_empty_features is set to False or True. The proposed solution is to remove this step and retain the original mask_missing_values.
  2. When running the iterative imputation step with empty features, a ValueError: Found array with 0 sample(s) ... exception is raised because the estimator attempts to fit data with 0 rows. The proposed solution is to use the imputed values from the initial imputation result when the estimator encounters empty features. These values are constant across all rows.

Any other comments?

Copy link

github-actions bot commented Sep 3, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a573512. Link to the linter CI: here

@glemaitre glemaitre self-requested a review September 5, 2024 14:29
@glemaitre
Copy link
Member

We will need some non-regression tests to check that we get the expected results. On the top of the head, we should test:

  • A matrix without a column full of np.nan and check that keep_empty_feature=True/False lead to the same results. We can strengthen the current test in this regard.
  • A matrix containing a entire np.nan column and check that other columns remain the same and check that get a zero column if keep_empty_feature=True otherwise, it should be drop.

So from the code, I think that the diff is not what I would expect. I would only expect a change in the initial imputation with something like:

        if not self.keep_empty_features:
            # drop empty features
            valid_mask = np.flatnonzero(
                np.logical_not(np.isnan(self.initial_imputer_.statistics_))
            )
            Xt = X[:, valid_mask]
            mask_missing_values = mask_missing_values[:, valid_mask]
        else:
            # mark empty features as not missing and keep the original
            # imputation
            is_empty_feature = np.all(mask_missing_values, axis=0)
            mask_missing_values[:, is_empty_feature] = False
            Xt = X
            Xt[:, is_empty_feature] = X_filled[:, is_empty_feature]

Basically, keep_empty_feature=False remains the same behaviour as before. For the case of keep_empty_feature, we need to detect which column was full of np.nan. We can do that since we already computed the mask and we cannot use the initial_imputer_ since it does show this information (it imputes the values with 0). Once we have the info, we can set the empty features as not missing and fill the X matrix with the filled value that is zero.

@glemaitre glemaitre removed their request for review September 5, 2024 15:22
@glemaitre
Copy link
Member

In addition to the tests and the right fix, we need an entry in the changelog:

Please add an entry to the change log at doc/whats_new/v1.5.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@glemaitre glemaitre changed the title FIX Iterative Imputation Skipped when keep_empty_features is Set to True FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True Sep 5, 2024
@arifqodari
Copy link
Contributor Author

We will need some non-regression tests to check that we get the expected results. On the top of the head, we should test:

  • A matrix without a column full of np.nan and check that keep_empty_feature=True/False lead to the same results. We can strengthen the current test in this regard.

  • A matrix containing a entire np.nan column and check that other columns remain the same and check that get a zero column if keep_empty_feature=True otherwise, it should be drop.

Thanks for the feedback @glemaitre . I have added two more tests covering various initial_strategy as well.

So from the code, I think that the diff is not what I would expect. I would only expect a change in the initial imputation with something like:

        if not self.keep_empty_features:
            # drop empty features
            valid_mask = np.flatnonzero(
                np.logical_not(np.isnan(self.initial_imputer_.statistics_))
            )
            Xt = X[:, valid_mask]
            mask_missing_values = mask_missing_values[:, valid_mask]
        else:
            # mark empty features as not missing and keep the original
            # imputation
            is_empty_feature = np.all(mask_missing_values, axis=0)
            mask_missing_values[:, is_empty_feature] = False
            Xt = X
            Xt[:, is_empty_feature] = X_filled[:, is_empty_feature]

Basically, keep_empty_feature=False remains the same behaviour as before. For the case of keep_empty_feature, we need to detect which column was full of np.nan. We can do that since we already computed the mask and we cannot use the initial_imputer_ since it does show this information (it imputes the values with 0). Once we have the info, we can set the empty features as not missing and fill the X matrix with the filled value that is zero.

I have updated the changes as suggested and apparently there is one case that is not covered. It is when the initial_strategy="constant", the iterative process estimator attempts to fit data with all missing rows. So, I put another if condition if it is in_fit it will mark the empty features as not missing and use results from initial imputation. Let me know your thoughts @glemaitre .

Additionally, I have added one new entry to the 1.5.2 changelogs.

@glemaitre glemaitre self-requested a review September 9, 2024 21:28
@glemaitre
Copy link
Member

I have updated the changes as suggested and apparently there is one case that is not covered. It is when the initial_strategy="constant", the iterative process estimator attempts to fit data with all missing rows. So, I put another if condition if it is in_fit it will mark the empty features as not missing and use results from initial imputation.

I don't think that the in_fit is the right fix here. I think that we are stroke by the following inconsistency: #29827

It means that the line:

        valid_mask = np.flatnonzero(
            np.logical_not(np.isnan(self.initial_imputer_.statistics_))
        )

is not doing what we think: for most imputation, we expect to have np.nan statistics for empty feature. However, due to the inconsistency discussed in the issue, we get a value that is fill_value. So I think that the right fix here would be to use:

is_empty_feature = np.all(mask_missing_values, axis=0)

instead ~is_empty_feature should be equivalent to valid_mask.

@glemaitre glemaitre requested review from glemaitre and removed request for glemaitre September 11, 2024 07:34
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of additional small changes. But otherwise it looks good.

@glemaitre
Copy link
Member

Last nitpick otherwise LGTM. We will need a second reviewer.
@adrinjalali do you want to have a look at this one.

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @arifqodari

@jeremiedbb
Copy link
Member

Let's merge and hopefully we won't forget to clean it up when we fix the inconsistency in the SimpleImputer directly 😄

@jeremiedbb jeremiedbb merged commit eec6ef0 into scikit-learn:main Oct 10, 2024
30 checks passed
BenJourdan pushed a commit to gregoryschwartzman/scikit-learn that referenced this pull request Oct 11, 2024
…eep_empty_features=True (scikit-learn#29779)

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IterativeImputer skip iterative part if keep_empty_features is set to True
3 participants