FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779

arifqodari · 2024-09-03T14:59:15Z

Reference Issues

Fixes #29375

Empty Feature = Feature that consist exclusively of missing values.

What does this implement/fix? Explain your changes.

This PR fixes 2 cases mentioned on this issue #29375:

In the IterativeImputer, when keep_empty_features is set to True, the iterative imputation step is skipped, and the result is constructed solely from the initial imputation (SimpleImputer). This occurs because mask_missing_values is set to True for all non-empty feature columns. As a result, the outcomes differ depending on whether keep_empty_features is set to False or True. The proposed solution is to remove this step and retain the original mask_missing_values.
When running the iterative imputation step with empty features, a ValueError: Found array with 0 sample(s) ... exception is raised because the estimator attempts to fit data with 0 rows. The proposed solution is to use the imputed values from the initial imputation result when the estimator encounters empty features. These values are constant across all rows.

Any other comments?

github-actions · 2024-09-03T15:00:38Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a573512. Link to the linter CI: here}

glemaitre · 2024-09-05T15:22:26Z

We will need some non-regression tests to check that we get the expected results. On the top of the head, we should test:

A matrix without a column full of np.nan and check that keep_empty_feature=True/False lead to the same results. We can strengthen the current test in this regard.
A matrix containing a entire np.nan column and check that other columns remain the same and check that get a zero column if keep_empty_feature=True otherwise, it should be drop.

So from the code, I think that the diff is not what I would expect. I would only expect a change in the initial imputation with something like:

        if not self.keep_empty_features:
            # drop empty features
            valid_mask = np.flatnonzero(
                np.logical_not(np.isnan(self.initial_imputer_.statistics_))
            )
            Xt = X[:, valid_mask]
            mask_missing_values = mask_missing_values[:, valid_mask]
        else:
            # mark empty features as not missing and keep the original
            # imputation
            is_empty_feature = np.all(mask_missing_values, axis=0)
            mask_missing_values[:, is_empty_feature] = False
            Xt = X
            Xt[:, is_empty_feature] = X_filled[:, is_empty_feature]

Basically, keep_empty_feature=False remains the same behaviour as before. For the case of keep_empty_feature, we need to detect which column was full of np.nan. We can do that since we already computed the mask and we cannot use the initial_imputer_ since it does show this information (it imputes the values with 0). Once we have the info, we can set the empty features as not missing and fill the X matrix with the filled value that is zero.

glemaitre · 2024-09-05T15:29:17Z

In addition to the tests and the right fix, we need an entry in the changelog:

Please add an entry to the change log at doc/whats_new/v1.5.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

arifqodari · 2024-09-06T08:52:00Z

We will need some non-regression tests to check that we get the expected results. On the top of the head, we should test:

A matrix without a column full of np.nan and check that keep_empty_feature=True/False lead to the same results. We can strengthen the current test in this regard.

A matrix containing a entire np.nan column and check that other columns remain the same and check that get a zero column if keep_empty_feature=True otherwise, it should be drop.

Thanks for the feedback @glemaitre . I have added two more tests covering various initial_strategy as well.

So from the code, I think that the diff is not what I would expect. I would only expect a change in the initial imputation with something like:
        if not self.keep_empty_features:
            # drop empty features
            valid_mask = np.flatnonzero(
                np.logical_not(np.isnan(self.initial_imputer_.statistics_))
            )
            Xt = X[:, valid_mask]
            mask_missing_values = mask_missing_values[:, valid_mask]
        else:
            # mark empty features as not missing and keep the original
            # imputation
            is_empty_feature = np.all(mask_missing_values, axis=0)
            mask_missing_values[:, is_empty_feature] = False
            Xt = X
            Xt[:, is_empty_feature] = X_filled[:, is_empty_feature]
Basically, keep_empty_feature=False remains the same behaviour as before. For the case of keep_empty_feature, we need to detect which column was full of np.nan. We can do that since we already computed the mask and we cannot use the initial_imputer_ since it does show this information (it imputes the values with 0). Once we have the info, we can set the empty features as not missing and fill the X matrix with the filled value that is zero.

I have updated the changes as suggested and apparently there is one case that is not covered. It is when the initial_strategy="constant", the iterative process estimator attempts to fit data with all missing rows. So, I put another if condition if it is in_fit it will mark the empty features as not missing and use results from initial imputation. Let me know your thoughts @glemaitre .

Additionally, I have added one new entry to the 1.5.2 changelogs.

doc/whats_new/v1.5.rst

glemaitre · 2024-09-10T15:07:21Z

I have updated the changes as suggested and apparently there is one case that is not covered. It is when the initial_strategy="constant", the iterative process estimator attempts to fit data with all missing rows. So, I put another if condition if it is in_fit it will mark the empty features as not missing and use results from initial imputation.

I don't think that the in_fit is the right fix here. I think that we are stroke by the following inconsistency: #29827

It means that the line:

        valid_mask = np.flatnonzero(
            np.logical_not(np.isnan(self.initial_imputer_.statistics_))
        )

is not doing what we think: for most imputation, we expect to have np.nan statistics for empty feature. However, due to the inconsistency discussed in the issue, we get a value that is fill_value. So I think that the right fix here would be to use:

is_empty_feature = np.all(mask_missing_values, axis=0)

instead ~is_empty_feature should be equivalent to valid_mask.

sklearn/impute/_iterative.py

sklearn/impute/tests/test_impute.py

glemaitre

A couple of additional small changes. But otherwise it looks good.

sklearn/impute/tests/test_impute.py

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

sklearn/impute/tests/test_impute.py

glemaitre · 2024-09-17T16:45:47Z

Last nitpick otherwise LGTM. We will need a second reviewer.
@adrinjalali do you want to have a look at this one.

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

jeremiedbb

LGTM. Thanks @arifqodari

jeremiedbb · 2024-10-10T13:45:16Z

Let's merge and hopefully we won't forget to clean it up when we fix the inconsistency in the SimpleImputer directly 😄

…eep_empty_features=True (scikit-learn#29779) Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

Fix Iterative Imputation not Triggered when Keep Empty Features

e38593a

github-actions bot added the module:impute label Sep 3, 2024

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

43d61eb

glemaitre self-requested a review September 5, 2024 14:29

glemaitre removed their request for review September 5, 2024 15:22

glemaitre changed the title ~~FIX Iterative Imputation Skipped when keep_empty_features is Set to True~~ FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True Sep 5, 2024

arifqodari added 8 commits September 6, 2024 00:07

Added Two More Tests for IterativeImputer

d5eddf6

Fix Applied Only in the Initial Imputation Step

1f49868

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

e01bd1c

Added Changelog

47800e0

Remove Trailing Whitespace and Unused Lines

04c94f2

Added More Tests with Various Initial Strategis

b37cbdc

Fix Error in IterativeImputer When Initial Strategy Constant

498a346

Fix Lint Issues

33dcfbe

arifqodari added 3 commits September 9, 2024 11:43

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

59ca7ca

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

d0fc4c3

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

05da983

glemaitre reviewed Sep 9, 2024

View reviewed changes

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved

glemaitre self-requested a review September 9, 2024 21:28

Move Changelog to 1.6

4ff968b

glemaitre reviewed Sep 10, 2024

View reviewed changes

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved

glemaitre reviewed Sep 10, 2024

View reviewed changes

arifqodari added 3 commits September 11, 2024 07:49

Better Variable Naming in Tests

8a71a26

Use Simple Handcrafted Matrix Instread of Generated Randomly

af88e79

Black Formatting

6d7e858

glemaitre requested review from glemaitre and removed request for glemaitre September 11, 2024 07:34

arifqodari added 4 commits September 12, 2024 08:48

Simplify Test Iterative Imputer with Empty Features

31cf853

Fix Missing Value Masking

3d04df9

Remove Redundant Test

0e1f067

Add Test Data along with Few More Tests

a3bbbda

glemaitre reviewed Sep 12, 2024

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

arifqodari and others added 4 commits September 13, 2024 06:56

Update docstring for Test without Empty Feature

daf1adc

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

Update docstring for Test with Empty Feature

291edcf

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

Parametrize on X_test for test_iterative_imputer_with_empty_features

004df0c

Merge branch 'main' into fix_iterative_imputer_keep_empty_features

63048e4

arifqodari requested a review from glemaitre September 13, 2024 01:51

glemaitre approved these changes Sep 17, 2024

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

Fix Typo in Docstring

a573512

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

jeremiedbb approved these changes Oct 10, 2024

View reviewed changes

jeremiedbb merged commit eec6ef0 into scikit-learn:main Oct 10, 2024
30 checks passed

Uh oh!

FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779

FIX make sure IterativeImputer does not skip iterative process when keep_empty_features=True #29779

Uh oh!

Conversation

arifqodari commented Sep 3, 2024

Reference Issues

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

glemaitre commented Sep 5, 2024

Uh oh!

glemaitre commented Sep 5, 2024

Uh oh!

arifqodari commented Sep 6, 2024

Uh oh!

Uh oh!

glemaitre commented Sep 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 17, 2024

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Oct 10, 2024

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 3, 2024 •

edited

Loading