ENH avoid checking columns where training data is all nan in KNNImputer #29060

xuefeng-xu · 2024-05-21T06:21:15Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In KNNImputer, columns where training data is all nan will be removed or impute with 0.
Therefore, we only need to check data with valid columns using valid_mask.
This can avoid computing pairwise distance when data with valid columns has no missing values.

Any other comments?

This could potentially save some memory.

github-actions · 2024-05-21T06:22:31Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 3db8f2e. Link to the linter CI: here}

xuefeng-xu · 2024-06-03T02:39:02Z

This code script shows efficiency improvement if one column of data is all nan.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)
density = 10
mask = rng.randint(density, size=X.shape) == 0

X_na = X.copy()
X_na.values[mask] = np.nan

X_na.values[:, 0] = np.nan # one column is all nan

This PR

%timeit KNNImputer().fit_transform(X_na)
6.72 s ± 59.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Main

%timeit KNNImputer().fit_transform(X_na)
9.24 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

adrinjalali · 2024-08-13T11:11:05Z

maybe @adam2392 or @OmarManzoor could have a look?

OmarManzoor

LGTM. Thanks @xuefeng-xu

…er (scikit-learn#29060)

ENH avoid checking columns where training data is all nan in KNNImputer

fe12132

github-actions bot added the module:impute label May 21, 2024

row_missing_idx also remove invalid column mask

4f6bed6

xuefeng-xu mentioned this pull request Jun 7, 2024

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

Merged

adrinjalali approved these changes Aug 13, 2024

View reviewed changes

adrinjalali added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Aug 13, 2024

Merge branch 'main' into valid_mask_knnimpute

3db8f2e

OmarManzoor approved these changes Aug 15, 2024

View reviewed changes

OmarManzoor merged commit 4e44ede into scikit-learn:main Aug 15, 2024
30 checks passed

xuefeng-xu deleted the valid_mask_knnimpute branch August 20, 2024 15:43

MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024

ENH avoid checking columns where training data is all nan in KNNImput…

2bd05bf

…er (scikit-learn#29060)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH avoid checking columns where training data is all nan in KNNImputer #29060

ENH avoid checking columns where training data is all nan in KNNImputer #29060

Uh oh!

xuefeng-xu commented May 21, 2024 •

edited

Loading

Uh oh!

github-actions bot commented May 21, 2024 •

edited

Loading

Uh oh!

xuefeng-xu commented Jun 3, 2024

Uh oh!

adrinjalali commented Aug 13, 2024

Uh oh!

OmarManzoor left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH avoid checking columns where training data is all nan in KNNImputer #29060

ENH avoid checking columns where training data is all nan in KNNImputer #29060

Uh oh!

Conversation

xuefeng-xu commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

xuefeng-xu commented Jun 3, 2024

Uh oh!

adrinjalali commented Aug 13, 2024

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xuefeng-xu commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024 •

edited

Loading