Skip to content

ENH avoid checking columns where training data is all nan in KNNImputer #29060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 15, 2024

Conversation

xuefeng-xu
Copy link
Contributor

@xuefeng-xu xuefeng-xu commented May 21, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In KNNImputer, columns where training data is all nan will be removed or impute with 0.
Therefore, we only need to check data with valid columns using valid_mask.
This can avoid computing pairwise distance when data with valid columns has no missing values.

Any other comments?

This could potentially save some memory.

Copy link

github-actions bot commented May 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3db8f2e. Link to the linter CI: here

@xuefeng-xu
Copy link
Contributor Author

This code script shows efficiency improvement if one column of data is all nan.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)
density = 10
mask = rng.randint(density, size=X.shape) == 0

X_na = X.copy()
X_na.values[mask] = np.nan

X_na.values[:, 0] = np.nan # one column is all nan

This PR

%timeit KNNImputer().fit_transform(X_na)
6.72 s ± 59.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Main

%timeit KNNImputer().fit_transform(X_na)
9.24 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@adrinjalali adrinjalali added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Aug 13, 2024
@adrinjalali
Copy link
Member

maybe @adam2392 or @OmarManzoor could have a look?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @xuefeng-xu

@OmarManzoor OmarManzoor merged commit 4e44ede into scikit-learn:main Aug 15, 2024
30 checks passed
@xuefeng-xu xuefeng-xu deleted the valid_mask_knnimpute branch August 20, 2024 15:43
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants