Skip to content

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 6, 2024

Conversation

xuefeng-xu
Copy link
Contributor

Reference Issues/PRs

Closes #29079

What does this implement/fix? Explain your changes.

For 'distance' weights, KNNImputer exclude samples with nan distance.
This PR perform similar behavior for 'uniform' weights.

Any other comments?

Copy link

github-actions bot commented May 30, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9215e67. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please do a benchmark and see how this affects the performance? Since before this PR np.ma.average was getting weights=None but now it's always an array.

@xuefeng-xu
Copy link
Contributor Author

I did a benchmark as follows, the time looks similar.

Ratio of nans This PR Main
0% 1.09 ms ± 25.3 µs 1.11 ms ± 40.1 µs
10% 9.34 s ± 129 ms 9.44 s ± 119 ms
20% 16.7 s ± 212 ms 16.5 s ± 275 ms
30% 23.1 s ± 227 ms 22.9 s ± 115 ms
40% 29.1 s ± 457 ms 29.2 s ± 458 ms
50% 31.4 s ± 121 ms 32.1 s ± 286 ms
60% 33.3 s ± 321 ms 33.6 s ± 392 ms
70% 30.8 s ± 190 ms 31.1 s ± 250 ms
80% 24.8 s ± 287 ms 24.5 s ± 228 ms
90% 16.8 s ± 222 ms 15.8 s ± 117 ms
100% 7.5 s ± 186 ms 7.19 s ± 72.1 ms
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)
randint = rng.randint(10, size=X.shape)

density = 1 # ratio of nans 1/10
mask = randint < density

X_na = X.copy()
X_na.values[mask] = np.nan

%timeit KNNImputer().fit_transform(X_na)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a changelog, otherwise LGTM.

@OmarManzoor would you wanna have a look?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @xuefeng-xu. Generally looks good just a few comments.

knn = KNNImputer(missing_values=na, n_neighbors=2, weights=weights)
knn.fit(X_train)
assert_allclose(knn.transform(X_test), X_test_expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add another scenario where we have like 3 or 4 features, 3 or 4 rows in X_train and more 2 or 3 rows in X_test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added another test scenario.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @xuefeng-xu

@OmarManzoor OmarManzoor enabled auto-merge (squash) June 6, 2024 15:24
@OmarManzoor OmarManzoor merged commit 8d0b243 into scikit-learn:main Jun 6, 2024
28 checks passed
@xuefeng-xu xuefeng-xu deleted the nan_dist_knnimpute branch June 6, 2024 16:18
@xuefeng-xu
Copy link
Contributor Author

There is another minor issue in KNNImputer. If the dataset is all nan, it still takes some time to run (around 7 seconds, as shown in the benchmark) because the current code still compute the pairwise distances.

PR #29060 tackles this issue, because columns where training data is all nan can be excluded.

@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Samples with nan distance are included in the computation of mean in KNNImputer for uniform weights
3 participants