FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

xuefeng-xu · 2024-05-30T07:08:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

For 'distance' weights, KNNImputer exclude samples with nan distance.
This PR perform similar behavior for 'uniform' weights.

Any other comments?

github-actions · 2024-05-30T07:09:45Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9215e67. Link to the linter CI: here}

adrinjalali

Could you please do a benchmark and see how this affects the performance? Since before this PR np.ma.average was getting weights=None but now it's always an array.

xuefeng-xu · 2024-06-06T07:54:58Z

I did a benchmark as follows, the time looks similar.

Ratio of nans	This PR	Main
0%	1.09 ms ± 25.3 µs	1.11 ms ± 40.1 µs
10%	9.34 s ± 129 ms	9.44 s ± 119 ms
20%	16.7 s ± 212 ms	16.5 s ± 275 ms
30%	23.1 s ± 227 ms	22.9 s ± 115 ms
40%	29.1 s ± 457 ms	29.2 s ± 458 ms
50%	31.4 s ± 121 ms	32.1 s ± 286 ms
60%	33.3 s ± 321 ms	33.6 s ± 392 ms
70%	30.8 s ± 190 ms	31.1 s ± 250 ms
80%	24.8 s ± 287 ms	24.5 s ± 228 ms
90%	16.8 s ± 222 ms	15.8 s ± 117 ms
100%	7.5 s ± 186 ms	7.19 s ± 72.1 ms

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)
randint = rng.randint(10, size=X.shape)

density = 1 # ratio of nans 1/10
mask = randint < density

X_na = X.copy()
X_na.values[mask] = np.nan

%timeit KNNImputer().fit_transform(X_na)

adrinjalali

This also needs a changelog, otherwise LGTM.

@OmarManzoor would you wanna have a look?

OmarManzoor

Thanks for the PR @xuefeng-xu. Generally looks good just a few comments.

sklearn/impute/tests/test_knn.py

OmarManzoor · 2024-06-06T12:51:07Z

sklearn/impute/tests/test_knn.py

+    knn = KNNImputer(missing_values=na, n_neighbors=2, weights=weights)
+    knn.fit(X_train)
+    assert_allclose(knn.transform(X_test), X_test_expected)
+


Would it be possible to add another scenario where we have like 3 or 4 features, 3 or 4 rows in X_train and more 2 or 3 rows in X_test?

Sure, I added another test scenario.

OmarManzoor

LGTM. Thanks @xuefeng-xu

xuefeng-xu · 2024-06-07T02:32:05Z

There is another minor issue in KNNImputer. If the dataset is all nan, it still takes some time to run (around 7 seconds, as shown in the benchmark) because the current code still compute the pairwise distances.

PR #29060 tackles this issue, because columns where training data is all nan can be excluded.

xuefeng-xu added 2 commits May 22, 2024 17:36

FIX exclude samples with nan distance in KNNImputer for uniform weights

2925794

update weight_matrix

1d114f6

github-actions bot added the module:impute label May 30, 2024

adrinjalali reviewed Jun 5, 2024

View reviewed changes

Merge branch 'main' into nan_dist_knnimpute

7cdee80

adrinjalali approved these changes Jun 6, 2024

View reviewed changes

add changelog

8761001

glemaitre mentioned this pull request Jun 6, 2024

Samples with nan distance are included in the computation of mean in KNNImputer for uniform weights #29079

Closed

glemaitre self-requested a review June 6, 2024 08:26

OmarManzoor reviewed Jun 6, 2024

View reviewed changes

xuefeng-xu added 2 commits June 6, 2024 23:11

add another test data with more rows and columns

be7dab8

resolve conflicts

9215e67

OmarManzoor approved these changes Jun 6, 2024

View reviewed changes

OmarManzoor enabled auto-merge (squash) June 6, 2024 15:24

OmarManzoor merged commit 8d0b243 into scikit-learn:main Jun 6, 2024
28 checks passed

xuefeng-xu deleted the nan_dist_knnimpute branch June 6, 2024 16:18

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

xuefeng-xu commented May 30, 2024

github-actions bot commented May 30, 2024 •

edited

Loading

adrinjalali left a comment

xuefeng-xu commented Jun 6, 2024

adrinjalali left a comment

OmarManzoor left a comment

OmarManzoor Jun 6, 2024

xuefeng-xu Jun 6, 2024

OmarManzoor left a comment

xuefeng-xu commented Jun 7, 2024

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

FIX exclude samples with nan distance in KNNImputer for uniform weights #29135

Conversation

xuefeng-xu commented May 30, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented May 30, 2024 • edited Loading

✔️ Linting Passed

adrinjalali left a comment

Choose a reason for hiding this comment

xuefeng-xu commented Jun 6, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor Jun 6, 2024

Choose a reason for hiding this comment

xuefeng-xu Jun 6, 2024

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

xuefeng-xu commented Jun 7, 2024

github-actions bot commented May 30, 2024 •

edited

Loading