-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX exclude samples with nan distance in KNNImputer for uniform weights #29135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please do a benchmark and see how this affects the performance? Since before this PR np.ma.average
was getting weights=None
but now it's always an array.
I did a benchmark as follows, the time looks similar.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd
calhousing = fetch_california_housing()
X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')
rng = np.random.RandomState(42)
randint = rng.randint(10, size=X.shape)
density = 1 # ratio of nans 1/10
mask = randint < density
X_na = X.copy()
X_na.values[mask] = np.nan
%timeit KNNImputer().fit_transform(X_na) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs a changelog, otherwise LGTM.
@OmarManzoor would you wanna have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @xuefeng-xu. Generally looks good just a few comments.
knn = KNNImputer(missing_values=na, n_neighbors=2, weights=weights) | ||
knn.fit(X_train) | ||
assert_allclose(knn.transform(X_test), X_test_expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add another scenario where we have like 3 or 4 features, 3 or 4 rows in X_train and more 2 or 3 rows in X_test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added another test scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @xuefeng-xu
There is another minor issue in PR #29060 tackles this issue, because columns where training data is all nan can be excluded. |
Reference Issues/PRs
Closes #29079
What does this implement/fix? Explain your changes.
For 'distance' weights,
KNNImputer
exclude samples with nan distance.This PR perform similar behavior for 'uniform' weights.
Any other comments?