You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two examples for missing-values imputation use k-neighbors imputation without scaling data first.
As a result, the approaches under-perform.
The examples are:
In the first example, the effect is quite small, adding scaling before calling k-neighbours imputer changes MSE for the california dataset for k-NN from 0.2987 ± 0.1469 to 0.2912 ± 0.1410 and for the diabetes dataset from 3314 ± 114 to 3323 ± 90.
In the second example (comparing iterative imputations), the change is more significant: before the change, iterative imputation with k-neighbors performed worse than imputation with mean, after the scaling -- it performs better than mean imputation.
In both cases, it is a better practice to scale data before using a k-neighbors approach which is based on distances between points.
Suggest a potential alternative/fix
I will submit a patch to fix an issue.
The text was updated successfully, but these errors were encountered:
5nizza
changed the title
Add scaling when using k-neighbours imputation
Examples (imputation): add scaling when using k-neighbours imputation
Apr 14, 2025
5nizza
added a commit
to 5nizza/scikit-learn
that referenced
this issue
Apr 14, 2025
I agree that k-NN requires feature scaling in general.
ArturoAmorQ
changed the title
Examples (imputation): add scaling when using k-neighbours imputation
DOC Examples (imputation): add scaling when using k-neighbours imputation
Apr 24, 2025
Describe the issue linked to the documentation
Two examples for missing-values imputation use k-neighbors imputation without scaling data first.
As a result, the approaches under-perform.
The examples are:
In the first example, the effect is quite small, adding scaling before calling k-neighbours imputer changes MSE for the california dataset for k-NN from 0.2987 ± 0.1469 to 0.2912 ± 0.1410 and for the diabetes dataset from 3314 ± 114 to 3323 ± 90.
In the second example (comparing iterative imputations), the change is more significant: before the change, iterative imputation with k-neighbors performed worse than imputation with mean, after the scaling -- it performs better than mean imputation.
In both cases, it is a better practice to scale data before using a k-neighbors approach which is based on distances between points.
Suggest a potential alternative/fix
I will submit a patch to fix an issue.
The text was updated successfully, but these errors were encountered: