DOC Scale data before using k-neighbours regression #31201

5nizza · 2025-04-14T12:33:20Z

Fixes #31200
by basically replacing
KNeigboursRegressor(......) with make_pipeline(StandardScaler(), KNeigboursRegressor(......))

…n#31200)

github-actions · 2025-04-14T12:34:37Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 15acf02. Link to the linter CI: here}

ogrisel

Thanks for the PR. Since we now use custom estimator names, let's not use class names anymore.

Also, could you please fix the linting problems (use the formatters) as instructed in the automated comment above?

examples/impute/plot_iterative_imputer_variants_comparison.py

examples/impute/plot_missing_values.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ArturoAmorQ

Thanks for the PR @5nizza. These suggestions should fix the failing CI, but I strongly recommend using pre-commit to format your files before commiting:

conda install pre-commit

examples/impute/plot_iterative_imputer_variants_comparison.py

examples/impute/plot_missing_values.py

examples/impute/plot_iterative_imputer_variants_comparison.py

virchan

Thank you for the PR, @5nizza!

I have one small nitpick. Otherwise, LGTM!

examples/impute/plot_missing_values.py

Co-authored-by: Virgil Chan <virchan.math@gmail.com>

ArturoAmorQ

Even though I agree that kNN requires feature scaling in general, in this case the features of the diabetes dataset already have the same scale, so it's normal that we don't really see an improvement with your PR.

For the California housing dataset we do have different scales, but the features "Population" and "AveOccup" have large outliers, so maybe using a StandardScaler is not the best option. How about we try a RobustScaler?

5nizza · 2025-05-21T09:16:46Z

thanks for the comments.

I will use RobustScaler instead of StandardScaler in both python examples.
There is another problem: the iterative imputer in the example plot_missing_values.py uses estimator=BayesianRidge which also requires features on a similar scale for its best performance (due to ridge's L2 optimization of weights), so I will change this as well (and update the issue description).

5nizza · 2025-05-26T10:44:24Z

Here is the summary of the updates. In plot_missing_values.py:

re-phrasing, typos
the same order throughout the file: first, process the diabetes data, then process the california data
using a single evaluation function for both full and imputed data (to remove code repetitions)
removed the argument missing_values=np.nan in the calls to all imputers
changed order: first, compute scores for ImputeByZero, then for ImputeByMean
as suggested, using RobustScaler instead of StandardScaler for the California dataset (and no scaling for the diabetes dataset)
in the California dataset, use RobustScaler before imputing with BayesianRidge (which is the default) to put the features on the same scale
the labels on the diagram do not contain "scaled", because the scaling is applied only to the California dataset

Changes in file plot_iterative_imputer_variants_comparison.py:

using RobustScaler instead of StandardScaler as was suggested
using the single function compute_score_for for computing the scores for both full and missing-values datasets
using scaling for all evaluations, because the final target estimator is BayesianRidge which is affected by different-scale features (due to regularization)
modified call parameters of Ridge, k-NN regressor, and RandomForestRegressor, to improve computational stability during iterative imputation, and increased the number of iterations
removed median simple imputation (does not add anything interesting in this context)

examples/impute/plot_iterative_imputer_variants_comparison.py

examples/impute/plot_missing_values.py

…target_estimator) and re-initialization of rng

examples/impute/plot_iterative_imputer_variants_comparison.py

betatim · 2025-06-05T12:49:49Z

Looks good to me. What do you think @ArturoAmorQ?

Co-authored-by: Tim Head <betatim@gmail.com>

scaling data before using k-neighbours regression (fixes: scikit-lear…

9cea45a

…n#31200)

ogrisel reviewed Apr 24, 2025

View reviewed changes

Apply suggestions from code review (next: resolving lint issues)

ee09627

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ArturoAmorQ changed the title ~~scaling data before using k-neighbours regression (fixes #31200)~~ DOC Scale data before using k-neighbours regression (fixes #31200) Apr 24, 2025

github-actions bot added the Documentation label Apr 24, 2025

ArturoAmorQ reviewed Apr 28, 2025

View reviewed changes

5nizza added 2 commits April 28, 2025 16:08

fixed lint issues; now ready for merge

e4fe1d5

Merge branch 'main' into scaling-for-kNN

b7ebbbd

virchan reviewed May 13, 2025

View reviewed changes

examples/impute/plot_missing_values.py Outdated Show resolved Hide resolved

Apply suggestions

d780e9e

Co-authored-by: Virgil Chan <virchan.math@gmail.com>

ArturoAmorQ reviewed May 20, 2025

View reviewed changes

betatim changed the title ~~DOC Scale data before using k-neighbours regression (fixes #31200)~~ DOC Scale data before using k-neighbours regression May 20, 2025

5nizza added 2 commits May 26, 2025 12:29

using recommended practices for scaling data; other minor modifications

933a1f2

Merge branch 'main' into scaling-for-kNN

3f48cff

betatim reviewed Jun 4, 2025

View reviewed changes

changes suggested by @betatim; removed unnecessary global variables (…

be1e95c

…target_estimator) and re-initialization of rng

betatim reviewed Jun 5, 2025

View reviewed changes

examples/impute/plot_iterative_imputer_variants_comparison.py Outdated Show resolved Hide resolved

betatim approved these changes Jun 5, 2025

View reviewed changes

Update examples/impute/plot_iterative_imputer_variants_comparison.py

15acf02

Co-authored-by: Tim Head <betatim@gmail.com>

betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 6, 2025

adrinjalali approved these changes Jun 12, 2025

View reviewed changes

adrinjalali merged commit 9f86681 into scikit-learn:main Jun 12, 2025
40 checks passed

Uh oh!

DOC Scale data before using k-neighbours regression #31201

DOC Scale data before using k-neighbours regression #31201

Uh oh!

Conversation

5nizza commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

virchan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

5nizza commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

5nizza commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

betatim commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

5nizza commented Apr 14, 2025 •

edited

Loading

github-actions bot commented Apr 14, 2025 •

edited

Loading

5nizza commented May 21, 2025 •

edited

Loading

5nizza commented May 26, 2025 •

edited

Loading