Skip to content

DOC Scale data before using k-neighbours regression #31201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 12, 2025

Conversation

5nizza
Copy link
Contributor

@5nizza 5nizza commented Apr 14, 2025

Fixes #31200
by basically replacing
KNeigboursRegressor(......) with make_pipeline(StandardScaler(), KNeigboursRegressor(......))

Copy link

github-actions bot commented Apr 14, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 15acf02. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Since we now use custom estimator names, let's not use class names anymore.

Also, could you please fix the linting problems (use the formatters) as instructed in the automated comment above?

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ArturoAmorQ ArturoAmorQ changed the title scaling data before using k-neighbours regression (fixes #31200) DOC Scale data before using k-neighbours regression (fixes #31200) Apr 24, 2025
Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @5nizza. These suggestions should fix the failing CI, but I strongly recommend using pre-commit to format your files before commiting:

conda install pre-commit

Copy link
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @5nizza!

I have one small nitpick. Otherwise, LGTM!

Co-authored-by: Virgil Chan <virchan.math@gmail.com>
Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I agree that kNN requires feature scaling in general, in this case the features of the diabetes dataset already have the same scale, so it's normal that we don't really see an improvement with your PR.

For the California housing dataset we do have different scales, but the features "Population" and "AveOccup" have large outliers, so maybe using a StandardScaler is not the best option. How about we try a RobustScaler?

@betatim betatim changed the title DOC Scale data before using k-neighbours regression (fixes #31200) DOC Scale data before using k-neighbours regression May 20, 2025
@5nizza
Copy link
Contributor Author

5nizza commented May 21, 2025

thanks for the comments.

  1. I will use RobustScaler instead of StandardScaler in both python examples.
  2. There is another problem: the iterative imputer in the example plot_missing_values.py uses estimator=BayesianRidge which also requires features on a similar scale for its best performance (due to ridge's L2 optimization of weights), so I will change this as well (and update the issue description).

@5nizza
Copy link
Contributor Author

5nizza commented May 26, 2025

Here is the summary of the updates. In plot_missing_values.py:

  • re-phrasing, typos
  • the same order throughout the file: first, process the diabetes data, then process the california data
  • using a single evaluation function for both full and imputed data (to remove code repetitions)
  • removed the argument missing_values=np.nan in the calls to all imputers
  • changed order: first, compute scores for ImputeByZero, then for ImputeByMean
  • as suggested, using RobustScaler instead of StandardScaler for the California dataset (and no scaling for the diabetes dataset)
  • in the California dataset, use RobustScaler before imputing with BayesianRidge (which is the default) to put the features on the same scale
  • the labels on the diagram do not contain "scaled", because the scaling is applied only to the California dataset

Changes in file plot_iterative_imputer_variants_comparison.py:

  • using RobustScaler instead of StandardScaler as was suggested
  • using the single function compute_score_for for computing the scores for both full and missing-values datasets
  • using scaling for all evaluations, because the final target estimator is BayesianRidge which is affected by different-scale features (due to regularization)
  • modified call parameters of Ridge, k-NN regressor, and RandomForestRegressor, to improve computational stability during iterative imputation, and increased the number of iterations
  • removed median simple imputation (does not add anything interesting in this context)

…target_estimator) and re-initialization of rng
@betatim
Copy link
Member

betatim commented Jun 5, 2025

Looks good to me. What do you think @ArturoAmorQ?

@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 6, 2025
@adrinjalali adrinjalali merged commit 9f86681 into scikit-learn:main Jun 12, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC Examples (imputation): add scaling when using k-neighbours imputation
6 participants