Skip to content

Make automatic validation for all scikit-learn public functions #24862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glemaitre opened this issue Nov 8, 2022 · 243 comments
Closed

Make automatic validation for all scikit-learn public functions #24862

glemaitre opened this issue Nov 8, 2022 · 243 comments
Labels
good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint Validation related to input validation

Comments

@glemaitre
Copy link
Member

glemaitre commented Nov 8, 2022

PR #22722 introduced a decorator to validate the parameters of functions. We now need to use it for all functions where it is applicable.

Please open one PR per function. The title of the PR must mention which function it's dealing with. We recommend using the following pattern for titles:

MAINT Parameters validation for <function>

where <function> is a placeholder to be replaced with the function you chose.

The description of the PR must begin with Towards #24862 so that this issue and the PR are mutually crossed-linked.

Steps

  1. Chose a public function that is documented in https://scikit-learn.org/dev/modules/classes.html. Check in the source code if the function contains some manual parameter validation (i.e. you should see some if condition and error raising). In case there is no validation in the function, you can report it in the issue where we will decide whether or not to skip the function.

  2. To validate the function, you need to decorate it with the decorator sklearn.utils._param_validation.validate_params. Do not rely only on the docstring of the estimator to define it: although it can help, it's important to primarily rely on the implementation to find the valid values because the docstring might not be completely accurate. The decorator take a Python dictionary as input where each key corresponds to a parameter name and the value corresponds to the associate constraints. You can find an example for kmeans_plusplus below

    @validate_params(
    {
    "X": ["array-like", "sparse matrix"],
    "n_clusters": [Interval(Integral, 1, None, closed="left")],
    "x_squared_norms": ["array-like", None],
    "random_state": ["random_state"],
    "n_local_trials": [Interval(Integral, 1, None, closed="left"), None],
    }
    )
    def kmeans_plusplus(
    X, n_clusters, *, x_squared_norms=None, random_state=None, n_local_trials=None
    ):
    You can also get more details regarding the constraints by looking at the different Estimators validation previously implemented (cf. the _parameter_constraints attribute).

  3. All existing simple param validation can now be removed. (simple means that does not depend on the input data or that does not depend on the value of another parameter for instance).

  4. Tests that check error messages from simple param validation can also be removed (carefully: we need to keep the tests checking for more complex param validation !).

  5. Finally, add the function to the list of the common param validation test

    PARAM_VALIDATION_FUNCTION_LIST = [
    "sklearn.cluster.kmeans_plusplus",
    ]

    and make sure the test passes: pytest -vl sklearn/tests/test_public_functions.py

Functions already updated:

See "details" in section 1

Be aware that you can see an up-to-date list at the following link: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tests/test_public_functions.py#L132

@github-actions github-actions bot added the Needs Triage Issue requires triage label Nov 8, 2022
@glemaitre glemaitre added Easy Well-defined and straightforward way to resolve Sprint good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Validation related to input validation and removed Needs Triage Issue requires triage labels Nov 8, 2022
@mrastgoo
Copy link
Contributor

mrastgoo commented Nov 8, 2022

Hi can I work on sklearn.cluster.spectral_clustering

@Ala-Na
Copy link
Contributor

Ala-Na commented Nov 8, 2022

Hi ! Can I try to do the sklearn.cluster.affinity_propagation function ?

@rth
Copy link
Member

rth commented Nov 8, 2022

Yes sure, no need to ask. Just commenting that you are working on it is enough :)

@nirezuluet
Copy link

Hi. I can work on sklearn.cluster.mean_shift

@vinayak-mehta
Copy link
Contributor

I can work on cluster.dbscan

@jirask
Copy link

jirask commented Nov 8, 2022

hi! I am working on ward_tree

@vinayak-mehta
Copy link
Contributor

cluster.dbscan doesn't need parameter validation because it's already happening inside the class. cc: @glemaitre

@glevv
Copy link
Contributor

glevv commented Nov 8, 2022

l1_min_c is not on the list, so it's not needed?

@OmarManzoor
Copy link
Contributor

@glemaitre I think cluster.k_means might already be done?

@OmarManzoor
Copy link
Contributor

Working on cluster.estimate_bandwidth

@glevv
Copy link
Contributor

glevv commented Nov 9, 2022

WIP covariance.ledoit_wolf

@OmarManzoor
Copy link
Contributor

Working on decomposition.dict_learning

@emirkmo
Copy link
Contributor

emirkmo commented Nov 9, 2022

since cluster.k_means simply calls cluster.KMeans which is a validated class, should we still validate k_means or mark it as skipped/validated? If it should be validated I will add it.

@adrinjalali
Copy link
Member

@emirkmo no we don't need to validate separately, if the underlying class already does it.

@emirkmo
Copy link
Contributor

emirkmo commented Nov 10, 2022

@emirkmo no we don't need to validate separately, if the underlying class already does it.

Then cluster.mean_shift is also validated.

@emirkmo
Copy link
Contributor

emirkmo commented Nov 10, 2022

Working on feature_extraction.image.grid_to_graph and feature_extraction.image.img_to_graph

@ggold7046
Copy link

2. You can also get more details regarding the constraints by looking at the different Estimators validation previously implemented (cf. the _parameter_constraints attribute).

Where can I get details about _parameter_constraints attributes ?

@glemaitre
Copy link
Member Author

Look at any previously merged PR from this thread. They will give a good overview of what is required.

@ggold7046
Copy link

Actually I was looking for what parameters and values to put in the dictionary key : value pair.

@Josipo
Copy link

Josipo commented Jul 21, 2023

Hi,
datasets.load_sample_images actually has no input parameters. Hence, you can consider it to be done 🟢

@lpsilvestrin
Copy link
Contributor

Working on neighbors.radius_neighbors_graph

@lpsilvestrin
Copy link
Contributor

lpsilvestrin commented Jul 27, 2023

@jeremiedbb It seems like neighbors.radius_neighbors_graph is not public. Perhaps it was not meant to be on the todo list, or am I missing something?

@sqali
Copy link
Contributor

sqali commented Jul 30, 2023

Hi @glemaitre ,

I can see that validation parameters in the files of the ensemble functions have been added. However, the list in this file does not contain any ensemble function. Kindly correct me if I am wrong.

"sklearn.datasets.make_spd_matrix",
"sklearn.datasets.make_swiss_roll",
"sklearn.decomposition.sparse_encode",
"sklearn.feature_extraction.grid_to_graph",

@Tialo
Copy link
Contributor

Tialo commented Aug 1, 2023

@AnuravModak
Copy link

is the utils.extmath.weighted_mode validation done? @glemaitre ??

@NikkiBht
Copy link

Is there anything left to be validated? I checked everything and didn't find any to work on. I am new to this so I am not sure if I ACTUALLY checked everything + a few of them have open PRs from a while ago I suppose, am I right? Can we work on those?

@jeremiedbb
Copy link
Member

@lpsilvestrin radius_neighbors_graph is a public function (by convention public functions are the ones documented, i.e. that are in classes.rst). Feel free to open a PR for this one if you're still interested. It's the last one left :)

@jeremiedbb
Copy link
Member

jeremiedbb commented Aug 30, 2023

I updated the list of functions that are not valiated yet #24862 (comment). All of them have open PRs except radius_neighbors_graph. It should be pretty straight forward, taking inspiration from kneighbors_graph.

@sqali
Copy link
Contributor

sqali commented Aug 30, 2023

Hi @jeremiedbb ,

Hope you are doing well. I am interested in this pull request. May I proceed to work on this?

@jeremiedbb
Copy link
Member

Sure go ahead !

@sqali
Copy link
Contributor

sqali commented Aug 30, 2023

Hi @jeremiedbb ,

I can see that validation has already been added as you can see below. Can you please guide me if I misunderstood your former comment?

@validate_params(
{
"X": ["array-like", KNeighborsMixin],
"n_neighbors": [Interval(Integral, 1, None, closed="left")],
"mode": [StrOptions({"connectivity", "distance"})],
"metric": [StrOptions(set(itertools.chain(*VALID_METRICS.values()))), callable],
"p": [Interval(Real, 0, None, closed="right"), None],
"metric_params": [dict, None],
"include_self": ["boolean", StrOptions({"auto"})],
"n_jobs": [Integral, None],
},
prefer_skip_nested_validation=False, # metric is not validated yet
)

@jeremiedbb
Copy link
Member

The function to validate is radius_neighbors_graph

@Snerdify
Copy link

Snerdify commented Sep 1, 2023

Hi @lpsilvestrin , can i proceed with utils.extmath.randomized_range_finder function ?

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 1, 2023

Hi @Snerdify we've decided not to add validation for this function. All the remaining functions to do already have open PRs.

@glemaitre
Copy link
Member Author

To be more explicit I will also lock the issue for new comment but let it open until we merged the remaining PRs.

@scikit-learn scikit-learn locked as resolved and limited conversation to collaborators Sep 27, 2023
@jeremiedbb
Copy link
Member

The last PRs have been merged. We can now close this issue. Thanks a lot everyone !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint Validation related to input validation
Projects
None yet
Development

No branches or pull requests