Skip to content

add feature_indices param to permutation_importance() #30005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

andrewm4894
Copy link

@andrewm4894 andrewm4894 commented Oct 3, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

add feature_indices param to permutation_importance() to enable a divide and conquer approach when you have 1000's of features. Idea being to pass in a subset of features you want to calculate feature importance for. For example you might want to randomly cover different subsets using feature_indices and do it multiple times in seperate work to cover all features in seperate permutation_importance() calls.

Any other comments?

My use case here is that i have some feature importance tasks running over a model (dnn) with 1000's of features and i am finding that as a bottle neck in the process since each time permutation_importance() needs to process all features. I would like to horizontally scale a bit more by having each permutation importance task randomly work on say 100's features and then i will combine all the results back together and handle and account for how many times we sampled each feature col etc.

Just wanted to make a quick small illustrative PR to show what i mean and see if maintainers think this is a reasonable or feasible approach/idea. Was thinking about coding this up myself in some way but though it could maybe end up being useful for others so just wanted to see about maybe contributing here.

add `n_features` param to enable a divide and conquer approach when you have 100's of features. Idea being to randomly cover n_features and do it multiple times in seperate work.
Copy link

github-actions bot commented Oct 3, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/inspection/tests/test_permutation_importance.py	2025-04-10 11:09:15.474510+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/inspection/tests/test_permutation_importance.py	2025-04-10 11:09:33.370666+00:00
@@ -555,11 +555,11 @@
         y,
         n_repeats=n_repeats_test,
         scoring="neg_mean_squared_error",
         feature_indices=feature_indices_test,
     )
-    assert results.importances.shape == (len(feature_indices_test),n_repeats_test)
+    assert results.importances.shape == (len(feature_indices_test), n_repeats_test)
 
 
 def test_permutation_importance_feature_indices_out_of_range_error():
     """Check you get error when feature_indices are out of range."""
     n_features = 10
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/inspection/tests/test_permutation_importance.py

Oh no! 💥 💔 💥
1 file would be reformatted, 918 files would be left unchanged.

Generated for commit: fc98ab6. Link to the linter CI: here

@andrewm4894 andrewm4894 marked this pull request as ready for review October 3, 2024 19:53
@andrewm4894
Copy link
Author

hmm not sure how to fix that failing test:

FAILED tests/test_public_functions.py::test_function_param_validation[sklearn.inspection.permutation_importance] - AssertionError: Mismatch between _parameter_constraints and the parameters

@glemaitre
Copy link
Member

@andrewm4894 Actually, we had a similar feature request in the past: #18694

In general, I think that the scope of #18694 is better defined for the tabular learning use case but using the propose API there, you could request N random features and make a single permutation and repeat the process.

You will need to combine the results but this is already something that you were planning.

It would be great that you have a look at the propose API in the issue.

@andrewm4894
Copy link
Author

@andrewm4894 Actually, we had a similar feature request in the past: #18694

@glemaitre oh interesting, I had thought about similar but felt this change a bit smaller and simple in some sense so maybe had more of a chance.

But agree, being able to pass on a list of features as a subset is probably the most general case and would serve what I need here too.

@andrewm4894
Copy link
Author

andrewm4894 commented Oct 8, 2024

Although @glemaitre this is a small and simple pr here while you linked to an issue but not sure if there is a draft pr anywhere?

(I will try think what that might look like but not sure how complex/big or not it might be)

Eg maybe a trade off between small incremental change vs a bigger one that maybe is more flexible longer term. Let's maybe see what maintainers think a potential way forward might be (I'm new here, first pr to the repo, so will defer to others).

@glemaitre
Copy link
Member

glemaitre commented Oct 8, 2024

not sure if there is a draft pr anywhere?

Nop I don't think that we got any PR on the subject.

Eg maybe a trade off between small incremental change vs a bigger one that maybe is more flexible longer term. Let's maybe see what maintainers think a potential way forward might be (I'm new here, first pr to the repo, so will defer to others).

Basically the API in the issue is much better suited than going on the n_features line because we cannot use the code as-is.

@andrewm4894 andrewm4894 changed the title add n_features param to permutation_importance() add feature_indices param to permutation_importance() Oct 10, 2024
@andrewm4894
Copy link
Author

andrewm4894 commented Oct 10, 2024

@glemaitre 👋 i made some changes here to switch to feature_indices approach that i think matches sort of api desired in #18694

this way user can just pass in a random list of feature indecies if they want or a more specific one if they only want to look at a specific feature or set of sub features.

just went with feature_indices name but maybe there is a better more conventional name for similar params elsewhere perhaps?

@andrewm4894
Copy link
Author

@glemaitre 👋 do you think the updated approach in this pr to use the more flexible API is a way forward?

I'm a little bit blocked by this in work so keen to try push to see if feasible to get something merged in the next little while or if I'll need to maybe try something else in work as a work around.

@andrewm4894
Copy link
Author

@glemaitre any idea if I can find someone to try help me get this reviewed and merged over next few weeks?

@andrewm4894
Copy link
Author

@glemaitre any idea if i can try make another run at seeing if feasible to consider getting this merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants