-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
add feature_indices
param to permutation_importance()
#30005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
add `n_features` param to enable a divide and conquer approach when you have 100's of features. Idea being to randomly cover n_features and do it multiple times in seperate work.
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
hmm not sure how to fix that failing test:
|
@andrewm4894 Actually, we had a similar feature request in the past: #18694 In general, I think that the scope of #18694 is better defined for the tabular learning use case but using the propose API there, you could request N random features and make a single permutation and repeat the process. You will need to combine the results but this is already something that you were planning. It would be great that you have a look at the propose API in the issue. |
@glemaitre oh interesting, I had thought about similar but felt this change a bit smaller and simple in some sense so maybe had more of a chance. But agree, being able to pass on a list of features as a subset is probably the most general case and would serve what I need here too. |
Although @glemaitre this is a small and simple pr here while you linked to an issue but not sure if there is a draft pr anywhere? (I will try think what that might look like but not sure how complex/big or not it might be) Eg maybe a trade off between small incremental change vs a bigger one that maybe is more flexible longer term. Let's maybe see what maintainers think a potential way forward might be (I'm new here, first pr to the repo, so will defer to others). |
Nop I don't think that we got any PR on the subject.
Basically the API in the issue is much better suited than going on the |
n_features
param to permutation_importance()
feature_indices
param to permutation_importance()
@glemaitre 👋 i made some changes here to switch to this way user can just pass in a random list of feature indecies if they want or a more specific one if they only want to look at a specific feature or set of sub features. just went with |
@glemaitre 👋 do you think the updated approach in this pr to use the more flexible API is a way forward? I'm a little bit blocked by this in work so keen to try push to see if feasible to get something merged in the next little while or if I'll need to maybe try something else in work as a work around. |
@glemaitre any idea if I can find someone to try help me get this reviewed and merged over next few weeks? |
@glemaitre any idea if i can try make another run at seeing if feasible to consider getting this merged? |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
add
feature_indices
param topermutation_importance()
to enable a divide and conquer approach when you have 1000's of features. Idea being to pass in a subset of features you want to calculate feature importance for. For example you might want to randomly cover different subsets usingfeature_indices
and do it multiple times in seperate work to cover all features in seperatepermutation_importance()
calls.Any other comments?
My use case here is that i have some feature importance tasks running over a model (dnn) with 1000's of features and i am finding that as a bottle neck in the process since each time
permutation_importance()
needs to process all features. I would like to horizontally scale a bit more by having each permutation importance task randomly work on say 100's features and then i will combine all the results back together and handle and account for how many times we sampled each feature col etc.Just wanted to make a quick small illustrative PR to show what i mean and see if maintainers think this is a reasonable or feasible approach/idea. Was thinking about coding this up myself in some way but though it could maybe end up being useful for others so just wanted to see about maybe contributing here.