-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Support cardinality filtering of columns in make_column_selector
#22923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Support cardinality filtering of columns in make_column_selector
#22923
Conversation
Are params |
Those probably make more sense, and would allow for filtering ranges rather than just below or above a single value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update doc/modules/compose.rst
(["col_low", "col_mid", "col_high"], None, 7), | ||
(["col_mid"], 2, 6), | ||
([], 2, 3), | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check boundary condition on min, i.e. min=1 or min=5. Please also check max > 7.
Since it is closed to what the SuperVectorizer of dirty-cat is doing as heuristic I was wondering if @GaelVaroquaux would have some thoughts regarding the API. Is there a need for adding the minimum cardinality because in practice I would think that low cardinality would be categorical otherwise it will be numerical. OneHotEncoder can already apply different policies on binary and multi categories features. So I am just wondering if we need two parameters or a single one? Any thoughts? |
My feeling is that I find the "make_column_selector" API very tedious to use. I think that there is a benefit to have an object a bit like the SuperVectorizer (maybe calling it differently, that name is a bit ... fun): a simplified API that makes simple things simple. |
While SuperVectorizer is great, one reason |
ColumnTransformer does not route columns of the input exclusively to different transformers, except where |
@michelkluger Something missing in this PR is related to the documentation. We would need to have this feature demo in the User Guide (next where it is). Then, it would be great to see if there is an example where we could make use of the feature. |
Reference Issues/PRs
Resolves #15873.
What does this implement/fix? Explain your changes.
Implemented the functionality in the PR (selecting columns with cardinality > or <= some threshold), and a test in the test suite to check it works as expected.