Skip to content

ENH Support cardinality filtering of columns in make_column_selector #22923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

richardt94
Copy link
Contributor

Reference Issues/PRs

Resolves #15873.

What does this implement/fix? Explain your changes.

Implemented the functionality in the PR (selecting columns with cardinality > or <= some threshold), and a test in the test suite to check it works as expected.

@jnothman
Copy link
Member

Are params min_cardinality and max_cardinality more self-explanatory?

@richardt94
Copy link
Contributor Author

Are params min_cardinality and max_cardinality more self-explanatory?

Those probably make more sense, and would allow for filtering ranges rather than just below or above a single value.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update doc/modules/compose.rst

(["col_low", "col_mid", "col_high"], None, 7),
(["col_mid"], 2, 6),
([], 2, 3),
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check boundary condition on min, i.e. min=1 or min=5. Please also check max > 7.

@glemaitre
Copy link
Member

Since it is closed to what the SuperVectorizer of dirty-cat is doing as heuristic I was wondering if @GaelVaroquaux would have some thoughts regarding the API. Is there a need for adding the minimum cardinality because in practice I would think that low cardinality would be categorical otherwise it will be numerical.

OneHotEncoder can already apply different policies on binary and multi categories features.

So I am just wondering if we need two parameters or a single one?

Any thoughts?

@GaelVaroquaux
Copy link
Member

My feeling is that I find the "make_column_selector" API very tedious to use.

I think that there is a benefit to have an object a bit like the SuperVectorizer (maybe calling it differently, that name is a bit ... fun): a simplified API that makes simple things simple.

@jnothman
Copy link
Member

While SuperVectorizer is great, one reason ColumnTransformer + make_column_selector is tedious is its genericity; the other is that we're trying to configure it with lists of tuples. Having factory-like methods on ColumnTransformer to add new transformers and specify the columns they apply to would make its usage more succinct and pleasant (ColumnTransformer().add(OneHotEncoder(), max_cardinality=5).add(GapEncoder(), min_cardinality=6)), even when not providing something as immediately useful as SuperVectorizer.

@jnothman
Copy link
Member

So I am just wondering if we need two parameters or a single one?

ColumnTransformer does not route columns of the input exclusively to different transformers, except where remainder is used. Unless we want to add a exclusive param to the ColumnTransformer, the user will need a way to specify both bounds, even to reimplement SuperVectorizer.

@glemaitre
Copy link
Member

@michelkluger Something missing in this PR is related to the documentation. We would need to have this feature demo in the User Guide (next where it is). Then, it would be great to see if there is an example where we could make use of the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make_column_selector: support for cardinality selection
4 participants