-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX ColumnTransformer.get_feature_names_out with string slices #22913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How does this PR look? Am I missing something that I could fix? Any suggestions? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
To be more specific, the whole method can become:
def _get_feature_name_out_for_transformer(
self, name, trans, column, feature_names_in
):
"""Gets feature names of transformer.
Used in conjunction with self._iter(fitted=True) in get_feature_names_out.
"""
if trans == "drop" or _is_empty_column_selection(column):
return
elif trans == "passthrough":
column_indices = self._transformer_to_input_indices[name]
return feature_names_in[column_indices]
# An actual transformer
if not hasattr(trans, "get_feature_names_out"):
raise AttributeError(
f"Transformer {name} (type {type(trans).__name__}) does "
"not provide get_feature_names_out."
)
column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
return trans.get_feature_names_out(names)
which removes many if
statements.
( | ||
[ | ||
("bycol1", TransWithNames(), ["b"]), | ||
("bycol2", "drop", slice("d", "d")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a slice that contains more than one element?
# Assuming column and self._transformer_to_input_indices[name] | ||
# are matched. True if column generated from _iter | ||
column = self._transformer_to_input_indices[name] | ||
return feature_names_in[column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think we can simplify this whole section:
elif trans == "passthrough":
column_indices = self._transformer_to_input_indices[name]
return feature_names_in[column_indices]
@@ -453,6 +459,10 @@ def _get_feature_name_out_for_transformer( | |||
f"Transformer {name} (type {type(trans).__name__}) does " | |||
"not provide get_feature_names_out." | |||
) | |||
if isinstance(column, slice) and _determine_key_type(column) == "str": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here:
column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
return trans.get_feature_names_out(names)
and remove the if statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Thanks for your feedback. When I was writing this, I was not sure if keeping column
severed another purpose. Completely depending only on _transformer_to_input_indices
will simplify the logic even more as in the latest PR. Do you think the API for _get_feature_name_out_for_transformer
shouldn't change? Tx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _get_feature_name_out_for_transformer
is private, we can change the API if we need to. We need to make sure the test pass for the public get_feature_names_out
API.
Although in this case, the column
is still used in:
if trans == "drop" or _is_empty_column_selection(column):
I think we can keep this line here for now, so this PR is easier to review. In the future, we can open an follow up PR to refactor _get_feature_name_out_for_transformer
.
@@ -21,6 +21,7 @@ | |||
from ..utils import Bunch | |||
from ..utils import _safe_indexing | |||
from ..utils import _get_column_indices | |||
from ..utils import _determine_key_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the linting issue on the CI:
from ..utils import _determine_key_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan for your feedback. I hope I have addressed them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Not sure if this PR is still on your radar or you you want to see it updated in any way. Tx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@thomasjpfan Thanks. What are the next steps? I am still not able to merge PR. Does policy require another reviewer to review the PR? |
Yes, this PR requires a second reviewer |
@adrinjalali @ogrisel This PR requires another approver. Would it be possible for you to talk a look since it extends the previous PR you approved earlier? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @randomgeek78
…t-learn#22913) * FIX ColumnTransformer.get_feature_names_out with string slices * Fixed black formatting * Added tests * Updated whats new 1.1 section * Simplified logic and added slices containing two elements * Fixed black linting * Linting cleanup, revert API change * Fixed linting
…t-learn#22913) * FIX ColumnTransformer.get_feature_names_out with string slices * Fixed black formatting * Added tests * Updated whats new 1.1 section * Simplified logic and added slices containing two elements * Fixed black linting * Linting cleanup, revert API change * Fixed linting
The earlier PR (#22775) did not account for string slices. A string slice like
slice('col_A', 'col_B')
is a valid column specification format and was not handled in the earlier PR. This PR extends the earlier fix to include string slices.Reference Issues/PRs
Extends Fix #22775
Issue: #22774
What does this implement/fix? Explain your changes.
It extends get_feature_names_out API to include cases where the column specification was using a string slice like
slice('col_A', 'col_B')
Any other comments?