Skip to content

FIX ColumnTransformer.get_feature_names_out with string slices #22913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 31, 2022

Conversation

randomgeek78
Copy link
Contributor

@randomgeek78 randomgeek78 commented Mar 21, 2022

The earlier PR (#22775) did not account for string slices. A string slice like slice('col_A', 'col_B') is a valid column specification format and was not handled in the earlier PR. This PR extends the earlier fix to include string slices.

Reference Issues/PRs

Extends Fix #22775

Issue: #22774

What does this implement/fix? Explain your changes.

It extends get_feature_names_out API to include cases where the column specification was using a string slice like slice('col_A', 'col_B')

Any other comments?

@randomgeek78
Copy link
Contributor Author

How does this PR look? Am I missing something that I could fix? Any suggestions? Thanks!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

To be more specific, the whole method can become:

    def _get_feature_name_out_for_transformer(
        self, name, trans, column, feature_names_in
    ):
        """Gets feature names of transformer.

        Used in conjunction with self._iter(fitted=True) in get_feature_names_out.
        """
        if trans == "drop" or _is_empty_column_selection(column):
            return
        elif trans == "passthrough":
            column_indices = self._transformer_to_input_indices[name]
            return feature_names_in[column_indices]

        # An actual transformer
        if not hasattr(trans, "get_feature_names_out"):
            raise AttributeError(
                f"Transformer {name} (type {type(trans).__name__}) does "
                "not provide get_feature_names_out."
            )
        column_indices = self._transformer_to_input_indices[name]
        names = feature_names_in[column_indices]
        return trans.get_feature_names_out(names)

which removes many if statements.

(
[
("bycol1", TransWithNames(), ["b"]),
("bycol2", "drop", slice("d", "d")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a slice that contains more than one element?

# Assuming column and self._transformer_to_input_indices[name]
# are matched. True if column generated from _iter
column = self._transformer_to_input_indices[name]
return feature_names_in[column]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think we can simplify this whole section:

elif trans == "passthrough":
    column_indices = self._transformer_to_input_indices[name]
    return feature_names_in[column_indices]

@@ -453,6 +459,10 @@ def _get_feature_name_out_for_transformer(
f"Transformer {name} (type {type(trans).__name__}) does "
"not provide get_feature_names_out."
)
if isinstance(column, slice) and _determine_key_type(column) == "str":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here:

column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
return trans.get_feature_names_out(names)

and remove the if statements.

Copy link
Contributor Author

@randomgeek78 randomgeek78 Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Thanks for your feedback. When I was writing this, I was not sure if keeping column severed another purpose. Completely depending only on _transformer_to_input_indices will simplify the logic even more as in the latest PR. Do you think the API for _get_feature_name_out_for_transformer shouldn't change? Tx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _get_feature_name_out_for_transformer is private, we can change the API if we need to. We need to make sure the test pass for the public get_feature_names_out API.

Although in this case, the column is still used in:

if trans == "drop" or _is_empty_column_selection(column):

I think we can keep this line here for now, so this PR is easier to review. In the future, we can open an follow up PR to refactor _get_feature_name_out_for_transformer.

@@ -21,6 +21,7 @@
from ..utils import Bunch
from ..utils import _safe_indexing
from ..utils import _get_column_indices
from ..utils import _determine_key_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the linting issue on the CI:

Suggested change
from ..utils import _determine_key_type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan for your feedback. I hope I have addressed them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Not sure if this PR is still on your radar or you you want to see it updated in any way. Tx

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@randomgeek78
Copy link
Contributor Author

@thomasjpfan Thanks. What are the next steps? I am still not able to merge PR. Does policy require another reviewer to review the PR?

@thomasjpfan
Copy link
Member

Does policy require another reviewer to review the PR?

Yes, this PR requires a second reviewer

@randomgeek78
Copy link
Contributor Author

@adrinjalali @ogrisel This PR requires another approver. Would it be possible for you to talk a look since it extends the previous PR you approved earlier? Thanks.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @randomgeek78

@adrinjalali adrinjalali merged commit 7acb247 into scikit-learn:main Mar 31, 2022
@randomgeek78 randomgeek78 deleted the patch-1 branch March 31, 2022 13:28
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
…t-learn#22913)

* FIX ColumnTransformer.get_feature_names_out with string slices

* Fixed black formatting

* Added tests

* Updated whats new 1.1 section

* Simplified logic and added slices containing two elements

* Fixed black linting

* Linting cleanup, revert API change

* Fixed linting
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…t-learn#22913)

* FIX ColumnTransformer.get_feature_names_out with string slices

* Fixed black formatting

* Added tests

* Updated whats new 1.1 section

* Simplified logic and added slices containing two elements

* Fixed black linting

* Linting cleanup, revert API change

* Fixed linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants