Skip to content

ENH Record output of transformers in ColumnTransformer #18393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Apr 2, 2021
Merged

ENH Record output of transformers in ColumnTransformer #18393

merged 27 commits into from
Apr 2, 2021

Conversation

lbittarello
Copy link
Contributor

This PR adds a fitted attribute to the ColumnTransformer, tentatively named transformers_output_. It records the columns produced by each transformer (if any). This is useful for debugging as well as modelling (e.g., to speed up partial dependencies). I have incremented the existing tests to inspect the new attribute.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lbittarello !

This is an interesting feature. Can you provide a code snippet of how you would use this in practice?

@lbittarello
Copy link
Contributor Author

Here is a silly example:

import lightgbm as lgb
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

ct = ColumnTransformer(
    [
        ("x1", FunctionTransformer(lambda x: 2 * x.to_numpy()), ["x1"]),
        ("x2", OneHotEncoder(sparse=False), ["x2"]),
    ]
)

df = pd.DataFrame({"y": [0, 2, 5], "x1": [0, 1, 2], "x2": ["i", "i", "ii"]})
dft = ct.fit_transform(df)

estimator = lgb.LGBMRegressor(min_child_samples=1).fit(dft, df["y"])

# compute partial dependency (advantage: no unnecessary transformations)

ix = ct.transformers_output_["x2"]
dft[:, ix] = ct.named_transformers_["x2"].transform(pd.DataFrame({"x2": ["i"] * 3}))
estimator.predict(dft).mean()

# compute total gain from an untransformed feature

ix = ct.transformers_output_["x2"]
estimator.booster_.feature_importance(importance_type="gain")[ix].sum()

The attribute becomes more valuable when you construct the column transformer programmatically, so it may contain many transformers and it isn't obvious which is responsible for what column.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +0.5 on this feature. This type of correspondence is related to SLEP 003.

I would recommend waiting to see what others think.

@lbittarello
Copy link
Contributor Author

lbittarello commented Sep 24, 2020

This type of correspondence is related to SLEP 003.

Very true. But SLEP 003 refers to a slightly different problem: the relation between the inputs and outputs of a single transformer. It does not address the fact that we can't tell which transformers in a ColumnTransformer generated which columns.

Consider the example above:

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer

ct = ColumnTransformer(
    [("norm1", Normalizer(norm="l1"), [1, 2]), ("norm2", Normalizer(norm="l1"), [0, 3])]
)
X = np.array([[0.0, 1.0, 2.0, 2.0], [1.0, 1.0, 0.0, 1.0]])
Xt = ct.fit_transform(X)

Xt has four columns. Each Normalizer generates two of those columns.

I could use the proposed get_feature_dependence of the first Normalizer to determine that column 1 of X is responsible for the first column in the output of this particular transformer. But I don't know which column of Xt corresponds to the first column in the output of "norm1". So I am not much wiser in the end.

On the other hand, I can combine transformers_output_ and get_feature_dependence to have a full mapping from each input column to each column in the output data, which is more informative.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is precisely one of the things I was considering when proposing SLEP003, and I long ago suggested similar functionality for FeatureUnion. I'm generally positive about the idea.

One awkwardness for FeatureUnion was that, without n_features_out_ or consistent availability of get_feature_names, transformers_output_ is only available if fit_transform and not fit is called. Here, fit calls fit_transform, so that's not an issue.

So the question here is: what is the right name for the attribute, and what is the right format for its data? I like the idea of it being a dict. I might be more comfortable, however, with its values being slices.

@lbittarello
Copy link
Contributor Author

lbittarello commented Sep 25, 2020

I tried to make the docstrings more explicit, spun off the tests and changed the values in the dictionary to slices. Happy to change the name of the attribute too.

@mlondschien
Copy link
Contributor

What is the state on this? I would be interested in this as well. @jnothman @thomasjpfan

@thomasjpfan
Copy link
Member

Slices may not be the best representation, because the selected columns can be disjoint. In this case a boolean mask would be better.

With either solution, it may conflict with the discussion in #14251, which is if we are okay with allowing columns in transform, if they were not seen in fit.

@lbittarello
Copy link
Contributor Author

it may conflict with the discussion in #14251

As far as I understand, #14251 is about input columns. This PR only relates transformers to columns in the output.

Slices may not be the best representation, because the selected columns can be disjoint

As far as I understand, columns in the output are never disjoint (unlike input columns, which are not the object of this PR).

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 14, 2020

As far as I understand, columns in the output are never disjoint (unlike input columns, which are not the object of this PR).

Ah yes, you are correct. This PR does not conflict with the issue.

I think it would be useful to extend one of the examples to showcase the new attribute, thus increasing the visibility of the new attribute. It may be difficult to find a examples to extend because we usually place the column transformer into a pipeline, where we do not need to connection between the name and the index of the feature output indices.

@NicolasHug
Copy link
Member

I think it would be useful to extend one of the examples to showcase the new attribute, thus increasing the visibility of the new attribute. It may be difficult to find a examples to extend because we usually place the column transformer into a pipeline, where we do not need to connection between the name and the index of the feature output indices

@thomasjpfan wouldn't this be a perfect match for the categorical_features of HistGradientBoosting ? #18394 (comment)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lbittarello , made a first pass, this looks good.

I think it would be more naturally if all transformers were in the dict (including passthrough), even if they're mapped to an empty slice. I'm happy to be convinced otherwise though.

Regarding the name, I like output_indices_:
ct.output_indices_['encoder'] reads naturally to me

lbittarello and others added 8 commits November 19, 2020 22:15
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lbittarello , some minor nits but LGTM!

We'll also need a whats new entry in doc/whats_new/v0.24.rst. Make sure to reference this PR as illustrated in the other entries there.

lbittarello and others added 4 commits November 21, 2020 22:18
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Base automatically changed from master to main January 22, 2021 10:53
@thomasjpfan thomasjpfan changed the title Record output of transformers in ColumnTransformer ENH Record output of transformers in ColumnTransformer Mar 22, 2021
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 26e688d into scikit-learn:main Apr 2, 2021
@thomasjpfan
Copy link
Member

Thank you for working on this @lbittarello !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants