-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Record output of transformers in ColumnTransformer #18393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @lbittarello !
This is an interesting feature. Can you provide a code snippet of how you would use this in practice?
Here is a silly example: import lightgbm as lgb
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
ct = ColumnTransformer(
[
("x1", FunctionTransformer(lambda x: 2 * x.to_numpy()), ["x1"]),
("x2", OneHotEncoder(sparse=False), ["x2"]),
]
)
df = pd.DataFrame({"y": [0, 2, 5], "x1": [0, 1, 2], "x2": ["i", "i", "ii"]})
dft = ct.fit_transform(df)
estimator = lgb.LGBMRegressor(min_child_samples=1).fit(dft, df["y"])
# compute partial dependency (advantage: no unnecessary transformations)
ix = ct.transformers_output_["x2"]
dft[:, ix] = ct.named_transformers_["x2"].transform(pd.DataFrame({"x2": ["i"] * 3}))
estimator.predict(dft).mean()
# compute total gain from an untransformed feature
ix = ct.transformers_output_["x2"]
estimator.booster_.feature_importance(importance_type="gain")[ix].sum() The attribute becomes more valuable when you construct the column transformer programmatically, so it may contain many transformers and it isn't obvious which is responsible for what column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +0.5 on this feature. This type of correspondence is related to SLEP 003.
I would recommend waiting to see what others think.
Very true. But SLEP 003 refers to a slightly different problem: the relation between the inputs and outputs of a single transformer. It does not address the fact that we can't tell which transformers in a Consider the example above: import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
[("norm1", Normalizer(norm="l1"), [1, 2]), ("norm2", Normalizer(norm="l1"), [0, 3])]
)
X = np.array([[0.0, 1.0, 2.0, 2.0], [1.0, 1.0, 0.0, 1.0]])
Xt = ct.fit_transform(X)
I could use the proposed On the other hand, I can combine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is precisely one of the things I was considering when proposing SLEP003, and I long ago suggested similar functionality for FeatureUnion. I'm generally positive about the idea.
One awkwardness for FeatureUnion was that, without n_features_out_ or consistent availability of get_feature_names, transformers_output_
is only available if fit_transform
and not fit
is called. Here, fit calls fit_transform, so that's not an issue.
So the question here is: what is the right name for the attribute, and what is the right format for its data? I like the idea of it being a dict. I might be more comfortable, however, with its values being slice
s.
I tried to make the docstrings more explicit, spun off the tests and changed the values in the dictionary to slices. Happy to change the name of the attribute too. |
What is the state on this? I would be interested in this as well. @jnothman @thomasjpfan |
Slices may not be the best representation, because the selected columns can be disjoint. In this case a boolean mask would be better. With either solution, it may conflict with the discussion in #14251, which is if we are okay with allowing columns in |
As far as I understand, #14251 is about input columns. This PR only relates transformers to columns in the output.
As far as I understand, columns in the output are never disjoint (unlike input columns, which are not the object of this PR). |
Ah yes, you are correct. This PR does not conflict with the issue. I think it would be useful to extend one of the examples to showcase the new attribute, thus increasing the visibility of the new attribute. It may be difficult to find a examples to extend because we usually place the column transformer into a pipeline, where we do not need to connection between the name and the index of the feature output indices. |
@thomasjpfan wouldn't this be a perfect match for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lbittarello , made a first pass, this looks good.
I think it would be more naturally if all transformers were in the dict (including passthrough
), even if they're mapped to an empty slice. I'm happy to be convinced otherwise though.
Regarding the name, I like output_indices_
:
ct.output_indices_['encoder']
reads naturally to me
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lbittarello , some minor nits but LGTM!
We'll also need a whats new entry in doc/whats_new/v0.24.rst
. Make sure to reference this PR as illustrated in the other entries there.
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for working on this @lbittarello ! |
This PR adds a fitted attribute to the
ColumnTransformer
, tentatively namedtransformers_output_
. It records the columns produced by each transformer (if any). This is useful for debugging as well as modelling (e.g., to speed up partial dependencies). I have incremented the existing tests to inspect the new attribute.