ENH Record output of transformers in ColumnTransformer #18393

lbittarello · 2020-09-13T16:00:48Z

This PR adds a fitted attribute to the ColumnTransformer, tentatively named transformers_output_. It records the columns produced by each transformer (if any). This is useful for debugging as well as modelling (e.g., to speed up partial dependencies). I have incremented the existing tests to inspect the new attribute.

thomasjpfan

Thank you for the PR @lbittarello !

This is an interesting feature. Can you provide a code snippet of how you would use this in practice?

lbittarello · 2020-09-17T19:06:34Z

Here is a silly example:

import lightgbm as lgb
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

ct = ColumnTransformer(
    [
        ("x1", FunctionTransformer(lambda x: 2 * x.to_numpy()), ["x1"]),
        ("x2", OneHotEncoder(sparse=False), ["x2"]),
    ]
)

df = pd.DataFrame({"y": [0, 2, 5], "x1": [0, 1, 2], "x2": ["i", "i", "ii"]})
dft = ct.fit_transform(df)

estimator = lgb.LGBMRegressor(min_child_samples=1).fit(dft, df["y"])

# compute partial dependency (advantage: no unnecessary transformations)

ix = ct.transformers_output_["x2"]
dft[:, ix] = ct.named_transformers_["x2"].transform(pd.DataFrame({"x2": ["i"] * 3}))
estimator.predict(dft).mean()

# compute total gain from an untransformed feature

ix = ct.transformers_output_["x2"]
estimator.booster_.feature_importance(importance_type="gain")[ix].sum()

The attribute becomes more valuable when you construct the column transformer programmatically, so it may contain many transformers and it isn't obvious which is responsible for what column.

thomasjpfan

I am +0.5 on this feature. This type of correspondence is related to SLEP 003.

I would recommend waiting to see what others think.

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

lbittarello · 2020-09-24T20:11:43Z

This type of correspondence is related to SLEP 003.

Very true. But SLEP 003 refers to a slightly different problem: the relation between the inputs and outputs of a single transformer. It does not address the fact that we can't tell which transformers in a ColumnTransformer generated which columns.

Consider the example above:

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer

ct = ColumnTransformer(
    [("norm1", Normalizer(norm="l1"), [1, 2]), ("norm2", Normalizer(norm="l1"), [0, 3])]
)
X = np.array([[0.0, 1.0, 2.0, 2.0], [1.0, 1.0, 0.0, 1.0]])
Xt = ct.fit_transform(X)

Xt has four columns. Each Normalizer generates two of those columns.

I could use the proposed get_feature_dependence of the first Normalizer to determine that column 1 of X is responsible for the first column in the output of this particular transformer. But I don't know which column of Xt corresponds to the first column in the output of "norm1". So I am not much wiser in the end.

On the other hand, I can combine transformers_output_ and get_feature_dependence to have a full mapping from each input column to each column in the output data, which is more informative.

jnothman

This is precisely one of the things I was considering when proposing SLEP003, and I long ago suggested similar functionality for FeatureUnion. I'm generally positive about the idea.

One awkwardness for FeatureUnion was that, without n_features_out_ or consistent availability of get_feature_names, transformers_output_ is only available if fit_transform and not fit is called. Here, fit calls fit_transform, so that's not an issue.

So the question here is: what is the right name for the attribute, and what is the right format for its data? I like the idea of it being a dict. I might be more comfortable, however, with its values being slices.

lbittarello · 2020-09-25T07:43:26Z

I tried to make the docstrings more explicit, spun off the tests and changed the values in the dictionary to slices. Happy to change the name of the attribute too.

mlondschien · 2020-11-10T12:56:00Z

What is the state on this? I would be interested in this as well. @jnothman @thomasjpfan

thomasjpfan · 2020-11-14T18:35:33Z

Slices may not be the best representation, because the selected columns can be disjoint. In this case a boolean mask would be better.

With either solution, it may conflict with the discussion in #14251, which is if we are okay with allowing columns in transform, if they were not seen in fit.

lbittarello · 2020-11-14T18:43:04Z

it may conflict with the discussion in #14251

As far as I understand, #14251 is about input columns. This PR only relates transformers to columns in the output.

Slices may not be the best representation, because the selected columns can be disjoint

As far as I understand, columns in the output are never disjoint (unlike input columns, which are not the object of this PR).

thomasjpfan · 2020-11-14T21:04:10Z

As far as I understand, columns in the output are never disjoint (unlike input columns, which are not the object of this PR).

Ah yes, you are correct. This PR does not conflict with the issue.

I think it would be useful to extend one of the examples to showcase the new attribute, thus increasing the visibility of the new attribute. It may be difficult to find a examples to extend because we usually place the column transformer into a pipeline, where we do not need to connection between the name and the index of the feature output indices.

NicolasHug · 2020-11-16T22:58:04Z

I think it would be useful to extend one of the examples to showcase the new attribute, thus increasing the visibility of the new attribute. It may be difficult to find a examples to extend because we usually place the column transformer into a pipeline, where we do not need to connection between the name and the index of the feature output indices

@thomasjpfan wouldn't this be a perfect match for the categorical_features of HistGradientBoosting ? #18394 (comment)

NicolasHug

Thanks @lbittarello , made a first pass, this looks good.

I think it would be more naturally if all transformers were in the dict (including passthrough), even if they're mapped to an empty slice. I'm happy to be convinced otherwise though.

Regarding the name, I like output_indices_:
ct.output_indices_['encoder'] reads naturally to me

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

sklearn/compose/_column_transformer.py

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

NicolasHug

Thanks @lbittarello , some minor nits but LGTM!

We'll also need a whats new entry in doc/whats_new/v0.24.rst. Make sure to reference this PR as illustrated in the other entries there.

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

sklearn/compose/_column_transformer.py

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

doc/whats_new/v0.24.rst

sklearn/compose/_column_transformer.py

thomasjpfan

LGTM

sklearn/compose/_column_transformer.py

thomasjpfan · 2021-04-02T20:08:42Z

Thank you for working on this @lbittarello !

Record output of transformers in ColumnTransformer

b70639c

github-actions bot added the module:compose label Sep 13, 2020

lbittarello and others added 2 commits September 13, 2020 17:06

Shorten lines

80a9b14

Control variable 'trans' not used within the loop

5ce6646

thomasjpfan reviewed Sep 17, 2020

View reviewed changes

thomasjpfan reviewed Sep 24, 2020

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved

jnothman reviewed Sep 24, 2020

View reviewed changes

lbittarello added 3 commits September 25, 2020 08:38

Use slices as values instead of lists

fbc374f

Clarify docstring (hopefully)

6e86e80

Update and spin off tests

9c789de

Fix white space

bcc35ee

NicolasHug mentioned this pull request Nov 16, 2020

ENH Add Categorical support for HistGradientBoosting #18394

Merged

NicolasHug reviewed Nov 18, 2020

View reviewed changes

jnothman reviewed Nov 18, 2020

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

lbittarello and others added 8 commits November 19, 2020 22:15

Update documentation

fc3948f

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Rename idx_

4a4f81d

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Do not return

e844aac

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Rename _index_output

b8191ec

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com> Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Rename attribute

6665c03

Co-Authored-By: Joel Nothman <78827+jnothman@users.noreply.github.com>

Add comment to tests

127760b

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Add entries for transformers without output

a0dd285

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Docstring

29e12f3

lbittarello added 4 commits November 19, 2020 23:41

Rename attribute

ee44d50

Inconsistent tests

f8013d8

Docstrings

94cf924

Docstrings without reference

14899f7

NicolasHug approved these changes Nov 21, 2020

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

lbittarello and others added 4 commits November 21, 2020 22:18

slice(0, 0)

7d5ed36

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Update outdated comment

50ed71f

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

More tests

06ba19a

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

Log changes

36b20e8

Co-Authored-By: Nicolas Hug <NicolasHug@users.noreply.github.com>

cmarmo added the Waiting for Reviewer label Jan 10, 2021

Base automatically changed from master to main January 22, 2021 10:53

lbittarello requested review from jnothman and thomasjpfan March 17, 2021 20:15

thomasjpfan reviewed Mar 22, 2021

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/tests/test_column_transformer.py Show resolved Hide resolved

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Show resolved Hide resolved

thomasjpfan changed the title ~~Record output of transformers in ColumnTransformer~~ ENH Record output of transformers in ColumnTransformer Mar 22, 2021

cmarmo added New Feature Enhancement and removed Waiting for Reviewer New Feature labels Mar 25, 2021

lbittarello added 4 commits March 25, 2021 22:06

Merge upstream

4d0a64e

Update change logs

27cfee5

Update comment

2b96828

Split tests

5ee27f9

thomasjpfan approved these changes Apr 2, 2021

View reviewed changes

sklearn/compose/_column_transformer.py Show resolved Hide resolved

thomasjpfan merged commit 26e688d into scikit-learn:main Apr 2, 2021

thomasjpfan mentioned this pull request Apr 2, 2021

DOC Adds version added to output_indices_ in ColumnTransformer #19815

Merged

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Uh oh!

ENH Record output of transformers in ColumnTransformer #18393

ENH Record output of transformers in ColumnTransformer #18393

Uh oh!

Conversation

lbittarello commented Sep 13, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

lbittarello commented Sep 17, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lbittarello commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

lbittarello commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlondschien commented Nov 10, 2020

Uh oh!

thomasjpfan commented Nov 14, 2020

Uh oh!

lbittarello commented Nov 14, 2020

Uh oh!

thomasjpfan commented Nov 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Nov 16, 2020

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented Apr 2, 2021

Uh oh!

Uh oh!

lbittarello commented Sep 24, 2020 •

edited

Loading

lbittarello commented Sep 25, 2020 •

edited

Loading

thomasjpfan commented Nov 14, 2020 •

edited

Loading