[MRG] Add experimental.ColumnTransformer #9012

jorisvandenbossche · 2017-06-06T15:04:16Z

Continuation @amueller's PR #3886 (for now just rebased and updated for changes in sklearn)

Fixes #2034.

Closes #2034, closes #3886, closes #8540, closes #8539

amueller · 2017-06-06T15:28:37Z

feel free to squash my commits

…neous_feature_union

…input

vene · 2017-06-06T16:35:29Z

examples/column_transformer.py

                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
-            ])),
+            ]), 'body'),


Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body' somehow)

While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?

vene · 2017-06-06T16:40:40Z

doc/modules/feature_extraction.rst

+Often it is easiest to preprocess data before applying scikit-learn methods, for example using
+pandas.
+If the preprocessing has parameters that you want to adjust within a
+grid-search, however, they need to be inside a transformer. This can be


they=?

I'd remove the whole sentence and just say ":class:ColumnTransformer is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.

vene · 2017-06-06T16:42:26Z

doc/modules/feature_extraction.rst

+
+.. note::
+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.


Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)) ?

Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these

sure, don't take my comment personally, just making notes

vene · 2017-06-06T16:43:01Z

doc/modules/feature_extraction.rst

+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
+    For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
+    ``1`` for each sample (``X_columns[1].shape == (n_samples,)``).


Are we supporting integer column labels?

vene · 2017-06-06T16:43:49Z

doc/modules/feature_extraction.rst

+.. note::
+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
+    For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called


-> a pandas DataFrame

Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.

jorisvandenbossche · 2017-06-06T17:09:36Z

So the current way to specify a transformer is like this:

ColumnTransformer({"name": (Transformer(), column), ..})

(where 'name' is the transformer name, and column is the column on which to apply the transformer).

There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):

ColumnTransformer([("name", Transformer(), column), ..]) # more similar to Pipeline interface

or

ColumnTransformer([('column', Transformer()), ..])   # in this case transformer name / column name has to be identical

BTW, when using dicts, I would actually find this interface more logical:

ColumnTransformer({"column": ("name", Transformer()), ..})

which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data).
But this has the disadvantage that we cannot extend this to multiple columns with lists (since lists cannot be dict keys).

amueller · 2017-06-06T17:17:25Z

ColumnTransformer({"column": ("name", Transformer()), ..})

The column is a numpy array, right, so it's not hashable.

I think we could use either the list or dict thing here, and have a helper make_column_transformer or somthing that does
make_column_transformer({Transformer(): column})
Transformer is hashable, so that works, and we can generate the name from the class name as in make_pipeline.

amueller · 2017-06-06T17:19:22Z

Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that...

vene · 2017-06-07T07:06:29Z

Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion?

If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order?

vene · 2017-06-07T10:14:11Z

Let's try to discuss this with @amueller before proceeding. I think I'll help on this today.

vene · 2017-06-07T12:22:07Z

doc/modules/pipeline.rst

+transformations to each field of the data, producing a homogeneous feature
+matrix from a heterogeneous data source.
+The transformers are applied in parallel, and the feature matrices they output
+are concatenated side-by-side into a larger matrix.


Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer.

…s always 2D

jorisvandenbossche · 2017-06-07T15:09:51Z

Additional problem that we encountered:

In the meantime (since the original PR was made), Transformers need 2D X values. Therefore, I made sure in the ColumnTransformer that I always pass through the subset of columns to the transformer as a 2D object, also when you apply the transformer only on a single column.

But, by ensuring this, the example using a TfidfVectorizer fails, because that one expects a 1D object of text samples.

So possible options:

Add a way to specify in the ColumnTransformer that for certain transformers the passed X values should be kept as 1D.
E.g. we could have
```
ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col'), ...],
                  flatten_column={'tfidf': True})`
```
where flatten_column (or other name like keep_1d) is False by default (satisfying the normal Transformers), but you can specify per transformer if you want to override this default of False.
The use of a dict here is similar to the transformer_weights keyword.

Adapt the example (which means: letting the user write more boilerplate) to fix it, eg by adding a step in the pipeline to select the single column from the 2D object.
We could add one line to the current pipeline that holds the TFIDF (this tuple is one of the transformers in the ColumnTransformer)

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
added -->   ('flatten', FunctionTransformer(lambda x: x[:, 0], validate=False)),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ]), 'body'),

Adapt TfidfVectorizer to eg have a keyword that allows to specify that 2D data is expected (which would be False by default for backwards compatibility).
If we would like to do this one, this might ideally be a separate PR and so the second option can be used as a temporary hack to the example to have it working and which can be removed in the other PR.

jorisvandenbossche · 2017-06-07T15:59:57Z

Another option would be:

Make a distinction between specifying the columns as a scalar or as a list: when using a scalar, the data is passed as 1D to the Transformers, as a list as 2D data.
The disadvantage of this is that for all Transformers except of the text vectorizers, you will have to specify the single column as a list:
```
ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col')
                   ('scaler', StandardScaler(), ['other_col'])])
```

jnothman

Some things I noticed

jnothman · 2018-05-27T00:24:07Z

doc/modules/feature_extraction.rst

@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
 default instead of a ``numpy.ndarray``.


+.. _column_transformer:


This should be in compose.rst, but perhaps noted at the top of this file

Yes, I know, but also (related to what I mentioned here: #9012 (comment)):

when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)

we should reference this in preprocessing.rst

we should add a better 'typical data science usecase" example for the example gallery

I would maybe keep the explanation currently in feature_extraction.rst (the example), but shorten it by referring to compose.rst for the general explanation.

I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).

jnothman · 2018-05-27T00:27:31Z

sklearn/compose/_column_transformer.py

+        feature_names : list of strings
+            Names of the features produced by transform.
+        """
+        check_is_fitted(self, 'transformers_')


Shouldn't remainder be handled here too?

Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])

I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names in the future, it does not lead to a change in behaviour (but a removal of the error).

Can just raise NotImplementedError in case of remainder != 'drop' for now.... Or you can tack the remainder transformer onto the end of _iter.

I agree get_feature_names is not quite the right design.

Can we have an issue for this?

jnothman · 2018-05-28T10:53:06Z

Yes, just move it. The rest can happen after release or at the hands of other contributors

…hrough

jorisvandenbossche · 2018-05-29T07:20:50Z

Added last update: added additional error for get_feature_names, and moved the docs to compose.rst.

As far as I am concerned, somebody can push the green button ;-)

jnothman · 2018-05-29T21:48:28Z

Indeed: let's see how this flies!

jnothman · 2018-05-29T21:50:38Z

Thanks Joris for some great work on finally making this happen!

jorisvandenbossche · 2018-05-29T22:12:54Z

Woohoo, thanks for merging!

TomDLT · 2018-05-29T22:22:43Z

Great work congrats !

GaelVaroquaux · 2018-05-29T22:39:32Z

Yes, I am jumping up and down of exitement.

glemaitre · 2018-05-29T23:00:11Z

Really nice!!! Let's stack those columns then ;)

eyadsibai · 2018-05-29T23:03:38Z

Looking forward for the next release

armgilles · 2018-05-31T15:36:54Z

Next release will be amazing !

amueller · 2018-05-31T19:25:47Z

OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche

partmor · 2018-06-01T18:53:04Z

Thank you @jorisvandenbossche!! Great stuff.
I have a question regarding this feature (I'm very new in GitHub, not sure if this is the right place... apologies in advance):

Is there going to be an effort (I would like to contribute) to implement get_feature_names in the majority of transformers?

I find that one of the big advantages that DataFrameMapper from sklearn-pandas brought to us also is the ability to trace names of derived features (using aliases and df_out=True), with what this means for interpretability (e.g. get some feature importances for a tree based model after a fairly complex non-sequential preprocessing pipeline). Having get_feature_names working consistently in ColumnTransformer would be the bomb.

What do you guys think?

Thank you in advance.

amueller · 2018-06-01T18:57:39Z

@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a get_feature_names. For most multivariate transformations it's hard to get feature names, so I don't know how DataFrameMapper did that.

partmor · 2018-06-01T19:26:35Z

@amueller thank you for the links. For instance, if we want to use StandardScaler on a set of numeric variables, ColumnTransformer raises an exception because SC does not have get_feature_names implemented. In "1 to 1" column transformations like standard scaling in DataFrameMapper you could just passthrough the feature names: ['x0', 'x1',...] to ['x0', 'x1',...]. ColumnTransformer.get_feature_names() just raises exception by using SC in it.

amueller · 2018-06-02T00:57:24Z

@partmor yeah for univariate things like that it would be easy to implement. We should revisit #6425 keeping in mind that ColumnTransformer relies heavily on it.

jnothman · 2018-06-03T09:07:29Z

you might want to look at the eli5 library where more feature names have been implemented in a single dispatch framework that is easily extended. Also: #5523 suggests pandas in/out. I think it might be a good idea to have something like df_out on transformers, but this is a harder change to make

jonathan-taylor · 2021-10-07T04:15:26Z

Just learned of this class. Very happy to find it!

Was wondering if there ever might be support for allowing elements of columns to be ColumnConstructor themselves or even Transformers. (I actually have been writing a Transformer to do this in the last little while unaware of ColumnConstructor).

The composition operation is natural: a Transformer, given X will ultimately produce some columns that can be column stacked. This implicitly would require a Pipeline whenever one of the columns was a Transformer. Indeed, this is essentially how I've coded things in my use case (to be clear my base object is not really a Transformer so it can't be used in a GridSearch, it is just an object that can transform columns). It would likely be hard to expose parameters of such embedded Transformers but GridSearchCV should be able to work on the top level objects as in https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

For a (perhaps far-fetched to sklearn) use case, consider interactions in regression models and in particular interactions between transformed columns and / or categorical variables. So, one might want to have something like

pcaX = ColumnTransformer(['pca(X)', PCA(n_components=4), range(10)])
color = ColumnTransformer(['color', OneHotEncoder(), [10])
interaction = ColumnTransformer(['pca(X):color', Interaction, [pca, color])

Here Interaction would be a transform that forms the columns of the interactions, i.e. pairwise products (this probably exists in sklearn but I'm not 100% certain). You might ask who wants interactions between PCA components and a categorical variable? Not sure, but you could replace PCA with some other rich basis and get varying coefficient models easily. Setting n_components in a GridSearch would seem to be tricky but maybe the dunder notation could just be propagated deeper?

If any interest, I have some code I could share that builds these objects though ColumnTransformer is not a Transformer per se. The on Transformer I have built actually takes a list of (my analog of)ColumnTransformer as constructor, but that's a distinction without a difference. My intent is to manipulate this other object by e.g. slicing the transformers_.

GaelVaroquaux · 2021-10-07T06:39:35Z

Hi Jonathan, I think that most of what you want to do is achievable with some objects already available in scikit-learn: https://scikit-learn.org/stable/modules/compose.html https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html The remaining gaps are probably not frequently used by most scikit-learn users. Hence, the best option is probably to write a custom class that implements what you need. Cheers

jonathan-taylor · 2021-10-07T07:52:38Z

Hi Gael, In theory, yes it should be possible, I'm mostly thinking of convenience and maybe complexity. I hadn't closely considered sklearn to do the internal work because I hadn't heard of this ColumnTransformer until yesterday. If T1, T2 and T3 are transformer classes then I'd like to be able to specify a transformer in ColumnTransformer as ('mytrans', T1(), [3,4,5,T2(),6,T3()]). It can be expressed several ways. Most simply, it can be expressed as T1 piped with a FeatureUnion of six transformers where each column is converted to a ColumnTransformer with a single passthrough column (or maybe one could inspect the list and see that one could use columns [3,4,5] and [6]) and the transformers T2() and T3() remain as they are. This seems a simpler way to express it and is pretty close to what I've been doing (again except things not being transformers). Do you think this is going to blow up in complexity quickly? Of course, any specialized solution I write would have the same complexity in array manipulation as using transformers so I guess the issue is how much overhead is there really in using transformers? Not much I would guess. And I guess I could get convenience by writing a helper function, something like make_column_transformer. Lastly, I haven't yet figured out how well ColumnTransformer will respect pd.DataFrame. If a ColumnTransformer fit on a dataframe with columns ['X', 'Y', 'Z', 'W'] only refers to ['X', 'Z'] can one call transform on a dataframe with columns ['X', 'Z'] (implicitly assuming remainder="drop"). I'm guessing this will fail in sklearn but it is kind of convenient in some cases if you're going to work with dataframes a lot.

…

________________________________ From: Gael Varoquaux ***@***.***> Sent: Wednesday, October 6, 2021 11:39 PM To: scikit-learn/scikit-learn ***@***.***> Cc: Jonathan Taylor ***@***.***>; Comment ***@***.***> Subject: Re: [scikit-learn/scikit-learn] [MRG] Add experimental.ColumnTransformer (#9012) Hi Jonathan, I think that most of what you want to do is achievable with some objects already available in scikit-learn: https://scikit-learn.org/stable/modules/compose.html https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html The remaining gaps are probably not frequently used by most scikit-learn users. Hence, the best option is probably to write a custom class that implements what you need. Cheers — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#9012 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTM22BTJMT4KXVDIUPWQDUFU6DFANCNFSM4DOF5DWQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

glemaitre · 2021-10-14T08:54:19Z

Lastly, I haven't yet figured out how well ColumnTransformer will respect pd.DataFrame. If a ColumnTransformer fit on a dataframe with columns ['X', 'Y', 'Z', 'W'] only refers to ['X', 'Z'] can one call transform on a dataframe with columns ['X', 'Z'] (implicitly assuming remainder="drop"). I'm guessing this will fail in sklearn but it is kind of convenient in some cases if you're going to work with dataframes a lot.

The behaviour will be defined as follows:

scikit-learn/sklearn/compose/_column_transformer.py

Lines 725 to 742 in 57658ba

    
           if fit_dataframe_and_transform_dataframe: 
        
               named_transformers = self.named_transformers_ 
        
               # check that all names seen in fit are in transform, unless 
        
               # they were dropped 
        
               non_dropped_indices = [ 
        
                   ind 
        
                   for name, ind in self._transformer_to_input_indices.items() 
        
                   if name in named_transformers 
        
                   and isinstance(named_transformers[name], str) 
        
                   and named_transformers[name] != "drop" 
        
               ] 
        
               all_indices = set(chain(*non_dropped_indices)) 
        
               all_names = set(self.feature_names_in_[ind] for ind in all_indices) 
        
               diff = all_names - set(X.columns) 
        
               if diff: 
        
                   raise ValueError(f"columns are missing: {diff}")

So if you drop, it should be fine.

add heterogeneous ColumnTransformer

1937d56

jorisvandenbossche force-pushed the amueller/heterogeneous_feature_union branch from 2111a57 to 30bea38 Compare June 6, 2017 16:03

jorisvandenbossche added 3 commits June 6, 2017 18:08

Merge remote-tracking branch 'upstream/master' into amueller/heteroge…

95bf6cb

…neous_feature_union

Get tests/examples working with current sklearn

914ba53

Add support for numpy arrays and positional columns in dataframes as …

2333e61

…input

jorisvandenbossche force-pushed the amueller/heterogeneous_feature_union branch from 30bea38 to 2333e61 Compare June 6, 2017 16:10

vene reviewed Jun 6, 2017

View reviewed changes

add support for selecting multiple columns

464f7e6

jorisvandenbossche added 2 commits June 7, 2017 11:34

doc corrections

7777e2a

Change to tuples instead of dict

42ce18c

jorisvandenbossche added 3 commits June 7, 2017 13:40

Reimplement as subclass of FeatureUnion

4a55b9b

Fix-ups and move tests

55a5372

update docs

74d0639

vene reviewed Jun 7, 2017

View reviewed changes

jnothman mentioned this pull request Jun 7, 2017

[MRG+1] Stacking classifier with pipelines API #8960

Closed

7 tasks

jorisvandenbossche added 3 commits June 7, 2017 15:09

Support selecting multiple columns from dict + ensure passed subset i…

b6883b9

…s always 2D

Also support slices for positional subsets

1c4f09b

Fix 2d dict items case

7cef7df

jnothman reviewed May 27, 2018

View reviewed changes

jorisvandenbossche added 2 commits May 29, 2018 07:51

Add NotImplementedError for get_feature_names if columns are passed t…

4098928

…hrough

move docs from feature_extraction.rst -> compose.rst

9ab27fb

jnothman merged commit 0b6308c into scikit-learn:master May 29, 2018

jorisvandenbossche deleted the amueller/heterogeneous_feature_union branch May 30, 2018 07:48

amueller mentioned this pull request Jun 1, 2018

Add mixed categorical / continuous example with ColumnTransformer #11185

Closed

qinhanmin2014 mentioned this pull request Jun 7, 2018

[MRG] Fix some example refs due to renaming examples #11214

Merged

janvanrijn mentioned this pull request Jun 14, 2018

Setup of Scikit-learn Experiments openml/Study-14#4

Open

jorisvandenbossche mentioned this pull request Jul 17, 2018

[MRG+2] Change default of ColumnTransformer remainder from passthrough to drop #11603

Merged

marco-c mentioned this pull request Jan 17, 2019

Use ColumnTransformer to simplify feature extraction pipeline mozilla/bugbug#72

Closed

		@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
		default instead of a ``numpy.ndarray``.


		.. _column_transformer:

[MRG] Add experimental.ColumnTransformer #9012

[MRG] Add experimental.ColumnTransformer #9012

Conversation

jorisvandenbossche commented Jun 6, 2017 • edited by amueller Loading

amueller commented Jun 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 6, 2017 • edited Loading

amueller commented Jun 6, 2017 • edited Loading

amueller commented Jun 6, 2017

vene commented Jun 7, 2017

vene commented Jun 7, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 7, 2017 • edited Loading

jorisvandenbossche commented Jun 7, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented May 28, 2018 via email

jorisvandenbossche commented May 29, 2018

jnothman commented May 29, 2018

jnothman commented May 29, 2018

jorisvandenbossche commented May 29, 2018

TomDLT commented May 29, 2018

GaelVaroquaux commented May 29, 2018 via email

glemaitre commented May 29, 2018

eyadsibai commented May 29, 2018

armgilles commented May 31, 2018

amueller commented May 31, 2018

partmor commented Jun 1, 2018

amueller commented Jun 1, 2018 • edited Loading

partmor commented Jun 1, 2018

amueller commented Jun 2, 2018

jnothman commented Jun 3, 2018 via email

jonathan-taylor commented Oct 7, 2021

GaelVaroquaux commented Oct 7, 2021 via email

jonathan-taylor commented Oct 7, 2021 via email

glemaitre commented Oct 14, 2021

jorisvandenbossche commented Jun 6, 2017 •

edited by amueller

Loading

jorisvandenbossche commented Jun 6, 2017 •

edited

Loading

amueller commented Jun 6, 2017 •

edited

Loading

jorisvandenbossche commented Jun 7, 2017 •

edited

Loading

amueller commented Jun 1, 2018 •

edited

Loading