Skip to content

[MRG] Add experimental.ColumnTransformer #9012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 6, 2017

Continuation @amueller's PR #3886 (for now just rebased and updated for changes in sklearn)

Fixes #2034.

Closes #2034, closes #3886, closes #8540, closes #8539

@amueller
Copy link
Member

amueller commented Jun 6, 2017

feel free to squash my commits

@jorisvandenbossche jorisvandenbossche force-pushed the amueller/heterogeneous_feature_union branch from 2111a57 to 30bea38 Compare June 6, 2017 16:03
@jorisvandenbossche jorisvandenbossche force-pushed the amueller/heterogeneous_feature_union branch from 30bea38 to 2333e61 Compare June 6, 2017 16:10
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
]), 'body'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body' somehow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?

Often it is easiest to preprocess data before applying scikit-learn methods, for example using
pandas.
If the preprocessing has parameters that you want to adjust within a
grid-search, however, they need to be inside a transformer. This can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they=?

I'd remove the whole sentence and just say ":class:ColumnTransformer is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.


.. note::
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, don't take my comment personally, just making notes

:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
``1`` for each sample (``X_columns[1].shape == (n_samples,)``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we supporting integer column labels?

.. note::
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> a pandas DataFrame

Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 6, 2017

So the current way to specify a transformer is like this:

ColumnTransformer({"name": (Transformer(), column), ..})

(where 'name' is the transformer name, and column is the column on which to apply the transformer).

There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):

ColumnTransformer([("name", Transformer(), column), ..]) # more similar to Pipeline interface

or

ColumnTransformer([('column', Transformer()), ..])   # in this case transformer name / column name has to be identical

BTW, when using dicts, I would actually find this interface more logical:

ColumnTransformer({"column": ("name", Transformer()), ..})

which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data).
But this has the disadvantage that we cannot extend this to multiple columns with lists (since lists cannot be dict keys).

@amueller
Copy link
Member

amueller commented Jun 6, 2017

ColumnTransformer({"column": ("name", Transformer()), ..})

The column is a numpy array, right, so it's not hashable.

I think we could use either the list or dict thing here, and have a helper make_column_transformer or somthing that does
make_column_transformer({Transformer(): column})
Transformer is hashable, so that works, and we can generate the name from the class name as in make_pipeline.

@amueller
Copy link
Member

amueller commented Jun 6, 2017

Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that...

@vene
Copy link
Member

vene commented Jun 7, 2017

Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion?

If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order?

@vene
Copy link
Member

vene commented Jun 7, 2017

Let's try to discuss this with @amueller before proceeding. I think I'll help on this today.

transformations to each field of the data, producing a homogeneous feature
matrix from a heterogeneous data source.
The transformers are applied in parallel, and the feature matrices they output
are concatenated side-by-side into a larger matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 7, 2017

Additional problem that we encountered:

In the meantime (since the original PR was made), Transformers need 2D X values. Therefore, I made sure in the ColumnTransformer that I always pass through the subset of columns to the transformer as a 2D object, also when you apply the transformer only on a single column.

But, by ensuring this, the example using a TfidfVectorizer fails, because that one expects a 1D object of text samples.

So possible options:

  1. Add a way to specify in the ColumnTransformer that for certain transformers the passed X values should be kept as 1D.
    E.g. we could have

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col'), ...],
                      flatten_column={'tfidf': True})`
    

    where flatten_column (or other name like keep_1d) is False by default (satisfying the normal Transformers), but you can specify per transformer if you want to override this default of False.
    The use of a dict here is similar to the transformer_weights keyword.

  2. Adapt the example (which means: letting the user write more boilerplate) to fix it, eg by adding a step in the pipeline to select the single column from the 2D object.
    We could add one line to the current pipeline that holds the TFIDF (this tuple is one of the transformers in the ColumnTransformer)

            # Pipeline for standard bag-of-words model for body
            ('body_bow', Pipeline([
    added -->   ('flatten', FunctionTransformer(lambda x: x[:, 0], validate=False)),
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),
    
  3. Adapt TfidfVectorizer to eg have a keyword that allows to specify that 2D data is expected (which would be False by default for backwards compatibility).
    If we would like to do this one, this might ideally be a separate PR and so the second option can be used as a temporary hack to the example to have it working and which can be removed in the other PR.

@jorisvandenbossche
Copy link
Member Author

Another option would be:

  1. Make a distinction between specifying the columns as a scalar or as a list: when using a scalar, the data is passed as 1D to the Transformers, as a list as 2D data.
    The disadvantage of this is that for all Transformers except of the text vectorizers, you will have to specify the single column as a list:

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col')
                       ('scaler', StandardScaler(), ['other_col'])])
    

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things I noticed

@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
default instead of a ``numpy.ndarray``.


.. _column_transformer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in compose.rst, but perhaps noted at the top of this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but also (related to what I mentioned here: #9012 (comment)):

  • when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
  • we should reference this in preprocessing.rst
  • we should add a better 'typical data science usecase" example for the example gallery
  • I would maybe keep the explanation currently in feature_extraction.rst (the example), but shorten it by referring to compose.rst for the general explanation.

I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).

feature_names : list of strings
Names of the features produced by transform.
"""
check_is_fitted(self, 'transformers_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't remainder be handled here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])

I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names in the future, it does not lead to a change in behaviour (but a removal of the error).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just raise NotImplementedError in case of remainder != 'drop' for now.... Or you can tack the remainder transformer onto the end of _iter.

I agree get_feature_names is not quite the right design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an issue for this?

@jnothman
Copy link
Member

jnothman commented May 28, 2018 via email

@jorisvandenbossche
Copy link
Member Author

Added last update: added additional error for get_feature_names, and moved the docs to compose.rst.

As far as I am concerned, somebody can push the green button ;-)

@jnothman
Copy link
Member

Indeed: let's see how this flies!

@jnothman jnothman merged commit 0b6308c into scikit-learn:master May 29, 2018
@jnothman
Copy link
Member

Thanks Joris for some great work on finally making this happen!

@jorisvandenbossche
Copy link
Member Author

Woohoo, thanks for merging!

@TomDLT
Copy link
Member

TomDLT commented May 29, 2018

Great work congrats !

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 29, 2018 via email

@glemaitre
Copy link
Member

Really nice!!! Let's stack those columns then ;)

@eyadsibai
Copy link

Looking forward for the next release

@jorisvandenbossche jorisvandenbossche deleted the amueller/heterogeneous_feature_union branch May 30, 2018 07:48
@armgilles
Copy link

Next release will be amazing !

@amueller
Copy link
Member

OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche

@partmor
Copy link
Contributor

partmor commented Jun 1, 2018

Thank you @jorisvandenbossche!! Great stuff.
I have a question regarding this feature (I'm very new in GitHub, not sure if this is the right place... apologies in advance):

Is there going to be an effort (I would like to contribute) to implement get_feature_names in the majority of transformers?

I find that one of the big advantages that DataFrameMapper from sklearn-pandas brought to us also is the ability to trace names of derived features (using aliases and df_out=True), with what this means for interpretability (e.g. get some feature importances for a tree based model after a fairly complex non-sequential preprocessing pipeline). Having get_feature_names working consistently in ColumnTransformer would be the bomb.

What do you guys think?

Thank you in advance.

@amueller
Copy link
Member

amueller commented Jun 1, 2018

@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a get_feature_names. For most multivariate transformations it's hard to get feature names, so I don't know how DataFrameMapper did that.

@partmor
Copy link
Contributor

partmor commented Jun 1, 2018

@amueller thank you for the links. For instance, if we want to use StandardScaler on a set of numeric variables, ColumnTransformer raises an exception because SC does not have get_feature_names implemented. In "1 to 1" column transformations like standard scaling in DataFrameMapper you could just passthrough the feature names: ['x0', 'x1',...] to ['x0', 'x1',...]. ColumnTransformer.get_feature_names() just raises exception by using SC in it.

@amueller
Copy link
Member

amueller commented Jun 2, 2018

@partmor yeah for univariate things like that it would be easy to implement. We should revisit #6425 keeping in mind that ColumnTransformer relies heavily on it.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2018 via email

@jonathan-taylor
Copy link

Just learned of this class. Very happy to find it!

Was wondering if there ever might be support for allowing elements of columns to be ColumnConstructor themselves or even Transformers. (I actually have been writing a Transformer to do this in the last little while unaware of ColumnConstructor).

The composition operation is natural: a Transformer, given X will ultimately produce some columns that can be column stacked. This implicitly would require a Pipeline whenever one of the columns was a Transformer. Indeed, this is essentially how I've coded things in my use case (to be clear my base object is not really a Transformer so it can't be used in a GridSearch, it is just an object that can transform columns). It would likely be hard to expose parameters of such embedded Transformers but GridSearchCV should be able to work on the top level objects as in https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

For a (perhaps far-fetched to sklearn) use case, consider interactions in regression models and in particular interactions between transformed columns and / or categorical variables. So, one might want to have something like

pcaX = ColumnTransformer(['pca(X)', PCA(n_components=4), range(10)])
color = ColumnTransformer(['color', OneHotEncoder(), [10])
interaction = ColumnTransformer(['pca(X):color', Interaction, [pca, color])

Here Interaction would be a transform that forms the columns of the interactions, i.e. pairwise products (this probably exists in sklearn but I'm not 100% certain). You might ask who wants interactions between PCA components and a categorical variable? Not sure, but you could replace PCA with some other rich basis and get varying coefficient models easily. Setting n_components in a GridSearch would seem to be tricky but maybe the dunder notation could just be propagated deeper?

If any interest, I have some code I could share that builds these objects though ColumnTransformer is not a Transformer per se. The on Transformer I have built actually takes a list of (my analog of)ColumnTransformer as constructor, but that's a distinction without a difference. My intent is to manipulate this other object by e.g. slicing the transformers_.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 7, 2021 via email

@jonathan-taylor
Copy link

jonathan-taylor commented Oct 7, 2021 via email

@glemaitre
Copy link
Member

Lastly, I haven't yet figured out how well ColumnTransformer​ will respect pd.DataFrame. If a ColumnTransformer​ fit on a dataframe with columns ['X', 'Y', 'Z', 'W'] only refers to ['X', 'Z'] can one call transform​ on a dataframe with columns ['X', 'Z'] (implicitly assuming remainder="drop"​). I'm guessing this will fail in sklearn​ but it is kind of convenient in some cases if you're going to work with dataframes a lot.​

The behaviour will be defined as follows:

if fit_dataframe_and_transform_dataframe:
named_transformers = self.named_transformers_
# check that all names seen in fit are in transform, unless
# they were dropped
non_dropped_indices = [
ind
for name, ind in self._transformer_to_input_indices.items()
if name in named_transformers
and isinstance(named_transformers[name], str)
and named_transformers[name] != "drop"
]
all_indices = set(chain(*non_dropped_indices))
all_names = set(self.feature_names_in_[ind] for ind in all_indices)
diff = all_names - set(X.columns)
if diff:
raise ValueError(f"columns are missing: {diff}")

So if you drop, it should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet