-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add experimental.ColumnTransformer #9012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add experimental.ColumnTransformer #9012
Conversation
feel free to squash my commits |
2111a57
to
30bea38
Compare
30bea38
to
2333e61
Compare
examples/column_transformer.py
Outdated
('tfidf', TfidfVectorizer()), | ||
('best', TruncatedSVD(n_components=50)), | ||
])), | ||
]), 'body'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body'
somehow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?
doc/modules/feature_extraction.rst
Outdated
Often it is easiest to preprocess data before applying scikit-learn methods, for example using | ||
pandas. | ||
If the preprocessing has parameters that you want to adjust within a | ||
grid-search, however, they need to be inside a transformer. This can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they=?
I'd remove the whole sentence and just say ":class:ColumnTransformer
is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.
doc/modules/feature_extraction.rst
Outdated
|
||
.. note:: | ||
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, don't take my comment personally, just making notes
doc/modules/feature_extraction.rst
Outdated
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. | ||
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called | ||
``1`` for each sample (``X_columns[1].shape == (n_samples,)``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we supporting integer column labels?
doc/modules/feature_extraction.rst
Outdated
.. note:: | ||
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. | ||
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> a pandas DataFrame
Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.
So the current way to specify a transformer is like this:
(where There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):
or
BTW, when using dicts, I would actually find this interface more logical:
which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data). |
The column is a numpy array, right, so it's not hashable. I think we could use either the list or dict thing here, and have a helper |
Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that... |
Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion? If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order? |
Let's try to discuss this with @amueller before proceeding. I think I'll help on this today. |
doc/modules/pipeline.rst
Outdated
transformations to each field of the data, producing a homogeneous feature | ||
matrix from a heterogeneous data source. | ||
The transformers are applied in parallel, and the feature matrices they output | ||
are concatenated side-by-side into a larger matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer
.
Additional problem that we encountered: In the meantime (since the original PR was made), Transformers need 2D But, by ensuring this, the example using a So possible options:
|
Another option would be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things I noticed
doc/modules/feature_extraction.rst
Outdated
@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by | |||
default instead of a ``numpy.ndarray``. | |||
|
|||
|
|||
.. _column_transformer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in compose.rst, but perhaps noted at the top of this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but also (related to what I mentioned here: #9012 (comment)):
- when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
- we should reference this in preprocessing.rst
- we should add a better 'typical data science usecase" example for the example gallery
- I would maybe keep the explanation currently in
feature_extraction.rst
(the example), but shorten it by referring to compose.rst for the general explanation.
I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).
feature_names : list of strings | ||
Names of the features produced by transform. | ||
""" | ||
check_is_fitted(self, 'transformers_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't remainder be handled here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names
is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names
, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])
I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names
in the future, it does not lead to a change in behaviour (but a removal of the error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just raise NotImplementedError in case of remainder != 'drop'
for now.... Or you can tack the remainder transformer onto the end of _iter
.
I agree get_feature_names
is not quite the right design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an issue for this?
Yes, just move it. The rest can happen after release or at the hands of
other contributors
|
Added last update: added additional error for get_feature_names, and moved the docs to compose.rst. As far as I am concerned, somebody can push the green button ;-) |
Indeed: let's see how this flies! |
Thanks Joris for some great work on finally making this happen! |
Woohoo, thanks for merging! |
Great work congrats ! |
Yes, I am jumping up and down of exitement.
|
Really nice!!! Let's stack those columns then ;) |
Looking forward for the next release |
Next release will be amazing ! |
OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche |
Thank you @jorisvandenbossche!! Great stuff. Is there going to be an effort (I would like to contribute) to implement I find that one of the big advantages that What do you guys think? Thank you in advance. |
@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a |
@amueller thank you for the links. For instance, if we want to use |
you might want to look at the eli5 library where more feature names have
been implemented in a single dispatch framework that is easily extended.
Also: #5523 suggests pandas in/out. I think it might be a good idea to have
something like df_out on transformers, but this is a harder change to make
|
Just learned of this class. Very happy to find it! Was wondering if there ever might be support for allowing elements of The composition operation is natural: a Transformer, given X will ultimately produce some columns that can be column stacked. This implicitly would require a Pipeline whenever one of the For a (perhaps far-fetched to
Here If any interest, I have some code I could share that builds these objects though |
Hi Jonathan,
I think that most of what you want to do is achievable with some objects
already available in scikit-learn:
https://scikit-learn.org/stable/modules/compose.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
The remaining gaps are probably not frequently used by most scikit-learn
users. Hence, the best option is probably to write a custom class that
implements what you need.
Cheers
|
Hi Gael,
In theory, yes it should be possible, I'm mostly thinking of convenience and maybe complexity. I hadn't closely considered sklearn to do the internal work because I hadn't heard of this ColumnTransformer until yesterday.
If T1, T2 and T3 are transformer classes then I'd like to be able to specify a transformer in ColumnTransformer as ('mytrans', T1(), [3,4,5,T2(),6,T3()]).
It can be expressed several ways. Most simply, it can be expressed as T1 piped with a FeatureUnion of six transformers where each column is converted to a ColumnTransformer with a single passthrough column (or maybe one could inspect the list and see that one could use columns [3,4,5] and [6]) and the transformers T2() and T3() remain as they are. This seems a simpler way to express it and is pretty close to what I've been doing (again except things not being transformers). Do you think this is going to blow up in complexity quickly?
Of course, any specialized solution I write would have the same complexity in array manipulation as using transformers so I guess the issue is how much overhead is there really in using transformers? Not much I would guess.
And I guess I could get convenience by writing a helper function, something like make_column_transformer.
Lastly, I haven't yet figured out how well ColumnTransformer will respect pd.DataFrame. If a ColumnTransformer fit on a dataframe with columns ['X', 'Y', 'Z', 'W'] only refers to ['X', 'Z'] can one call transform on a dataframe with columns ['X', 'Z'] (implicitly assuming remainder="drop"). I'm guessing this will fail in sklearn but it is kind of convenient in some cases if you're going to work with dataframes a lot.
…________________________________
From: Gael Varoquaux ***@***.***>
Sent: Wednesday, October 6, 2021 11:39 PM
To: scikit-learn/scikit-learn ***@***.***>
Cc: Jonathan Taylor ***@***.***>; Comment ***@***.***>
Subject: Re: [scikit-learn/scikit-learn] [MRG] Add experimental.ColumnTransformer (#9012)
Hi Jonathan,
I think that most of what you want to do is achievable with some objects
already available in scikit-learn:
https://scikit-learn.org/stable/modules/compose.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
The remaining gaps are probably not frequently used by most scikit-learn
users. Hence, the best option is probably to write a custom class that
implements what you need.
Cheers
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#9012 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTM22BTJMT4KXVDIUPWQDUFU6DFANCNFSM4DOF5DWQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The behaviour will be defined as follows: scikit-learn/sklearn/compose/_column_transformer.py Lines 725 to 742 in 57658ba
So if you drop, it should be fine. |
Continuation @amueller's PR #3886 (for now just rebased and updated for changes in sklearn)
Fixes #2034.
Closes #2034, closes #3886, closes #8540, closes #8539