[WIP] Add feature_extraction.ColumnTransformer #3886

amueller · 2014-11-25T19:44:40Z

Fixes #2034.

Todo:

Also see here for how this would help people.

mrterry · 2014-11-25T20:50:51Z

examples/hetero_feature_union.py

    )),

    # Use a SVC classifier on the combined features
-    ('svc', SVC(kernel='linear')),
+    ('svc', LinearSVC(dual=False)),


much more appropriate.

mrterry · 2014-11-25T21:01:06Z

FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK?

mrterry · 2014-11-25T21:01:53Z

examples/hetero_feature_union.py

-        return self
-
-    def transform(self, data_dict):
-        return data_dict[self.key]


good riddance.

Happy to see this go. Your change makes all this much more elegant.

mrterry · 2014-11-25T21:06:31Z

Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel?

amueller · 2014-11-25T22:15:05Z

We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it.

I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't.

mrterry · 2014-11-25T22:28:45Z

You understood my poorly worded question.

jnothman · 2014-11-26T04:22:30Z

I wrote something similar to this, but didn't get around to writing the tests.

That's what the WIPs are for in PR titles!

jnothman · 2014-11-26T12:11:14Z

@amueller, is it okay to allow fields to be a list of string or functions, where functions are just applied to X?

Why won't previous FeatureUnions be unplickleable?

amueller · 2014-11-26T15:41:11Z

I think there is a pickling problem because old ones don't have a fields attribute, right?

Can you give an application for passing a function?
Also, what would you call the parameter then?

I would rather have data transforming functions in transform objects, I think.

mrterry · 2014-11-26T15:42:35Z

@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields()

amueller · 2014-12-16T23:33:16Z

Can I has reviews?

jnothman · 2014-12-17T11:59:20Z

sklearn/pipeline.py

+    --------
+    >>> from sklearn.preprocessing import Normalizer
+    >>> union = FeatureUnion([("norm1", Normalizer(norm='l1')),  \
+                              ("norm2", Normalizer(norm='l1'))], \


I'm not really sure it's a useful example if both undergo the same column-wise transformation.

It is. If both are histograms this is different than doing it per column ;)

Ah. I've never used Normalizer before. I confused it for a feature scaler. It's norming each sample. Thanks...

jnothman · 2014-12-17T12:05:24Z

~~LGTM!~~

Changed my mind: can I suggest that this new functionality be noted in the description of FeatureUnion in the narrative docs, and perhaps in the docstring?

jnothman · 2014-12-25T09:54:19Z

now LGTM! ;)

amueller · 2014-12-29T23:23:22Z

Thanks for your help @jnothman :) Any other reviews?

amueller · 2015-01-07T22:40:11Z

@ogrisel a review would be much appreciated ;)

amueller · 2015-03-05T20:30:56Z

Ping @ogrisel what do you think of this?

bmabey · 2015-03-13T03:26:46Z

I'm new to the code base but FWIW LGTM.

I'm also biased since I'm excited to see this merged. :)

GaelVaroquaux · 2015-03-13T06:49:32Z

While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder.

I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages.

As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier.

As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction.

amueller · 2015-06-09T23:22:55Z

Well @GaelVaroquaux feels strongly about having anything slicing in the first direction in a general module. If we keep the FeatureUnion interface for this transformer there would be much less duplication.

jnothman · 2015-06-09T23:43:10Z

Okay. Together or apart aside:

I am tempted to suggest we just have a list of pairs for the transformers list. The first element in each pair would be both the field name and the transformer name, which means that field names must be valid identifiers not containing __ (and should also not be transformers, etc.). Maybe that's a silly set of constraints, but having a different input data structure here and in Pipeline/FeatureUnion seems set to confuse.

amueller · 2015-06-10T00:03:48Z

ok, I then I'll go back to the old interface, which would get rid of the sorting issue, gives the same interface as in pipeline / feature union and removes the duplication. so that seems like a good idea.

vene · 2015-06-10T00:49:34Z

Does this API allow multiple transformers for the same column? Maybe they get automatically aliased to name_1, name_2, ...?

jnothman · 2015-06-10T01:35:22Z

The current implementation does allow multiple transformers for the same column. I'm suggesting an interface that does not. For such things was FeatureUnion created.

But part of the problem here is that we come to the limits of an interface where all configuration are provided in an object's construction using Python/NumPy primitives. Pipeline-like construction would be much more readable, future-proof and self-documenting if we used a factory idiom, doing away with complicated nested and parallel dict/list/tuple monsters:

myunion = FeatureUnion().append(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)

Similar could be achieved (uglier, IMO) by providing a namedtuple-like class in the API:

myunion = FeatureUnion([UnionEntry(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)])

Indeed, this would likely happen behind the scenes in the former example.

Yes, both of these begin to look like Java, but perhaps when used with discretion they are the right way to make usable APIs and legible code. I actually think the former passes the zen test better than the incumbent on Zen "Flat is better than nested... Readability counts... Practicality beats purity" measures.

vene · 2015-06-10T01:56:10Z

By "this" I indeed meant the API you proposed. It's probably an important feature.

jnothman · 2015-06-10T02:02:03Z

And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a FeatureUnion. But now I'm just being tired of looking at lists of tuples and dicts of string keys and tuple values and parallel arrays and other such monsters.

amueller · 2015-06-10T15:17:52Z

I feel if we make users nest FeatureUnion and ColumnTransformer then we failed.
I feel the interface as implemented now is not to bad, I think, it doesn't have lists of tuples or parallel arrays.
I was just about to revert this to the parallel array solution.
If we do any other API, there will be very little code reuse with FeatureUnion.

What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names.

raghavrv · 2015-06-10T15:52:02Z

sklearn/feature_extraction/heterogeneous.py

+class ColumnTransformer(BaseEstimator, TransformerMixin):
+    """Applies transformers to columns of a dataframe / dict.
+
+    This estimator applies a transformer objects to each column or field of the


applies transformer objects :)

glouppe · 2015-07-19T18:09:27Z

sklearn/feature_extraction/heterogeneous.py

+
+
+class ColumnTransformer(BaseEstimator, TransformerMixin):
+    """Applies transformers to columns of a dataframe / dict.


While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.

(Coming late to the party, this might have been discussed before...)

That would be pretty easy with the FunctionTransformer #4798

amueller · 2017-03-31T19:09:56Z

sklearn/feature_extraction/heterogeneous.py

+            Input data, used to fit transformers.
+        """
+        transformers = Parallel(n_jobs=self.n_jobs)(
+            delayed(_fit_one_transformer)(trans, X[column], y)


Should use .iloc if it exists otherwise slice in second direction, and allow multiple columns.

GaelVaroquaux · 2017-11-28T16:47:35Z

Should we close this, in favor of #9012, to clean the tracker?

amueller mentioned this pull request Nov 25, 2014

Extend FeatureUnion to better handle heterogeneous data #2034

Closed

mrterry reviewed Nov 25, 2014
View reviewed changes

amueller force-pushed the heterogeneous_feature_union branch from b701a58 to edcc143 Compare November 26, 2014 21:52

amueller changed the title ~~WIP Add transform fields option to FeatureUnion~~ MRG Add transform fields option to FeatureUnion Nov 26, 2014

amueller force-pushed the heterogeneous_feature_union branch from edcc143 to 5f6f7af Compare November 26, 2014 21:53

jnothman reviewed Dec 17, 2014
View reviewed changes

jnothman changed the title ~~MRG Add transform fields option to FeatureUnion~~ [MRG+1] Add transform fields option to FeatureUnion Dec 25, 2014

amueller force-pushed the heterogeneous_feature_union branch from 0a3b75a to ddefe62 Compare January 15, 2015 20:42

amueller force-pushed the heterogeneous_feature_union branch from ddefe62 to f31d5db Compare January 22, 2015 21:20

amueller mentioned this pull request Feb 3, 2015

Feature request: pass meta-data per column/sample through the Pipeline #4196

Closed

amueller force-pushed the heterogeneous_feature_union branch from f31d5db to fec5175 Compare March 3, 2015 23:19

GaelVaroquaux changed the title ~~[MRG+1] Add transform fields option to FeatureUnion~~ [MRG+1-1] Add transform fields option to FeatureUnion Mar 13, 2015

raghavrv reviewed Jun 10, 2015
View reviewed changes

amueller force-pushed the heterogeneous_feature_union branch from 5e42880 to 6addec5 Compare June 10, 2015 17:32

weird work-around for count vectorizer unicode strings in doctests

c3b6568

amueller force-pushed the heterogeneous_feature_union branch from 6addec5 to c3b6568 Compare June 10, 2015 17:40

amueller changed the title ~~[WIP] Add feature_extraction.ColumnTransformer~~ [MRG] Add feature_extraction.ColumnTransformer Jun 10, 2015

glouppe reviewed Jul 19, 2015
View reviewed changes

amueller mentioned this pull request Oct 23, 2015

Pandas in, Pandas out? #5523

Closed

amueller added the Waiting for Reviewer label Dec 10, 2015

jnothman mentioned this pull request Jul 9, 2016

Add column selector to Imputer #6967

Closed

jnothman mentioned this pull request Sep 3, 2016

Adding Pandas Data Frame Feature Union #7334

Closed

jnothman mentioned this pull request Oct 18, 2016

Added Pipeline friendly LabelBinarizer #7375

Closed

amueller commented Mar 31, 2017

View reviewed changes

amueller changed the title ~~[MRG] Add feature_extraction.ColumnTransformer~~ [WIP] Add feature_extraction.ColumnTransformer Mar 31, 2017

amueller added Need Contributor and removed Waiting for Reviewer labels Mar 31, 2017

amueller mentioned this pull request May 15, 2017

Add example dataset with missing data and string-encoded categorical variables #8888

Closed

jnothman mentioned this pull request May 26, 2017

[MRG] Support for strings in OneHotEncoder #8793

Closed

4 tasks

jorisvandenbossche mentioned this pull request Jun 6, 2017

[MRG] Add experimental.ColumnTransformer #9012

Merged

amueller removed the Need Contributor label Jul 21, 2017

amueller closed this Nov 28, 2017



		class ColumnTransformer(BaseEstimator, TransformerMixin):
		"""Applies transformers to columns of a dataframe / dict.

Uh oh!

[WIP] Add feature_extraction.ColumnTransformer #3886

[WIP] Add feature_extraction.ColumnTransformer #3886

Uh oh!

Conversation

amueller commented Nov 25, 2014 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrterry commented Nov 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrterry commented Nov 25, 2014

Uh oh!

amueller commented Nov 25, 2014

Uh oh!

mrterry commented Nov 25, 2014

Uh oh!

jnothman commented Nov 26, 2014

Uh oh!

jnothman commented Nov 26, 2014

Uh oh!

amueller commented Nov 26, 2014

Uh oh!

mrterry commented Nov 26, 2014

Uh oh!

amueller commented Dec 16, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 17, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

amueller commented Dec 29, 2014

Uh oh!

amueller commented Jan 7, 2015

Uh oh!

amueller commented Mar 5, 2015

Uh oh!

bmabey commented Mar 13, 2015

Uh oh!

GaelVaroquaux commented Mar 13, 2015

Uh oh!

amueller commented Jun 9, 2015

Uh oh!

jnothman commented Jun 9, 2015

Uh oh!

amueller commented Jun 10, 2015

Uh oh!

vene commented Jun 10, 2015

Uh oh!

jnothman commented Jun 10, 2015

Uh oh!

vene commented Jun 10, 2015

Uh oh!

jnothman commented Jun 10, 2015

Uh oh!

amueller commented Jun 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Nov 25, 2014 •

edited

Loading