-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add feature_extraction.ColumnTransformer #3886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add feature_extraction.ColumnTransformer #3886
Conversation
)), | ||
|
||
# Use a SVC classifier on the combined features | ||
('svc', SVC(kernel='linear')), | ||
('svc', LinearSVC(dual=False)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much more appropriate.
FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK? |
return self | ||
|
||
def transform(self, data_dict): | ||
return data_dict[self.key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good riddance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see this go. Your change makes all this much more elegant.
Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel? |
We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it. I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't. |
You understood my poorly worded question. |
That's what the WIPs are for in PR titles! |
@amueller, is it okay to allow Why won't previous |
I think there is a pickling problem because old ones don't have a fields attribute, right? Can you give an application for passing a function? I would rather have data transforming functions in transform objects, I think. |
@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields() |
b701a58
to
edcc143
Compare
edcc143
to
5f6f7af
Compare
Can I has reviews? |
-------- | ||
>>> from sklearn.preprocessing import Normalizer | ||
>>> union = FeatureUnion([("norm1", Normalizer(norm='l1')), \ | ||
("norm2", Normalizer(norm='l1'))], \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure it's a useful example if both undergo the same column-wise transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. If both are histograms this is different than doing it per column ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I've never used Normalizer
before. I confused it for a feature scaler. It's norming each sample. Thanks...
Changed my mind: can I suggest that this new functionality be noted in the description of |
now LGTM! ;) |
Thanks for your help @jnothman :) Any other reviews? |
@ogrisel a review would be much appreciated ;) |
0a3b75a
to
ddefe62
Compare
ddefe62
to
f31d5db
Compare
f31d5db
to
fec5175
Compare
Ping @ogrisel what do you think of this? |
I'm new to the code base but FWIW LGTM. I'm also biased since I'm excited to see this merged. :) |
While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder. I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages. As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier. As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction. |
Well @GaelVaroquaux feels strongly about having anything slicing in the first direction in a general module. If we keep the FeatureUnion interface for this transformer there would be much less duplication. |
Okay. Together or apart aside: I am tempted to suggest we just have a list of pairs for the transformers list. The first element in each pair would be both the field name and the transformer name, which means that field names must be valid identifiers not containing |
ok, I then I'll go back to the old interface, which would get rid of the sorting issue, gives the same interface as in pipeline / feature union and removes the duplication. so that seems like a good idea. |
Does this API allow multiple transformers for the same column? Maybe they get automatically aliased to |
The current implementation does allow multiple transformers for the same column. I'm suggesting an interface that does not. For such things was FeatureUnion created. But part of the problem here is that we come to the limits of an interface where all configuration are provided in an object's construction using Python/NumPy primitives. Pipeline-like construction would be much more readable, future-proof and self-documenting if we used a factory idiom, doing away with complicated nested and parallel dict/list/tuple monsters: myunion = FeatureUnion().append(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True) Similar could be achieved (uglier, IMO) by providing a namedtuple-like class in the API: myunion = FeatureUnion([UnionEntry(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)]) Indeed, this would likely happen behind the scenes in the former example. Yes, both of these begin to look like Java, but perhaps when used with discretion they are the right way to make usable APIs and legible code. I actually think the former passes the zen test better than the incumbent on Zen "Flat is better than nested... Readability counts... Practicality beats purity" measures. |
By "this" I indeed meant the API you proposed. It's probably an important feature. |
And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a |
I feel if we make users nest FeatureUnion and ColumnTransformer then we failed. What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names. |
class ColumnTransformer(BaseEstimator, TransformerMixin): | ||
"""Applies transformers to columns of a dataframe / dict. | ||
|
||
This estimator applies a transformer objects to each column or field of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applies transformer objects
:)
5e42880
to
6addec5
Compare
6addec5
to
c3b6568
Compare
|
||
|
||
class ColumnTransformer(BaseEstimator, TransformerMixin): | ||
"""Applies transformers to columns of a dataframe / dict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Coming late to the party, this might have been discussed before...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be pretty easy with the FunctionTransformer #4798
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, +1
Input data, used to fit transformers. | ||
""" | ||
transformers = Parallel(n_jobs=self.n_jobs)( | ||
delayed(_fit_one_transformer)(trans, X[column], y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use .iloc
if it exists otherwise slice in second direction, and allow multiple columns.
Should we close this, in favor of #9012, to clean the tracker? |
Fixes #2034.
Todo:
Also see here for how this would help people.