-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH: Adds inverse_transform to ColumnTransformer #11639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet reviewed.
for name, trans, cols in self.transformers: | ||
col_indices = _get_column_indices(X, cols) | ||
if not all_indexes.isdisjoint(set(col_indices)): | ||
self._invertible = (False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be covered by tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_column_transformer_inverse_transform_with_overlaping_slices
should cover this, I added the ValueError message in the assertion.
|
||
if not Xs: | ||
# All transformers are None | ||
return np.zeros((X.shape[0], 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added test_column_transformer_inverse_transform_all_transformers_drop
to cover this.
inverse_Xs[:, indices] = inverse_X | ||
|
||
if self._X_is_sparse: | ||
return sparse.csr_matrix(inverse_Xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assert to test_column_transformer_sparse_array
to cover this.
Returns | ||
------- | ||
Xt : array-like, shape = [n_samples, n_features] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is a pandas DataFrame when ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet a full review
try: | ||
import pandas as pd | ||
return pd.DataFrame(inverse_Xs, columns=self._X_columns) | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes no sense. Either we promise pandas or we don't. Changing the return type on the basis of not having a dependency can't happen.
input_indices = [] | ||
for name, trans, cols in self.transformers: | ||
col_indices = _get_column_indices(X, cols) | ||
if not all_indexes.isdisjoint(set(col_indices)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something. I can't see where you update all_indexes to be non-empty. If I'm right, add tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It turns out pytest.raises(..., match=...)
was need to properly test the Exception.
input_indices.append(col_indices) | ||
|
||
# check remainder | ||
remainder_indices = self._remainder[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make _remainder a namedtuple so that this is more legible
input_indices.append(remainder_indices) | ||
|
||
self._input_indices = input_indices | ||
self._X_features = X.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps name this _n_features_in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it!
Added a few comments (didn't look at the tests yet)
col_indices_set = set(col_indices) | ||
if not all_indexes.isdisjoint(col_indices_set): | ||
self._invert_error = ("Unable to invert: transformers " | ||
"contain overlaping columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overlaping -> overlapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the "Unable to invert" is already included into the message in inverse_transform
?
"contain overlaping columns") | ||
return | ||
if trans == 'drop': | ||
self._invert_error = "'{}' drops columns".format(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add something explicitly saying that dropping columns is not supported
Private function to calcuate indicies for inverse_transform | ||
""" | ||
# checks for overlap | ||
all_indexes = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, but maybe also use 'indices' instead of 'indexes' since all other variables do that?
|
||
self._input_indices = input_indices | ||
self._n_features_in = X.shape[1] | ||
self._X_columns = X.columns if hasattr(X, 'columns') else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe getattr(X, 'columns', None)
?
self._output_indices = [] | ||
cur_index = 0 | ||
|
||
Xs = self._fit_transform(X[0:1], None, _transform_one, fitted=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of transforming here again, we could also save the dimensions of the outputs in self.fit_transform
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or actually, you can also pass Xs
if we want to keep the code here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great idea! Thank you!
trans = FunctionTransformer( | ||
validate=False, accept_sparse=True, check_inverse=False) | ||
|
||
inv_transformers.append((name, trans, sub, get_weight(name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems name
is not used below, so not needed to pass it?
inverse_Xs = sparse.lil_matrix((Xs[0].shape[0], | ||
self._n_features_in)) | ||
else: | ||
inverse_Xs = np.zeros((Xs[0].shape[0], self._n_features_in)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zeros -> empty?
else: | ||
inverse_Xs[:, indices] = inverse_X.toarray() | ||
else: | ||
if inverse_X.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to inverse_Xs[:, indices] = ...
or inverse_x.ndim == 1
?
inverse_Xs[:, indices] = ...
runs wheninverse_X
is sparse andX
is not sparse.test_column_transformer_sparse_array
was updated to test this.inverse_x.ndim == 1
runs wheninverse_X
is not sparse and only has one dimension.test_column_transformer_sparse_stacking
tests for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the second about inverse_x
being 1-dimensional. I would think that any valid sklearn transformer should always return 2D output? (but maybe I am overlooking something). And so I would expect input for inverse_transform
to have the same constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preprocessing.LabelEncoder
's transform
and inverse_transform
returns 1-D arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LabelEncoder shouldn't be used on X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR was updated to remove the check for 1D outputs.
350652b
to
81e3e68
Compare
b6db9ca
to
360d09b
Compare
This pull request introduces 1 alert when merging 9d0cd00 into 7166cd5 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could come up with a helpful example of this. I'm not sure it's necessary. Noting the available functionality in the user guide might be worthwhile.
Only a partial review.
else: | ||
inverse_Xs[:, indices] = inverse_X.toarray() | ||
else: | ||
if inverse_X.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LabelEncoder shouldn't be used on X
sklearn/pipeline.py
Outdated
@@ -609,6 +609,12 @@ def _transform_one(transformer, X, y, weight, **fit_params): | |||
return res * weight | |||
|
|||
|
|||
def _inverse_transform_one(transformer, X, weight, **fit_params): | |||
weight = weight or 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather make 1 if weight is None else weight
more explicitly... bool(weight) is not a good way to check for None
A motivating example would be nice. I will see if there is a place to add it in |
This modifies whats_new/v0.20.rst and should be updated |
I have a use case internally where I want to use the preprocessor to process |
@glemaitre thank you for the idea! Let’s see if there is a way to integrate it into one of our examples 🤔 |
do you want to merge with master for another round of reviews? |
@glemaitre What kind of insights do you get when you have the quantiles in the original scale? |
@@ -456,6 +461,60 @@ def _fit_transform(self, X, y, func, fitted=False): | |||
else: | |||
raise | |||
|
|||
def _calculate_inverse_indices(self, X, Xs): | |||
""" | |||
Private function to calcuate indices for inverse_transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo:
Private function to calcuate indices for inverse_transform | |
Private function to calculate indices for inverse_transform |
Ouch. I don't remember what was my use case now. I should have directly given what I was programming at that time. It could have been linked to some neuroscience stuff but I am unsure now. |
Is this merge-able? |
@Jude188 Not right now, do you have an example where this feature would be useful to you? |
@thomasjpfan I've got a case where I'm using Sklearn for only data preprocessing and not using an estimator. This is because the model that I have doesn't really have a I'm sure I did a terrible job of explaining that! |
hello, any news on that PR? I'm working on interpretability and started working on adding an inverse_transform for ColumnTransformer. I stumbled upon that PR and it would be extremely helpful to me. |
Reference Issues/PRs
Fixes #11463
What does this implement/fix? Explain your changes.
inverse_transform
with overlap ordrop
will raise aValueError
_calculate_inverse_indices
is used to connect indices from the output space back to the input space.