-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] ENH add check_inverse in FunctionTransformer #9399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] ENH add check_inverse in FunctionTransformer #9399
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some thoughts on when this might be too restrictive
@@ -59,23 +62,60 @@ class FunctionTransformer(BaseEstimator, TransformerMixin): | |||
|
|||
.. deprecated::0.19 | |||
|
|||
check_inverse : bool, (default=False) | |||
Whether to check that ``transform`` followed by ``inverse_transform`` | |||
or ``func`` followed by ``inverse_func`` leads to the original targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Targets -> input
size=n_subsample, | ||
replace=False) | ||
if sparse.issparse(X): | ||
X_sel = X[subsample_idx].A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can be sure that the function accepts dense data, so you can't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, use safe_indexing to support passing lists or DataFrames through.
def _validate_inverse(self, X): | ||
"""Check that func and inverse_func are the inverse.""" | ||
# Apply the transform and inverse_transform on few samples. | ||
X = check_array(X, accept_sparse='csr') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes too many assumptions to be practical. Then again allclose below will convert to array (but not to a particular shape or sparse format, etc)
@jnothman I used the Also, it seems that |
ahh right. sure, it's worth mentioning.
|
|
||
def _validate_inverse(self, X): | ||
"""Check that func and inverse_func are the inverse.""" | ||
random_state = check_random_state(self.random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.resample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from prints and possible use of resample.
replace=False) | ||
|
||
X_sel = safe_indexing(X, subsample_idx) | ||
print(subsample_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print?
sparse.csc_matrix(X_dense)] | ||
|
||
for X in X_list: | ||
print(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print?
LGTM apart from pep8 break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments.
One further thing I was wondering whether the random_state
kwarg is worth adding. It's used by check_inverse, so if you for some reason need to control it, I agree you need this keyword, but it just feels a bit like adding 'noise' to the class for most cases.
@@ -59,23 +60,54 @@ class FunctionTransformer(BaseEstimator, TransformerMixin): | |||
|
|||
.. deprecated::0.19 | |||
|
|||
check_inverse : bool, (default=False) | |||
Whether to check that ``transform`` followed by ``inverse_transform`` | |||
or ``func`` followed by ``inverse_func`` leads to the original inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioning both transform/inverse_transform and func/inverse_func is a leftover from copying from the other PR?
As here you only have func and inverse_func as kwargs.
@@ -59,23 +60,54 @@ class FunctionTransformer(BaseEstimator, TransformerMixin): | |||
|
|||
.. deprecated::0.19 | |||
|
|||
check_inverse : bool, (default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious comma ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 23 above (can't put a comment there), it is stated that "A FunctionTransformer will not do any checks on its function's output.", which is still correct for the default, but might get an update to mention check_inverse ?
try: | ||
assert_allclose_dense_sparse( | ||
X_sel, self.inverse_transform(self.transform(X_sel)), | ||
atol=1e-7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not the default?
If RandomState instance, random_state is the random number generator; | ||
If None, the random number generator is the RandomState instance used | ||
by np.random. Note that this is used to compute if func and | ||
inverse_func are the inverse of each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify that this is only done for check_inverse=True
@amueller @jnothman what are you thoughts regarding the |
Drop |
My main concern with this PR is that users aren't going to use the feature if it's not the default. Yes, we might internally. Should we make it eventually be default behaviour? But doing so also complicates what's meant to be a simple utility. |
@jnothman If we make it default it will not be back-compatible, isn't (we might need to deprecate the current behaviour). As a default it would make sense since the majority should have this behaviour.
It makes sense to me.
Not sure to know what you mean by "take a prefix of the data". If this is what I am thinking of, I am not sure that random sampling will make it more robust. |
👍 for making it deterministic. |
test failure related. |
Codecov Report
@@ Coverage Diff @@
## master #9399 +/- ##
==========================================
+ Coverage 96.19% 96.19% +<.01%
==========================================
Files 335 335
Lines 61800 61825 +25
==========================================
+ Hits 59448 59473 +25
Misses 2352 2352
Continue to review full report at Codecov.
|
@jnothman do you see any other changes? |
@@ -59,6 +58,12 @@ class FunctionTransformer(BaseEstimator, TransformerMixin): | |||
|
|||
.. deprecated::0.19 | |||
|
|||
check_inverse : bool, default=False | |||
Whether to check that or ``func`` followed by ``inverse_func`` leads to | |||
the original inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is weak on motivation. Are we trying to recommend it basically as a sanity check / quality assurance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity check seems a good motivation
inverse_func=np.log1p, | ||
accept_sparse=accept_sparse, | ||
check_inverse=True) | ||
Xt = trans.fit_transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_no_warning?
doc/modules/preprocessing.rst
Outdated
@@ -610,6 +610,9 @@ a transformer that applies a log transformation in a pipeline, do:: | |||
array([[ 0. , 0.69314718], | |||
[ 1.09861229, 1.38629436]]) | |||
|
|||
We can ensure that ``func`` and ``inverse_func`` are the inverse of each other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "You" but doesn't matter ;)
LGTM (still) apart from checking no warning is raised in the good case. |
@jnothman Could you put this PR in your revision queue? |
@@ -92,7 +121,9 @@ def fit(self, X, y=None): | |||
self | |||
""" | |||
if self.validate: | |||
check_array(X, self.accept_sparse) | |||
X = check_array(X, self.accept_sparse) | |||
if self.check_inverse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and self.inverse_func is not None
" inverse of each other. If you are sure you" | ||
" want to proceed regardless, set" | ||
" 'check_inverse=False'. This warning will turn to" | ||
" an error from 0.22", DeprecationWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be a DeprecationWarning. It's a UserWarning, I think, or perhaps a RuntimeWarning.
We could just allow the user to use a warning filter to make it raise an error (particularly if we define a new warning class), rather than changing behaviour in two versions. I think that's okay in the transformed target case too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all, raising an error because the inverse transform was not perfect seems a bit harsh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against keeping a warning
Believe it or note the current test suite doesn't appear to include a case of actually fitting a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
.. versionadded:: 0.20 | ||
|
||
.. deprecated:: 0.20 | ||
``check_inverse=True`` is currently raising a warning if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update. Definitely don't use deprecated tag
@@ -610,6 +610,9 @@ a transformer that applies a log transformation in a pipeline, do:: | |||
array([[ 0. , 0.69314718], | |||
[ 1.09861229, 1.38629436]]) | |||
|
|||
You can ensure that ``func`` and ``inverse_func`` are the inverse of each other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that a warning is raised and can be turned into an error with: give simplefilter example which tests message
@jnothman so it should be ready to be merged this time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, these should've been included in the previous batch of comment, but github must've glitched.
doc/modules/preprocessing.rst
Outdated
|
||
>>> import warnings | ||
>>> warnings.filterwarnings("error", message="The provided functions are not" | ||
... " strictly inverse of each other", append=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why append=True? append=False just means it takes precedence over other filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to make the message check loose enough to allow us to change the message. Perhaps '.*check_inverse.*'
but with the UserWarning
filter on category, so that if we ever deprecate check_inverse
the filter isn't fired. Or else we should just make a new category.
@jnothman this is corrected |
I'm fine with this, but @amueller should confirm his +1 since we changed to always warning, rather than have a deprecated behaviour. |
And it's still possible we should make it a custom warning class so filtering is easy. But I'm ambivalent. |
thanks! |
…cs/donigian-update-contribution-guidelines * 'master' of github.com:scikit-learn/scikit-learn: (23 commits) fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033) [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025) [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022) [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015) MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573) [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005) [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399) [MRG] FIX bug in nested set_params usage (scikit-learn#9999) [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798) [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968) DOC Fix typo (scikit-learn#9996) DOC Fix typo: x axis -> y axis (scikit-learn#9985) improve example plot_forest_iris.py (scikit-learn#9989) [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875) DOC update news DOC Fix three typos in manifold documentation (scikit-learn#9990) DOC add missing dot in docstring DOC Add what's new for 0.19.1 (scikit-learn#9983) Improve readability of outlier detection example. (scikit-learn#9973) DOC: Fixed typo (scikit-learn#9977) ...
…9399) * EHN add check_inverse in FunctionTransformer * Add whats new entry and short narrative doc * Sparse support * better handle sparse data * Address andreas comments * PEP8 * Absolute tolerance default * DOC fix docstring * Remove random state and make check_inverse deterministic * FIX remove random_state from init * PEP8 * DOC motivation for the inverse * make check_inverse=True default with a warning * PEP8 * FIX get back X from check_array * Andread comments * Update whats new * remove blank line * joel s comments * no check if one of forward or inverse not provided * DOC fixes and example of filterwarnings * DOC fix warningfiltering * DOC fix merge error git
…9399) * EHN add check_inverse in FunctionTransformer * Add whats new entry and short narrative doc * Sparse support * better handle sparse data * Address andreas comments * PEP8 * Absolute tolerance default * DOC fix docstring * Remove random state and make check_inverse deterministic * FIX remove random_state from init * PEP8 * DOC motivation for the inverse * make check_inverse=True default with a warning * PEP8 * FIX get back X from check_array * Andread comments * Update whats new * remove blank line * joel s comments * no check if one of forward or inverse not provided * DOC fixes and example of filterwarnings * DOC fix warningfiltering * DOC fix merge error git
Reference Issue
We mentioned in #9041 that it could be nice to port the
check_inverse
inside theFunctionTransformer
What does this implement/fix? Explain your changes.
Make the checking in the
fit
methods.Any other comments?