Skip to content

[MRG + 1] ENH add check_inverse in FunctionTransformer #9399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

glemaitre
Copy link
Member

Reference Issue

We mentioned in #9041 that it could be nice to port the check_inverse inside the FunctionTransformer

What does this implement/fix? Explain your changes.

Make the checking in the fit methods.

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thoughts on when this might be too restrictive

@@ -59,23 +62,60 @@ class FunctionTransformer(BaseEstimator, TransformerMixin):

.. deprecated::0.19

check_inverse : bool, (default=False)
Whether to check that ``transform`` followed by ``inverse_transform``
or ``func`` followed by ``inverse_func`` leads to the original targets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targets -> input

size=n_subsample,
replace=False)
if sparse.issparse(X):
X_sel = X[subsample_idx].A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can be sure that the function accepts dense data, so you can't do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, use safe_indexing to support passing lists or DataFrames through.

def _validate_inverse(self, X):
"""Check that func and inverse_func are the inverse."""
# Apply the transform and inverse_transform on few samples.
X = check_array(X, accept_sparse='csr')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes too many assumptions to be practical. Then again allclose below will convert to array (but not to a particular shape or sparse format, etc)

@glemaitre
Copy link
Member Author

@jnothman I used the assert_allclose_dense_sparse which look particularly ugly. Is it good enough or we should move the base code into utils.allclose_dense_sparse?

Also, it seems that safe_indexing do not handle COO matrix. Is it worth mentioning in the docstring?

@jnothman
Copy link
Member

jnothman commented Jul 19, 2017 via email


def _validate_inverse(self, X):
"""Check that func and inverse_func are the inverse."""
random_state = check_random_state(self.random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.resample?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from prints and possible use of resample.

replace=False)

X_sel = safe_indexing(X, subsample_idx)
print(subsample_idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print?

sparse.csc_matrix(X_dense)]

for X in X_list:
print(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print?

@amueller amueller changed the title [MRG] EHN add check_inverse in FunctionTransformer [MRG + 1] EHN add check_inverse in FunctionTransformer Jul 21, 2017
@amueller
Copy link
Member

LGTM apart from pep8 break

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

One further thing I was wondering whether the random_state kwarg is worth adding. It's used by check_inverse, so if you for some reason need to control it, I agree you need this keyword, but it just feels a bit like adding 'noise' to the class for most cases.

@@ -59,23 +60,54 @@ class FunctionTransformer(BaseEstimator, TransformerMixin):

.. deprecated::0.19

check_inverse : bool, (default=False)
Whether to check that ``transform`` followed by ``inverse_transform``
or ``func`` followed by ``inverse_func`` leads to the original inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioning both transform/inverse_transform and func/inverse_func is a leftover from copying from the other PR?
As here you only have func and inverse_func as kwargs.

@@ -59,23 +60,54 @@ class FunctionTransformer(BaseEstimator, TransformerMixin):

.. deprecated::0.19

check_inverse : bool, (default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious comma ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 23 above (can't put a comment there), it is stated that "A FunctionTransformer will not do any checks on its function's output.", which is still correct for the default, but might get an update to mention check_inverse ?

try:
assert_allclose_dense_sparse(
X_sel, self.inverse_transform(self.transform(X_sel)),
atol=1e-7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not the default?

If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random. Note that this is used to compute if func and
inverse_func are the inverse of each other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify that this is only done for check_inverse=True

@glemaitre
Copy link
Member Author

@amueller @jnothman what are you thoughts regarding the random_state.
I have the same concerns that @jorisvandenbossche so I would need advice on the way to go.

@jnothman
Copy link
Member

Drop random_state. Make it deterministic. An evenly-space slice X[::max(1, X.shape[0] // 100)],; perhaps (good because it avoids copying the array, bad because in some cases it will only take a prefix of the data)?

@jnothman
Copy link
Member

My main concern with this PR is that users aren't going to use the feature if it's not the default. Yes, we might internally. Should we make it eventually be default behaviour? But doing so also complicates what's meant to be a simple utility.

@glemaitre
Copy link
Member Author

@jnothman If we make it default it will not be back-compatible, isn't (we might need to deprecate the current behaviour). As a default it would make sense since the majority should have this behaviour.

Drop random_state. Make it deterministic.

It makes sense to me.

bad because in some cases it will only take a prefix of the data)

Not sure to know what you mean by "take a prefix of the data". If this is what I am thinking of, I am not sure that random sampling will make it more robust.

@amueller
Copy link
Member

👍 for making it deterministic.

@amueller
Copy link
Member

test failure related.

@codecov
Copy link

codecov bot commented Jul 31, 2017

Codecov Report

Merging #9399 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9399      +/-   ##
==========================================
+ Coverage   96.19%   96.19%   +<.01%     
==========================================
  Files         335      335              
  Lines       61800    61825      +25     
==========================================
+ Hits        59448    59473      +25     
  Misses       2352     2352
Impacted Files Coverage Δ
sklearn/preprocessing/_function_transformer.py 97.95% <100%> (+0.66%) ⬆️
...n/preprocessing/tests/test_function_transformer.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02a31f0...f3c0d10. Read the comment docs.

@glemaitre
Copy link
Member Author

@jnothman do you see any other changes?

@@ -59,6 +58,12 @@ class FunctionTransformer(BaseEstimator, TransformerMixin):

.. deprecated::0.19

check_inverse : bool, default=False
Whether to check that or ``func`` followed by ``inverse_func`` leads to
the original inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is weak on motivation. Are we trying to recommend it basically as a sanity check / quality assurance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check seems a good motivation

inverse_func=np.log1p,
accept_sparse=accept_sparse,
check_inverse=True)
Xt = trans.fit_transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_no_warning?

@@ -610,6 +610,9 @@ a transformer that applies a log transformation in a pipeline, do::
array([[ 0. , 0.69314718],
[ 1.09861229, 1.38629436]])

We can ensure that ``func`` and ``inverse_func`` are the inverse of each other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "You" but doesn't matter ;)

@amueller
Copy link
Member

amueller commented Aug 2, 2017

LGTM (still) apart from checking no warning is raised in the good case.

@glemaitre
Copy link
Member Author

@jnothman Could you put this PR in your revision queue?

@@ -92,7 +121,9 @@ def fit(self, X, y=None):
self
"""
if self.validate:
check_array(X, self.accept_sparse)
X = check_array(X, self.accept_sparse)
if self.check_inverse:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and self.inverse_func is not None

" inverse of each other. If you are sure you"
" want to proceed regardless, set"
" 'check_inverse=False'. This warning will turn to"
" an error from 0.22", DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be a DeprecationWarning. It's a UserWarning, I think, or perhaps a RuntimeWarning.

We could just allow the user to use a warning filter to make it raise an error (particularly if we define a new warning class), rather than changing behaviour in two versions. I think that's okay in the transformed target case too...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, raising an error because the inverse transform was not perfect seems a bit harsh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against keeping a warning

@jnothman
Copy link
Member

Believe it or note the current test suite doesn't appear to include a case of actually fitting a FunctionTransformer where inverse_func is not provided.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

.. versionadded:: 0.20

.. deprecated:: 0.20
``check_inverse=True`` is currently raising a warning if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update. Definitely don't use deprecated tag

@@ -610,6 +610,9 @@ a transformer that applies a log transformation in a pipeline, do::
array([[ 0. , 0.69314718],
[ 1.09861229, 1.38629436]])

You can ensure that ``func`` and ``inverse_func`` are the inverse of each other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that a warning is raised and can be turned into an error with: give simplefilter example which tests message

@glemaitre
Copy link
Member Author

@jnothman so it should be ready to be merged this time :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, these should've been included in the previous batch of comment, but github must've glitched.


>>> import warnings
>>> warnings.filterwarnings("error", message="The provided functions are not"
... " strictly inverse of each other", append=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why append=True? append=False just means it takes precedence over other filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make the message check loose enough to allow us to change the message. Perhaps '.*check_inverse.*' but with the UserWarning filter on category, so that if we ever deprecate check_inverse the filter isn't fired. Or else we should just make a new category.

@glemaitre
Copy link
Member Author

@jnothman this is corrected

@jnothman jnothman changed the title [MRG + 1] EHN add check_inverse in FunctionTransformer [MRG + 1] ENG add check_inverse in FunctionTransformer Sep 1, 2017
@jnothman jnothman changed the title [MRG + 1] ENG add check_inverse in FunctionTransformer [MRG + 1] ENH add check_inverse in FunctionTransformer Sep 1, 2017
@jnothman
Copy link
Member

jnothman commented Sep 1, 2017

I'm fine with this, but @amueller should confirm his +1 since we changed to always warning, rather than have a deprecated behaviour.

@jnothman
Copy link
Member

jnothman commented Sep 1, 2017

And it's still possible we should make it a custom warning class so filtering is easy. But I'm ambivalent.

@glemaitre
Copy link
Member Author

@lesteve @amueller Can you give a look to this PR.

@amueller amueller merged commit 8472350 into scikit-learn:master Oct 25, 2017
@amueller
Copy link
Member

thanks!

donigian added a commit to donigian/scikit-learn that referenced this pull request Oct 28, 2017
…cs/donigian-update-contribution-guidelines

* 'master' of github.com:scikit-learn/scikit-learn: (23 commits)
  fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033)
  [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025)
  [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022)
  [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015)
  MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573)
  [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005)
  [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399)
  [MRG] FIX bug in nested set_params usage (scikit-learn#9999)
  [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798)
  [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968)
  DOC Fix typo (scikit-learn#9996)
  DOC Fix typo: x axis -> y axis (scikit-learn#9985)
  improve example plot_forest_iris.py (scikit-learn#9989)
  [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875)
  DOC update news
  DOC Fix three typos in manifold documentation (scikit-learn#9990)
  DOC add missing dot in docstring
  DOC Add what's new for 0.19.1 (scikit-learn#9983)
  Improve readability of outlier detection example. (scikit-learn#9973)
  DOC: Fixed typo (scikit-learn#9977)
  ...
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…9399)

* EHN add check_inverse in FunctionTransformer

* Add whats new entry and short narrative doc

* Sparse support

* better handle sparse data

* Address andreas comments

* PEP8

* Absolute tolerance default

* DOC fix docstring

* Remove random state and make check_inverse deterministic

* FIX remove random_state from init

* PEP8

* DOC motivation for the inverse

* make check_inverse=True default with a warning

* PEP8

* FIX get back X from check_array

* Andread comments

* Update whats new

* remove blank line

* joel s comments

* no check if one of forward or inverse not provided

* DOC fixes and example of filterwarnings

* DOC fix warningfiltering

* DOC fix merge error git
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…9399)

* EHN add check_inverse in FunctionTransformer

* Add whats new entry and short narrative doc

* Sparse support

* better handle sparse data

* Address andreas comments

* PEP8

* Absolute tolerance default

* DOC fix docstring

* Remove random state and make check_inverse deterministic

* FIX remove random_state from init

* PEP8

* DOC motivation for the inverse

* make check_inverse=True default with a warning

* PEP8

* FIX get back X from check_array

* Andread comments

* Update whats new

* remove blank line

* joel s comments

* no check if one of forward or inverse not provided

* DOC fixes and example of filterwarnings

* DOC fix warningfiltering

* DOC fix merge error git
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants