Skip to content

[MRG+1] More robust input validation, more testing. #4136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 13, 2015

Conversation

amueller
Copy link
Member

Test a a lot more of the estimators for input validation etc.
Enforce that they can accept other dtypes, in particular float32.
Hack for parts of #4134, fixing parts of #4056, fixing #4124, fixing #4133 (debatable) and #4132.

@amueller amueller added the Bug label Jan 20, 2015
@amueller amueller added this to the 0.16 milestone Jan 20, 2015
@amueller amueller added the API label Jan 20, 2015
@amueller amueller changed the title Test more estimators, test dtype handling [MRG] Test more estimators, test dtype handling Jan 20, 2015
@amueller
Copy link
Member Author

@VirgileFritsch if you have any time, your input on what is happening in the covariance module would be much appreciated.
If I use the one sample dataset it now crashes with a ValueError because of NaNs

@amueller
Copy link
Member Author

It looks like there was an issue in the tests. The X that is passed is 1d, and the covariance module didn't do proper input validation, so the X was not treated as one sample. If you reshape the X in the test in master to be (1, n_features) which I think was the intend, given the comment, then the test also breaks in master.

@amueller
Copy link
Member Author

For reference, the error was here: https://travis-ci.org/scikit-learn/scikit-learn/jobs/47715706

@amueller amueller mentioned this pull request Feb 1, 2015
@amueller amueller changed the title [MRG] Test more estimators, test dtype handling [MRG] More robust input validation, more testing. Feb 3, 2015
@amueller
Copy link
Member Author

amueller commented Feb 3, 2015

Any reviews?

assert_warns(UserWarning, cov.fit, X_1sample)
# FIXME I don't know what this test does
# X_1sample = np.arange(5)
# cov = EmpiricalCovariance()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you git blame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried to ping @VirgileFritsch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, if you reshape X to (1, n_features) which is what happens now internally, the tests fail. As the test is called "with one sample" I'm not sure what the intention is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect (n_features,) was converted to (n_features, 1) to work out of the box in 1D. Which was wrong... so bug fix and API change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpretation in the code is actually different from (n_features, 1)....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In master I get the following:

>>> emp_cov = EmpiricalCovariance().fit(np.arange(5))
/Users/ogrisel/code/scikit-learn/sklearn/covariance/empirical_covariance_.py:73: UserWarning: Only one sample available. You may want to reshape your data array
  warnings.warn("Only one sample available. "
>>> emp_cov.covariance_
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
>>> emp_cov.precision_
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

For regularized models:

>>> lw_cov = LedoitWolf().fit(np.arange(5))
/Users/ogrisel/code/scikit-learn/sklearn/covariance/shrunk_covariance_.py:283: UserWarning: Only one sample available. You may want to reshape your data array
  warnings.warn("Only one sample available. "
>>> lw_cov.covariance_
array([[ 4.,  2.,  0., -2., -4.],
       [ 2.,  1.,  0., -1., -2.],
       [ 0.,  0.,  0.,  0.,  0.],
       [-2., -1.,  0.,  1.,  2.],
       [-4., -2.,  0.,  2.,  4.]])
>>> lw_cov.precision_
array([[ 0.04,  0.02,  0.  , -0.02, -0.04],
       [ 0.02,  0.01,  0.  , -0.01, -0.02],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [-0.02, -0.01,  0.  ,  0.01,  0.02],
       [-0.04, -0.02,  0.  ,  0.02,  0.04]])

So indeed covariance mode seems to treat the case of 1D input as 1 sample cases which is not consistent with the rest of the library. On this branch I get:

>>> emp_cov = EmpiricalCovariance().fit(np.arange(5))
>>> emp_cov.covariance_
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

So no more warning but still the covariance matrix is 5x5 five meaning that the 1D input is still treated as a single sample dataset which is still not consistent with the rest of the library IIRC.

Anyway covariance models of single sample datasets are pretty much meaningless to me. Covariance models of datasets with a single features are barely better: they are singleton matrices which while well defined (this is just the variance of the feature), is pretty useless to treat as a covariance estimation model in practice.

I think we should consider the case of covariance estimation of datasets with a single sample as invalid and raise a ValueError telling explicitly that X needs at least 2 samples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do interpret X.shape=(n,) as X.shape=(1, n_features). This is what check_array does and so it should be pretty consistent now. I'm not sure we have a common test for that, maybe we should.

I agree that the covariance estimate is not really meaningful for this case.

The warning was just raised in the wrong way (if X.ndim == 1, not if X.shape[0] == 1)

It seems I commented out one test too much, try out LedoitWolf().fit(np.arange(5)) on this branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master:

 LedoitWolf().fit(np.arange(5))

Works, but

 LedoitWolf().fit(np.arange(5).reshape(1, -1)

crashes with ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

That is clearly inconsistent. Now the test crashes on np.arange(5) as it is (sensibly) converted to .reshape(1, -1).

@agramfort agramfort changed the title [MRG] More robust input validation, more testing. [MRG+1] More robust input validation, more testing. Feb 3, 2015
@agramfort
Copy link
Member

besides LGTM

@@ -521,5 +521,5 @@ def fit_transform(self, X):
X_new : array, shape (n_samples, n_components)
Embedding of the training data in low-dimensional space.
"""
self._fit(X)
self.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I thought that we were in the fit method.

@arjoly
Copy link
Member

arjoly commented Feb 4, 2015

Besides my comments and @agramfort remark, it looks good to merge.

@amueller
Copy link
Member Author

amueller commented Feb 4, 2015

I have no idea how to fix the broken covariance test :-/ And I don't know what it is supposed to test.

@@ -413,7 +413,7 @@ def _fit(self, X):
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row.
"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], dtype=np.float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use dtype=np.float64 explicitly here. np.float is an alias for the Python float builtin and the precision could in theory be platform specific although I think it maps to 64bit floats on all supported platforms. Anyway let's be explicit rather than implicit.

@amueller amueller force-pushed the test_dtypes_all branch 2 times, most recently from 7d4f60a to 70a65d4 Compare February 5, 2015 12:42
@@ -348,6 +348,9 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
ElasticNetCV
"""
X = check_array(X, 'csc', dtype=np.float64, order='F', copy=copy_X)
if Xy is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @agramfort wants to have a look at this. This comes up in ElasticNetCV. In c2f0b31 I enforced it there, but there was a comment that is shouldn't be enforced there. not sure about that.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 95.02% when pulling f9a5244 on amueller:test_dtypes_all into 1d08f08 on scikit-learn:master.

# FIXME these should be done also for non-mixin estimators!
estimators = all_estimators(type_filter=['classifier', 'regressor',
'transformer', 'cluster'])
estimators = all_estimators()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to emphasize, this is the important line in this PR. It extends testing to many more estimators, namely those that don't inherit from one of the four mixins used before.

@amueller
Copy link
Member Author

amueller commented Feb 7, 2015

Merging #4064 means more tests that we extend to more estimators, so now I need to fix the GMMs etc ^^

@amueller
Copy link
Member Author

amueller commented Feb 7, 2015

I "fixed" #4132 by making y=None the second positional argument to MDS.fit. I did the same for OneClassSVM.
That is a backward compatible change that might result in parameters being ignored if the users relied on positional arguments. However, it is the only way to make things work properly with pipeline....

@amueller
Copy link
Member Author

amueller commented Feb 7, 2015

Tests failing as SpectralEmbedding is super non-deterministic. And I'm not sure if we want the kneighbors_graph with include_self==True.

@amueller amueller force-pushed the test_dtypes_all branch 2 times, most recently from c8473fc to 87f6c31 Compare February 10, 2015 20:40
@amueller
Copy link
Member Author

The test error is a sign flip in the eigensolver. I think we need to something simliar to sklearn.utils.extmath.svd_flip for eigensolvers to make the signs of the eigenvectors deterministic. That seems outside of the scope of the PR, though.

@amueller
Copy link
Member Author

Removed Spectral embedding from the tests for now, should be passing.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 95.04% when pulling 8ef0b9a on amueller:test_dtypes_all into 73e5cf5 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Feb 13, 2015

The non-deterministic signs in spectral_embedding is tracked in #4236.

@ogrisel
Copy link
Member

ogrisel commented Feb 13, 2015

@amueller I fixed the case of regularized covariance on 1D data in this PR to your branch: amueller#24

Once integrated, +1 for merge as well on my side.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 95.04% when pulling 021bf74 on amueller:test_dtypes_all into 73e5cf5 on scikit-learn:master.

@amueller
Copy link
Member Author

Ok, merging. Thanks for the reviews :)

amueller added a commit that referenced this pull request Feb 13, 2015
[MRG+1] More robust input validation, more testing.
@amueller amueller merged commit cd931fa into scikit-learn:master Feb 13, 2015
@amueller amueller deleted the test_dtypes_all branch February 13, 2015 22:14
This was referenced Feb 13, 2015
@@ -357,7 +357,7 @@ def __init__(self, n_components=2, metric=True, n_init=4,
def _pairwise(self):
return self.kernel == "precomputed"

def fit(self, X, init=None, y=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I merged to fast. I am still not 100% about this :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about that one. I don't think it's a big deal though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants