-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] More robust input validation, more testing. #4136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fc2b2fc
to
8b4342d
Compare
@VirgileFritsch if you have any time, your input on what is happening in the covariance module would be much appreciated. |
It looks like there was an issue in the tests. The X that is passed is 1d, and the covariance module didn't do proper input validation, so the X was not treated as one sample. If you reshape the X in the test in master to be |
For reference, the error was here: https://travis-ci.org/scikit-learn/scikit-learn/jobs/47715706 |
Any reviews? |
assert_warns(UserWarning, cov.fit, X_1sample) | ||
# FIXME I don't know what this test does | ||
# X_1sample = np.arange(5) | ||
# cov = EmpiricalCovariance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you git blame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried to ping @VirgileFritsch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, if you reshape X to (1, n_features)
which is what happens now internally, the tests fail. As the test is called "with one sample" I'm not sure what the intention is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect (n_features,) was converted to (n_features, 1) to work out of the box in 1D. Which was wrong... so bug fix and API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpretation in the code is actually different from (n_features, 1)
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In master I get the following:
>>> emp_cov = EmpiricalCovariance().fit(np.arange(5))
/Users/ogrisel/code/scikit-learn/sklearn/covariance/empirical_covariance_.py:73: UserWarning: Only one sample available. You may want to reshape your data array
warnings.warn("Only one sample available. "
>>> emp_cov.covariance_
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
>>> emp_cov.precision_
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
For regularized models:
>>> lw_cov = LedoitWolf().fit(np.arange(5))
/Users/ogrisel/code/scikit-learn/sklearn/covariance/shrunk_covariance_.py:283: UserWarning: Only one sample available. You may want to reshape your data array
warnings.warn("Only one sample available. "
>>> lw_cov.covariance_
array([[ 4., 2., 0., -2., -4.],
[ 2., 1., 0., -1., -2.],
[ 0., 0., 0., 0., 0.],
[-2., -1., 0., 1., 2.],
[-4., -2., 0., 2., 4.]])
>>> lw_cov.precision_
array([[ 0.04, 0.02, 0. , -0.02, -0.04],
[ 0.02, 0.01, 0. , -0.01, -0.02],
[ 0. , 0. , 0. , 0. , 0. ],
[-0.02, -0.01, 0. , 0.01, 0.02],
[-0.04, -0.02, 0. , 0.02, 0.04]])
So indeed covariance mode seems to treat the case of 1D input as 1 sample cases which is not consistent with the rest of the library. On this branch I get:
>>> emp_cov = EmpiricalCovariance().fit(np.arange(5))
>>> emp_cov.covariance_
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
So no more warning but still the covariance matrix is 5x5 five meaning that the 1D input is still treated as a single sample dataset which is still not consistent with the rest of the library IIRC.
Anyway covariance models of single sample datasets are pretty much meaningless to me. Covariance models of datasets with a single features are barely better: they are singleton matrices which while well defined (this is just the variance of the feature), is pretty useless to treat as a covariance estimation model in practice.
I think we should consider the case of covariance estimation of datasets with a single sample as invalid and raise a ValueError
telling explicitly that X needs at least 2 samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do interpret X.shape=(n,)
as X.shape=(1, n_features)
. This is what check_array
does and so it should be pretty consistent now. I'm not sure we have a common test for that, maybe we should.
I agree that the covariance estimate is not really meaningful for this case.
The warning was just raised in the wrong way (if X.ndim == 1
, not if X.shape[0] == 1
)
It seems I commented out one test too much, try out LedoitWolf().fit(np.arange(5))
on this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master:
LedoitWolf().fit(np.arange(5))
Works, but
LedoitWolf().fit(np.arange(5).reshape(1, -1)
crashes with ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
That is clearly inconsistent. Now the test crashes on np.arange(5)
as it is (sensibly) converted to .reshape(1, -1)
.
besides LGTM |
@@ -521,5 +521,5 @@ def fit_transform(self, X): | |||
X_new : array, shape (n_samples, n_components) | |||
Embedding of the training data in low-dimensional space. | |||
""" | |||
self._fit(X) | |||
self.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I thought that we were in the fit method.
Besides my comments and @agramfort remark, it looks good to merge. |
I have no idea how to fix the broken covariance test :-/ And I don't know what it is supposed to test. |
3621dc6
to
df02a39
Compare
@@ -413,7 +413,7 @@ def _fit(self, X): | |||
If the metric is 'precomputed' X must be a square distance | |||
matrix. Otherwise it contains a sample per row. | |||
""" | |||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) | |||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], dtype=np.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use dtype=np.float64
explicitly here. np.float
is an alias for the Python float
builtin and the precision could in theory be platform specific although I think it maps to 64bit floats on all supported platforms. Anyway let's be explicit rather than implicit.
7d4f60a
to
70a65d4
Compare
@@ -348,6 +348,9 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None, | |||
ElasticNetCV | |||
""" | |||
X = check_array(X, 'csc', dtype=np.float64, order='F', copy=copy_X) | |||
if Xy is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @agramfort wants to have a look at this. This comes up in ElasticNetCV. In c2f0b31 I enforced it there, but there was a comment that is shouldn't be enforced there. not sure about that.
# FIXME these should be done also for non-mixin estimators! | ||
estimators = all_estimators(type_filter=['classifier', 'regressor', | ||
'transformer', 'cluster']) | ||
estimators = all_estimators() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to emphasize, this is the important line in this PR. It extends testing to many more estimators, namely those that don't inherit from one of the four mixins used before.
Merging #4064 means more tests that we extend to more estimators, so now I need to fix the GMMs etc ^^ |
f9a5244
to
bb6417b
Compare
I "fixed" #4132 by making |
Tests failing as |
c8473fc
to
87f6c31
Compare
The test error is a sign flip in the eigensolver. I think we need to something simliar to |
87f6c31
to
8ef0b9a
Compare
Removed Spectral embedding from the tests for now, should be passing. |
The non-deterministic signs in |
@amueller I fixed the case of regularized covariance on 1D data in this PR to your branch: amueller#24 Once integrated, +1 for merge as well on my side. |
FIX regularized covariance on 1D data
Ok, merging. Thanks for the reviews :) |
[MRG+1] More robust input validation, more testing.
@@ -357,7 +357,7 @@ def __init__(self, n_components=2, metric=True, n_init=4, | |||
def _pairwise(self): | |||
return self.kernel == "precomputed" | |||
|
|||
def fit(self, X, init=None, y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I merged to fast. I am still not 100% about this :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot about that one. I don't think it's a big deal though.
Test a a lot more of the estimators for input validation etc.
Enforce that they can accept other dtypes, in particular
float32
.Hack for parts of #4134, fixing parts of #4056, fixing #4124, fixing #4133 (debatable) and #4132.