[MRG+1] More robust input validation, more testing. #4136

amueller · 2015-01-20T22:31:03Z

Test a a lot more of the estimators for input validation etc.
Enforce that they can accept other dtypes, in particular float32.
Hack for parts of #4134, fixing parts of #4056, fixing #4124, fixing #4133 (debatable) and #4132.

amueller · 2015-01-21T18:05:39Z

@VirgileFritsch if you have any time, your input on what is happening in the covariance module would be much appreciated.
If I use the one sample dataset it now crashes with a ValueError because of NaNs

amueller · 2015-01-21T18:10:23Z

It looks like there was an issue in the tests. The X that is passed is 1d, and the covariance module didn't do proper input validation, so the X was not treated as one sample. If you reshape the X in the test in master to be (1, n_features) which I think was the intend, given the comment, then the test also breaks in master.

amueller · 2015-01-21T18:15:15Z

For reference, the error was here: https://travis-ci.org/scikit-learn/scikit-learn/jobs/47715706

amueller · 2015-02-03T16:31:52Z

Any reviews?

agramfort · 2015-02-03T18:49:46Z

sklearn/covariance/tests/test_covariance.py

-    assert_warns(UserWarning, cov.fit, X_1sample)
+    # FIXME I don't know what this test does
+    # X_1sample = np.arange(5)
+    # cov = EmpiricalCovariance()


did you git blame?

Yeah, I tried to ping @VirgileFritsch

In particular, if you reshape X to (1, n_features) which is what happens now internally, the tests fail. As the test is called "with one sample" I'm not sure what the intention is.

I suspect (n_features,) was converted to (n_features, 1) to work out of the box in 1D. Which was wrong... so bug fix and API change?

The interpretation in the code is actually different from (n_features, 1)....

I don't know...

In master I get the following:

>>> emp_cov = EmpiricalCovariance().fit(np.arange(5)) /Users/ogrisel/code/scikit-learn/sklearn/covariance/empirical_covariance_.py:73: UserWarning: Only one sample available. You may want to reshape your data array warnings.warn("Only one sample available. " >>> emp_cov.covariance_ array([[ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.]]) >>> emp_cov.precision_ array([[ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.]])

For regularized models:

>>> lw_cov = LedoitWolf().fit(np.arange(5)) /Users/ogrisel/code/scikit-learn/sklearn/covariance/shrunk_covariance_.py:283: UserWarning: Only one sample available. You may want to reshape your data array warnings.warn("Only one sample available. " >>> lw_cov.covariance_ array([[ 4., 2., 0., -2., -4.], [ 2., 1., 0., -1., -2.], [ 0., 0., 0., 0., 0.], [-2., -1., 0., 1., 2.], [-4., -2., 0., 2., 4.]]) >>> lw_cov.precision_ array([[ 0.04, 0.02, 0. , -0.02, -0.04], [ 0.02, 0.01, 0. , -0.01, -0.02], [ 0. , 0. , 0. , 0. , 0. ], [-0.02, -0.01, 0. , 0.01, 0.02], [-0.04, -0.02, 0. , 0.02, 0.04]])

So indeed covariance mode seems to treat the case of 1D input as 1 sample cases which is not consistent with the rest of the library. On this branch I get:

>>> emp_cov = EmpiricalCovariance().fit(np.arange(5)) >>> emp_cov.covariance_ array([[ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.]])

So no more warning but still the covariance matrix is 5x5 five meaning that the 1D input is still treated as a single sample dataset which is still not consistent with the rest of the library IIRC.

Anyway covariance models of single sample datasets are pretty much meaningless to me. Covariance models of datasets with a single features are barely better: they are singleton matrices which while well defined (this is just the variance of the feature), is pretty useless to treat as a covariance estimation model in practice.

I think we should consider the case of covariance estimation of datasets with a single sample as invalid and raise a ValueError telling explicitly that X needs at least 2 samples.

We do interpret X.shape=(n,) as X.shape=(1, n_features). This is what check_array does and so it should be pretty consistent now. I'm not sure we have a common test for that, maybe we should.

I agree that the covariance estimate is not really meaningful for this case.

The warning was just raised in the wrong way (if X.ndim == 1, not if X.shape[0] == 1)

It seems I commented out one test too much, try out LedoitWolf().fit(np.arange(5)) on this branch.

On master:

LedoitWolf().fit(np.arange(5))

Works, but

LedoitWolf().fit(np.arange(5).reshape(1, -1)

crashes with ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

That is clearly inconsistent. Now the test crashes on np.arange(5) as it is (sensibly) converted to .reshape(1, -1).

agramfort · 2015-02-03T18:50:45Z

besides LGTM

arjoly · 2015-02-04T08:30:34Z

sklearn/manifold/t_sne.py

@@ -521,5 +521,5 @@ def fit_transform(self, X):
        X_new : array, shape (n_samples, n_components)
            Embedding of the training data in low-dimensional space.
        """
-        self._fit(X)
+        self.fit(X)


This looks strange.

Never mind, I thought that we were in the fit method.

arjoly · 2015-02-04T08:39:55Z

Besides my comments and @agramfort remark, it looks good to merge.

amueller · 2015-02-04T18:46:34Z

I have no idea how to fix the broken covariance test :-/ And I don't know what it is supposed to test.

ogrisel · 2015-02-05T10:19:14Z

sklearn/manifold/t_sne.py

@@ -413,7 +413,7 @@ def _fit(self, X):
            If the metric is 'precomputed' X must be a square distance
            matrix. Otherwise it contains a sample per row.
        """
-        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
+        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], dtype=np.float)


I would rather use dtype=np.float64 explicitly here. np.float is an alias for the Python float builtin and the precision could in theory be platform specific although I think it maps to 64bit floats on all supported platforms. Anyway let's be explicit rather than implicit.

amueller · 2015-02-05T12:45:17Z

sklearn/linear_model/coordinate_descent.py

@@ -348,6 +348,9 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
    ElasticNetCV
    """
    X = check_array(X, 'csc', dtype=np.float64, order='F', copy=copy_X)
+    if Xy is not None:


Maybe @agramfort wants to have a look at this. This comes up in ElasticNetCV. In c2f0b31 I enforced it there, but there was a comment that is shouldn't be enforced there. not sure about that.

coveralls · 2015-02-05T23:34:19Z

Coverage increased (+0.02%) to 95.02% when pulling f9a5244 on amueller:test_dtypes_all into 1d08f08 on scikit-learn:master.

amueller · 2015-02-06T11:22:27Z

sklearn/tests/test_common.py

-    # FIXME these should be done also for non-mixin estimators!
-    estimators = all_estimators(type_filter=['classifier', 'regressor',
-                                             'transformer', 'cluster'])
+    estimators = all_estimators()


Just to emphasize, this is the important line in this PR. It extends testing to many more estimators, namely those that don't inherit from one of the four mixins used before.

amueller · 2015-02-07T08:16:02Z

Merging #4064 means more tests that we extend to more estimators, so now I need to fix the GMMs etc ^^

amueller · 2015-02-07T08:27:44Z

I "fixed" #4132 by making y=None the second positional argument to MDS.fit. I did the same for OneClassSVM.
That is a backward compatible change that might result in parameters being ignored if the users relied on positional arguments. However, it is the only way to make things work properly with pipeline....

amueller · 2015-02-07T13:55:07Z

Tests failing as SpectralEmbedding is super non-deterministic. And I'm not sure if we want the kneighbors_graph with include_self==True.

amueller · 2015-02-10T20:52:19Z

The test error is a sign flip in the eigensolver. I think we need to something simliar to sklearn.utils.extmath.svd_flip for eigensolvers to make the signs of the eigenvectors deterministic. That seems outside of the scope of the PR, though.

amueller · 2015-02-10T23:35:36Z

Removed Spectral embedding from the tests for now, should be passing.

coveralls · 2015-02-10T23:45:56Z

Coverage increased (+0.02%) to 95.04% when pulling 8ef0b9a on amueller:test_dtypes_all into 73e5cf5 on scikit-learn:master.

ogrisel · 2015-02-13T18:17:45Z

The non-deterministic signs in spectral_embedding is tracked in #4236.

ogrisel · 2015-02-13T19:37:21Z

@amueller I fixed the case of regularized covariance on 1D data in this PR to your branch: amueller#24

Once integrated, +1 for merge as well on my side.

FIX regularized covariance on 1D data

coveralls · 2015-02-13T22:13:02Z

Coverage increased (+0.03%) to 95.04% when pulling 021bf74 on amueller:test_dtypes_all into 73e5cf5 on scikit-learn:master.

amueller · 2015-02-13T22:13:56Z

Ok, merging. Thanks for the reviews :)

[MRG+1] More robust input validation, more testing.

amueller · 2015-02-13T22:36:16Z

sklearn/manifold/mds.py

@@ -357,7 +357,7 @@ def __init__(self, n_components=2, metric=True, n_init=4,
    def _pairwise(self):
        return self.kernel == "precomputed"

-    def fit(self, X, init=None, y=None):


Maybe I merged to fast. I am still not 100% about this :-/

I forgot about that one. I don't think it's a big deal though.

…ikit-learn#4136

amueller added the Bug label Jan 20, 2015

amueller added this to the 0.16 milestone Jan 20, 2015

amueller added the API label Jan 20, 2015

amueller changed the title ~~Test more estimators, test dtype handling~~ [MRG] Test more estimators, test dtype handling Jan 20, 2015

amueller force-pushed the test_dtypes_all branch from fc2b2fc to 8b4342d Compare January 20, 2015 23:05

This was referenced Jan 23, 2015

Test that sparse matrix with 64bit index are supported #4149

Closed

Input validation refactoring #3440

Closed

[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines #4064

Merged

amueller mentioned this pull request Feb 1, 2015

Signature of MDS fit #4132

Closed

amueller changed the title ~~[MRG] Test more estimators, test dtype handling~~ [MRG] More robust input validation, more testing. Feb 3, 2015

agramfort reviewed Feb 3, 2015
View reviewed changes

agramfort changed the title ~~[MRG] More robust input validation, more testing.~~ [MRG+1] More robust input validation, more testing. Feb 3, 2015

arjoly reviewed Feb 4, 2015
View reviewed changes

amueller force-pushed the test_dtypes_all branch from 3621dc6 to df02a39 Compare February 4, 2015 18:48

ogrisel reviewed Feb 5, 2015
View reviewed changes

amueller force-pushed the test_dtypes_all branch 2 times, most recently from 7d4f60a to 70a65d4 Compare February 5, 2015 12:42

amueller reviewed Feb 5, 2015
View reviewed changes

amueller reviewed Feb 6, 2015
View reviewed changes

This was referenced Feb 6, 2015

[MRG + 2] make check_array convert object to float. #4057

Merged

[MRG+2] add validation for non-empty input data #4214

Merged

amueller force-pushed the test_dtypes_all branch from f9a5244 to bb6417b Compare February 7, 2015 08:16

FIX check (and enforce) that estimators can accept different dtypes.

e3e0827

amueller force-pushed the test_dtypes_all branch 2 times, most recently from c8473fc to 87f6c31 Compare February 10, 2015 20:40

fixes in GMM, TSNE, MDS, LSHForest, exclude SpectralEmbedding

8ef0b9a

amueller force-pushed the test_dtypes_all branch from 87f6c31 to 8ef0b9a Compare February 10, 2015 23:34

FIX regularized covariance on 1D data

9f50699

Merge pull request #24 from ogrisel/fix-nan-1d-regularized-covariance

021bf74

FIX regularized covariance on 1D data

amueller added a commit that referenced this pull request Feb 13, 2015

Merge pull request #4136 from amueller/test_dtypes_all

cd931fa

[MRG+1] More robust input validation, more testing.

amueller merged commit cd931fa into scikit-learn:master Feb 13, 2015

amueller deleted the test_dtypes_all branch February 13, 2015 22:14

This was referenced Feb 13, 2015

Ensure common tests cover everything #4056

Closed

TSNE has no fit #4133

Closed

amueller reviewed Feb 13, 2015
View reviewed changes

amueller mentioned this pull request Feb 25, 2015

sklearn.manifold.TSNE, ValueError: Buffer dtype mismatch, expected 'float_t' but got 'float' #4124

Closed

lesteve added a commit to lesteve/scikit-learn that referenced this pull request Sep 15, 2023

[pyodide build] Hack to get pyodide build to work based on Pyodide sc…

82da466

…ikit-learn#4136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] More robust input validation, more testing. #4136

[MRG+1] More robust input validation, more testing. #4136

amueller commented Jan 20, 2015

amueller commented Jan 21, 2015

amueller commented Jan 21, 2015

amueller commented Jan 21, 2015

amueller commented Feb 3, 2015

agramfort Feb 3, 2015

amueller Feb 3, 2015

amueller Feb 3, 2015

agramfort Feb 3, 2015

amueller Feb 3, 2015

agramfort Feb 3, 2015

ogrisel Feb 5, 2015

amueller Feb 5, 2015

amueller Feb 5, 2015

agramfort commented Feb 3, 2015

arjoly Feb 4, 2015

arjoly Feb 4, 2015

arjoly commented Feb 4, 2015

amueller commented Feb 4, 2015

ogrisel Feb 5, 2015

amueller Feb 5, 2015

coveralls commented Feb 5, 2015

amueller Feb 6, 2015

amueller commented Feb 7, 2015

amueller commented Feb 7, 2015

amueller commented Feb 7, 2015

amueller commented Feb 10, 2015

amueller commented Feb 10, 2015

coveralls commented Feb 10, 2015

ogrisel commented Feb 13, 2015

ogrisel commented Feb 13, 2015

coveralls commented Feb 13, 2015

amueller commented Feb 13, 2015

amueller Feb 13, 2015

ogrisel Feb 13, 2015

[MRG+1] More robust input validation, more testing. #4136

[MRG+1] More robust input validation, more testing. #4136

Conversation

amueller commented Jan 20, 2015

amueller commented Jan 21, 2015

amueller commented Jan 21, 2015

amueller commented Jan 21, 2015

amueller commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agramfort commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented Feb 4, 2015

amueller commented Feb 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Feb 5, 2015

Choose a reason for hiding this comment

amueller commented Feb 7, 2015

amueller commented Feb 7, 2015

amueller commented Feb 7, 2015

amueller commented Feb 10, 2015

amueller commented Feb 10, 2015

coveralls commented Feb 10, 2015

ogrisel commented Feb 13, 2015

ogrisel commented Feb 13, 2015

coveralls commented Feb 13, 2015

amueller commented Feb 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment