Skip to content

Tests for sample order invariance in estimator_checks #8695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Apr 3, 2017 · 10 comments · Fixed by MLH-Fellowship/scikit-learn#1
Closed

Tests for sample order invariance in estimator_checks #8695

jnothman opened this issue Apr 3, 2017 · 10 comments · Fixed by MLH-Fellowship/scikit-learn#1
Labels
Easy Well-defined and straightforward way to resolve

Comments

@jnothman
Copy link
Member

jnothman commented Apr 3, 2017

While sample and feature order can have subtle effects on the model fit by an estimator, I think we should have common tests to ensure that reordering or subsampling X in predict or transform or score_samples or predict_proba or decision_function does not change the sample-wise output. That is:

idx = np.random.randint(X.shape[0], size=X.shape[0] // 2)
assert_array_equal(method(X)[idx], method(X[idx]))

Apologies if we already have such tests, but I can't see them (which is also an issue: we don't actually have a clear list of what is asserted by estimator_checks)

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Need Contributor labels Apr 3, 2017
@jmcol
Copy link

jmcol commented Apr 5, 2017

Hi! I would like to work on this, but I'm new to this project. I took a look at estimator_checks.py and there's a lot there. Would this be a completely new test or would this be added to test_check_estimator?

@jnothman
Copy link
Member Author

jnothman commented Apr 5, 2017 via email

jmcol added a commit to jmcol/scikit-learn that referenced this issue Apr 7, 2017
… for sample invariance in predict_proba to ensure that reordering or subsampling \n does not change the sample-wide output \n \n Addresses: scikit-learn#8695 \n
jmcol added a commit to jmcol/scikit-learn that referenced this issue Apr 7, 2017
Adds a check for sample order invariance for regressors
in estimator_checks

Adresses: scikit-learn#8695
@jmcol
Copy link

jmcol commented Apr 10, 2017

Would the test for transformers only be in check_transformer_general, or would it need to be present in all the check_transformer* functions? I'm also unsure how I would produce an estimator that would fail one of these tests. Also, I've been unable to find a straightforward way to build from source on windows. Do you (or anyone) have any recommendations?

Thanks so much!

jmcol added a commit to jmcol/scikit-learn that referenced this issue Apr 11, 2017
Adds a simple check for sample order invariance
in the predict function when using the
estimator_cheks

Addresses: scikit-learn#8695
@jnothman
Copy link
Member Author

I'm sorry I know nothing about building on windows beyond what's in the docs.

An example of an estimator that fails on one of these tests is:

class Bad(BaseEstimator):
    def fit(self, X, y=None): return self
    def predict(self, X): return np.arange(len(X))

I'm not sure the best way to structure the tests in the current checking framework. Try one and we'll see if there's better when reviewing your pull request.

@jmcol
Copy link

jmcol commented Apr 14, 2017

Hmm this might not be the place to ask this, but I'm having a hard time building/testing. I created a VM on my windows machine and I'm not able to use make in the scikit-learn directory. I'm getting the following error

ERROR: Failure: ImportError (cannot import name _hierarchical)
Traceback (most recent call last):
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/nose/loader.py", line 418, in loadTestsFromName
addr.filename, addr.module)
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/home/jcolfer/scikit-learn/sklearn/tests/test_random_projection.py", line 14, in
from sklearn.utils.testing import assert_less
File "/home/jcolfer/scikit-learn/sklearn/utils/testing.py", line 61, in
from sklearn.cluster import DBSCAN
File "/home/jcolfer/scikit-learn/sklearn/cluster/init.py", line 10, in
from .hierarchical import (ward_tree, AgglomerativeClustering, linkage_tree,
File "/home/jcolfer/scikit-learn/sklearn/cluster/hierarchical.py", line 23, in
from . import _hierarchical
ImportError: cannot import name _hierarchical
Ran 348 tests in 1.328s

FAILED (errors=143)
Makefile:32: recipe for target 'test-code' failed

Building doesn't fail when I just type python setup.py install. However. I get a similar error about the hierarchical package when I try to execute the following code:

from sklearn.base import BaseEstimator
import numpy as np


class Bad(BaseEstimator):
    def fit(self, X, y=None): return self

    def predict(self, X): return np.arange(len(X))


check_estimator(Bad)

Traceback (most recent call last):
File "test_check_estimators.py", line 1, in
from sklearn.utils.estimator_checks import check_estimator
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/sklearn/utils/estimator_checks.py", line 16, in
from sklearn.utils.testing import assert_raises
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/sklearn/utils/testing.py", line 61, in
from sklearn.cluster import DBSCAN
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/sklearn/cluster/init.py", line 10, in
from .hierarchical import (ward_tree, AgglomerativeClustering, linkage_tree,
File "/home/jcolfer/anaconda2/lib/python2.7/site-packages/sklearn/cluster/hierarchical.py", line 23, in
from . import _hierarchical
File "sklearn/utils/fast_dict.pxd", line 22, in init sklearn.cluster._hierarchical (sklearn/cluster/_hierarchical.cpp:21043)
ImportError: /home/jcolfer/anaconda2/lib/python2.7/site-packages/sklearn/utils/fast_dict.so: undefined symbol: _ZTINSt8ios_base7failureB5cxx11E

I'm using Ubuntu 16.04.2 in a VMWare workstation VM on a 64 bit Intel system.

@AishwaryaRK
Copy link
Contributor

@jnothman this issue seems open, I would like to contribute, can you please give me some pointers
to start with

@jnothman
Copy link
Member Author

Get your head around sklearn/utils/estimator_checks.py for a start. Thanks!

@AishwaryaRK
Copy link
Contributor

Thanks, I'm picking it up.

@anhqngo
Copy link

anhqngo commented Jun 8, 2020

hey is someone working on this? Can I pick this up?

@cmarmo
Copy link
Contributor

cmarmo commented Oct 8, 2020

Fixed in #17598 via #18570.

@cmarmo cmarmo closed this as completed Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve
Projects
None yet
6 participants