Skip to content

[WIP] Common test for equivalence between sparse and dense matrices. #13246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 127 commits into from

Conversation

wdevazelhes
Copy link
Contributor

@wdevazelhes wdevazelhes commented Feb 25, 2019

Fixes #1572
Follow-up of #7590
TODO:

@wdevazelhes
Copy link
Contributor Author

I tried to integrate the dictionary of values to overwrite the default, as in this @lesteve comment: #7590 (comment)
My strategy is to look at tests that fail, and if they fail because of different parameters settings than default, I would add them to this dictionary, otherwise I'll file the bug. Does this sound good @jnothman, @lesteve ?

@wdevazelhes
Copy link
Contributor Author

Right now here are the tests that fail (without for now looking at whether it's a real fail, or if it's because the input takes different paths between sparse and non sparse):

  • TransformedTargetRegressor
  • SGDRegressor
  • SGDClassifier
  • RobustScaler
  • RidgeClassifierCV
  • RidgeCV
  • Ridge
  • RANSACRegressor
  • Perceptron
  • PassiveAggressiveRegressor
  • PassiveAggressiveClassifier
  • MultiOutputRegressor
  • LinearRegression
  • KNeighborsRegressor
  • KNeighborsClassifier
  • GradientBoostingClassifier

@agramfort
Copy link
Member

agramfort commented Feb 25, 2019 via email

@wdevazelhes
Copy link
Contributor Author

Setting fit_intercept=False, these are the tests that fail:
TransformedTargetRegressor
RobustScaler
RANSACRegressor
MultiOutputRegressor
MultiOutputRegressor
KNeighborsRegressor
KNeighborsClassifier
GradientBoostingClassifier

@@ -148,7 +148,7 @@ def predict(self, X):
Target values
"""
if issparse(X) and self.metric == 'precomputed':
raise ValueError(
raise TypeError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort should I leave this for another PR or is it ok to change it here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort related to our discussion, in fact here they do call check_array but the sparse error is raised by this "if", which has a ValueError, so I guess this is the right fix (and I'll let it in this PR for this one as you said)

@@ -2406,7 +2406,7 @@ def check_estimator_sparse_dense(name, estimator_orig):

assert_equal(probs.shape, (X.shape[0], 4))
except TypeError as e:
if 'sparse' not in repr(e):
if 'sparse' not in str.lower(repr(e)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this because the estimator that used kernel='precomputed' raised something like "Sparse..." which is an OK message

@@ -2338,7 +2338,7 @@ def check_classifiers_regression_target(name, estimator_orig):

@ignore_warnings(category=(DeprecationWarning, FutureWarning))
def check_estimator_sparse_dense(name, estimator_orig):
rng = np.random.RandomState(42)
rng = np.random.RandomState(52)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort I had the same problem of random seed here for AdaBoostClassifier: (tested at the end of test_check_estimator)
this random seed fixed it

@wdevazelhes
Copy link
Contributor Author

The new seed (52) introduced a weird bug for AffinityPropagation that predicts two really different arrays (one with zeros everywhere and one with -1 everywhere (putting back the seed to 0 fixes the pb). I'll try to investigate that

Err message:

Testing started at 09:57 ...
/home/will/anaconda3/envs/sprint/bin/python /snap/pycharm-community/112/helpers/pycharm/_jb_pytest_runner.py --target test_common.py::test_estimators -- -k test_estimators[AffinityPropagation-check_estimator_sparse_dense]
Launching pytest with arguments -k test_estimators[AffinityPropagation-check_estimator_sparse_dense] test_common.py::test_estimators in /home/will/Code/sklearn-forks/wdevazelhes/scikit-learn/sklearn/tests

============================= test session starts ==============================
platform linux -- Python 3.7.2, pytest-4.3.0, py-1.7.0, pluggy-0.8.1
rootdir: /home/will/Code/sklearn-forks/wdevazelhes/scikit-learn, inifile: setup.cfg
plugins: sugar-0.9.2collected 5385 items / 5384 deselected / 1 selected

test_common.py FEstimator AffinityPropagation doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.

sklearn/tests/test_common.py:101 (test_estimators[AffinityPropagation-check_estimator_sparse_dense])
estimator = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)
check = <function check_estimator_sparse_dense at 0x7fa5ee03e048>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

name = 'AffinityPropagation'
estimator_orig = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)

    @ignore_warnings(category=(DeprecationWarning, FutureWarning))
    def check_estimator_sparse_dense(name, estimator_orig):
        rng = np.random.RandomState(52)
        X = rng.rand(40, 10)
        X[X < .8] = 0
        X_csr = sparse.csr_matrix(X)
        y = (4 * rng.rand(40)).astype(np.int)
        estimator = clone(estimator_orig)
        estimator_sp = clone(estimator_orig)
        for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
            X_sp = X_csr.asformat(sparse_format)
            # catch deprecation warnings
            with ignore_warnings(category=DeprecationWarning):
                if name in ['Scaler', 'StandardScaler']:
                    estimator.set_params(with_mean=False)
                    estimator_sp.set_params(with_mean=False)
            dense_vs_sparse_additional_params = defaultdict(dict,
                    {'Ridge': {'solver': 'cholesky'}})
            params = dense_vs_sparse_additional_params[
                estimator.__class__.__name__]
            estimator.set_params(**params)
            estimator_sp.set_params(**params)
            set_random_state(estimator)
            set_random_state(estimator_sp)
            try:
                with ignore_warnings(category=DeprecationWarning):
                    estimator_sp.fit(X_sp, y)
                    estimator.fit(X, y)
                if hasattr(estimator, "predict"):
                    pred = estimator.predict(X)
                    pred_sp = estimator_sp.predict(X_sp)
>                   assert_array_almost_equal(pred, pred_sp, 2)
E                   AssertionError: 
E                   Arrays are not almost equal to 2 decimals
E                   
E                   (mismatch 100.0%)
E                    x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
E                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
E                    y: array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1])

../utils/estimator_checks.py:2367: AssertionError
                                                         [100%]

=================================== FAILURES ===================================
______ test_estimators[AffinityPropagation-check_estimator_sparse_dense] _______

estimator = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)
check = <function check_estimator_sparse_dense at 0x7fa5ee03e048>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

name = 'AffinityPropagation'
estimator_orig = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)

    @ignore_warnings(category=(DeprecationWarning, FutureWarning))
    def check_estimator_sparse_dense(name, estimator_orig):
        rng = np.random.RandomState(52)
        X = rng.rand(40, 10)
        X[X < .8] = 0
        X_csr = sparse.csr_matrix(X)
        y = (4 * rng.rand(40)).astype(np.int)
        estimator = clone(estimator_orig)
        estimator_sp = clone(estimator_orig)
        for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
            X_sp = X_csr.asformat(sparse_format)
            # catch deprecation warnings
            with ignore_warnings(category=DeprecationWarning):
                if name in ['Scaler', 'StandardScaler']:
                    estimator.set_params(with_mean=False)
                    estimator_sp.set_params(with_mean=False)
            dense_vs_sparse_additional_params = defaultdict(dict,
                    {'Ridge': {'solver': 'cholesky'}})
            params = dense_vs_sparse_additional_params[
                estimator.__class__.__name__]
            estimator.set_params(**params)
            estimator_sp.set_params(**params)
            set_random_state(estimator)
            set_random_state(estimator_sp)
            try:
                with ignore_warnings(category=DeprecationWarning):
                    estimator_sp.fit(X_sp, y)
                    estimator.fit(X, y)
                if hasattr(estimator, "predict"):
                    pred = estimator.predict(X)
                    pred_sp = estimator_sp.predict(X_sp)
>                   assert_array_almost_equal(pred, pred_sp, 2)
E                   AssertionError: 
E                   Arrays are not almost equal to 2 decimals
E                   
E                   (mismatch 100.0%)
E                    x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
E                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
E                    y: array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1])

Base automatically changed from master to main January 22, 2021 10:50
@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Feb 26, 2021

This issue: #17554, is causing the following test not to pass: test_estimators[PolynomialFeatures()-check_estimator_sparse_data(strict_mode=True)]. Indeed, one of the sparse matrices tested in check_estimator_sparse_data has int64 as dtype for the indices. I'll comment on that issue to see if I can fix it.

@@ -2974,7 +2974,9 @@ def check_estimator_sparse_dense(name, estimator_orig,
else:
tags = estimator_orig._get_tags()
centers = 2 if tags["binary_only"] else None
X, y = make_blobs(random_state=rng, cluster_std=0.5, centers=centers)
X, y = make_blobs(n_samples=10, n_features=2, random_state=rng,
Copy link
Contributor Author

@wdevazelhes wdevazelhes Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the data here similarly to before, because there were some numerical uncertainties with the Nystrom method (particularly with almost zero elements: the relative error between two numbers can be quite high but both number can be of order 1e-15 for instance, so they are clearly supposed to be zeros). With a simpler dataset that will lead to less operations (because less samples, less features, and less nonzero decimals), the uncertainties disappeared on my machine

@cmarmo
Copy link
Contributor

cmarmo commented Mar 3, 2021

Hi @wdevazelhes , do you mind synchronizing with upstream? Checks are failing looking for the master branch. Thanks!

@wdevazelhes
Copy link
Contributor Author

Hi @cmarmo , thanks, I just merged with master.
I also added a few things in the TODO that I need to finish before putting the PR in MRG

@jjerphan jjerphan added this to the 1.1 milestone Mar 4, 2022
@adrinjalali
Copy link
Member

removing the milestone for now.

@haiatn
Copy link
Contributor

haiatn commented Jul 29, 2023

Is this PR ready for review?

@agramfort
Copy link
Member

@haiatn this one would deserve a big refresher to be now considered for merge. I don't know how much we should invest on this one at this point.

@haiatn
Copy link
Contributor

haiatn commented Aug 12, 2023

@haiatn this one would deserve a big refresher to be now considered for merge. I don't know how much we should invest on this one at this point.

I do like the idea of testing this, but if you think so, maybe we should close this and leave the issue open for whoever that would like to start fresh

@agramfort
Copy link
Member

agramfort commented Aug 12, 2023 via email

@adrinjalali
Copy link
Member

I think this can be closed, the original issue is still open and relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:test-suite everything related to our tests module:utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Common tests: check that results on sparse and dense matrices are identical