Skip to content

[WIP] Common test for equivalence between sparse and dense matrices. #13246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 127 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
127 commits
Select commit Hold shift + click to select a range
a22ee34
Add test for sparse and dense equivalence
maniteja123 Oct 4, 2016
17b4da6
Set random state
maniteja123 Oct 6, 2016
c4241f0
Merge branch 'master' into tests_sparse
Feb 25, 2019
3c4cd3e
wip tests for sparse estimators
Feb 25, 2019
9f70939
remove wrongly merged code
Feb 25, 2019
80cc509
Fix: make it work
Feb 25, 2019
6acd14d
Add fit_intercept=False for linear models
Feb 25, 2019
d7b56a1
add brute exception for kneighbors classifier
Feb 25, 2019
e5cc48c
more workaround
agramfort Feb 25, 2019
fd45ad2
fix test by fixing the seed :)
agramfort Feb 25, 2019
a8e6606
STY: fix pep8 errors
Feb 26, 2019
f4e885b
Merge branch 'tests_sparse' of https://github.com/wdevazelhes/scikit-…
Feb 26, 2019
66000f5
FIX deal with the case that was failing (test_check_estimator_pairwise)
Feb 26, 2019
f2d8b16
FIX fix random seed to make test_check_estimator pass (lines at the e…
Feb 26, 2019
22af420
Fix the test that was checking a ValueError
Feb 26, 2019
1e5d8fc
FIX: do another way around to pass tests for kernel algorithms (becau…
Feb 26, 2019
0f47078
FIX: fix mistake on n_samples, n_features
Feb 26, 2019
a1fc52a
remove spurious comment
Feb 26, 2019
f849d24
Also convert X_csr
Feb 26, 2019
27f2177
Convert X_sp instead
Feb 26, 2019
926d27f
Fix typo
Feb 26, 2019
7eb9710
Merge branch 'master' into tests_sparse
Feb 27, 2019
4163387
FIX: fix teh check that was making fail AffinityPropagation and remov…
Feb 27, 2019
b01a1ab
FIX: remove the change that belongs to #13334
Mar 1, 2019
f626717
Merge branch 'master' into tests_sparse
Mar 1, 2019
49e7bbe
Add more friendly dataset to avoid problems of random seeds with rand…
Mar 1, 2019
1db1408
Address jeremie's review
Mar 1, 2019
df985a6
FIX: put friendly dataset that allows to clean up nearest neighbors
Mar 1, 2019
ca3c046
simplify
agramfort Mar 6, 2019
adfe965
Merge branch 'master' into tests_sparse
agramfort Mar 6, 2019
a09fd9a
Merge branch 'master' into tests_sparse
Apr 4, 2019
4c80312
Merge branch 'tests_sparse' of https://github.com/wdevazelhes/scikit-…
Apr 4, 2019
78ba594
Merge branch 'master' into tests_sparse
Apr 4, 2019
f561a25
fix precomputed case + SGD
agramfort Apr 27, 2019
c07c412
Merge branch 'master' into tests_sparse
agramfort Apr 27, 2019
1e906a1
more try
agramfort Apr 27, 2019
e9ba178
Merge remote-tracking branch 'upstream/master' into common_check_spar…
jeromedockes Jul 30, 2019
49e3981
only 2 classes in check_estimator_sparse_dense if tag "binary_only"
jeromedockes Jul 30, 2019
a149f8f
check value of multioutput or multioutput_only in tags
jeromedockes Jul 30, 2019
247a2ce
Merge remote-tracking branch 'upstream/master' into common_check_spar…
jeromedockes Jul 30, 2019
2842f1e
Address https://github.com/scikit-learn/scikit-learn/pull/13246/files…
Aug 19, 2019
f3d027c
Remove file that should not be here
Aug 19, 2019
6705cee
Merge branch 'master' into tests_sparse
Aug 20, 2019
f11264c
Change ValueError into TypeError for QuantileTransformer
Aug 20, 2019
10bccbe
Modify also the test
Aug 20, 2019
a4dd99b
Avoid to reiterate the distance matrix of the distance matrix etc
Aug 21, 2019
ef0657c
Put the right error for error message in approximate nearest neighbors
Aug 21, 2019
6651743
Also update the test with TypeError
Aug 21, 2019
feb2932
Remove comment since the issue is opened
Aug 26, 2019
49b31ea
merge with master
wdevazelhes Sep 26, 2019
5a65aa1
fix pep8 errors
wdevazelhes Sep 26, 2019
a011adb
Put TypeError instead of ValueError in neighbors tree
wdevazelhes Sep 26, 2019
8ad06b7
update QuantileTransformer with TypeError when sparse not accepted
wdevazelhes Sep 26, 2019
1c28543
Remove the test of exceptions in check_estimator_sparse_dense
wdevazelhes Sep 30, 2019
a89a94c
Merge branch 'master' into tests_sparse
wdevazelhes Sep 30, 2019
e7975ad
- Test that check_estimator_dense_sparse indeed checks equality
wdevazelhes Oct 1, 2019
1bf69d5
STY: fix pep8 errors
wdevazelhes Oct 1, 2019
6fe2184
Use a BaseBadClassifier to not raise the NotFittedError since NoSpars…
wdevazelhes Oct 1, 2019
c57401d
no reason to remove this decorator, must be a mistake in merge or sth
wdevazelhes Oct 1, 2019
eb854c0
be clearer about NoSparseClassifier
wdevazelhes Oct 1, 2019
7fa703e
Use the undeprecated version of _pairwise_estimator_convert
wdevazelhes Oct 7, 2019
00360ca
merge tentative
wdevazelhes Oct 25, 2020
b068c3f
remove remaining from merge
wdevazelhes Oct 25, 2020
d7e0c70
fix merge
wdevazelhes Oct 25, 2020
baff986
fixes
wdevazelhes Oct 25, 2020
da0a3ad
pep8 fix
wdevazelhes Oct 25, 2020
2063343
replace _safe_tags by _get_tags
wdevazelhes Oct 25, 2020
85862ce
put again the original code that I accidently removed in the merge
wdevazelhes Oct 25, 2020
6088c7e
fix indentation
wdevazelhes Oct 25, 2020
6cdeaed
the block should be in fact removed
wdevazelhes Oct 25, 2020
2f14fb9
stochastic gradient not found, replacing by private module
wdevazelhes Oct 25, 2020
b2f22fb
add strict mode
wdevazelhes Oct 25, 2020
51c1db2
use instance not class
wdevazelhes Oct 25, 2020
12047be
fix tests
wdevazelhes Oct 25, 2020
792b1ef
remove message since referenced pr is closed
wdevazelhes Oct 26, 2020
2067f4e
remove pyorig file
wdevazelhes Oct 26, 2020
f92a953
use enforce_estimator_tags
wdevazelhes Oct 26, 2020
f952402
use fstrings
wdevazelhes Nov 2, 2020
dde4da5
put a standard random seed
wdevazelhes Nov 2, 2020
5a155fe
rename mock classes with better names
wdevazelhes Nov 2, 2020
0f88016
address https://github.com/scikit-learn/scikit-learn/pull/13246/files…
wdevazelhes Nov 4, 2020
3c152ed
Allow sparsity in AffinityPropagation and fix error in check that was…
wdevazelhes Nov 4, 2020
0f1e39b
Fix pb that appeared due to the previous fix:
wdevazelhes Nov 6, 2020
e32e6aa
Remove specific centering for StandardScaler and RobustScaler in esti…
wdevazelhes Nov 8, 2020
40dd3e9
fix indent
wdevazelhes Nov 11, 2020
0c83018
Remove if precomputed metric or kernel, see comment https://github.co…
wdevazelhes Nov 11, 2020
ed6554a
try to not exclude SGD to see if it still works
wdevazelhes Nov 20, 2020
d7587e1
Merge branch 'master' into tests_sparse
wdevazelhes Nov 22, 2020
70e9d2c
Use BaseBadClassifier instead of LogisticRegression
Nov 23, 2020
d9deadc
fix typo
wdevazelhes Nov 23, 2020
62a893e
Add checks for BaseSGD that sets the intercept decay to the same valu…
wdevazelhes Nov 23, 2020
b955d0c
remove unused import
wdevazelhes Nov 23, 2020
d504d30
try with a smaller intercept decay to be more stable ?
wdevazelhes Nov 24, 2020
0ec7a96
fix code to set the intercept decay to a smaller value for both
wdevazelhes Nov 24, 2020
0ac94c0
fix code to set the intercept decay to a smaller value for both
wdevazelhes Nov 24, 2020
806aa47
ensure that the test doesn't change the intercept decay
wdevazelhes Nov 24, 2020
665053f
remove blank line
wdevazelhes Nov 24, 2020
138b2b9
Be more specific by excluding only estimators that fail (with the jus…
wdevazelhes Nov 24, 2020
b3f99f8
remove unused imports
wdevazelhes Nov 24, 2020
017c318
Create simple regression pb for regressors
wdevazelhes Nov 26, 2020
04fce9d
round the dataset to limit numerical problems
wdevazelhes Nov 26, 2020
e6112e2
put less significative numbers
wdevazelhes Nov 26, 2020
7c3fd40
try with float16 precision
wdevazelhes Nov 26, 2020
ed37152
Update sklearn/utils/estimator_checks.py
wdevazelhes Dec 15, 2020
152bc66
refactor X_size into n_samples
wdevazelhes Dec 15, 2020
762112f
Update sklearn/utils/estimator_checks.py
wdevazelhes Dec 15, 2020
c615d5c
remove comment
wdevazelhes Dec 15, 2020
70d6e88
add comment about the precision of numbers
wdevazelhes Dec 15, 2020
c74d998
add other prediction functions
wdevazelhes Dec 15, 2020
1b292dc
Remove Scalers tests
wdevazelhes Dec 15, 2020
8a5778a
separate scoring function and prediction/transform functions
wdevazelhes Dec 15, 2020
691a845
score_samples dont have y
wdevazelhes Dec 15, 2020
5da39b3
test more methods in check_estimator_sparse_data
wdevazelhes Dec 15, 2020
d9c9645
the try except should only be on the fit, we don't want assertion err…
wdevazelhes Dec 20, 2020
4b8f2a4
subsequent tests should be done only on estimators that pass the 'fit…
wdevazelhes Dec 20, 2020
d066b32
make shape testing work with outlier detector
wdevazelhes Dec 20, 2020
6bb5e36
Fix AdditiveChi2Sampler
wdevazelhes Feb 26, 2021
1e67503
fix IncrementalPCA: accept to transform some sparse inputs with index…
wdevazelhes Feb 26, 2021
9643a38
Fix problems with linear_models
wdevazelhes Mar 2, 2021
71ad125
Simplify data to remove numerical uncertainties
wdevazelhes Mar 3, 2021
ac27d73
Merge branch 'master' into tests_sparse
wdevazelhes Mar 3, 2021
084b604
Refactor dataset use
wdevazelhes Mar 5, 2021
25316ad
use the more extensive _generate_sparse_matrix
wdevazelhes Mar 5, 2021
891f5a9
fix ortho
wdevazelhes Mar 5, 2021
31ecac2
add changelog
wdevazelhes Mar 5, 2021
a5bfc98
use float because np.float is deprecated
wdevazelhes Mar 5, 2021
6a67de0
merge with master
wdevazelhes Mar 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,16 @@ Changelog
:class:`calibration.CalibratedClassifierCV can now properly be used on
prefitted pipelines. :pr:`19641` by :user:`Alek Lefebvre <AlekLefebvre>`

:mod:`sklearn.utils`
....................

- |Enhancement| Improve common tests in `estimator_checks` to ensure that
estimators give the same result when given sparse and dense inputs.
:pr:`13246` by :user:`Maniteja Nandana <maniteja123>`,
:user:`William de Vazelhes <wdevazelhes>`,
:user:`Alexandre Gramfort <agramfort>`, and
:user:`Jérôme Dockès <jeromedockes>`.

Code and Documentation Contributors
-----------------------------------

Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_affinity_propagation.py
Original file line number Diff line number Diff line change
Expand Up @@ -448,7 +448,7 @@ def predict(self, X):
Cluster labels.
"""
check_is_fitted(self)
X = self._validate_data(X, reset=False)
X = self._validate_data(X, reset=False, accept_sparse=True)
if not hasattr(self, "cluster_centers_"):
raise ValueError("Predict method is not supported when "
"affinity='precomputed'.")
Expand Down
3 changes: 3 additions & 0 deletions sklearn/decomposition/_incremental_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,9 @@ def transform(self, X):
if sparse.issparse(X):
n_samples = X.shape[0]
output = []
if type(X) in [sparse.bsr_matrix, sparse.coo_matrix,
sparse.dia_matrix]:
X = X.tocsr(copy=True)
for batch in gen_batches(n_samples, self.batch_size_,
min_batch_size=self.n_components or 0):
output.append(super().transform(X[batch].toarray()))
Expand Down
16 changes: 10 additions & 6 deletions sklearn/kernel_approximation.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,16 +593,20 @@ def _transform_dense(self, X):
return np.hstack(X_new)

def _transform_sparse(self, X):
indices = X.indices.copy()
indptr = X.indptr.copy()

data_step = np.sqrt(X.data * self.sample_interval_)
# We remove possible explicit zeros, which will lead to infinity
# in the log:
X_pruned = X.copy()
X_pruned.eliminate_zeros()

indices = X_pruned.indices.copy()
indptr = X_pruned.indptr.copy()
data_step = np.sqrt(X_pruned.data * self.sample_interval_)
X_step = sp.csr_matrix((data_step, indices, indptr),
shape=X.shape, dtype=X.dtype, copy=False)
X_new = [X_step]

log_step_nz = self.sample_interval_ * np.log(X.data)
step_nz = 2 * X.data * self.sample_interval_
log_step_nz = self.sample_interval_ * np.log(X_pruned.data)
step_nz = 2 * X_pruned.data * self.sample_interval_

for j in range(1, self.sample_steps):
factor_nz = np.sqrt(step_nz /
Expand Down
3 changes: 2 additions & 1 deletion sklearn/linear_model/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
# TODO: bayesian_ridge_regression and bayesian_regression_ard
# should be squashed into its respective objects.

DENSE_INTERCEPT_DECAY = 1.0
SPARSE_INTERCEPT_DECAY = 0.01
# For sparse data intercept updates are scaled by this decay factor to avoid
# intercept oscillation.
Expand Down Expand Up @@ -186,7 +187,7 @@ def make_dataset(X, y, sample_weight, random_state=None):
else:
X = np.ascontiguousarray(X)
dataset = ArrayData(X, y, sample_weight, seed=seed)
intercept_decay = 1.0
intercept_decay = DENSE_INTERCEPT_DECAY

return dataset, intercept_decay

Expand Down
34 changes: 34 additions & 0 deletions sklearn/linear_model/_passive_aggressive.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,23 @@ def fit(self, X, y, coef_init=None, intercept_init=None):
loss="hinge", learning_rate=lr,
coef_init=coef_init, intercept_init=intercept_init)

def _more_tags(self):
return {
'_xfail_checks': {
'check_estimator_sparse_dense':
"PassiveAggressiveClassifier has a "
"special intercept_decay for sparse inputs (see "
"the constant "
"`linear_model._base.SPARSE_INTERCEPT_DECAY`), "
"which gives different results than the one for "
"dense data. Therefore it is not tested in common "
"tests but rather in `linear_model.test_sgd` (namely "
"`test_sgd_sparse_dense_same_decay`), with the sparse "
"intercept set to the same value between sparse and "
"dense, on a toy example."
}
}


class PassiveAggressiveRegressor(BaseSGDRegressor):
"""Passive Aggressive Regressor
Expand Down Expand Up @@ -468,3 +485,20 @@ def fit(self, X, y, coef_init=None, intercept_init=None):
learning_rate=lr,
coef_init=coef_init,
intercept_init=intercept_init)

def _more_tags(self):
return {
'_xfail_checks': {
'check_estimator_sparse_dense':
"PassiveAggressiveRegressor has a "
"special intercept_decay for sparse inputs (see "
"the constant "
"`linear_model._base.SPARSE_INTERCEPT_DECAY`), "
"which gives different results than the one for "
"dense data. Therefore it is not tested in common "
"tests but rather in `linear_model.test_sgd` (namely "
"`test_sgd_sparse_dense_same_decay`), with the sparse "
"intercept set to the same value between sparse and "
"dense, on a toy example."
}
}
17 changes: 17 additions & 0 deletions sklearn/linear_model/_perceptron.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,20 @@ def __init__(self, *, penalty=None, alpha=0.0001, l1_ratio=0.15,
validation_fraction=validation_fraction,
n_iter_no_change=n_iter_no_change, power_t=0.5,
warm_start=warm_start, class_weight=class_weight, n_jobs=n_jobs)

def _more_tags(self):
return {
'_xfail_checks': {
'check_estimator_sparse_dense':
"Perceptron has a "
"special intercept_decay for sparse inputs (see "
"the constant "
"`linear_model._base.SPARSE_INTERCEPT_DECAY`), "
"which gives different results than the one for "
"dense data. Therefore it is not tested in common "
"tests but rather in `linear_model.test_sgd` (namely "
"`test_sgd_sparse_dense_same_decay`), with the sparse "
"intercept set to the same value between sparse and "
"dense, on a toy example."
}
}
22 changes: 22 additions & 0 deletions sklearn/linear_model/_stochastic_gradient.py
Original file line number Diff line number Diff line change
Expand Up @@ -1111,6 +1111,17 @@ def _more_tags(self):
'_xfail_checks': {
'check_sample_weights_invariance':
'zero sample_weight is not equivalent to removing samples',
'check_estimator_sparse_dense':
"SGDClassifier has a "
"special intercept_decay for sparse inputs (see "
"the constant "
"`linear_model._base.SPARSE_INTERCEPT_DECAY`), "
"which gives different results than the one for "
"dense data. Therefore it is not tested in common "
"tests but rather in `linear_model.test_sgd` (namely "
"`test_sgd_sparse_dense_same_decay`), with the sparse "
"intercept set to the same value between sparse and "
"dense, on a toy example."
}
}

Expand Down Expand Up @@ -1604,5 +1615,16 @@ def _more_tags(self):
'_xfail_checks': {
'check_sample_weights_invariance':
'zero sample_weight is not equivalent to removing samples',
'check_estimator_sparse_dense':
"SGDRegressor has a "
"special intercept_decay for sparse inputs (see "
"the constant "
"`linear_model._base.SPARSE_INTERCEPT_DECAY`), "
"which gives different results than the one for "
"dense data. Therefore it is not tested in common "
"tests but rather in `linear_model.test_sgd` (namely "
"`test_sgd_sparse_dense_same_decay`), with the sparse "
"intercept set to the same value between sparse and "
"dense, on a toy example."
}
}
26 changes: 26 additions & 0 deletions sklearn/linear_model/tests/test_sgd.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from sklearn.utils._testing import assert_array_almost_equal
from sklearn.utils._testing import assert_raises_regexp
from sklearn.utils._testing import ignore_warnings
from sklearn.utils.estimator_checks import check_estimator_sparse_dense
from sklearn.utils.fixes import parse_version

from sklearn import linear_model, datasets, metrics
Expand Down Expand Up @@ -1646,3 +1647,28 @@ def test_SGDClassifier_fit_for_all_backends(backend):
with joblib.parallel_backend(backend=backend):
clf_parallel.fit(X, y)
assert_array_almost_equal(clf_sequential.coef_, clf_parallel.coef_)


@pytest.mark.parametrize('estimator_orig',
[linear_model.SGDRegressor(),
linear_model.PassiveAggressiveRegressor(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
linear_model.PassiveAggressiveClassifier()])
def test_sgd_sparse_dense_same_decay(estimator_orig):
"""Tests that with default parameters, estimators that inherit from
`sklearn.linear_model._stochastic_gradient.BaseSGD`
return the same results on dense and sparse data. It's
tested here and not in common tests because for sparse data,
the "intercept decay" variable is set to a different value than for
dense data, which would give different results between sparse and
dense. Here we test that for toy examples, if this intercept
decay is set to the same value, the result is the same between
sparse and dense."""
old_dense_intercept_decay = linear_model._base.DENSE_INTERCEPT_DECAY
old_sparse_intercept_decay = linear_model._base.SPARSE_INTERCEPT_DECAY
linear_model._base.DENSE_INTERCEPT_DECAY = 0.01
linear_model._base.SPARSE_INTERCEPT_DECAY = 0.01
check_estimator_sparse_dense(None, estimator_orig)
linear_model._base.DENSE_INTERCEPT_DECAY = old_dense_intercept_decay
linear_model._base.SPARSE_INTERCEPT_DECAY = old_sparse_intercept_decay
19 changes: 17 additions & 2 deletions sklearn/preprocessing/_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -953,7 +953,15 @@ def inverse_transform(self, X, copy=None):

def _more_tags(self):
return {'allow_nan': True,
'preserves_dtype': [np.float64, np.float32]}
'preserves_dtype': [np.float64, np.float32],
'_xfail_checks':
{'check_estimator_sparse_dense':
"Default StandardScaler doesn't support sparse "
"inputs. But StandardScaler is tested on sparse "
"data in `preprocessing.tests.test_data."
"test_scaler_without_centering`."
}
}


class MaxAbsScaler(TransformerMixin, BaseEstimator):
Expand Down Expand Up @@ -1459,7 +1467,14 @@ def inverse_transform(self, X):
return X

def _more_tags(self):
return {'allow_nan': True}
return {'allow_nan': True,
'_xfail_checks':
{'check_estimator_sparse_dense':
"Default RobustScaler don't support sparse inputs. "
"But RobustScaler is tested on sparse data in "
"`preprocessing.tests.test_data."
"test_robust_scaler_equivalence_dense_sparse`."
}}


@_deprecate_positional_args
Expand Down
Loading