[WIP] Make SVC tests independent of SV ordering #12849

ogrisel · 2018-12-21T11:02:01Z

This PR changes the assertions made on the parametrization of the dual coefficients of SVM to check the equality of two models because the ordering of the support vectors is non-deterministic in case of duplicated samples in the training set (as is the case in the iris dataset and the subset of 20 newsgroups
we use in some of the tests).

The problem was identified in #12738 when trying to replace the libsvm-base solver by the one from DAAL using the daal4py monkeypatch from uxlfoundation/scikit-learn-intelex#15.

I have tried to run the tests with:

python -m daal4py -m pytest -vl  sklearn/svm/tests

and they pass on this branch (using the daal4py branch from uxlfoundation/scikit-learn-intelex#15).

However, if I decrease the regularization (larger C value) in test_svc_iris or in test_sparse_20newsgroups_subset they will fail. I am not sure whether this reveals a true discrepancy between the 2 solvers or whether this is just that the optimization problem becomes too under determined to guarantee such a strict equivalence of the estimated parametrization of the decision function.

This also relaxes some strict assertions on the fitted coefficients.

oleksandr-pavlyk · 2019-01-10T21:18:07Z

sklearn/svm/tests/test_sparse.py

+    dc1 = _toarray(svc1.dual_coef_)
+    sv2 = _toarray(svc2.support_vectors_)
+    dc2 = _toarray(svc2.dual_coef_)
+    assert dc1.shape == dc2.shape


Actually sets of support vectors may be different. If the input contains duplicates samples, and that point happens to be a support vector, then support_vectors_ may contains a single, or multiple entries depending on history of accumulation of floating point errors.

adrinjalali

I wonder if the SVs are not deterministic in case of duplicate entries, should we change the datasets used in these tests?

Doesn't seem like there's an easy solution to this.

adrinjalali · 2020-02-14T13:49:46Z

sklearn/svm/tests/test_sparse.py

+def _toarray(a):
+    if sparse.issparse(a):
+        return a.toarray()
+    return a


would np.asarray(a) not work?

This was initial proposal in #12732

scikit-learn still needs to be able to deal with inputs where feature matrix has duplicates, it is just that the original quadratic optimization problem admits infinitely many solutions.

The test for such a case should be that the training succeeds and predictions can be made.

Summarizing some IRL discussion:

We could have test that checks that the predictions (on a random test set) are the same if:

SVC is trained both on dense and sparse version of the same training dataset with duplicates;

SVC is trained on datasets with and without duplicates but with 0.5 sample weight for training points duplicate twice.

And then update the sparse / dense coef_ comparison test to use data without duplicated data-points.

Make SVC tests independent of SV ordering

6b9da82

This also relaxes some strict assertions on the fitted coefficients.

ogrisel changed the title ~~[WIP] Improve SVC tests to make the~~ [WIP] Make SVC tests independent of SV ordering Dec 21, 2018

oleksandr-pavlyk reviewed Jan 10, 2019

View reviewed changes

amueller added the Waiting for Reviewer label Aug 6, 2019

adrinjalali added the module:svm label Feb 14, 2020

adrinjalali reviewed Feb 14, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:50

cmarmo added Needs work and removed Waiting for Reviewer labels Jul 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Make SVC tests independent of SV ordering #12849

[WIP] Make SVC tests independent of SV ordering #12849

ogrisel commented Dec 21, 2018 •

edited

Loading

oleksandr-pavlyk Jan 10, 2019

adrinjalali left a comment

adrinjalali Feb 14, 2020

oleksandr-pavlyk Feb 14, 2020

ogrisel Feb 14, 2020

[WIP] Make SVC tests independent of SV ordering #12849

Are you sure you want to change the base?

[WIP] Make SVC tests independent of SV ordering #12849

Conversation

ogrisel commented Dec 21, 2018 • edited Loading

oleksandr-pavlyk Jan 10, 2019

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Feb 14, 2020

Choose a reason for hiding this comment

oleksandr-pavlyk Feb 14, 2020

Choose a reason for hiding this comment

ogrisel Feb 14, 2020

Choose a reason for hiding this comment

ogrisel commented Dec 21, 2018 •

edited

Loading