Skip to content

[WIP] Make SVC tests independent of SV ordering #12849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 21, 2018

This PR changes the assertions made on the parametrization of the dual coefficients of SVM to check the equality of two models because the ordering of the support vectors is non-deterministic in case of duplicated samples in the training set (as is the case in the iris dataset and the subset of 20 newsgroups
we use in some of the tests).

The problem was identified in #12738 when trying to replace the libsvm-base solver by the one from DAAL using the daal4py monkeypatch from uxlfoundation/scikit-learn-intelex#15.

I have tried to run the tests with:

python -m daal4py -m pytest -vl  sklearn/svm/tests

and they pass on this branch (using the daal4py branch from uxlfoundation/scikit-learn-intelex#15).

However, if I decrease the regularization (larger C value) in test_svc_iris or in test_sparse_20newsgroups_subset they will fail. I am not sure whether this reveals a true discrepancy between the 2 solvers or whether this is just that the optimization problem becomes too under determined to guarantee such a strict equivalence of the estimated parametrization of the decision function.

This also relaxes some strict assertions on the fitted coefficients.
@ogrisel ogrisel changed the title [WIP] Improve SVC tests to make the [WIP] Make SVC tests independent of SV ordering Dec 21, 2018
dc1 = _toarray(svc1.dual_coef_)
sv2 = _toarray(svc2.support_vectors_)
dc2 = _toarray(svc2.dual_coef_)
assert dc1.shape == dc2.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sets of support vectors may be different. If the input contains duplicates samples, and that point happens to be a support vector, then support_vectors_ may contains a single, or multiple entries depending on history of accumulation of floating point errors.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the SVs are not deterministic in case of duplicate entries, should we change the datasets used in these tests?

Doesn't seem like there's an easy solution to this.

Comment on lines +139 to +142
def _toarray(a):
if sparse.issparse(a):
return a.toarray()
return a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would np.asarray(a) not work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initial proposal in #12732

scikit-learn still needs to be able to deal with inputs where feature matrix has duplicates, it is just that the original quadratic optimization problem admits infinitely many solutions.

The test for such a case should be that the training succeeds and predictions can be made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some IRL discussion:

We could have test that checks that the predictions (on a random test set) are the same if:

  • SVC is trained both on dense and sparse version of the same training dataset with duplicates;
  • SVC is trained on datasets with and without duplicates but with 0.5 sample weight for training points duplicate twice.

And then update the sparse / dense coef_ comparison test to use data without duplicated data-points.

Base automatically changed from master to main January 22, 2021 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants