-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Make SVC tests independent of SV ordering #12849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This also relaxes some strict assertions on the fitted coefficients.
dc1 = _toarray(svc1.dual_coef_) | ||
sv2 = _toarray(svc2.support_vectors_) | ||
dc2 = _toarray(svc2.dual_coef_) | ||
assert dc1.shape == dc2.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually sets of support vectors may be different. If the input contains duplicates samples, and that point happens to be a support vector, then support_vectors_
may contains a single, or multiple entries depending on history of accumulation of floating point errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the SVs are not deterministic in case of duplicate entries, should we change the datasets used in these tests?
Doesn't seem like there's an easy solution to this.
def _toarray(a): | ||
if sparse.issparse(a): | ||
return a.toarray() | ||
return a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would np.asarray(a)
not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initial proposal in #12732
scikit-learn still needs to be able to deal with inputs where feature matrix has duplicates, it is just that the original quadratic optimization problem admits infinitely many solutions.
The test for such a case should be that the training succeeds and predictions can be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing some IRL discussion:
We could have test that checks that the predictions (on a random test set) are the same if:
- SVC is trained both on dense and sparse version of the same training dataset with duplicates;
- SVC is trained on datasets with and without duplicates but with 0.5 sample weight for training points duplicate twice.
And then update the sparse / dense coef_
comparison test to use data without duplicated data-points.
This PR changes the assertions made on the parametrization of the dual coefficients of SVM to check the equality of two models because the ordering of the support vectors is non-deterministic in case of duplicated samples in the training set (as is the case in the iris dataset and the subset of 20 newsgroups
we use in some of the tests).
The problem was identified in #12738 when trying to replace the libsvm-base solver by the one from DAAL using the daal4py monkeypatch from uxlfoundation/scikit-learn-intelex#15.
I have tried to run the tests with:
and they pass on this branch (using the daal4py branch from uxlfoundation/scikit-learn-intelex#15).
However, if I decrease the regularization (larger C value) in
test_svc_iris
or intest_sparse_20newsgroups_subset
they will fail. I am not sure whether this reveals a true discrepancy between the 2 solvers or whether this is just that the optimization problem becomes too under determined to guarantee such a strict equivalence of the estimated parametrization of the decision function.