-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Linear One-Class SVM using SGD implementation #10027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This looks cool, thank you. Since this is a pretty significant contribution, expect that this will take some time to review... |
Right |
Thanks for the review @TomDLT! The estimator is now using the future default values of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind having a smaller intercept update for sparse data is that the intercept is updated at each sample, whereas a particular feature coefficient is updated only when this feature is nonzero in the current sample. Thus the intercept is updated much more often that other features, hence the decay to balance this effect.
I don't mind having a parameter offset_decay
, but I would be in favor of homogeneity among SGD classes, as it seems that SGDOneClassSVM
is not fundamentally different. Also, I would put the parameter in __init__
, and not in fit
.
This pull request introduces 1 alert - view on lgtm.com new alerts:
Comment posted by lgtm.com |
Thanks for the clarification @TomDLT. Let's use the same decay for |
I removed the |
This pull request introduces 2 alerts - view on lgtm.com new alerts:
Comment posted by lgtm.com |
4fc2376
to
afea83b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor remarks on sphinx link rendering.
Common tests are also failing.
Thanks again for the review @TomDLT! I will double check the doc when the rendering is available. |
I will have to investigate the last failing test... apparently clf.coef_ is reallocated when doing a partial_fit. But this fails only on python2,7 for appveyor. |
b838450
to
fd7f202
Compare
The test that was failing is not failing anymore so this appears to be random. |
fd7f202
to
e523a22
Compare
Is this still of interest @albertcthomas? Lots of red crosses. |
e523a22
to
c324021
Compare
There should be less red crosses now that I rebased. This PR was almost already done except for the tests. Should be good now. |
Rendered doc: outlier detection, sgd and user guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall; only minor syntactic comments.
Thanks for the review @bthirion |
Hello. I want to know which version of sklearn can I use the SGDOneClassSVM. I can't find it in the scikit-learn v0.21.3. |
Thanks a lot @TomDLT for all your reviews and great comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum sorry I broke the tests when resolving the conflicts via github. Let me fix it.
In the mean time here are other comments.
The PR looks very good, and the benchmarks are really impressive. Don't forget to add a what's new entry in doc/whats_new/v1.0.rst
.
# Loading datasets | ||
if dataset_name in ['http', 'smtp', 'SA', 'SF']: | ||
dataset = fetch_kddcup99(subset=dataset_name, shuffle=False, | ||
percent10=False, random_state=88) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble loading this on a machine with 16GB of RAM: fetch_kddcup99(percent10=False)
never completes because my machine swaps...
The code for parsing the source dataset file is complex, written in pure Python and does weird numpy object arrays conversions which are not efficient.
The following is much faster and does not swap (1GB in RAM max):
X, y = fetch_openml(name="KDDCup99", as_frame=True, return_X_y=True, version=1)
It's quite easy to then use pandas to filter the rows for a specific subset (I think). Not sure if it's worth updating this benchmark script, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check this. If it's much faster this would definitely be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the two fetchers and for me it seems fetch_kddcup99
is faster fetch_openml
Note that to get the full data set (percent10=False) from OpenML you need to set version to 5 (https://www.openml.org/d/42746). I might be missing something.
In [11]: %timeit fetch_openml(name="KDDCup99", return_X_y=True, version=5)
3min 23s ± 975 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: %timeit fetch_kddcup99(percent10=False, return_X_y=True)
28 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the comments above are taken care of.
One more suggestion:
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Hi @albertcthomas I believe this is worth a change log entry for 1.0... :) |
Yes will do. I've just started working on @ogrisel's review :) |
700492c
to
47e5978
Compare
Merged! Thank you very much @albertcthomas. I did not wait for the last open comment of https://github.com/scikit-learn/scikit-learn/pull/10027/files#r598734037 which is minor and can be tackled in a separate PR if you wish. |
Thank you again for the very nice contribution! |
As I understand this is the same idea as Uniclass Passive-Aggressive Algorithm (p 13/563) but with kernels support? |
What does this implement/fix? Explain your changes.
This implements a linear version of the One-Class SVM based on the SGD implementation. This implementation thus scales linearly with the number of samples and has a
partial_fit
method. Combining this implementation with kernel approximation techniques, we can approximate the solution of a kernelizedOneClassSVM
and still benefit from the training time complexity improvement (see example and benchmark below).The optimization problem of the One-Class SVM can be written as an optimization problem that is very close to the ones solved by the SGD implementation (see the doc for details). This implementation thus requires very few changes in the SGD cython code.
Benchmark comparing
OneClassSVM
andSGDOneClassSVM
in terms of training time and AUC (n: number of training samples, d: number of features). The training size has been reduced for some of the datasets for LibSVM to finish in a decent time.Toy example

Any other comments?
This is still WIP because the tests can be refactored. Any comment is more than welcome.
cc @agramfort