[MRG] Linear One-Class SVM using SGD implementation #10027

albertcthomas · 2017-10-27T16:28:17Z

What does this implement/fix? Explain your changes.

This implements a linear version of the One-Class SVM based on the SGD implementation. This implementation thus scales linearly with the number of samples and has a partial_fit method. Combining this implementation with kernel approximation techniques, we can approximate the solution of a kernelized OneClassSVM and still benefit from the training time complexity improvement (see example and benchmark below).

The optimization problem of the One-Class SVM can be written as an optimization problem that is very close to the ones solved by the SGD implementation (see the doc for details). This implementation thus requires very few changes in the SGD cython code.

Benchmark comparing OneClassSVM and SGDOneClassSVM in terms of training time and AUC (n: number of training samples, d: number of features). The training size has been reduced for some of the datasets for LibSVM to finish in a decent time.

Toy example

Any other comments?

This is still WIP because the tests can be refactored. Any comment is more than welcome.
cc @agramfort

amueller · 2017-10-27T20:57:20Z

This looks cool, thank you.
Needs a user guide and maybe adding to some of the outlier detection examples? I think the docs should also be pretty explicit that this mostly makes sense with some kernel approximation in front of it (right?).

Since this is a pretty significant contribution, expect that this will take some time to review...

albertcthomas · 2017-10-28T09:20:26Z

I think the docs should also be pretty explicit that this mostly makes sense with some kernel approximation in front of it (right?).

Right

albertcthomas · 2017-10-29T18:30:27Z

For the part of the doc explaining how the SGD implementation is used see here.
I will add this algorithm to the outlier detection example suggested in #10004 once it is merged.

benchmarks/bench_online_ocsvm.py

sklearn/linear_model/stochastic_gradient.py

benchmarks/bench_online_ocsvm.py

sklearn/linear_model/base.py

albertcthomas · 2017-11-10T10:10:24Z

Thanks for the review @TomDLT! The estimator is now using the future default values of max_iter and tol. For the moment, offset_decay is a parameter of the fit method but we should maybe remove it depending on the benchmark results. I do not know yet if we should consider the same decay as the other SGD estimators for sparse data (see my comment above)

TomDLT

The idea behind having a smaller intercept update for sparse data is that the intercept is updated at each sample, whereas a particular feature coefficient is updated only when this feature is nonzero in the current sample. Thus the intercept is updated much more often that other features, hence the decay to balance this effect.

I don't mind having a parameter offset_decay, but I would be in favor of homogeneity among SGD classes, as it seems that SGDOneClassSVM is not fundamentally different. Also, I would put the parameter in __init__, and not in fit.

sklearn/linear_model/stochastic_gradient.py

jnothman · 2017-11-10T10:51:29Z

This pull request introduces 1 alert - view on lgtm.com

new alerts:

1 for Unreachable code

Comment posted by lgtm.com

albertcthomas · 2017-11-10T12:16:31Z

Thanks for the clarification @TomDLT. Let's use the same decay for SGDOneClassSVM then.

albertcthomas · 2017-12-03T18:43:59Z

I removed the offset_decay parameter as I'm using the same default values as the other SGD estimators.

jnothman · 2017-12-03T19:26:41Z

This pull request introduces 2 alerts - view on lgtm.com

new alerts:

1 for Unreachable code
1 for init method calls overridden method

Comment posted by lgtm.com

albertcthomas · 2018-11-21T20:50:36Z

I rebased on master and added the algorithm to the example plot_anomaly_comparison.py

albertcthomas · 2018-11-21T20:52:28Z

I also fixed the benchmark so that the shuttle dataset is now downloaded with fetch_openml. Results are:
Training time:

AUC:

TomDLT

Some minor remarks on sphinx link rendering.

Common tests are also failing.

doc/modules/sgd.rst

doc/modules/outlier_detection.rst

examples/linear_model/plot_sgdocsvm_vs_ocsvm.py

examples/plot_anomaly_comparison.py

sklearn/linear_model/stochastic_gradient.py

albertcthomas · 2018-11-22T22:42:32Z

Thanks again for the review @TomDLT! I will double check the doc when the rendering is available.

albertcthomas · 2018-11-23T17:04:24Z

I will have to investigate the last failing test... apparently clf.coef_ is reallocated when doing a partial_fit. But this fails only on python2,7 for appveyor.

albertcthomas · 2018-11-25T13:11:45Z

The test that was failing is not failing anymore so this appears to be random.

jnothman · 2019-02-28T05:44:24Z

Is this still of interest @albertcthomas? Lots of red crosses.

albertcthomas · 2019-02-28T11:15:54Z

There should be less red crosses now that I rebased. This PR was almost already done except for the tests. Should be good now.

albertcthomas · 2019-02-28T12:24:20Z

Rendered doc: outlier detection, sgd and user guide.

bthirion

Looks great overall; only minor syntactic comments.

benchmarks/bench_online_ocsvm.py

examples/plot_anomaly_comparison.py

sklearn/linear_model/stochastic_gradient.py

albertcthomas · 2019-02-28T16:57:55Z

Thanks for the review @bthirion

doc/modules/outlier_detection.rst

sklearn/linear_model/stochastic_gradient.py

DanyYan · 2019-08-08T13:23:33Z

Hello. I want to know which version of sklearn can I use the SGDOneClassSVM. I can't find it in the scikit-learn v0.21.3.

albertcthomas · 2021-01-08T21:37:11Z

Thanks a lot @TomDLT for all your reviews and great comments.

ogrisel

Hum sorry I broke the tests when resolving the conflicts via github. Let me fix it.

In the mean time here are other comments.

The PR looks very good, and the benchmarks are really impressive. Don't forget to add a what's new entry in doc/whats_new/v1.0.rst.

ogrisel · 2021-03-16T17:04:03Z

benchmarks/bench_online_ocsvm.py

+    # Loading datasets
+    if dataset_name in ['http', 'smtp', 'SA', 'SF']:
+        dataset = fetch_kddcup99(subset=dataset_name, shuffle=False,
+                                 percent10=False, random_state=88)


I have trouble loading this on a machine with 16GB of RAM: fetch_kddcup99(percent10=False) never completes because my machine swaps...

The code for parsing the source dataset file is complex, written in pure Python and does weird numpy object arrays conversions which are not efficient.

The following is much faster and does not swap (1GB in RAM max):

X, y = fetch_openml(name="KDDCup99", as_frame=True, return_X_y=True, version=1)

It's quite easy to then use pandas to filter the rows for a specific subset (I think). Not sure if it's worth updating this benchmark script, though.

I'll check this. If it's much faster this would definitely be better.

I tried the two fetchers and for me it seems fetch_kddcup99 is faster fetch_openml Note that to get the full data set (percent10=False) from OpenML you need to set version to 5 (https://www.openml.org/d/42746). I might be missing something.

In [11]: %timeit fetch_openml(name="KDDCup99", return_X_y=True, version=5) 3min 23s ± 975 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [12]: %timeit fetch_kddcup99(percent10=False, return_X_y=True) 28 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

doc/modules/sgd.rst

examples/miscellaneous/plot_anomaly_comparison.py

sklearn/linear_model/_stochastic_gradient.py

sklearn/linear_model/tests/test_sgd.py

sklearn/svm/_classes.py

ogrisel

LGTM once the comments above are taken care of.

One more suggestion:

doc/modules/sgd.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cmarmo · 2021-03-19T07:39:10Z

Hi @albertcthomas I believe this is worth a change log entry for 1.0... :)

albertcthomas · 2021-03-19T08:14:16Z

Hi @albertcthomas I believe this is worth a change log entry for 1.0... :)

Yes will do. I've just started working on @ogrisel's review :)

ogrisel · 2021-03-23T09:54:37Z

Merged! Thank you very much @albertcthomas. I did not wait for the last open comment of https://github.com/scikit-learn/scikit-learn/pull/10027/files#r598734037 which is minor and can be tackled in a separate PR if you wish.

ogrisel · 2021-03-23T09:54:54Z

Thank you again for the very nice contribution!

albertcthomas · 2021-03-23T10:35:39Z

Wow thanks a lot!! Thanks again for the first review and all the help @TomDLT :) Thanks for the additional reviews and comments @ogrisel @bthirion @banilo @amueller. Thanks @agramfort for making me work on this :) and thanks @cmarmo for reviving this PR

glevv · 2021-10-28T09:02:52Z

As I understand this is the same idea as Uniclass Passive-Aggressive Algorithm (p 13/563) but with kernels support?

TomDLT reviewed Oct 30, 2017

View reviewed changes

TomDLT reviewed Nov 2, 2017

View reviewed changes

benchmarks/bench_online_ocsvm.py Outdated Show resolved Hide resolved

benchmarks/bench_online_ocsvm.py Show resolved Hide resolved

sklearn/linear_model/base.py Outdated Show resolved Hide resolved

TomDLT reviewed Nov 10, 2017

View reviewed changes

sklearn/linear_model/stochastic_gradient.py Outdated Show resolved Hide resolved

albertcthomas force-pushed the sgd_ocsvm branch from 4fc2376 to afea83b Compare November 21, 2018 20:45

TomDLT reviewed Nov 22, 2018

View reviewed changes

albertcthomas force-pushed the sgd_ocsvm branch from b838450 to fd7f202 Compare November 25, 2018 11:34

albertcthomas force-pushed the sgd_ocsvm branch from fd7f202 to e523a22 Compare February 27, 2019 21:11

albertcthomas force-pushed the sgd_ocsvm branch from e523a22 to c324021 Compare February 28, 2019 11:08

albertcthomas changed the title ~~[WIP] Online linear One-Class SVM using SGD implementation~~ [MRG] Online linear One-Class SVM using SGD implementation Feb 28, 2019

bthirion reviewed Feb 28, 2019

View reviewed changes

banilo reviewed Mar 1, 2019

View reviewed changes

doc/modules/outlier_detection.rst Outdated Show resolved Hide resolved

sklearn/linear_model/stochastic_gradient.py Outdated Show resolved Hide resolved

amueller added the Waiting for Reviewer label Aug 6, 2019

Base automatically changed from master to main January 22, 2021 10:49

Merge branch 'main' into sgd_ocsvm

f2cdd5f

ogrisel reviewed Mar 16, 2021

View reviewed changes

Fix assert_raises => pytest.raises

3d03381

ogrisel approved these changes Mar 17, 2021

View reviewed changes

doc/modules/sgd.rst Outdated Show resolved Hide resolved

thomasjpfan mentioned this pull request Mar 18, 2021

ENH Better error for corrupted files in fetch_kddcup99 #19669

Merged

albertcthomas and others added 2 commits March 18, 2021 22:41

Update doc/modules/sgd.rst

6c4d42b

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Update sklearn/linear_model/tests/test_sgd.py

4ef8462

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cmarmo added the New Feature label Mar 19, 2021

albertcthomas added 7 commits March 19, 2021 09:15

avoid insert in example

d803896

whats_new entry

731b686

add note

08d2721

int32

d3f1a7d

explicit ddocstring

715e6b6

rm useless ignore warnings in tests

ba53ef2

use assert_allclose

47e5978

albertcthomas force-pushed the sgd_ocsvm branch from 700492c to 47e5978 Compare March 22, 2021 14:11

albertcthomas added 3 commits March 22, 2021 15:13

Merge remote-tracking branch 'upstream/main' into sgd_ocsvm

0bb0763

remove assert_allclose import

a663430

remaining almost_equal

a1a70f7

albertcthomas changed the title ~~[MRG] Online linear One-Class SVM using SGD implementation~~ [MRG] Linear One-Class SVM using SGD implementation Mar 22, 2021

ogrisel merged commit c854b83 into scikit-learn:main Mar 23, 2021

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Uh oh!

[MRG] Linear One-Class SVM using SGD implementation #10027

[MRG] Linear One-Class SVM using SGD implementation #10027

Uh oh!

Conversation

albertcthomas commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Oct 27, 2017

Uh oh!

albertcthomas commented Oct 28, 2017

Uh oh!

albertcthomas commented Oct 29, 2017

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertcthomas commented Nov 10, 2017

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Nov 10, 2017

Uh oh!

albertcthomas commented Nov 10, 2017

Uh oh!

albertcthomas commented Dec 3, 2017

Uh oh!

jnothman commented Dec 3, 2017

Uh oh!

albertcthomas commented Nov 21, 2018

Uh oh!

albertcthomas commented Nov 21, 2018

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertcthomas commented Nov 22, 2018

Uh oh!

albertcthomas commented Nov 23, 2018

Uh oh!

albertcthomas commented Nov 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 28, 2019

Uh oh!

albertcthomas commented Feb 28, 2019

Uh oh!

albertcthomas commented Feb 28, 2019

Uh oh!

bthirion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertcthomas commented Oct 27, 2017 •

edited

Loading

albertcthomas commented Nov 25, 2018 •

edited

Loading

DanyYan commented Aug 8, 2019 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel left a comment •

edited

Loading