Skip to content

[MRG] Linear One-Class SVM using SGD implementation #10027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 23, 2021

Conversation

albertcthomas
Copy link
Contributor

@albertcthomas albertcthomas commented Oct 27, 2017

What does this implement/fix? Explain your changes.

This implements a linear version of the One-Class SVM based on the SGD implementation. This implementation thus scales linearly with the number of samples and has a partial_fit method. Combining this implementation with kernel approximation techniques, we can approximate the solution of a kernelized OneClassSVM and still benefit from the training time complexity improvement (see example and benchmark below).

The optimization problem of the One-Class SVM can be written as an optimization problem that is very close to the ones solved by the SGD implementation (see the doc for details). This implementation thus requires very few changes in the SGD cython code.

Benchmark comparing OneClassSVM and SGDOneClassSVM in terms of training time and AUC (n: number of training samples, d: number of features). The training size has been reduced for some of the datasets for LibSVM to finish in a decent time.

train_time

auc

Toy example
toy

Any other comments?

This is still WIP because the tests can be refactored. Any comment is more than welcome.
cc @agramfort

@amueller
Copy link
Member

This looks cool, thank you.
Needs a user guide and maybe adding to some of the outlier detection examples? I think the docs should also be pretty explicit that this mostly makes sense with some kernel approximation in front of it (right?).

Since this is a pretty significant contribution, expect that this will take some time to review...

@albertcthomas
Copy link
Contributor Author

I think the docs should also be pretty explicit that this mostly makes sense with some kernel approximation in front of it (right?).

Right

@albertcthomas
Copy link
Contributor Author

For the part of the doc explaining how the SGD implementation is used see here.
I will add this algorithm to the outlier detection example suggested in #10004 once it is merged.

@albertcthomas
Copy link
Contributor Author

Thanks for the review @TomDLT! The estimator is now using the future default values of max_iter and tol. For the moment, offset_decay is a parameter of the fit method but we should maybe remove it depending on the benchmark results. I do not know yet if we should consider the same decay as the other SGD estimators for sparse data (see my comment above)

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind having a smaller intercept update for sparse data is that the intercept is updated at each sample, whereas a particular feature coefficient is updated only when this feature is nonzero in the current sample. Thus the intercept is updated much more often that other features, hence the decay to balance this effect.

I don't mind having a parameter offset_decay, but I would be in favor of homogeneity among SGD classes, as it seems that SGDOneClassSVM is not fundamentally different. Also, I would put the parameter in __init__, and not in fit.

@jnothman
Copy link
Member

This pull request introduces 1 alert - view on lgtm.com

new alerts:

  • 1 for Unreachable code

Comment posted by lgtm.com

@albertcthomas
Copy link
Contributor Author

Thanks for the clarification @TomDLT. Let's use the same decay for SGDOneClassSVM then.

@albertcthomas
Copy link
Contributor Author

I removed the offset_decay parameter as I'm using the same default values as the other SGD estimators.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2017

This pull request introduces 2 alerts - view on lgtm.com

new alerts:

  • 1 for Unreachable code
  • 1 for init method calls overridden method

Comment posted by lgtm.com

@albertcthomas
Copy link
Contributor Author

I rebased on master and added the algorithm to the example plot_anomaly_comparison.py
anomaly_comparison

@albertcthomas
Copy link
Contributor Author

I also fixed the benchmark so that the shuttle dataset is now downloaded with fetch_openml. Results are:
Training time:
training-time
AUC:
auc

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor remarks on sphinx link rendering.

Common tests are also failing.

@albertcthomas
Copy link
Contributor Author

Thanks again for the review @TomDLT! I will double check the doc when the rendering is available.

@albertcthomas
Copy link
Contributor Author

I will have to investigate the last failing test... apparently clf.coef_ is reallocated when doing a partial_fit. But this fails only on python2,7 for appveyor.

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Nov 25, 2018

The test that was failing is not failing anymore so this appears to be random.

@jnothman
Copy link
Member

Is this still of interest @albertcthomas? Lots of red crosses.

@albertcthomas
Copy link
Contributor Author

There should be less red crosses now that I rebased. This PR was almost already done except for the tests. Should be good now.

@albertcthomas albertcthomas changed the title [WIP] Online linear One-Class SVM using SGD implementation [MRG] Online linear One-Class SVM using SGD implementation Feb 28, 2019
@albertcthomas
Copy link
Contributor Author

Rendered doc: outlier detection, sgd and user guide.

Copy link
Member

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall; only minor syntactic comments.

@albertcthomas
Copy link
Contributor Author

Thanks for the review @bthirion

@DanyYan
Copy link

DanyYan commented Aug 8, 2019

Hello. I want to know which version of sklearn can I use the SGDOneClassSVM. I can't find it in the scikit-learn v0.21.3.

@albertcthomas
Copy link
Contributor Author

Thanks a lot @TomDLT for all your reviews and great comments.

Base automatically changed from master to main January 22, 2021 10:49
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum sorry I broke the tests when resolving the conflicts via github. Let me fix it.

In the mean time here are other comments.

The PR looks very good, and the benchmarks are really impressive. Don't forget to add a what's new entry in doc/whats_new/v1.0.rst.

# Loading datasets
if dataset_name in ['http', 'smtp', 'SA', 'SF']:
dataset = fetch_kddcup99(subset=dataset_name, shuffle=False,
percent10=False, random_state=88)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble loading this on a machine with 16GB of RAM: fetch_kddcup99(percent10=False) never completes because my machine swaps...

The code for parsing the source dataset file is complex, written in pure Python and does weird numpy object arrays conversions which are not efficient.

The following is much faster and does not swap (1GB in RAM max):

 X, y = fetch_openml(name="KDDCup99", as_frame=True, return_X_y=True, version=1)

It's quite easy to then use pandas to filter the rows for a specific subset (I think). Not sure if it's worth updating this benchmark script, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this. If it's much faster this would definitely be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the two fetchers and for me it seems fetch_kddcup99 is faster fetch_openml Note that to get the full data set (percent10=False) from OpenML you need to set version to 5 (https://www.openml.org/d/42746). I might be missing something.

In [11]: %timeit fetch_openml(name="KDDCup99", return_X_y=True, version=5)
3min 23s ± 975 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit fetch_kddcup99(percent10=False, return_X_y=True)
28 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the comments above are taken care of.

One more suggestion:

albertcthomas and others added 2 commits March 18, 2021 22:41
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@cmarmo
Copy link
Contributor

cmarmo commented Mar 19, 2021

Hi @albertcthomas I believe this is worth a change log entry for 1.0... :)

@albertcthomas
Copy link
Contributor Author

Hi @albertcthomas I believe this is worth a change log entry for 1.0... :)

Yes will do. I've just started working on @ogrisel's review :)

@albertcthomas albertcthomas changed the title [MRG] Online linear One-Class SVM using SGD implementation [MRG] Linear One-Class SVM using SGD implementation Mar 22, 2021
@ogrisel ogrisel merged commit c854b83 into scikit-learn:main Mar 23, 2021
@ogrisel
Copy link
Member

ogrisel commented Mar 23, 2021

Merged! Thank you very much @albertcthomas. I did not wait for the last open comment of https://github.com/scikit-learn/scikit-learn/pull/10027/files#r598734037 which is minor and can be tackled in a separate PR if you wish.

@ogrisel
Copy link
Member

ogrisel commented Mar 23, 2021

Thank you again for the very nice contribution!

@albertcthomas
Copy link
Contributor Author

Wow thanks a lot!! Thanks again for the first review and all the help @TomDLT :) Thanks for the additional reviews and comments @ogrisel @bthirion @banilo @amueller. Thanks @agramfort for making me work on this :) and thanks @cmarmo for reviving this PR

@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
@glevv
Copy link
Contributor

glevv commented Oct 28, 2021

As I understand this is the same idea as Uniclass Passive-Aggressive Algorithm (p 13/563) but with kernels support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.