[MRG] Comparison plot for anomaly detection methods. #10004

agramfort · 2017-10-25T15:47:08Z

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

add a comparison plot (like we have for classif or clustering)
of our anomaly/outlier/novelty detection methods.

agramfort · 2017-10-25T15:47:51Z

looks like this:

jnothman

A few small things. The plot looks great!

jnothman · 2017-10-25T21:29:38Z

examples/plot_anomaly_comparison.py

+of algorithms to cope with multimodal data.
+
+While these examples give some intuition about the
+algorithms, this intuition might not apply to very high


Is it possible to expand on this or provide a reference? What features of the shown decision functions would not transfer to higher dimensions?

Well it's the same sentence as for our clustering summary example.

jnothman · 2017-10-25T21:37:18Z

examples/plot_anomaly_comparison.py

+            anomaly_scores = algorithm.decision_function(X)
+
+        threshold = stats.scoreatpercentile(anomaly_scores,
+                                            100 * outliers_fraction)


Do you need to comment on why this is necessary when outliers_fraction is also provided to constructors?

This is only used in drawing contours. Put it in the if.

(and remove from the if above)

(and remove the special case above)

with the new consistent anomaly API coming we'll have a consistent threshold_ attribute. I'll switch to this when merged.

jnothman · 2017-10-25T21:37:38Z

examples/plot_anomaly_comparison.py

+# Example settings
+n_samples = 300
+outliers_fraction = 0.15
+clusters_separation = (0, 1, 2)


jnothman · 2017-10-25T21:44:39Z

examples/plot_anomaly_comparison.py

+anomaly/novelty detection algorithms on 2D datasets. Datasets contain
+one or two modes (regions of high density) to illustrate the ability
+of algorithms to cope with multimodal data.
+


I think you should state:

15% of samples in each dataset are generated as random uniform noise. We display decision boundaries with a threshold that would mark 15% of training samples as outliers.

Although admittedly I'm a bit confused by this threshold...

amueller

LGTM apart from nitpicks. nice one :)

amueller · 2017-10-25T21:51:12Z

doc/modules/outlier_detection.rst

+   :align: center
+   :scale: 50
+
+   A comparison of the clustering algorithms in scikit-learn


amueller · 2017-10-25T21:52:35Z

examples/plot_anomaly_comparison.py

+of algorithms to cope with multimodal data.
+
+While these examples give some intuition about the
+algorithms, this intuition might not apply to very high


I think you should mention that LOF can not predict on new data so no decision boundary.

amueller · 2017-10-25T21:54:51Z

examples/plot_anomaly_comparison.py

+n_inliers = n_samples - n_outliers
+
+
+def make_dataset(n_inliers, offset, stds=(0.5, 0.5), random_state=0):


This is a special case of make_blobs, right? Well, here we have explicit offsets, but you could just pick a nice-looking random state for make_blobs?

amueller · 2017-10-25T21:55:14Z

examples/plot_anomaly_comparison.py

+# define outlier/anomaly detection methods to be compared
+anomaly_algorithms = [
+    ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)),
+    ("One-Class SVM", svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,


can you explain how nu was set?

and gamma, I guess....

amueller · 2017-10-25T21:55:28Z

examples/plot_anomaly_comparison.py

+    ("Isolation Forest", IsolationForest(contamination=outliers_fraction,
+                                         random_state=42)),
+    ("Local Outlier Factor", LocalOutlierFactor(
+        n_neighbors=35, contamination=outliers_fraction))]


can you explain how n_neighbors was set?

amueller · 2017-10-25T21:59:13Z

examples/plot_anomaly_comparison.py

+            plt.title(name, size=18)
+
+        # fit the data and tag outliers
+        if name == "Local Outlier Factor":


maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.

amueller · 2017-10-25T21:59:13Z

examples/plot_anomaly_comparison.py

+            anomaly_scores = algorithm.decision_function(X)
+
+        threshold = stats.scoreatpercentile(anomaly_scores,
+                                            100 * outliers_fraction)


(and remove from the if above)

amueller · 2017-10-25T21:59:23Z

examples/plot_anomaly_comparison.py

+            plt.title(name, size=18)
+
+        # fit the data and tag outliers
+        if name == "Local Outlier Factor":


maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.

used if name == "Local Outlier Factor":
in both places.

amueller · 2017-10-25T21:59:36Z

examples/plot_anomaly_comparison.py

+            anomaly_scores = algorithm.decision_function(X)
+
+        threshold = stats.scoreatpercentile(anomaly_scores,
+                                            100 * outliers_fraction)


(and remove the special case above)

amueller · 2017-10-25T22:00:34Z

examples/plot_anomaly_comparison.py

+            plt.contour(xx, yy, Z, levels=[threshold],
+                        linewidths=2, colors='black')
+
+        colors = np.array(['#377eb8', '#ff7f00'])


plt.cm.tab10?

I used the same colors as clustering gallery. tab10 is not as scikit-learn color style :)

codecov · 2017-10-26T11:47:05Z

Codecov Report

Merging #10004 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #10004      +/-   ##
==========================================
+ Coverage   96.17%   96.18%   +<.01%     
==========================================
  Files         336      336              
  Lines       62669    62733      +64     
==========================================
+ Hits        60274    60338      +64     
  Misses       2395     2395

Impacted Files	Coverage Δ
sklearn/tests/test_naive_bayes.py	`100% <0%> (ø)`	⬆️
...n/preprocessing/tests/test_function_transformer.py	`100% <0%> (ø)`	⬆️
sklearn/feature_selection/rfe.py	`97.6% <0%> (ø)`	⬆️
sklearn/linear_model/ridge.py	`95.66% <0%> (ø)`	⬆️
sklearn/linear_model/coordinate_descent.py	`96.95% <0%> (ø)`	⬆️
sklearn/neural_network/tests/test_mlp.py	`100% <0%> (ø)`	⬆️
sklearn/linear_model/logistic.py	`97.07% <0%> (ø)`	⬆️
sklearn/linear_model/least_angle.py	`96.26% <0%> (ø)`	⬆️
sklearn/calibration.py	`98.87% <0%> (ø)`	⬆️
sklearn/preprocessing/tests/test_data.py	`99.91% <0%> (ø)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 102620f...4441646. Read the comment docs.

albertcthomas

I would remove 'novelty' everywhere and only use 'outlier' (or 'anomaly') as this example is an outlier detection example and not a novelty detection example. I would also use 0 as the threshold and remove anomaly_scores (see my comment below). Otherwise LGTM. I really like the second data set showing the particularity of LOF :) Thanks @agramfort!

albertcthomas · 2017-10-26T12:44:35Z

examples/plot_anomaly_comparison.py

+        if name != "Local Outlier Factor":  # LOF does not implement predict
+            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
+            Z = Z.reshape(xx.shape)
+            plt.contour(xx, yy, Z, levels=[threshold],


If we use predict on the grid I would rather use levels=[0] (predict is in {-1, 1} for all estimators). And we do not need threshold and anomaly_scores anymore.

The other solution would be to keep levels=[threshold] and use decision_function on the grid but then the decision boundary might not agree with the colors of the points given by y_pred (which is the case in this example).

…r predictions

albertcthomas · 2017-10-26T13:50:52Z

I finally decided to keep decision_function and threshold as nu is a bit different than the contamination parameter of the other algorithms.
The plot now looks like this

amueller

lgtm

amueller · 2017-10-26T16:30:22Z

examples/plot_anomaly_comparison.py

+
+        threshold = stats.scoreatpercentile(anomaly_scores,
+                                            100 * outliers_fraction)
+        y_pred = (anomaly_scores >= threshold).astype(int)


I think it's a bit weird to reimplement predict, but whatever... lgtm

Hm I suppose @agramfort and I are a bit biased about how the OCSVM should be used :)
This is the plot we have if we use predict only and nu=outliers_fraction (instead of nu=0.2). Its not good for the OCSVM but I guess the users are going to use the algorithms this way so if there is a consensus I’m OK with this.

Can you elaborate? If you're saying OCSVM should not be used the way it is implemented, that seems pretty bad?

In particular if the example shows a usage pattern that is not explained somewhere else, that seems pretty confusing.

The nu parameter of the OCSVM does not allow to control the proportion of outliers as the contamination parameter does for the other algorithms. The contamination parameter of the other algorithms allows to define a threshold such that we exactly predict a proportion of outliers equal to contamination. The OCSVM automatically finds an offset/threshold which is such that the proportion of outliers is less than nu (but not necessarily equal to nu depending on the approximation/number of samples).
I prefer to choose a value for nu a bit higher than the proportion of outliers that I want to find and then threshold the decision function to obtain the desired proportion of outliers (as it is done for the other algorithms). Choosing a value for nu a bit higher than the proportion of outliers gives a bit more support vectors to approximate the decision function in the neighborhood of the decision boundary. But this extends the original implementation of the OCSVM and does not use the learned offset (and thus the default predict).

albertcthomas · 2017-10-29T18:02:15Z

Having thought about it a bit more I think it is better to expose the OneClassSVM with nu equal to the fraction of outliers and use predict as the users will very likely use the OneClassSVM this way. Besides the OneClassSVM here roughly predicts this same fraction of outliers for the considered datasets. Thanks for your comments @amueller.

agramfort · 2017-10-30T08:54:02Z

good to go from my end. Here is the rendered doc:

https://14584-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/outlier_detection.html#overview-of-outlier-detection-methods

glemaitre

LGTM.

glemaitre · 2017-11-02T15:38:52Z

examples/plot_anomaly_comparison.py

+data, the problem is completely unsupervised so model selection can be
+a challenge.
+"""
+print(__doc__)  # noqa


you can always put it after the import without the #noqa

glemaitre · 2017-11-02T15:40:16Z

examples/plot_anomaly_comparison.py

+
+for i_dataset, X in enumerate(datasets):
+    # Add outliers
+    rng = np.random.RandomState(42)


Do you want to put the rng outside of the loop?

glemaitre · 2017-11-02T15:45:18Z

examples/plot_anomaly_comparison.py

+Comparing anomaly detection algorithms for outlier detection on toy datasets
+============================================================================
+
+This example shows characteristics of different


I personally like that this going up to the 80 characters :)

glemaitre · 2017-11-02T15:48:55Z

examples/plot_anomaly_comparison.py

+a challenge.
+"""
+print(__doc__)  # noqa
+


Do we put authors and license notes usually?

agramfort · 2017-11-02T19:21:36Z

comments addressed

jnothman · 2017-11-02T21:41:03Z

Great stuff!

albertcthomas · 2017-11-02T22:55:47Z

Thanks @agramfort !

better dataset

3d7bdd8

jnothman reviewed Oct 25, 2017

View reviewed changes

amueller reviewed Oct 25, 2017

View reviewed changes

address comments

4b66899

more address comments

1b4a281

albertcthomas reviewed Oct 26, 2017

View reviewed changes

rm 'novelty', use decision_function for contour and anomaly_scores fo…

7666a7b

…r predictions

amueller approved these changes Oct 26, 2017

View reviewed changes

back to predict and add random_state to make_moons

675d65b

albertcthomas mentioned this pull request Oct 29, 2017

[MRG] Linear One-Class SVM using SGD implementation #10027

Merged

agramfort changed the title ~~Comparison plot for anomaly detection methods.~~ [MRG] Comparison plot for anomaly detection methods. Oct 30, 2017

glemaitre reviewed Nov 2, 2017

View reviewed changes

address comments from @glemaitre

4441646

jnothman merged commit 8e599c6 into scikit-learn:master Nov 2, 2017

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

DOC Comparison plot for anomaly detection methods. (scikit-learn#10004)

cd20105

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

DOC Comparison plot for anomaly detection methods. (scikit-learn#10004)

866fc78

MaiRajborirug mentioned this pull request Feb 3, 2020

[WIP] Performance comparison (ROC) plots for anomaly detection methods #16378

Closed

		n_inliers = n_samples - n_outliers


		def make_dataset(n_inliers, offset, stds=(0.5, 0.5), random_state=0):

Uh oh!

[MRG] Comparison plot for anomaly detection methods. #10004

[MRG] Comparison plot for anomaly detection methods. #10004

Uh oh!

Conversation

agramfort commented Oct 25, 2017

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

agramfort commented Oct 25, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

albertcthomas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Oct 26, 2017

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov bot commented Oct 26, 2017 •

edited

Loading

albertcthomas Oct 26, 2017 •

edited

Loading