Skip to content

[MRG] Comparison plot for anomaly detection methods. #10004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 2, 2017

Conversation

agramfort
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

add a comparison plot (like we have for classif or clustering)
of our anomaly/outlier/novelty detection methods.

@agramfort
Copy link
Member Author

looks like this:

figure_4

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things. The plot looks great!

of algorithms to cope with multimodal data.

While these examples give some intuition about the
algorithms, this intuition might not apply to very high
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to expand on this or provide a reference? What features of the shown decision functions would not transfer to higher dimensions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's the same sentence as for our clustering summary example.

anomaly_scores = algorithm.decision_function(X)

threshold = stats.scoreatpercentile(anomaly_scores,
100 * outliers_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to comment on why this is necessary when outliers_fraction is also provided to constructors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in drawing contours. Put it in the if.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and remove from the if above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and remove the special case above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new consistent anomaly API coming we'll have a consistent threshold_ attribute. I'll switch to this when merged.

# Example settings
n_samples = 300
outliers_fraction = 0.15
clusters_separation = (0, 1, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?

anomaly/novelty detection algorithms on 2D datasets. Datasets contain
one or two modes (regions of high density) to illustrate the ability
of algorithms to cope with multimodal data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should state:

15% of samples in each dataset are generated as random uniform noise. We display decision boundaries with a threshold that would mark 15% of training samples as outliers.

Although admittedly I'm a bit confused by this threshold...

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from nitpicks. nice one :)

:align: center
:scale: 50

A comparison of the clustering algorithms in scikit-learn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outlier?

of algorithms to cope with multimodal data.

While these examples give some intuition about the
algorithms, this intuition might not apply to very high
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention that LOF can not predict on new data so no decision boundary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

n_inliers = n_samples - n_outliers


def make_dataset(n_inliers, offset, stds=(0.5, 0.5), random_state=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case of make_blobs, right? Well, here we have explicit offsets, but you could just pick a nice-looking random state for make_blobs?

# define outlier/anomaly detection methods to be compared
anomaly_algorithms = [
("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)),
("One-Class SVM", svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain how nu was set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and gamma, I guess....

("Isolation Forest", IsolationForest(contamination=outliers_fraction,
random_state=42)),
("Local Outlier Factor", LocalOutlierFactor(
n_neighbors=35, contamination=outliers_fraction))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain how n_neighbors was set?

plt.title(name, size=18)

# fit the data and tag outliers
if name == "Local Outlier Factor":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.

anomaly_scores = algorithm.decision_function(X)

threshold = stats.scoreatpercentile(anomaly_scores,
100 * outliers_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and remove from the if above)

plt.title(name, size=18)

# fit the data and tag outliers
if name == "Local Outlier Factor":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used if name == "Local Outlier Factor":
in both places.

anomaly_scores = algorithm.decision_function(X)

threshold = stats.scoreatpercentile(anomaly_scores,
100 * outliers_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and remove the special case above)

plt.contour(xx, yy, Z, levels=[threshold],
linewidths=2, colors='black')

colors = np.array(['#377eb8', '#ff7f00'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plt.cm.tab10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same colors as clustering gallery. tab10 is not as scikit-learn color style :)

@codecov
Copy link

codecov bot commented Oct 26, 2017

Codecov Report

Merging #10004 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10004      +/-   ##
==========================================
+ Coverage   96.17%   96.18%   +<.01%     
==========================================
  Files         336      336              
  Lines       62669    62733      +64     
==========================================
+ Hits        60274    60338      +64     
  Misses       2395     2395
Impacted Files Coverage Δ
sklearn/tests/test_naive_bayes.py 100% <0%> (ø) ⬆️
...n/preprocessing/tests/test_function_transformer.py 100% <0%> (ø) ⬆️
sklearn/feature_selection/rfe.py 97.6% <0%> (ø) ⬆️
sklearn/linear_model/ridge.py 95.66% <0%> (ø) ⬆️
sklearn/linear_model/coordinate_descent.py 96.95% <0%> (ø) ⬆️
sklearn/neural_network/tests/test_mlp.py 100% <0%> (ø) ⬆️
sklearn/linear_model/logistic.py 97.07% <0%> (ø) ⬆️
sklearn/linear_model/least_angle.py 96.26% <0%> (ø) ⬆️
sklearn/calibration.py 98.87% <0%> (ø) ⬆️
sklearn/preprocessing/tests/test_data.py 99.91% <0%> (ø) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 102620f...4441646. Read the comment docs.

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove 'novelty' everywhere and only use 'outlier' (or 'anomaly') as this example is an outlier detection example and not a novelty detection example. I would also use 0 as the threshold and remove anomaly_scores (see my comment below). Otherwise LGTM. I really like the second data set showing the particularity of LOF :) Thanks @agramfort!

if name != "Local Outlier Factor": # LOF does not implement predict
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[threshold],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use predict on the grid I would rather use levels=[0] (predict is in {-1, 1} for all estimators). And we do not need threshold and anomaly_scores anymore.

The other solution would be to keep levels=[threshold] and use decision_function on the grid but then the decision boundary might not agree with the colors of the points given by y_pred (which is the case in this example).

@albertcthomas
Copy link
Contributor

I finally decided to keep decision_function and threshold as nu is a bit different than the contamination parameter of the other algorithms.
The plot now looks like this
figure_1

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


threshold = stats.scoreatpercentile(anomaly_scores,
100 * outliers_fraction)
y_pred = (anomaly_scores >= threshold).astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit weird to reimplement predict, but whatever... lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I suppose @agramfort and I are a bit biased about how the OCSVM should be used :)
This is the plot we have if we use predict only and nu=outliers_fraction (instead of nu=0.2). Its not good for the OCSVM but I guess the users are going to use the algorithms this way so if there is a consensus I’m OK with this.
figure_1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? If you're saying OCSVM should not be used the way it is implemented, that seems pretty bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular if the example shows a usage pattern that is not explained somewhere else, that seems pretty confusing.

Copy link
Contributor

@albertcthomas albertcthomas Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nu parameter of the OCSVM does not allow to control the proportion of outliers as the contamination parameter does for the other algorithms. The contamination parameter of the other algorithms allows to define a threshold such that we exactly predict a proportion of outliers equal to contamination. The OCSVM automatically finds an offset/threshold which is such that the proportion of outliers is less than nu (but not necessarily equal to nu depending on the approximation/number of samples).
I prefer to choose a value for nu a bit higher than the proportion of outliers that I want to find and then threshold the decision function to obtain the desired proportion of outliers (as it is done for the other algorithms). Choosing a value for nu a bit higher than the proportion of outliers gives a bit more support vectors to approximate the decision function in the neighborhood of the decision boundary. But this extends the original implementation of the OCSVM and does not use the learned offset (and thus the default predict).

@albertcthomas
Copy link
Contributor

Having thought about it a bit more I think it is better to expose the OneClassSVM with nu equal to the fraction of outliers and use predict as the users will very likely use the OneClassSVM this way. Besides the OneClassSVM here roughly predicts this same fraction of outliers for the considered datasets. Thanks for your comments @amueller.

figure_1

@agramfort
Copy link
Member Author

@agramfort agramfort changed the title Comparison plot for anomaly detection methods. [MRG] Comparison plot for anomaly detection methods. Oct 30, 2017
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

data, the problem is completely unsupervised so model selection can be
a challenge.
"""
print(__doc__) # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always put it after the import without the #noqa


for i_dataset, X in enumerate(datasets):
# Add outliers
rng = np.random.RandomState(42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put the rng outside of the loop?

Comparing anomaly detection algorithms for outlier detection on toy datasets
============================================================================

This example shows characteristics of different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like that this going up to the 80 characters :)

a challenge.
"""
print(__doc__) # noqa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we put authors and license notes usually?

@agramfort
Copy link
Member Author

comments addressed

@jnothman jnothman merged commit 8e599c6 into scikit-learn:master Nov 2, 2017
@jnothman
Copy link
Member

jnothman commented Nov 2, 2017

Great stuff!

@albertcthomas
Copy link
Contributor

Thanks @agramfort !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants