-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Comparison plot for anomaly detection methods. #10004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small things. The plot looks great!
examples/plot_anomaly_comparison.py
Outdated
of algorithms to cope with multimodal data. | ||
|
||
While these examples give some intuition about the | ||
algorithms, this intuition might not apply to very high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to expand on this or provide a reference? What features of the shown decision functions would not transfer to higher dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's the same sentence as for our clustering summary example.
examples/plot_anomaly_comparison.py
Outdated
anomaly_scores = algorithm.decision_function(X) | ||
|
||
threshold = stats.scoreatpercentile(anomaly_scores, | ||
100 * outliers_fraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to comment on why this is necessary when outliers_fraction is also provided to constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in drawing contours. Put it in the if.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and remove from the if above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and remove the special case above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new consistent anomaly API coming we'll have a consistent threshold_ attribute. I'll switch to this when merged.
examples/plot_anomaly_comparison.py
Outdated
# Example settings | ||
n_samples = 300 | ||
outliers_fraction = 0.15 | ||
clusters_separation = (0, 1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used?
examples/plot_anomaly_comparison.py
Outdated
anomaly/novelty detection algorithms on 2D datasets. Datasets contain | ||
one or two modes (regions of high density) to illustrate the ability | ||
of algorithms to cope with multimodal data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should state:
15% of samples in each dataset are generated as random uniform noise. We display decision boundaries with a threshold that would mark 15% of training samples as outliers.
Although admittedly I'm a bit confused by this threshold...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from nitpicks. nice one :)
doc/modules/outlier_detection.rst
Outdated
:align: center | ||
:scale: 50 | ||
|
||
A comparison of the clustering algorithms in scikit-learn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outlier?
examples/plot_anomaly_comparison.py
Outdated
of algorithms to cope with multimodal data. | ||
|
||
While these examples give some intuition about the | ||
algorithms, this intuition might not apply to very high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should mention that LOF can not predict on new data so no decision boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
examples/plot_anomaly_comparison.py
Outdated
n_inliers = n_samples - n_outliers | ||
|
||
|
||
def make_dataset(n_inliers, offset, stds=(0.5, 0.5), random_state=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case of make_blobs, right? Well, here we have explicit offsets, but you could just pick a nice-looking random state for make_blobs?
examples/plot_anomaly_comparison.py
Outdated
# define outlier/anomaly detection methods to be compared | ||
anomaly_algorithms = [ | ||
("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)), | ||
("One-Class SVM", svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how nu was set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and gamma, I guess....
examples/plot_anomaly_comparison.py
Outdated
("Isolation Forest", IsolationForest(contamination=outliers_fraction, | ||
random_state=42)), | ||
("Local Outlier Factor", LocalOutlierFactor( | ||
n_neighbors=35, contamination=outliers_fraction))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how n_neighbors was set?
plt.title(name, size=18) | ||
|
||
# fit the data and tag outliers | ||
if name == "Local Outlier Factor": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.
examples/plot_anomaly_comparison.py
Outdated
anomaly_scores = algorithm.decision_function(X) | ||
|
||
threshold = stats.scoreatpercentile(anomaly_scores, | ||
100 * outliers_fraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and remove from the if above)
plt.title(name, size=18) | ||
|
||
# fit the data and tag outliers | ||
if name == "Local Outlier Factor": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use "if hasattr(algorithm, predict)``, as that's what you're using further down? or just use this one there? Either way, I'd use the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used if name == "Local Outlier Factor":
in both places.
examples/plot_anomaly_comparison.py
Outdated
anomaly_scores = algorithm.decision_function(X) | ||
|
||
threshold = stats.scoreatpercentile(anomaly_scores, | ||
100 * outliers_fraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and remove the special case above)
examples/plot_anomaly_comparison.py
Outdated
plt.contour(xx, yy, Z, levels=[threshold], | ||
linewidths=2, colors='black') | ||
|
||
colors = np.array(['#377eb8', '#ff7f00']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plt.cm.tab10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same colors as clustering gallery. tab10 is not as scikit-learn color style :)
Codecov Report
@@ Coverage Diff @@
## master #10004 +/- ##
==========================================
+ Coverage 96.17% 96.18% +<.01%
==========================================
Files 336 336
Lines 62669 62733 +64
==========================================
+ Hits 60274 60338 +64
Misses 2395 2395
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove 'novelty' everywhere and only use 'outlier' (or 'anomaly') as this example is an outlier detection example and not a novelty detection example. I would also use 0 as the threshold and remove anomaly_scores
(see my comment below). Otherwise LGTM. I really like the second data set showing the particularity of LOF :) Thanks @agramfort!
examples/plot_anomaly_comparison.py
Outdated
if name != "Local Outlier Factor": # LOF does not implement predict | ||
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) | ||
Z = Z.reshape(xx.shape) | ||
plt.contour(xx, yy, Z, levels=[threshold], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use predict
on the grid I would rather use levels=[0]
(predict is in {-1, 1} for all estimators). And we do not need threshold
and anomaly_scores
anymore.
The other solution would be to keep levels=[threshold]
and use decision_function
on the grid but then the decision boundary might not agree with the colors of the points given by y_pred
(which is the case in this example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
examples/plot_anomaly_comparison.py
Outdated
|
||
threshold = stats.scoreatpercentile(anomaly_scores, | ||
100 * outliers_fraction) | ||
y_pred = (anomaly_scores >= threshold).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit weird to reimplement predict, but whatever... lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I suppose @agramfort and I are a bit biased about how the OCSVM should be used :)
This is the plot we have if we use predict only and nu=outliers_fraction (instead of nu=0.2). Its not good for the OCSVM but I guess the users are going to use the algorithms this way so if there is a consensus I’m OK with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? If you're saying OCSVM should not be used the way it is implemented, that seems pretty bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular if the example shows a usage pattern that is not explained somewhere else, that seems pretty confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nu
parameter of the OCSVM does not allow to control the proportion of outliers as the contamination
parameter does for the other algorithms. The contamination
parameter of the other algorithms allows to define a threshold such that we exactly predict a proportion of outliers equal to contamination
. The OCSVM automatically finds an offset/threshold which is such that the proportion of outliers is less than nu
(but not necessarily equal to nu depending on the approximation/number of samples).
I prefer to choose a value for nu
a bit higher than the proportion of outliers that I want to find and then threshold the decision function to obtain the desired proportion of outliers (as it is done for the other algorithms). Choosing a value for nu
a bit higher than the proportion of outliers gives a bit more support vectors to approximate the decision function in the neighborhood of the decision boundary. But this extends the original implementation of the OCSVM and does not use the learned offset (and thus the default predict).
Having thought about it a bit more I think it is better to expose the OneClassSVM with nu equal to the fraction of outliers and use predict as the users will very likely use the OneClassSVM this way. Besides the OneClassSVM here roughly predicts this same fraction of outliers for the considered datasets. Thanks for your comments @amueller. |
good to go from my end. Here is the rendered doc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
examples/plot_anomaly_comparison.py
Outdated
data, the problem is completely unsupervised so model selection can be | ||
a challenge. | ||
""" | ||
print(__doc__) # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can always put it after the import without the #noqa
examples/plot_anomaly_comparison.py
Outdated
|
||
for i_dataset, X in enumerate(datasets): | ||
# Add outliers | ||
rng = np.random.RandomState(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to put the rng outside of the loop?
examples/plot_anomaly_comparison.py
Outdated
Comparing anomaly detection algorithms for outlier detection on toy datasets | ||
============================================================================ | ||
|
||
This example shows characteristics of different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like that this going up to the 80 characters :)
examples/plot_anomaly_comparison.py
Outdated
a challenge. | ||
""" | ||
print(__doc__) # noqa | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we put authors and license notes usually?
comments addressed |
Great stuff! |
Thanks @agramfort ! |
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
add a comparison plot (like we have for classif or clustering)
of our anomaly/outlier/novelty detection methods.