Skip to content

[MRG + 1] Isolation forest - new anomaly detection algo #4163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 24, 2015

Conversation

ngoix
Copy link
Contributor

@ngoix ngoix commented Jan 26, 2015

Isolation Forest (iForest) algorithm
Based on previous discussion
http://sourceforge.net/p/scikit-learn/mailman/message/32485020

plt.title("IForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
# a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red')
# plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scaffolding could be removed... :)

@GaelVaroquaux
Copy link
Member

What's the reference publication? Is this algorithm a "classic" of outlier detection?

@amueller
Copy link
Member

http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation
International conference on Data Mining 2008, 66 Citations (sad trombone playing)

This does not seem to be a particularly established method.

# 'smtp-le', 'smtp-lb', 'smtp-3d', 'SF-le', 'SF-lb', 'SF-3d',
# 'SA-le','SA-lb', 'SA-3d']

datasets = ['http10-le','shuttle', 'forestcover','smtp-le']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately that requires manual download of the data. Any chance to use fetch_mldata? Also, for forestcover aka covertype we already have functions in sklearn.datasets

@amueller
Copy link
Member

I think the method is not established enough to warrant inclusion, sorry, see the FAQ http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published (or the updated one at #4131 )

@agramfort
Copy link
Member

we should add fetchers for the KDD dataset which is a very good benchmark for outlier detection. The IForest did particularly well on this competition.

this PR is a clear WIP but at least people know it's there now.

@agramfort
Copy link
Member

even if we end up not including it, we should continue benchmarks and code
improvements to see what it has to offer.

@raghavrv
Copy link
Member

Could we add a sandbox directory for these kind of not within scope but good to have stuff in there?

@GaelVaroquaux
Copy link
Member

even if we end up not including it, we should continue benchmarks and code
improvements to see what it has to offer.

It's a question of where to put priorities. The original thread
http://sourceforge.net/p/scikit-learn/mailman/message/32457192/
had links to algorithms that are very classic of outlier detection (I
have in particular in mind LOF). We should implement, tune and benchmark
these, before the fancier version. In addition, some of these are rather
easy to implement, and can be made very fast.

@GaelVaroquaux
Copy link
Member

Could we add a sandbox directory for these kind of not within scope but good to
have stuff in there?

https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg12039.html

(maybe we need this in the FAQ?)

@amueller
Copy link
Member

On 01/26/2015 02:35 PM, ragv wrote:

Could we add a sandbox directory for these kind of not within scope
but good to have stuff in there?

No, see #4131 and the
thread on the mailing list about it.

@raghavrv
Copy link
Member

Ah... I should have done a little research before... sorry for the noise!

(maybe we need this in the FAQ?)

#-4131 takes care of that ( once merged )

@amueller
Copy link
Member

@GaelVaroquaux that is what #4131 is about.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux that is what #4131 is about.

Sorry. In my current status of try to lock myself up to write grants, I
had missed it. Thanks a lot for writing it.

@amueller
Copy link
Member

Np ;) I quoted some of your mails but not all, if you find time to read it and want to add more from your mail to it, let me know.

@GaelVaroquaux
Copy link
Member

Np ;) I quoted some of your mails but not all, if you find time to read it and
want to add more from your mail to it, let me know.

I think that I might do that. I think that a bit more of that mail could
go in the FAQ, if you agree (in particular the parts on sandbox and
bitrot). Tell me what you think.

@amueller
Copy link
Member

Feel free.

@ngoix
Copy link
Contributor Author

ngoix commented Jan 27, 2015

So I propose to begin an implementation of more classical anomaly detection algo like LOF (on another git branch?), while keeping this PR updated (code improvements, benchmark,...). What do you think ?

@GaelVaroquaux
Copy link
Member

more classical anomaly detection algo like LOF (on another git
branch?),

Sounds good. And yes another branch would be necessary.

while keeping this PR updated (code improvements, benchmark,...). What
do you think ?

Well, what do the classic review papers on outlier detection say are the
major algorithms. These are the ones that we want to implement first (I
read a review paper a few years ago, and I remembered LOF, but there were
other things).

@amueller
Copy link
Member

@ngoix you should probably not put too much effort into making this PR adhere to the sklearn coding standards, as its future is unclear. Apart from that, I agree.

@agramfort
Copy link
Member

@ngoix you need to demonstrate (after ICML deadline maybe :) ) that IForest
is a clear benefit compared to what sklearn as to offer now. It won't be
lost anyway.

@agramfort
Copy link
Member

pushed an iterations

@ngoix please now build your case with public datasets

@glouppe as tree grower in chief any opinion?

general API discussion what should return the score method of an anomaly detection method? in anomaly detection the score refers to the level of anomaly (bigger means more likely an anomaly)

@agramfort
Copy link
Member

@ngoix also please add tests

@glouppe
Copy link
Contributor

glouppe commented Mar 5, 2015

Why is this code reimplenting the construction procedure of ExtraTreeClassifier instead of reusing the existing implementation? I think we could instead reuse the decision tree implementation and then simply implement the anomaly metric on top of the forest.

@glouppe
Copy link
Contributor

glouppe commented Mar 5, 2015

Regarding the algorithm itself, I have to admit it is the first time I read about them. The tree-based anomaly detection algorithm that I know is the one due to Breiman, based on proximity scores. They share the same principles though (computing a score derived from the structure of the forest and on where the samples lie in the forest).

@agramfort
Copy link
Member

agramfort commented Mar 5, 2015 via email

n_samples = check_array(n_samples)

x, y = n_samples.shape
n_samples = n_samples.reshape(1, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are passing the order argument.

@arjoly
Copy link
Member

arjoly commented Oct 22, 2015

you shouldn't have modification of _util.c

@ngoix
Copy link
Contributor Author

ngoix commented Oct 22, 2015

rebased

delayed(_parallel_helper)(tree.tree_,
'apply_depth', X)
for tree in self.estimators_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to concatenate the results of depth?

@arjoly
Copy link
Member

arjoly commented Oct 23, 2015

the bug that you mention is solve in #5560

It will only occur with pickle tests and joblib tests.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 23, 2015

bench using apply_depth:
bench_fulldata_with_apply_depth

and using decision_path:
bench_fulldata_with_decision_path

The largest dataset here is SA which has close to one million samples with 42 features.

@agramfort
Copy link
Member

you need to rebase

@ngoix
Copy link
Contributor Author

ngoix commented Oct 23, 2015

rebased
I think we should avoid duplicate code, even if we loose a 1.5 speed factor in predict().
@glouppe what do you think?

@ngoix
Copy link
Contributor Author

ngoix commented Oct 23, 2015

And I plot the memory used, there is not much difference.

@glouppe
Copy link
Contributor

glouppe commented Oct 23, 2015

rebased
I think we should avoid duplicate code, even if we loose a 1.5 speed factor in predict().
@glouppe what do you think?

+1 for not duplicating code

@agramfort
Copy link
Member

please remove C files from the PR and rebase

@agramfort
Copy link
Member

also squash into 1 commit and I think we're good to go !

example + benchmark

explanation

make some private functions + fix public API

IForest using BaseForest base class for trees

debug + plot_iforest

classic anomaly detection datasets and benchmark

small modif

BaseBagging inheritance

shuffle dataset before benchmarking

BaseBagging inheritance

remove class label 4 from shuttle dataset

pep8 + rm shuttle.csv bench_IsolationForest.png + doc decision_function

add tests

remove comments

fetching kddcup99 and shuttle datasets

fetching kddcup99 and shuttle datasets

pep8

fetching kddcup99 and shuttle datasets

pep8

new files iforest.py and test_iforest.py

sc

alternative to pandas (but very slow)
in kddcup99.py

faster parser

sc

pep8 + cleanup + simplification

example outlier detection

clean and correct

idem

random_state added

percent10=True in benchmark

mc

remove shuttle + minor changes

sc

undo modif on forest.py and recompile cython on _tree.c

fix travis

cosmit

change bagging to fix travis

Revert "change bagging to fix travis"

This reverts commit 30ea500.

add max_samples_ in BaseBagging.fit to fix travis

mc

API : don't add fit param but use a private _fit + update tests + examples to avoid warning

adapt to the new structure of _tree.pyx

cosmit

add performance test for iforest

add _tree.c _utils.c _criterion.c

TST : pass on tests

remove test

relax roc-auc to fix AppVeyor

add test on toy samples

Handle depth averaging at python level

plot example: rm html add png

load_kddcup99 -> fetch_kddcup99 + doc

Take into account arjoly comments

sh -> shuffle

add decision_path code from scikit-learn#5487 to bench

Take into account arjoly comments

Revert "add decision_path code from scikit-learn#5487 to bench"

This reverts commit 46ad44a.

fix bug with max_samples != int
@glouppe
Copy link
Contributor

glouppe commented Oct 24, 2015

@ngoix @agramfort @arjoly Anything left for this PR? Shall we merge?

@agramfort
Copy link
Member

You have my +1 if Travis is happy and examples run fine

On 24 oct. 2015, at 18:42, Gilles Louppe notifications@github.com wrote:

@ngoix @agramfort @arjoly Anything left for this PR? Shall we merge?


Reply to this email directly or view it on GitHub.

@glouppe
Copy link
Contributor

glouppe commented Oct 24, 2015

Travis is happy. I am merging. Thanks a lot @ngoix for your contribution and your patience throughout the process! This is a nice addition to scikit-learn.

glouppe added a commit that referenced this pull request Oct 24, 2015
[MRG + 1] Isolation forest - new anomaly detection algo
@glouppe glouppe merged commit 5c8855d into scikit-learn:master Oct 24, 2015
@agramfort
Copy link
Member

agramfort commented Oct 24, 2015 via email

@glouppe
Copy link
Contributor

glouppe commented Oct 24, 2015

What's new entry added in c99f5db

@arjoly
Copy link
Member

arjoly commented Oct 24, 2015

Congratulation Nicolas!!!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 24, 2015 via email

@ngoix
Copy link
Contributor Author

ngoix commented Oct 25, 2015

Thanks it's a teamwork! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.