[MRG + 1] Isolation forest - new anomaly detection algo #4163

ngoix · 2015-01-26T17:49:56Z

Isolation Forest (iForest) algorithm
Based on previous discussion
http://sourceforge.net/p/scikit-learn/mailman/message/32485020

raghavrv · 2015-01-26T17:52:48Z

sklearn/tree/plot_iforest.py

+plt.title("IForest")
+plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
+# a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red')
+# plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange')


I think scaffolding could be removed... :)

GaelVaroquaux · 2015-01-26T18:54:32Z

What's the reference publication? Is this algorithm a "classic" of outlier detection?

amueller · 2015-01-26T19:20:23Z

http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation
International conference on Data Mining 2008, 66 Citations (sad trombone playing)

This does not seem to be a particularly established method.

amueller · 2015-01-26T19:24:01Z

sklearn/tree/benchmark_iforest.py

+#            'smtp-le', 'smtp-lb', 'smtp-3d', 'SF-le', 'SF-lb', 'SF-3d',
+#            'SA-le','SA-lb', 'SA-3d']
+
+datasets = ['http10-le','shuttle', 'forestcover','smtp-le']


unfortunately that requires manual download of the data. Any chance to use fetch_mldata? Also, for forestcover aka covertype we already have functions in sklearn.datasets

amueller · 2015-01-26T19:26:45Z

I think the method is not established enough to warrant inclusion, sorry, see the FAQ http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published (or the updated one at #4131 )

agramfort · 2015-01-26T19:28:55Z

we should add fetchers for the KDD dataset which is a very good benchmark for outlier detection. The IForest did particularly well on this competition.

this PR is a clear WIP but at least people know it's there now.

agramfort · 2015-01-26T19:30:45Z

even if we end up not including it, we should continue benchmarks and code
improvements to see what it has to offer.

raghavrv · 2015-01-26T19:35:27Z

Could we add a sandbox directory for these kind of not within scope but good to have stuff in there?

GaelVaroquaux · 2015-01-26T19:36:21Z

even if we end up not including it, we should continue benchmarks and code
improvements to see what it has to offer.

It's a question of where to put priorities. The original thread
http://sourceforge.net/p/scikit-learn/mailman/message/32457192/
had links to algorithms that are very classic of outlier detection (I
have in particular in mind LOF). We should implement, tune and benchmark
these, before the fancier version. In addition, some of these are rather
easy to implement, and can be made very fast.

GaelVaroquaux · 2015-01-26T19:38:08Z

Could we add a sandbox directory for these kind of not within scope but good to
have stuff in there?

https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg12039.html

(maybe we need this in the FAQ?)

amueller · 2015-01-26T19:38:12Z

On 01/26/2015 02:35 PM, ragv wrote:

Could we add a sandbox directory for these kind of not within scope
but good to have stuff in there?

No, see #4131 and the
thread on the mailing list about it.

raghavrv · 2015-01-26T19:45:25Z

Ah... I should have done a little research before... sorry for the noise!

(maybe we need this in the FAQ?)

#-4131 takes care of that ( once merged )

amueller · 2015-01-26T20:17:55Z

@GaelVaroquaux that is what #4131 is about.

GaelVaroquaux · 2015-01-26T20:20:02Z

@GaelVaroquaux that is what #4131 is about.

Sorry. In my current status of try to lock myself up to write grants, I
had missed it. Thanks a lot for writing it.

amueller · 2015-01-26T20:23:15Z

Np ;) I quoted some of your mails but not all, if you find time to read it and want to add more from your mail to it, let me know.

GaelVaroquaux · 2015-01-26T20:24:18Z

Np ;) I quoted some of your mails but not all, if you find time to read it and
want to add more from your mail to it, let me know.

I think that I might do that. I think that a bit more of that mail could
go in the FAQ, if you agree (in particular the parts on sandbox and
bitrot). Tell me what you think.

amueller · 2015-01-26T22:04:01Z

Feel free.

ngoix · 2015-01-27T10:57:15Z

So I propose to begin an implementation of more classical anomaly detection algo like LOF (on another git branch?), while keeping this PR updated (code improvements, benchmark,...). What do you think ?

GaelVaroquaux · 2015-01-27T10:59:32Z

more classical anomaly detection algo like LOF (on another git
branch?),

Sounds good. And yes another branch would be necessary.

while keeping this PR updated (code improvements, benchmark,...). What
do you think ?

Well, what do the classic review papers on outlier detection say are the
major algorithms. These are the ones that we want to implement first (I
read a review paper a few years ago, and I remembered LOF, but there were
other things).

amueller · 2015-01-27T17:04:03Z

@ngoix you should probably not put too much effort into making this PR adhere to the sklearn coding standards, as its future is unclear. Apart from that, I agree.

agramfort · 2015-01-27T20:36:53Z

@ngoix you need to demonstrate (after ICML deadline maybe :) ) that IForest
is a clear benefit compared to what sklearn as to offer now. It won't be
lost anyway.

agramfort · 2015-03-05T10:37:40Z

pushed an iterations

@ngoix please now build your case with public datasets

@glouppe as tree grower in chief any opinion?

general API discussion what should return the score method of an anomaly detection method? in anomaly detection the score refers to the level of anomaly (bigger means more likely an anomaly)

agramfort · 2015-03-05T10:37:52Z

@ngoix also please add tests

glouppe · 2015-03-05T11:58:33Z

Why is this code reimplenting the construction procedure of ExtraTreeClassifier instead of reusing the existing implementation? I think we could instead reuse the decision tree implementation and then simply implement the anomaly metric on top of the forest.

glouppe · 2015-03-05T12:01:01Z

Regarding the algorithm itself, I have to admit it is the first time I read about them. The tree-based anomaly detection algorithm that I know is the one due to Breiman, based on proximity scores. They share the same principles though (computing a score derived from the structure of the forest and on where the samples lie in the forest).

agramfort · 2015-03-05T13:09:21Z

Regarding the algorithm itself, I have to admit it is the first time I read about them. The tree-based anomaly detection algorithm that I know is the one due to Breiman, based on proximity scores. They share the same principles though (computing a score derived from the structure of the forest and where the samples lie in there).

we should bench indeed.

arjoly · 2015-10-22T11:28:52Z

sklearn/ensemble/iforest.py

+    n_samples = check_array(n_samples)
+
+    x, y = n_samples.shape
+    n_samples = n_samples.reshape(1, -1)


I don't think this do you think.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.reshape.html#numpy.ndarray.reshape

You are passing the order argument.

arjoly · 2015-10-22T11:31:46Z

you shouldn't have modification of _util.c

ngoix · 2015-10-22T16:43:23Z

rebased

arjoly · 2015-10-22T19:27:21Z

sklearn/ensemble/iforest.py

+                               delayed(_parallel_helper)(tree.tree_,
+                                                         'apply_depth', X)
+                               for tree in self.estimators_)
+


Don't we need to concatenate the results of depth?

arjoly · 2015-10-23T13:42:16Z

the bug that you mention is solve in #5560

It will only occur with pickle tests and joblib tests.

ngoix · 2015-10-23T16:20:57Z

bench using apply_depth:

and using decision_path:

The largest dataset here is SA which has close to one million samples with 42 features.

agramfort · 2015-10-23T16:23:16Z

you need to rebase

ngoix · 2015-10-23T17:25:13Z

rebased
I think we should avoid duplicate code, even if we loose a 1.5 speed factor in predict().
@glouppe what do you think?

ngoix · 2015-10-23T17:26:32Z

And I plot the memory used, there is not much difference.

glouppe · 2015-10-23T17:58:14Z

rebased
I think we should avoid duplicate code, even if we loose a 1.5 speed factor in predict().
@glouppe what do you think?

+1 for not duplicating code

agramfort · 2015-10-24T07:39:06Z

please remove C files from the PR and rebase

agramfort · 2015-10-24T07:41:02Z

also squash into 1 commit and I think we're good to go !

example + benchmark explanation make some private functions + fix public API IForest using BaseForest base class for trees debug + plot_iforest classic anomaly detection datasets and benchmark small modif BaseBagging inheritance shuffle dataset before benchmarking BaseBagging inheritance remove class label 4 from shuttle dataset pep8 + rm shuttle.csv bench_IsolationForest.png + doc decision_function add tests remove comments fetching kddcup99 and shuttle datasets fetching kddcup99 and shuttle datasets pep8 fetching kddcup99 and shuttle datasets pep8 new files iforest.py and test_iforest.py sc alternative to pandas (but very slow) in kddcup99.py faster parser sc pep8 + cleanup + simplification example outlier detection clean and correct idem random_state added percent10=True in benchmark mc remove shuttle + minor changes sc undo modif on forest.py and recompile cython on _tree.c fix travis cosmit change bagging to fix travis Revert "change bagging to fix travis" This reverts commit 30ea500. add max_samples_ in BaseBagging.fit to fix travis mc API : don't add fit param but use a private _fit + update tests + examples to avoid warning adapt to the new structure of _tree.pyx cosmit add performance test for iforest add _tree.c _utils.c _criterion.c TST : pass on tests remove test relax roc-auc to fix AppVeyor add test on toy samples Handle depth averaging at python level plot example: rm html add png load_kddcup99 -> fetch_kddcup99 + doc Take into account arjoly comments sh -> shuffle add decision_path code from scikit-learn#5487 to bench Take into account arjoly comments Revert "add decision_path code from scikit-learn#5487 to bench" This reverts commit 46ad44a. fix bug with max_samples != int

glouppe · 2015-10-24T16:42:28Z

@ngoix @agramfort @arjoly Anything left for this PR? Shall we merge?

agramfort · 2015-10-24T17:23:35Z

You have my +1 if Travis is happy and examples run fine

On 24 oct. 2015, at 18:42, Gilles Louppe notifications@github.com wrote:

@ngoix @agramfort @arjoly Anything left for this PR? Shall we merge?

—
Reply to this email directly or view it on GitHub.

glouppe · 2015-10-24T17:53:51Z

Travis is happy. I am merging. Thanks a lot @ngoix for your contribution and your patience throughout the process! This is a nice addition to scikit-learn.

[MRG + 1] Isolation forest - new anomaly detection algo

agramfort · 2015-10-24T18:00:58Z

@ngoix you deserve some 🍻 !

glouppe · 2015-10-24T18:02:57Z

What's new entry added in c99f5db

arjoly · 2015-10-24T21:14:58Z

Congratulation Nicolas!!!

GaelVaroquaux · 2015-10-24T22:07:01Z

This is awesome! Thanks Nicolas!

ngoix · 2015-10-25T12:38:10Z

Thanks it's a teamwork! 🍻

raghavrv reviewed Jan 26, 2015
View reviewed changes

amueller reviewed Jan 26, 2015
View reviewed changes

arjoly reviewed Oct 22, 2015
View reviewed changes

ngoix force-pushed the iforest branch from d3ce4d1 to 67f1755 Compare October 22, 2015 16:42

arjoly reviewed Oct 22, 2015
View reviewed changes

ngoix force-pushed the iforest branch from 271b3da to 5b00071 Compare October 23, 2015 17:02

ngoix force-pushed the iforest branch from 033219a to a75eb63 Compare October 24, 2015 13:29

glouppe added a commit that referenced this pull request Oct 24, 2015

Merge pull request #4163 from ngoix/iforest

5c8855d

[MRG + 1] Isolation forest - new anomaly detection algo

glouppe merged commit 5c8855d into scikit-learn:master Oct 24, 2015

albertcthomas mentioned this pull request Mar 16, 2019

Expose warm_start in Isolation forest #13451

Closed

[MRG + 1] Isolation forest - new anomaly detection algo #4163

[MRG + 1] Isolation forest - new anomaly detection algo #4163

Conversation

ngoix commented Jan 26, 2015

raghavrv Jan 26, 2015

Choose a reason for hiding this comment

GaelVaroquaux commented Jan 26, 2015

amueller commented Jan 26, 2015

amueller Jan 26, 2015

Choose a reason for hiding this comment

amueller commented Jan 26, 2015

agramfort commented Jan 26, 2015

agramfort commented Jan 26, 2015

raghavrv commented Jan 26, 2015

GaelVaroquaux commented Jan 26, 2015

GaelVaroquaux commented Jan 26, 2015

amueller commented Jan 26, 2015

raghavrv commented Jan 26, 2015

amueller commented Jan 26, 2015

GaelVaroquaux commented Jan 26, 2015

amueller commented Jan 26, 2015

GaelVaroquaux commented Jan 26, 2015

amueller commented Jan 26, 2015

ngoix commented Jan 27, 2015

GaelVaroquaux commented Jan 27, 2015

amueller commented Jan 27, 2015

agramfort commented Jan 27, 2015

agramfort commented Mar 5, 2015

agramfort commented Mar 5, 2015

glouppe commented Mar 5, 2015

glouppe commented Mar 5, 2015

agramfort commented Mar 5, 2015 via email

arjoly Oct 22, 2015

Choose a reason for hiding this comment

arjoly Oct 22, 2015

Choose a reason for hiding this comment

arjoly commented Oct 22, 2015

ngoix commented Oct 22, 2015

arjoly Oct 22, 2015

Choose a reason for hiding this comment

arjoly commented Oct 23, 2015

ngoix commented Oct 23, 2015

agramfort commented Oct 23, 2015

ngoix commented Oct 23, 2015

ngoix commented Oct 23, 2015

glouppe commented Oct 23, 2015

agramfort commented Oct 24, 2015

agramfort commented Oct 24, 2015

glouppe commented Oct 24, 2015

agramfort commented Oct 24, 2015

glouppe commented Oct 24, 2015

agramfort commented Oct 24, 2015 via email

glouppe commented Oct 24, 2015

arjoly commented Oct 24, 2015

GaelVaroquaux commented Oct 24, 2015 via email

ngoix commented Oct 25, 2015