-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Isolation forest - new anomaly detection algo #4163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
plt.title("IForest") | ||
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r) | ||
# a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red') | ||
# plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scaffolding could be removed... :)
What's the reference publication? Is this algorithm a "classic" of outlier detection? |
http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation This does not seem to be a particularly established method. |
# 'smtp-le', 'smtp-lb', 'smtp-3d', 'SF-le', 'SF-lb', 'SF-3d', | ||
# 'SA-le','SA-lb', 'SA-3d'] | ||
|
||
datasets = ['http10-le','shuttle', 'forestcover','smtp-le'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately that requires manual download of the data. Any chance to use fetch_mldata
? Also, for forestcover
aka covertype
we already have functions in sklearn.datasets
I think the method is not established enough to warrant inclusion, sorry, see the FAQ http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published (or the updated one at #4131 ) |
we should add fetchers for the KDD dataset which is a very good benchmark for outlier detection. The IForest did particularly well on this competition. this PR is a clear WIP but at least people know it's there now. |
even if we end up not including it, we should continue benchmarks and code |
Could we add a sandbox directory for these kind of not within scope but good to have stuff in there? |
It's a question of where to put priorities. The original thread |
https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg12039.html (maybe we need this in the FAQ?) |
On 01/26/2015 02:35 PM, ragv wrote:
|
Ah... I should have done a little research before... sorry for the noise!
#-4131 takes care of that ( once merged ) |
@GaelVaroquaux that is what #4131 is about. |
Sorry. In my current status of try to lock myself up to write grants, I |
Np ;) I quoted some of your mails but not all, if you find time to read it and want to add more from your mail to it, let me know. |
I think that I might do that. I think that a bit more of that mail could |
Feel free. |
So I propose to begin an implementation of more classical anomaly detection algo like LOF (on another git branch?), while keeping this PR updated (code improvements, benchmark,...). What do you think ? |
Sounds good. And yes another branch would be necessary.
Well, what do the classic review papers on outlier detection say are the |
@ngoix you should probably not put too much effort into making this PR adhere to the sklearn coding standards, as its future is unclear. Apart from that, I agree. |
@ngoix you need to demonstrate (after ICML deadline maybe :) ) that IForest |
pushed an iterations @ngoix please now build your case with public datasets @glouppe as tree grower in chief any opinion? general API discussion what should return the score method of an anomaly detection method? in anomaly detection the score refers to the level of anomaly (bigger means more likely an anomaly) |
@ngoix also please add tests |
Why is this code reimplenting the construction procedure of ExtraTreeClassifier instead of reusing the existing implementation? I think we could instead reuse the decision tree implementation and then simply implement the anomaly metric on top of the forest. |
Regarding the algorithm itself, I have to admit it is the first time I read about them. The tree-based anomaly detection algorithm that I know is the one due to Breiman, based on proximity scores. They share the same principles though (computing a score derived from the structure of the forest and on where the samples lie in the forest). |
Regarding the algorithm itself, I have to admit it is the first time I read about them. The tree-based anomaly detection algorithm that I know is the one due to Breiman, based on proximity scores. They share the same principles though (computing a score derived from the structure of the forest and where the samples lie in there).
we should bench indeed.
|
n_samples = check_array(n_samples) | ||
|
||
x, y = n_samples.shape | ||
n_samples = n_samples.reshape(1, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this do you think.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.reshape.html#numpy.ndarray.reshape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are passing the order argument.
you shouldn't have modification of _util.c |
rebased |
delayed(_parallel_helper)(tree.tree_, | ||
'apply_depth', X) | ||
for tree in self.estimators_) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to concatenate the results of depth?
the bug that you mention is solve in #5560 It will only occur with pickle tests and joblib tests. |
you need to rebase |
rebased |
And I plot the memory used, there is not much difference. |
+1 for not duplicating code |
please remove C files from the PR and rebase |
also squash into 1 commit and I think we're good to go ! |
example + benchmark explanation make some private functions + fix public API IForest using BaseForest base class for trees debug + plot_iforest classic anomaly detection datasets and benchmark small modif BaseBagging inheritance shuffle dataset before benchmarking BaseBagging inheritance remove class label 4 from shuttle dataset pep8 + rm shuttle.csv bench_IsolationForest.png + doc decision_function add tests remove comments fetching kddcup99 and shuttle datasets fetching kddcup99 and shuttle datasets pep8 fetching kddcup99 and shuttle datasets pep8 new files iforest.py and test_iforest.py sc alternative to pandas (but very slow) in kddcup99.py faster parser sc pep8 + cleanup + simplification example outlier detection clean and correct idem random_state added percent10=True in benchmark mc remove shuttle + minor changes sc undo modif on forest.py and recompile cython on _tree.c fix travis cosmit change bagging to fix travis Revert "change bagging to fix travis" This reverts commit 30ea500. add max_samples_ in BaseBagging.fit to fix travis mc API : don't add fit param but use a private _fit + update tests + examples to avoid warning adapt to the new structure of _tree.pyx cosmit add performance test for iforest add _tree.c _utils.c _criterion.c TST : pass on tests remove test relax roc-auc to fix AppVeyor add test on toy samples Handle depth averaging at python level plot example: rm html add png load_kddcup99 -> fetch_kddcup99 + doc Take into account arjoly comments sh -> shuffle add decision_path code from scikit-learn#5487 to bench Take into account arjoly comments Revert "add decision_path code from scikit-learn#5487 to bench" This reverts commit 46ad44a. fix bug with max_samples != int
@ngoix @agramfort @arjoly Anything left for this PR? Shall we merge? |
You have my +1 if Travis is happy and examples run fine
|
Travis is happy. I am merging. Thanks a lot @ngoix for your contribution and your patience throughout the process! This is a nice addition to scikit-learn. |
[MRG + 1] Isolation forest - new anomaly detection algo
@ngoix you deserve some 🍻 !
|
What's new entry added in c99f5db |
Congratulation Nicolas!!! |
This is awesome! Thanks Nicolas!
|
Thanks it's a teamwork! 🍻 |
Isolation Forest (iForest) algorithm
Based on previous discussion
http://sourceforge.net/p/scikit-learn/mailman/message/32485020