Skip to content

Documentation issues in sklearn.datasets.fetch_kddcup99 #7861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frishrash opened this issue Nov 13, 2016 · 3 comments
Closed

Documentation issues in sklearn.datasets.fetch_kddcup99 #7861

frishrash opened this issue Nov 13, 2016 · 3 comments

Comments

@frishrash
Copy link

Description

  1. In documentation body it says the default value of parameter percent10 is False. The real value is True (which is correctly documented in the one line description in the beginning). It should be fixed.

  2. The documentation says

SF is obtained as in [2] by simply picking up the data whose attribute logged_in is positive, thus focusing on the intrusion attack, which gives a proportion of 0.3% of attack.

However, in [2] (A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data (2002) by Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, Sal Stolfo) they only mention that

In order to make the data set more realistic we filtered many of the attacks so that the resulting data set consisted of 1 to 1.5% attack and 98.5 to 99% normal instances

They didn't write how they did it. Is it documented anywhere ? if it is, the proper reference should be given. Otherwise, the documentation should reference the paper only with regard to the filtering goal and not how filtering is done.

@ngoix
Copy link
Contributor

ngoix commented Nov 14, 2016

Yes it is the wrong reference (SF is not even mentioned in this ref).

[2] should be:

[2] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. Online
unsupervised outlier detection using finite mixtures with
discounting learning algorithms. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 320–324. ACM Press, 2000.

b-carter added a commit to b-carter/scikit-learn that referenced this issue Dec 17, 2016
@b-carter
Copy link
Contributor

I can submit a PR for this with the updated reference.

@albertcthomas
Copy link
Contributor

@agramfort can you close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants