Skip to content

[MRG+1] BUG: Fix fetch_kddcup99 for Python 3 #5946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 13, 2015

Conversation

nmayorov
Copy link
Contributor

@nmayorov nmayorov commented Dec 1, 2015

This is a small fix to account for str / bytes issues in fetch_kddcup99.

@raghavrv
Copy link
Member

raghavrv commented Dec 2, 2015

LGTM Thanks for the fix :)

@nmayorov nmayorov changed the title [MRG] BUG: Fix fetch_kddcup99 for Python 3 [MRG+1] BUG: Fix fetch_kddcup99 for Python 3 Dec 6, 2015
@nmayorov
Copy link
Contributor Author

nmayorov commented Dec 6, 2015

OK, I changed to MRG+1

@amueller
Copy link
Member

amueller commented Dec 9, 2015

why was this not caught by a test? because the tests don't download? But if the file is downloaded already, the tests should run, right?

@raghavrv
Copy link
Member

raghavrv commented Dec 9, 2015

Ah there are no tests for kddcup99.py. @nmayorov could you introduce a test for this?

@nmayorov
Copy link
Contributor Author

nmayorov commented Dec 9, 2015

@rvraghav93 I will add a test similar to test_covtype.py.

@raghavrv
Copy link
Member

raghavrv commented Dec 9, 2015

yes thanks :)

@nmayorov
Copy link
Contributor Author

@rvraghav93 not sure how to test it, test_covtype.py doesn't download data during testing (the same for other fetch test routines).

The problem with fetch_kddcup99 occurs during the download and parsing phase. So what should I test? The download part doesn't seem to fit into unit testing.

Could someone just merge it and be done with it?

@raghavrv
Copy link
Member

I guess yes... @amueller merge?

@amueller
Copy link
Member

Wait, there are still tests for the other downloaded datasets, right? If the data was downloaded, then the tests are run with the downloaded data, otherwise they are mocked, right?
If not, we should make this more consistent.

@nmayorov
Copy link
Contributor Author

I added a test (even though it doesn't target this fix), and it was helpful to find a related bug with comparison (so @amueller was right).

I noticed other strange thing: the current implementation with subset='SF', 'http' or 'smtp'returns data with 4, 3 and 3 features respectively (and from the code it seems like it was intended). I'm not well familiar with this dataset, but it looks strange and contradicts the docstring (for now I modified the docstring).

I think we need to ask the author @ngoix of kddcup99.py what does it mean, is it right or wrong?

@ngoix
Copy link
Contributor

ngoix commented Dec 10, 2015

Yes you are right the docstring was wrong.

data = fetch_kddcup99('http', percent10=True)
assert_equal(data.data.shape, (58725, 3))
assert_equal(data.target.shape, (58725,))
print(data.data.shape, data.target.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print to be removed

@ngoix
Copy link
Contributor

ngoix commented Dec 10, 2015

+1 for merge

@nmayorov
Copy link
Contributor Author

@ngoix thanks,

I also changed "Load and return the kddcup 99 dataset (regression)" to "Load and return the kddcup 99 dataset (anomaly detection)". Is it appropriate description (calling it "regression" was definitely misleading)?

@ngoix
Copy link
Contributor

ngoix commented Dec 10, 2015

Yes right thanks! You could also call it more generally classification.

@nmayorov
Copy link
Contributor Author

Changed to standard "classification".

assert_equal(data.target.shape, (9571,))


if __name__ == '__main__':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this. the file should be run using nostests.

@amueller
Copy link
Member

+1 apart from the __main__...

@nmayorov
Copy link
Contributor Author

Done.

@nmayorov
Copy link
Contributor Author

Another thing: how about changing default of percent10 to True? The full data is big, and perhaps requires more than 8GB RAM to be fetched without issues (all data is stored as Python list at some point).

@agramfort
Copy link
Member

@ngoix any objection?

@ngoix
Copy link
Contributor

ngoix commented Dec 11, 2015

No I'm ok with this change!

@agramfort
Copy link
Member

+1 for merge when @amueller is happy

@amueller
Copy link
Member

squash maybe? LGTM.

BUG: Fixed comparison with bytes in kddcup99.py + test

MAINT: Changed default 'percent10' to True in fetch_kddcup99
@nmayorov
Copy link
Contributor Author

@amueller, squashed.

agramfort added a commit that referenced this pull request Dec 13, 2015
[MRG+1] BUG: Fix fetch_kddcup99 for Python 3
@agramfort agramfort merged commit 8d0a299 into scikit-learn:master Dec 13, 2015
@agramfort
Copy link
Member

thanks @nmayorov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants