ENH Better error for corrupted files in fetch_kddcup99 #19669

thomasjpfan · 2021-03-13T01:50:23Z

Reference Issues/PRs

Resolves #19667

What does this implement/fix? Explain your changes.

This PR adds a nicer error message when the file can not be loaded from cache.

ogrisel

LGTM.

Just a side note: when reviewing #10027 I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml.

Therefore, I wonder if we should not deprecate our own fetcher in favor of fetch_openml although it will not return exactly the same datastructure so is not a drop-in replacement...

rth · 2021-03-18T13:14:22Z

I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

rth

Thanks, LGTM!

thomasjpfan · 2021-03-18T16:15:10Z

Just a side note: when reviewing #10027 I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml.

fetch_openml is more memory efficient because it does parsing the arff file in chunks.

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

The (near?) future, we can use Parquet files! A parquet parser would be harder to vendor, so it may lead us down a path of using optional dependencies. (Like how pandas.read_parquet works)

XREF: openml/openml-python#1032
XREF: renatopp/liac-arff#120
XREF: #11821

rth · 2021-03-18T16:20:00Z

The (near?) future, we can use Parquet files! A parquet parser would be harder to vendor, so it may lead us down a path of using optional dependencies.

Great! Looking forward to that.

ogrisel · 2021-03-18T19:28:21Z

Really good news :)

ogrisel · 2021-03-18T19:31:05Z

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

I cannot even use the scikit-learn loader without getting my 16GB RAM system to start swapping while fetch_openml uses less than 1GB and completes in a few seconds:

https://github.com/scikit-learn/scikit-learn/pull/10027/files#r595370382

ENH Better error for corrupted files in fetch_kddcup99

d386d71

github-actions bot added the module:datasets label Mar 13, 2021

DOC Adds whats new

e46c951

thomasjpfan mentioned this pull request Mar 16, 2021

Corrupted Loaded Dataset Returns Unhelpful Error #19667

Closed

ogrisel approved these changes Mar 17, 2021

View reviewed changes

rth approved these changes Mar 18, 2021

View reviewed changes

rth merged commit 2e7009b into scikit-learn:main Mar 18, 2021

rth mentioned this pull request Mar 27, 2021

fetch_openml with mnist_784 uses excessive memory #19774

Closed

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Better error for corrupted files in fetch_kddcup99 #19669

ENH Better error for corrupted files in fetch_kddcup99 #19669

Uh oh!

thomasjpfan commented Mar 13, 2021

Uh oh!

ogrisel left a comment

Uh oh!

rth commented Mar 18, 2021

Uh oh!

rth left a comment

Uh oh!

thomasjpfan commented Mar 18, 2021

Uh oh!

rth commented Mar 18, 2021

Uh oh!

ogrisel commented Mar 18, 2021

Uh oh!

ogrisel commented Mar 18, 2021

Uh oh!

Uh oh!

Uh oh!

ENH Better error for corrupted files in fetch_kddcup99 #19669

ENH Better error for corrupted files in fetch_kddcup99 #19669

Uh oh!

Conversation

thomasjpfan commented Mar 13, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 18, 2021

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Mar 18, 2021

Uh oh!

rth commented Mar 18, 2021

Uh oh!

ogrisel commented Mar 18, 2021

Uh oh!

ogrisel commented Mar 18, 2021

Uh oh!

Uh oh!