Skip to content

ENH Better error for corrupted files in fetch_kddcup99 #19669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 18, 2021

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Resolves #19667

What does this implement/fix? Explain your changes.

This PR adds a nicer error message when the file can not be loaded from cache.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just a side note: when reviewing #10027 I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml.

Therefore, I wonder if we should not deprecate our own fetcher in favor of fetch_openml although it will not return exactly the same datastructure so is not a drop-in replacement...

@rth
Copy link
Member

rth commented Mar 18, 2021

I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@rth rth merged commit 2e7009b into scikit-learn:main Mar 18, 2021
@thomasjpfan
Copy link
Member Author

Just a side note: when reviewing #10027 I noticed that this fetcher is extremely memory inefficient and slow. For all practical purposes I think it's much better to use fetch_openml.

fetch_openml is more memory efficient because it does parsing the arff file in chunks.

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

The (near?) future, we can use Parquet files! A parquet parser would be harder to vendor, so it may lead us down a path of using optional dependencies. (Like how pandas.read_parquet works)

XREF: openml/openml-python#1032
XREF: renatopp/liac-arff#120
XREF: #11821

@rth
Copy link
Member

rth commented Mar 18, 2021

The (near?) future, we can use Parquet files! A parquet parser would be harder to vendor, so it may lead us down a path of using optional dependencies.

Great! Looking forward to that.

@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2021

Really good news :)

@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2021

Which is also slow (ARFF parsing in pure python) but I imagine less so :)

I cannot even use the scikit-learn loader without getting my 16GB RAM system to start swapping while fetch_openml uses less than 1GB and completes in a few seconds:

https://github.com/scikit-learn/scikit-learn/pull/10027/files#r595370382

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupted Loaded Dataset Returns Unhelpful Error
3 participants