-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
EXA Download parquet file from data.openml.org #30824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Full disclosure the example inside JupyterLite fails on the next cell when trying to load the data into a polars DataFrame, probably pola-rs/polars#20876.
|
@PGijsbers I would be curious how much you think relying on a parquet URL like |
As far as hard-coded URLs go, this one should be safe. It connects to our multi-node MinIO deployment, so it has high fault tolerance. We are looking into ways to redirect traffic more quickly if a similar event occurs where our university's IT department disables all incoming connections altogether, though these events should be extremely rare in the first case. |
OK great thanks for your insights @PGijsbers! For completeness: I am not too worried about a similar situation as the university cyberattack. I was more trying to figure out whether the parquet URL could be considered an implementation detail and subject to change, but it looks like relying on it should be fine then. |
It is technically an implementation detail (the dataset description's |
Sounds great, thanks for the additional details! |
Upside: CORS headers are set on openml.org so the data download works in JupyterLite without any work-around see #30708 (comment).
Downside seems to be that it may considered an OpenML implementation detail, see #30715 (comment).
If we want to make this more robust, we can always add a bit of Python to do the equivalent of:
curl -s https://api.openml.org/api/v1/json/data/44063 | jq .data_set_description.parquet_url
This will make the example robust at the cost of making it less realistic/more convoluted (people would load a parquet URL or parquet file directly)
cc @ogrisel.