Skip to content

EXA Download parquet file from data.openml.org #30824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 17, 2025

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Feb 13, 2025

Upside: CORS headers are set on openml.org so the data download works in JupyterLite without any work-around see #30708 (comment).

Downside seems to be that it may considered an OpenML implementation detail, see #30715 (comment).

If we want to make this more robust, we can always add a bit of Python to do the equivalent of:

curl -s https://api.openml.org/api/v1/json/data/44063 | jq .data_set_description.parquet_url

This will make the example robust at the cost of making it less realistic/more convoluted (people would load a parquet URL or parquet file directly)

cc @ogrisel.

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1fd4ac3. Link to the linter CI: here

@lesteve
Copy link
Member Author

lesteve commented Feb 13, 2025

Full disclosure the example inside JupyterLite fails on the next cell when trying to load the data into a polars DataFrame, probably pola-rs/polars#20876.

AttributeError: type object 'builtins.PyLazyFrame' has no attribute 'new_from_parquet'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 df = pl.read_parquet(bike_sharing_data_file)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:213, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, credential_provider, retries, use_pyarrow, pyarrow_options, memory_map, include_file_paths, allow_missing_columns)
    203     return _read_parquet_with_pyarrow(
    204         source,
    205         columns=columns,
   (...)
    209         rechunk=rechunk,
    210     )
    212 # For other inputs, defer to `scan_parquet`
--> 213 lf = scan_parquet(
    214     source,
    215     n_rows=n_rows,
    216     row_index_name=row_index_name,
    217     row_index_offset=row_index_offset,
    218     parallel=parallel,
    219     use_statistics=use_statistics,
    220     hive_partitioning=hive_partitioning,
    221     schema=schema,
    222     hive_schema=hive_schema,
    223     try_parse_hive_dates=try_parse_hive_dates,
    224     rechunk=rechunk,
    225     low_memory=low_memory,
    226     cache=False,
    227     storage_options=storage_options,
    228     credential_provider=credential_provider,
    229     retries=retries,
    230     glob=glob,
    231     include_file_paths=include_file_paths,
    232     allow_missing_columns=allow_missing_columns,
    233 )
    235 if columns is not None:
    236     if is_int_sequence(columns):

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:489, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, credential_provider, retries, include_file_paths, allow_missing_columns)
    481     source = [
    482         normalize_filepath(source, check_not_directory=False) for source in source
    483     ]
    485 credential_provider = _maybe_init_credential_provider(
    486     credential_provider, source, storage_options, "scan_parquet"
    487 )
--> 489 return _scan_parquet_impl(
    490     source,  # type: ignore[arg-type]
    491     n_rows=n_rows,
    492     cache=cache,
    493     parallel=parallel,
    494     rechunk=rechunk,
    495     row_index_name=row_index_name,
    496     row_index_offset=row_index_offset,
    497     storage_options=storage_options,
    498     credential_provider=credential_provider,
    499     low_memory=low_memory,
    500     use_statistics=use_statistics,
    501     hive_partitioning=hive_partitioning,
    502     schema=schema,
    503     hive_schema=hive_schema,
    504     try_parse_hive_dates=try_parse_hive_dates,
    505     retries=retries,
    506     glob=glob,
    507     include_file_paths=include_file_paths,
    508     allow_missing_columns=allow_missing_columns,
    509 )

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:546, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, credential_provider, low_memory, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, retries, include_file_paths, allow_missing_columns)
    542 else:
    543     # Handle empty dict input
    544     storage_options = None
--> 546 pylf = PyLazyFrame.new_from_parquet(
    547     source,
    548     sources,
    549     n_rows,
    550     cache,
    551     parallel,
    552     rechunk,
    553     parse_row_index_args(row_index_name, row_index_offset),
    554     low_memory,
    555     cloud_options=storage_options,
    556     credential_provider=credential_provider,
    557     use_statistics=use_statistics,
    558     hive_partitioning=hive_partitioning,
    559     schema=schema,
    560     hive_schema=hive_schema,
    561     try_parse_hive_dates=try_parse_hive_dates,
    562     retries=retries,
    563     glob=glob,
    564     include_file_paths=include_file_paths,
    565     allow_missing_columns=allow_missing_columns,
    566 )
    567 return wrap_ldf(pylf)

AttributeError: type object 'builtins.PyLazyFrame' has no attribute 'new_from_parquet'

@lesteve
Copy link
Member Author

lesteve commented Feb 13, 2025

@PGijsbers I would be curious how much you think relying on a parquet URL like https://data.openml.org/datasets/0004/44063/dataset_44063.pq is a bad idea 😅

@PGijsbers
Copy link
Contributor

As far as hard-coded URLs go, this one should be safe. It connects to our multi-node MinIO deployment, so it has high fault tolerance. We are looking into ways to redirect traffic more quickly if a similar event occurs where our university's IT department disables all incoming connections altogether, though these events should be extremely rare in the first case.

@lesteve
Copy link
Member Author

lesteve commented Feb 14, 2025

OK great thanks for your insights @PGijsbers!

For completeness: I am not too worried about a similar situation as the university cyberattack. I was more trying to figure out whether the parquet URL could be considered an implementation detail and subject to change, but it looks like relying on it should be fine then.

@PGijsbers
Copy link
Contributor

It is technically an implementation detail (the dataset description's parquet_url is leading). But with the data.openml.org subdomain fixed, and the /datasets/... path being defined by our MinIO bucket structure, there won't be any sudden changes. We have set up the buckets based on advise from MinIO support and there are currently no indications that we should use a different bucket organisation. If we were to ever reorganize, I don't currently see any reason why we would not be able to give a heads up and/or a transition period.

@lesteve
Copy link
Member Author

lesteve commented Feb 14, 2025

Sounds great, thanks for the additional details!

@jeremiedbb jeremiedbb merged commit 3c0e722 into scikit-learn:main Feb 17, 2025
38 checks passed
@lesteve lesteve deleted the use-openml-parquet-file branch February 18, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants