EXA Download parquet file from data.openml.org #30824

lesteve · 2025-02-13T08:48:18Z

Upside: CORS headers are set on openml.org so the data download works in JupyterLite without any work-around see #30708 (comment).

Downside seems to be that it may considered an OpenML implementation detail, see #30715 (comment).

If we want to make this more robust, we can always add a bit of Python to do the equivalent of:

curl -s https://api.openml.org/api/v1/json/data/44063 | jq .data_set_description.parquet_url

This will make the example robust at the cost of making it less realistic/more convoluted (people would load a parquet URL or parquet file directly)

cc @ogrisel.

github-actions · 2025-02-13T08:49:41Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 1fd4ac3. Link to the linter CI: here}

lesteve · 2025-02-13T09:14:11Z

Full disclosure the example inside JupyterLite fails on the next cell when trying to load the data into a polars DataFrame, probably pola-rs/polars#20876.

AttributeError: type object 'builtins.PyLazyFrame' has no attribute 'new_from_parquet'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 df = pl.read_parquet(bike_sharing_data_file)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:213, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, credential_provider, retries, use_pyarrow, pyarrow_options, memory_map, include_file_paths, allow_missing_columns)
    203     return _read_parquet_with_pyarrow(
    204         source,
    205         columns=columns,
   (...)
    209         rechunk=rechunk,
    210     )
    212 # For other inputs, defer to `scan_parquet`
--> 213 lf = scan_parquet(
    214     source,
    215     n_rows=n_rows,
    216     row_index_name=row_index_name,
    217     row_index_offset=row_index_offset,
    218     parallel=parallel,
    219     use_statistics=use_statistics,
    220     hive_partitioning=hive_partitioning,
    221     schema=schema,
    222     hive_schema=hive_schema,
    223     try_parse_hive_dates=try_parse_hive_dates,
    224     rechunk=rechunk,
    225     low_memory=low_memory,
    226     cache=False,
    227     storage_options=storage_options,
    228     credential_provider=credential_provider,
    229     retries=retries,
    230     glob=glob,
    231     include_file_paths=include_file_paths,
    232     allow_missing_columns=allow_missing_columns,
    233 )
    235 if columns is not None:
    236     if is_int_sequence(columns):

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:489, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, credential_provider, retries, include_file_paths, allow_missing_columns)
    481     source = [
    482         normalize_filepath(source, check_not_directory=False) for source in source
    483     ]
    485 credential_provider = _maybe_init_credential_provider(
    486     credential_provider, source, storage_options, "scan_parquet"
    487 )
--> 489 return _scan_parquet_impl(
    490     source,  # type: ignore[arg-type]
    491     n_rows=n_rows,
    492     cache=cache,
    493     parallel=parallel,
    494     rechunk=rechunk,
    495     row_index_name=row_index_name,
    496     row_index_offset=row_index_offset,
    497     storage_options=storage_options,
    498     credential_provider=credential_provider,
    499     low_memory=low_memory,
    500     use_statistics=use_statistics,
    501     hive_partitioning=hive_partitioning,
    502     schema=schema,
    503     hive_schema=hive_schema,
    504     try_parse_hive_dates=try_parse_hive_dates,
    505     retries=retries,
    506     glob=glob,
    507     include_file_paths=include_file_paths,
    508     allow_missing_columns=allow_missing_columns,
    509 )

File /lib/python3.12/site-packages/polars/io/parquet/functions.py:546, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, credential_provider, low_memory, use_statistics, hive_partitioning, glob, schema, hive_schema, try_parse_hive_dates, retries, include_file_paths, allow_missing_columns)
    542 else:
    543     # Handle empty dict input
    544     storage_options = None
--> 546 pylf = PyLazyFrame.new_from_parquet(
    547     source,
    548     sources,
    549     n_rows,
    550     cache,
    551     parallel,
    552     rechunk,
    553     parse_row_index_args(row_index_name, row_index_offset),
    554     low_memory,
    555     cloud_options=storage_options,
    556     credential_provider=credential_provider,
    557     use_statistics=use_statistics,
    558     hive_partitioning=hive_partitioning,
    559     schema=schema,
    560     hive_schema=hive_schema,
    561     try_parse_hive_dates=try_parse_hive_dates,
    562     retries=retries,
    563     glob=glob,
    564     include_file_paths=include_file_paths,
    565     allow_missing_columns=allow_missing_columns,
    566 )
    567 return wrap_ldf(pylf)

AttributeError: type object 'builtins.PyLazyFrame' has no attribute 'new_from_parquet'

lesteve · 2025-02-13T10:13:44Z

@PGijsbers I would be curious how much you think relying on a parquet URL like https://data.openml.org/datasets/0004/44063/dataset_44063.pq is a bad idea 😅

PGijsbers · 2025-02-14T08:02:03Z

As far as hard-coded URLs go, this one should be safe. It connects to our multi-node MinIO deployment, so it has high fault tolerance. We are looking into ways to redirect traffic more quickly if a similar event occurs where our university's IT department disables all incoming connections altogether, though these events should be extremely rare in the first case.

lesteve · 2025-02-14T08:13:15Z

OK great thanks for your insights @PGijsbers!

For completeness: I am not too worried about a similar situation as the university cyberattack. I was more trying to figure out whether the parquet URL could be considered an implementation detail and subject to change, but it looks like relying on it should be fine then.

PGijsbers · 2025-02-14T09:37:58Z

It is technically an implementation detail (the dataset description's parquet_url is leading). But with the data.openml.org subdomain fixed, and the /datasets/... path being defined by our MinIO bucket structure, there won't be any sudden changes. We have set up the buckets based on advise from MinIO support and there are currently no indications that we should use a different bucket organisation. If we were to ever reorganize, I don't currently see any reason why we would not be able to give a heads up and/or a transition period.

lesteve · 2025-02-14T11:30:30Z

Sounds great, thanks for the additional details!

EXA Download parquet file from data.openml.org

1fd4ac3

lesteve mentioned this pull request Feb 13, 2025

ENH Use OpenML metadata for download url #30708

Merged

lesteve mentioned this pull request Feb 13, 2025

[wasm] polars read_parquet raises AttributeError: type object 'builtins.PyLazyFrame' has no attribute 'new_from_parquet' pola-rs/polars#20876

Open

2 tasks

ogrisel approved these changes Feb 14, 2025

View reviewed changes

jeremiedbb approved these changes Feb 17, 2025

View reviewed changes

jeremiedbb merged commit 3c0e722 into scikit-learn:main Feb 17, 2025
38 checks passed

lesteve deleted the use-openml-parquet-file branch February 18, 2025 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXA Download parquet file from data.openml.org #30824

EXA Download parquet file from data.openml.org #30824

lesteve commented Feb 13, 2025 •

edited

Loading

github-actions bot commented Feb 13, 2025

lesteve commented Feb 13, 2025 •

edited

Loading

lesteve commented Feb 13, 2025 •

edited

Loading

PGijsbers commented Feb 14, 2025

lesteve commented Feb 14, 2025

PGijsbers commented Feb 14, 2025

lesteve commented Feb 14, 2025

EXA Download parquet file from data.openml.org #30824

EXA Download parquet file from data.openml.org #30824

Conversation

lesteve commented Feb 13, 2025 • edited Loading

github-actions bot commented Feb 13, 2025

✔️ Linting Passed

lesteve commented Feb 13, 2025 • edited Loading

lesteve commented Feb 13, 2025 • edited Loading

PGijsbers commented Feb 14, 2025

lesteve commented Feb 14, 2025

PGijsbers commented Feb 14, 2025

lesteve commented Feb 14, 2025

lesteve commented Feb 13, 2025 •

edited

Loading

lesteve commented Feb 13, 2025 •

edited

Loading

lesteve commented Feb 13, 2025 •

edited

Loading