Skip to content

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,10 @@ Changelog
3 times in case of a network failure with a delay between each try.
:pr:`21901` by :user:`Rileran <rileran>`.

- |Fix| :func:`datasets.fetch_openml` now downcasts whenever possible values
that are specified as nominal features instead of storing them as strings.
:pr:`22354` by :user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.decomposition`
............................

Expand Down
12 changes: 8 additions & 4 deletions sklearn/datasets/tests/test_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
from sklearn.utils._testing import assert_allclose, assert_array_equal
from urllib.error import HTTPError
from sklearn.datasets.tests.test_common import check_return_X_y
from sklearn.externals._arff import ArffContainerType
from sklearn.externals._arff import ArffContainerType, _downcast
from functools import partial
from sklearn.utils._testing import fails_if_pypy

Expand Down Expand Up @@ -90,12 +90,16 @@ def decode_column(data_bunch, col_idx):

data_downloaded = np.array(list(data_arff["data"]), dtype="O")

for i in range(len(data_bunch.feature_names)):
for column_idx, column_name in enumerate(data_bunch.feature_names):
# XXX: Test per column, as this makes it easier to avoid problems with
# missing values

column_downloaded = data_downloaded[:, column_idx]
if column_name in data_bunch.categories:
column_downloaded = np.array([_downcast(v) for v in column_downloaded])

np.testing.assert_array_equal(
data_downloaded[:, i], decode_column(data_bunch, i)
column_downloaded, decode_column(data_bunch, column_idx)
)


Expand Down Expand Up @@ -821,7 +825,7 @@ def test_fetch_openml_titanic_pandas(monkeypatch):
"boat": object,
"body": np.float64,
"home.dest": object,
"survived": CategoricalDtype(["0", "1"]),
"survived": CategoricalDtype([0, 1]),
}

frame_columns = [
Expand Down
19 changes: 16 additions & 3 deletions sklearn/externals/_arff.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,21 +280,34 @@ def _unquote(v):
return v


def _downcast(value):
"""Downcast a value to integral or float type if possible."""
if value is None:
return value
try:
return int(value)
except ValueError:
try:
return float(value)
except ValueError:
return value
Copy link
Member

@ogrisel ogrisel Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to pursue this PR anyway, please write a dedicate unittest for this helper function in isolation.



def _parse_values(s):
'''(INTERNAL) Split a line into a list of values'''
if not _RE_NONTRIVIAL_DATA.search(s):
# Fast path for trivial cases (unfortunately we have to handle missing
# values because of the empty string case :(.)
return [None if s in ('?', '') else s
return [None if s in ('?', '') else _downcast(s)
for s in next(csv.reader([s]))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so confusing to reuse s as the comprehension variable name as it's already a local variable name of the function!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and do not look around too much :P


# _RE_DENSE_VALUES tokenizes despite quoting, whitespace, etc.
values, errors = zip(*_RE_DENSE_VALUES.findall(',' + s))
if not any(errors):
return [_unquote(v) for v in values]
return [_downcast(_unquote(v)) for v in values]
if _RE_SPARSE_LINE.match(s):
try:
return {int(k): _unquote(v)
return {int(k): _downcast(_unquote(v))
for k, v in _RE_SPARSE_KEY_VALUES.findall(s)}
except ValueError:
# an ARFF syntax error in sparse data
Expand Down