FIX downcast nominal features whenever possible in LIAC-ARFF #22354

glemaitre · 2022-02-01T17:03:02Z

This is a fix such that we downcast to int or float values of nominal features instead of storing them always as strings.
It would be consistent with the behaviour of the future pandas parser implemented in #21938

Indeed it is related to the following quirk: https://github.com/scikit-learn/scikit-learn/pull/21938/files#diff-c14ad6f3f0f87029a67f4cca75114754191e3d2db225bb895b30e4d48b662cf0R825-R838

ogrisel

I am not sure whether or not this is a good idea:

a) I suspect this can cause a large performance regression datasets with categorical columns with string values
b) the result would be very weird if a column has some values that can be converted to int and other to float and other that would stay as str: in this case I believe pandas would use a single type for all values of the same column.

To address a) and b) we could first parse the metadata to know which categorical column can be consistently converted to int values and only attempt to cover those.

But that would not solve the issue of type inference for numerical columns with only int values. I don't think it worse investing time in trying to fix LIAC-ARFF.

I think we can move forward with #21938 without changing the behavior of liac-arff as long as we properly document that the 2 parsers do not necessarily yield the same types for int/float values and categories.

ogrisel · 2022-02-01T17:16:06Z

sklearn/externals/_arff.py

+        try:
+            return float(value)
+        except ValueError:
+            return value


If we decide to pursue this PR anyway, please write a dedicate unittest for this helper function in isolation.

doc/whats_new/v1.1.rst

ogrisel · 2022-02-01T17:19:44Z

sklearn/externals/_arff.py

 def _parse_values(s):
    '''(INTERNAL) Split a line into a list of values'''
    if not _RE_NONTRIVIAL_DATA.search(s):
        # Fast path for trivial cases (unfortunately we have to handle missing
        # values because of the empty string case :(.)
-        return [None if s in ('?', '') else s
+        return [None if s in ('?', '') else _downcast(s)
                for s in next(csv.reader([s]))]


This is so confusing to reuse s as the comprehension variable name as it's already a local variable name of the function!

Agreed and do not look around too much :P

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2022-02-11T17:07:38Z

@glemaitre agreed we can close IRL.

FIX downcast nominal features whenver possible when reading ARFF

7a87ade

github-actions bot added the module:datasets label Feb 1, 2022

glemaitre added 2 commits February 1, 2022 18:04

add changelog

e9c07d7

revert black on _arff.py

60c5817

glemaitre changed the title ~~FIX downcast nominal features whenver possible when reading ARFF~~ FIX downcast nominal features whenever possible in LIAC-ARFF Feb 1, 2022

ogrisel reviewed Feb 1, 2022

View reviewed changes

Update doc/whats_new/v1.1.rst

bc635f5

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre mentioned this pull request Feb 9, 2022

ENH improve ARFF parser using pandas #21938

Merged

ogrisel closed this Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

glemaitre commented Feb 1, 2022 •

edited

Loading

ogrisel left a comment

ogrisel Feb 1, 2022 •

edited

Loading

ogrisel Feb 1, 2022

glemaitre Feb 1, 2022

ogrisel commented Feb 11, 2022

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

Conversation

glemaitre commented Feb 1, 2022 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Feb 1, 2022 • edited Loading

Choose a reason for hiding this comment

ogrisel Feb 1, 2022

Choose a reason for hiding this comment

glemaitre Feb 1, 2022

Choose a reason for hiding this comment

ogrisel commented Feb 11, 2022

glemaitre commented Feb 1, 2022 •

edited

Loading

ogrisel Feb 1, 2022 •

edited

Loading