Skip to content

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 1, 2022

This is a fix such that we downcast to int or float values of nominal features instead of storing them always as strings.
It would be consistent with the behaviour of the future pandas parser implemented in #21938

Indeed it is related to the following quirk: https://github.com/scikit-learn/scikit-learn/pull/21938/files#diff-c14ad6f3f0f87029a67f4cca75114754191e3d2db225bb895b30e4d48b662cf0R825-R838

@glemaitre glemaitre changed the title FIX downcast nominal features whenver possible when reading ARFF FIX downcast nominal features whenever possible in LIAC-ARFF Feb 1, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether or not this is a good idea:

  • a) I suspect this can cause a large performance regression datasets with categorical columns with string values
  • b) the result would be very weird if a column has some values that can be converted to int and other to float and other that would stay as str: in this case I believe pandas would use a single type for all values of the same column.

To address a) and b) we could first parse the metadata to know which categorical column can be consistently converted to int values and only attempt to cover those.

But that would not solve the issue of type inference for numerical columns with only int values. I don't think it worse investing time in trying to fix LIAC-ARFF.

I think we can move forward with #21938 without changing the behavior of liac-arff as long as we properly document that the 2 parsers do not necessarily yield the same types for int/float values and categories.

try:
return float(value)
except ValueError:
return value
Copy link
Member

@ogrisel ogrisel Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to pursue this PR anyway, please write a dedicate unittest for this helper function in isolation.

def _parse_values(s):
'''(INTERNAL) Split a line into a list of values'''
if not _RE_NONTRIVIAL_DATA.search(s):
# Fast path for trivial cases (unfortunately we have to handle missing
# values because of the empty string case :(.)
return [None if s in ('?', '') else s
return [None if s in ('?', '') else _downcast(s)
for s in next(csv.reader([s]))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so confusing to reuse s as the comprehension variable name as it's already a local variable name of the function!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and do not look around too much :P

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Member

ogrisel commented Feb 11, 2022

@glemaitre agreed we can close IRL.

@ogrisel ogrisel closed this Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants