-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX downcast nominal features whenever possible in LIAC-ARFF #22354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether or not this is a good idea:
- a) I suspect this can cause a large performance regression datasets with categorical columns with string values
- b) the result would be very weird if a column has some values that can be converted to int and other to float and other that would stay as
str
: in this case I believe pandas would use a single type for all values of the same column.
To address a) and b) we could first parse the metadata to know which categorical column can be consistently converted to int values and only attempt to cover those.
But that would not solve the issue of type inference for numerical columns with only int values. I don't think it worse investing time in trying to fix LIAC-ARFF.
I think we can move forward with #21938 without changing the behavior of liac-arff as long as we properly document that the 2 parsers do not necessarily yield the same types for int/float values and categories.
try: | ||
return float(value) | ||
except ValueError: | ||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to pursue this PR anyway, please write a dedicate unittest for this helper function in isolation.
def _parse_values(s): | ||
'''(INTERNAL) Split a line into a list of values''' | ||
if not _RE_NONTRIVIAL_DATA.search(s): | ||
# Fast path for trivial cases (unfortunately we have to handle missing | ||
# values because of the empty string case :(.) | ||
return [None if s in ('?', '') else s | ||
return [None if s in ('?', '') else _downcast(s) | ||
for s in next(csv.reader([s]))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so confusing to reuse s
as the comprehension variable name as it's already a local variable name of the function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and do not look around too much :P
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@glemaitre agreed we can close IRL. |
This is a fix such that we downcast to
int
orfloat
values of nominal features instead of storing them always as strings.It would be consistent with the behaviour of the future
pandas
parser implemented in #21938Indeed it is related to the following quirk: https://github.com/scikit-learn/scikit-learn/pull/21938/files#diff-c14ad6f3f0f87029a67f4cca75114754191e3d2db225bb895b30e4d48b662cf0R825-R838