Skip to content

ENH allows to overwrite read_csv parameter in fetch_openml #25488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glemaitre
Copy link
Member

Allows to overwrite the parameter passed to read_csv when reading a dataframe.
It is not intended to be used widely but it could be worth it when things go sideways.

@glemaitre glemaitre marked this pull request as draft January 26, 2023 15:39
@glemaitre glemaitre marked this pull request as ready for review January 27, 2023 13:01
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I am okay with adding this option.

the default options. Internally, we used the default parameters of
:func:`pandas.read_csv` except for the following parameters:

- `header`: set to `None`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in fetch_openml as part of the public API?

dtype=dtypes,
skipinitialspace=True, # skip spaces after delimiter to follow ARFF specs
)
frame = pd.read_csv(gzip_file, **read_csv_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if there is an exception while reading the data, one would need to enter a debugger to find out where the file is and what the read_csv_kwargs are. I think it would be helpful reraise an exception that outputs the read_csv_kwargs and gzip_file to help with debugging the issue.

@glemaitre
Copy link
Member Author

I will close this one. Let's keep in mind that it exists if we really need more flexibility and tweak the parameter in the future.

@glemaitre
Copy link
Member Author

OK, so opening back this one. It seems that we will need it if we want to manage ourself some read_csv breaking change between 1.X and 2.X in pandas: #25878 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants