Skip to content

MAINT refactor ARFF parser #22026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 21, 2021
Merged

Conversation

glemaitre
Copy link
Member

Refactoring ARFF parser in order to introduce pandas parser.
Preparation of #21938

This PR is only moving functions aside and slightly modifying the signature of one of them.

@glemaitre
Copy link
Member Author

@thomasjpfan did you have this refactoring in mind?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not involved in the initial implementation of the ARFF parser in the OpenMP fetcher from scikit-learn and I am a bit confused by the name liac_arff_parser:

  • do we vendor code that was originally developed outside of scikit-learn (e.g. https://github.com/renatopp/liac-arff)? If so the functions that are derivatives from outside code should mention it in their docstrings or at the top of the module, including the original author name (or copyright holder name) and the license.
  • do we plan to resync this code with upstream from time to time in the future? Probably not because OpenML is moving towards parquet as a replacement for ARFF so I suspect that we won't have to deal with long term maintenance of the ARFF parser.

@ogrisel
Copy link
Member

ogrisel commented Dec 20, 2021

Beyond the point of making the relationship between upstream code and derivative code explicit, I am fine with the idea of moving ARFF parsing code in a dedicated private module. So +1 once the above comment is addressed.

@glemaitre
Copy link
Member Author

do we vendor code that was originally developed outside of scikit-learn (e.g. https://github.com/renatopp/liac-arff)? If so the functions that are derivatives from outside code should mention it in their docstrings or at the top of the module, including the original author name (or copyright holder name) and the license.

Yes, we do vendor in the sklearn.externals._arff.py. This file has all the info regarding the external package LIAC-ARFF.
Here, the change is just a wrapper that calls this ARFF parser with some additional work on the output to make it consistent with our fetchers.

do we plan to resync this code with upstream from time to time in the future?

Basically, the development in the original repository is not super active: https://github.com/renatopp/liac-arff
Indeed, it was people from OpenML and scikit-learn that did the latest maintenance. So in short, our current version in scikit-learn is synchronized.

Probably not because OpenML is moving towards parquet as a replacement for ARFF so I suspect that we won't have to deal with long term maintenance of the ARFF parser.

I open the following issue to ask a rough timeline regarding the landing of parquet format in OpenML: openml/OpenML#1133
Indeed, if parquet is landing all at once (for all datasets) then the ARFF parser would be obsolete.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. LGTM.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @glemaitre !

LGTM

@thomasjpfan thomasjpfan merged commit 826032f into scikit-learn:main Dec 21, 2021
@thomasjpfan
Copy link
Member

Merging because the CI failure is unrelated to this PR.

venkyyuvy pushed a commit to venkyyuvy/scikit-learn that referenced this pull request Jan 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants