-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT refactor ARFF parser #22026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT refactor ARFF parser #22026
Conversation
@thomasjpfan did you have this refactoring in mind? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not involved in the initial implementation of the ARFF parser in the OpenMP fetcher from scikit-learn and I am a bit confused by the name liac_arff_parser
:
- do we vendor code that was originally developed outside of scikit-learn (e.g. https://github.com/renatopp/liac-arff)? If so the functions that are derivatives from outside code should mention it in their docstrings or at the top of the module, including the original author name (or copyright holder name) and the license.
- do we plan to resync this code with upstream from time to time in the future? Probably not because OpenML is moving towards parquet as a replacement for ARFF so I suspect that we won't have to deal with long term maintenance of the ARFF parser.
Beyond the point of making the relationship between upstream code and derivative code explicit, I am fine with the idea of moving ARFF parsing code in a dedicated private module. So +1 once the above comment is addressed. |
Yes, we do vendor in the
Basically, the development in the original repository is not super active: https://github.com/renatopp/liac-arff
I open the following issue to ask a rough timeline regarding the landing of parquet format in OpenML: openml/OpenML#1133 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @glemaitre !
LGTM
Merging because the CI failure is unrelated to this PR. |
Refactoring ARFF parser in order to introduce pandas parser.
Preparation of #21938
This PR is only moving functions aside and slightly modifying the signature of one of them.