-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Investigate separating test data from repository #5329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This discussion could also benefit Basemap as its package data is fairly On Mon, Oct 26, 2015 at 6:14 PM, Michael Droettboom <
|
This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help! |
matplotlib includes its test data for image comparison tests in the git repository. Current HEAD is about 131MB of test data uncompressed. Not sure what the whole history of that data is, but it's a safe bet it's a significant fraction of the git repository.
There are some real advantages to this approach: The test data and the version of matplotlib they correspond to are easily syncronized by being in the same repo. The downside, of course, is the size of the repo.
There are a few alternative solutions I've been investigating, none of which seem to be the perfect answer, so I thought I'd open this up to a wider discussion.
git submodule
: The test data would move to another repo (call it thetests
repo), and the main repo has a special kind of symbolic link that points to a specific revision in thetests
repo. Thetests
repo is not cloned unless specifically asked for (git submodule update
). The downside ofgit submodule
is that a PR that requires both updating functionality in matplotlib and updating test data would have to be split into two PRs, one for each repo, and coordinated very carefully. The link in the matplotlib repo can not point to a revision in the fork of thetests
repo, so it will fail until the PR for the tests repo is merged. In short:git submodule
is awfully close to what we need, but it doesn't interact very well with the github PR workflow.git subtree
: Seems to avoid the extreme separation of reposgit submodule
, and merges can take place involving both repos. However, it doesn't solve the problem of only cloning the test data if requested --git subtree
s are always deeply cloned. Additionally,git submodule
seems more appropriate if the two repos are separate projects usable on their own. I don't think that's the case here.git annex
: Allows to check in special links to the git repo instead of files. The files these links referred to can then be fetched or cleared as requested. The actual file contents can live a number of places, like a WebDAV server, or another git repo (which probably makes the most sense for us, to use free github hosting).git annex
is a cool but fairly complex tool, but I think it's the closest to what we need.Of course, none of this impacts how we distribute matplotlib, and more and more of our packages for end users just don't include the tests, and this is easy enough to do. So given the added complexity of all of the options above vs. the bandwidth and data costs of the status quo, I'm not sure it's obvious we should do anything. But, as I said, there might be some good solutions that come out of discussion.
The text was updated successfully, but these errors were encountered: