Skip to content

Investigate separating test data from repository #5329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdboom opened this issue Oct 26, 2015 · 2 comments
Open

Investigate separating test data from repository #5329

mdboom opened this issue Oct 26, 2015 · 2 comments
Labels
keep Items to be ignored by the “Stale” Github Action topic: testing

Comments

@mdboom
Copy link
Member

mdboom commented Oct 26, 2015

matplotlib includes its test data for image comparison tests in the git repository. Current HEAD is about 131MB of test data uncompressed. Not sure what the whole history of that data is, but it's a safe bet it's a significant fraction of the git repository.

There are some real advantages to this approach: The test data and the version of matplotlib they correspond to are easily syncronized by being in the same repo. The downside, of course, is the size of the repo.

There are a few alternative solutions I've been investigating, none of which seem to be the perfect answer, so I thought I'd open this up to a wider discussion.

git submodule: The test data would move to another repo (call it the tests repo), and the main repo has a special kind of symbolic link that points to a specific revision in the tests repo. The tests repo is not cloned unless specifically asked for (git submodule update). The downside of git submodule is that a PR that requires both updating functionality in matplotlib and updating test data would have to be split into two PRs, one for each repo, and coordinated very carefully. The link in the matplotlib repo can not point to a revision in the fork of the tests repo, so it will fail until the PR for the tests repo is merged. In short: git submodule is awfully close to what we need, but it doesn't interact very well with the github PR workflow.

git subtree: Seems to avoid the extreme separation of repos git submodule, and merges can take place involving both repos. However, it doesn't solve the problem of only cloning the test data if requested -- git subtree s are always deeply cloned. Additionally, git submodule seems more appropriate if the two repos are separate projects usable on their own. I don't think that's the case here.

git annex: Allows to check in special links to the git repo instead of files. The files these links referred to can then be fetched or cleared as requested. The actual file contents can live a number of places, like a WebDAV server, or another git repo (which probably makes the most sense for us, to use free github hosting). git annex is a cool but fairly complex tool, but I think it's the closest to what we need.

Of course, none of this impacts how we distribute matplotlib, and more and more of our packages for end users just don't include the tests, and this is easy enough to do. So given the added complexity of all of the options above vs. the bandwidth and data costs of the status quo, I'm not sure it's obvious we should do anything. But, as I said, there might be some good solutions that come out of discussion.

@WeatherGod
Copy link
Member

This discussion could also benefit Basemap as its package data is fairly
substantial and rarely updated.

On Mon, Oct 26, 2015 at 6:14 PM, Michael Droettboom <
notifications@github.com> wrote:

matplotlib includes its test data for image comparison tests in the git
repository. Current HEAD is about 131MB of test data uncompressed. Not sure
what the whole history of that data is, but it's a safe bet it's a
significant fraction of the git repository.

There are some real advantages to this approach: The test data and the
version of matplotlib they correspond to are easily syncronized by being in
the same repo. The downside, of course, is the size of the repo.

There are a few alternative solutions I've been investigating, none of
which seem to be the perfect answer, so I thought I'd open this up to a
wider discussion.

git submodule: The test data would move to another repo (call it the tests
repo), and the main repo has a special kind of symbolic link that points to
a specific revision in the tests repo. The tests repo is not cloned
unless specifically asked for (git submodule update). The downside of git
submodule is that a PR that requires both updating functionality in
matplotlib and updating test data would have to be split into two PRs, one
for each repo, and coordinated very carefully. The link in the matplotlib
repo can not point to a revision in the fork of the tests repo, so it
will fail until the PR for the tests repo is merged. In short: git
submodule is awfully close to what we need, but it doesn't interact very
well with the github PR workflow.

git subtree: Seems to avoid the extreme separation of repos git submodule,
and merges can take place involving both repos. However, it doesn't solve
the problem of only cloning the test data if requested -- git subtree s
are always deeply cloned. Additionally, git submodule seems more
appropriate if the two repos are separate projects usable on their own. I
don't think that's the case here.

git annex: Allows to check in special links to the git repo instead of
files. The files these links referred to can then be fetched or cleared as
requested. The actual file contents can live a number of places, like a
WebDAV server, or another git repo (which probably makes the most sense for
us, to use free github hosting). git annex is a cool but fairly complex
tool, but I think it's the closest to what we need.

Of course, none of this impacts how we distribute matplotlib, and more and
more of our packages for end users just don't include the tests, and this
is easy enough to do. So given the added complexity of all of the options
above vs. the bandwidth and data costs of the status quo, I'm not sure it's
obvious we should do anything. But, as I said, there might be some good
solutions that come out of discussion.


Reply to this email directly or view it on GitHub
#5329.

@tacaswell tacaswell added this to the unassigned milestone Oct 28, 2015
@story645 story645 modified the milestones: unassigned, needs sorting Oct 6, 2022
@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Oct 9, 2023
@github-actions github-actions bot added the status: closed as inactive Issues closed by the "Stale" Github Action. Please comment on any you think should still be open. label Nov 8, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 8, 2023
@story645 story645 added topic: testing keep Items to be ignored by the “Stale” Github Action and removed status: inactive Marked by the “Stale” Github Action status: closed as inactive Issues closed by the "Stale" Github Action. Please comment on any you think should still be open. labels Nov 8, 2023
@story645 story645 reopened this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep Items to be ignored by the “Stale” Github Action topic: testing
Projects
None yet
Development

No branches or pull requests

4 participants