Skip to content

parallelize_tests #1951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 17, 2013
Merged

parallelize_tests #1951

merged 18 commits into from
May 17, 2013

Conversation

mdboom
Copy link
Member

@mdboom mdboom commented Apr 26, 2013

This is a possible solution to #1508 to get the test suite running in parallel.

The main fix was to not use a single gs process to do the conversion but to instantiate it every time. This makes parallelizing much easier, and it was never much of an optimization anyway. (Alternately, we could use one gs process if not parallelized, but I couldn't figure out how to detect in nose which mode we were in).

There was also a bug where the mathtext tests were being generated with different function names than what they were called in the module's namespace, which broke pickling of those functions through multiprocessing.

On my 4 core i7 machine, the wall clock time was 10:32, now it is 3:05 running all 4 cores. Pretty nice!

This also adds more tests to the default suite run by Travis.

@mdboom
Copy link
Member Author

mdboom commented Apr 29, 2013

Does anyone have any thoughts about the test failures? I'm stumped, and I can't reproduce them on a Ubuntu Precise VM I have (which apparently matches what Travis has).

@WeatherGod
Copy link
Member

Why is the testing/util.py file deleted?

@mdboom
Copy link
Member Author

mdboom commented Apr 29, 2013

testing/util.py only contains one class, which was a version of expect used to communicate with a long-lived gs process -- it feeds it PDF filenames and it writes out PNG files. Since this approach doesn't work when multiprocessing (since each process would need its own ghostscript process), I just deleted it. It could probably be made to work by being more clever about it, but it's a very small optimization over just creating a new gs process for each image.

@pelson
Copy link
Member

pelson commented Apr 30, 2013

Does anyone have any thoughts about the test failures?

I can't see travis-ci test results anymore (this has been true for some time now) - my browser crashes and burns when it sees the long log (firefox 17.0.5). It renders the travis-ci system into a pretty useless binary pass/fail system for me. Can we look at making the logs shorter again? (I have a "print if fails" executable in cartopy to hide the build output unless it went wrong, for example: https://travis-ci.org/SciTools/cartopy/builds/6686231). I can't see beyond the build output as it stands.

@pelson
Copy link
Member

pelson commented Apr 30, 2013

In principle this looks good. I'm guessing you're adding the necessary multiprocess arguments to your python tests.py call since there is no update to that file?

I also can't see what has changed that would make travis-ci use multiprocessing - I would have expected an update to .travis.yml

@mdboom
Copy link
Member Author

mdboom commented Apr 30, 2013

@pelson: I haven't had any trouble getting the logs from Travis lately. They recently did make the log fetching more chunked which has helped a lot. But I could capitulate and hide setup parts of the log if there's no other way.

In any event, for this specific PR, you can get the raw log here:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/6665520/log.txt

As for how to run this with multiprocess -- tests.py merely passes its arguments along to nosetests verbatim, so it's still up to the user to pass --process=-1 to run the tests in parallel. I think that's the right thing because we don't want to override any of the nose defaults and surprise people.

I had originally added --process=-1 to the .travis.yml, but then took it out when the tests failed to see if that would resolve it. It didn't (the results are identical). So the failures are not the result of multiprocessing, but due to some other change here, which makes it all the more puzzling.

@ghost ghost assigned mdboom Apr 30, 2013
@mdboom
Copy link
Member Author

mdboom commented Apr 30, 2013

Ok -- I've got this mostly working. The initial failures were due to still importing matplotlib.testing.util after it had been removed. :rage1:

It still seems, however, that running the tests in parallel on Travis doesn't work:

The command "cd ../tmp_test_dir" exited with 0.
$ python ../matplotlib/tests.py -sv --processes=-1 --process-timeout=300
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

I'm perfectly fine with merging this as is, because it's still very useful for running tests locally. I'm not entirely sure that Travis would give us access to more than one core per VM anyway -- I find the Travis docs completely unwieldy, so I can't confirm or deny that, but I wouldn't be surprised.

@pelson
Copy link
Member

pelson commented May 1, 2013

FWIW you can get at the log output in the form: https://api.travis-ci.org/jobs/{build_id}/log.txt?deansi=true

I was able to reproduce the freezing that travis-ci saw - it was related to providing a negative value to the processes argument so instead I went for the less elegant --processes=$(nproc) approach.

After fixing that, I was able to get the tests to run in ~91 seconds with just 3 failiures: https://api.travis-ci.org/jobs/6781057/log.txt?deansi=true . I think it might be worth tracking these down (if its not too thorny). (my extra commits were in my copy of your branch https://github.com/pelson/matplotlib/tree/multiprocess_testing)

@mdboom
Copy link
Member Author

mdboom commented May 1, 2013

@pelson: Thanks for getting to the bottom of the hanging. nosetests uses multiprocessing.cpu_count under the hood, which in turn uses num = os.sysconf('SC_NPROCESSORS_ONLN'). I guess that system call never returns?

In any case, I think your solution is just fine. I think it also makes sense to log the value of $(nproc) for informational purposes.

@mdboom
Copy link
Member Author

mdboom commented May 1, 2013

Ok -- it seems we're down to one failure: test_bbox_inches_tight_suptile_legend. The image is a different size than expected.

@mdboom
Copy link
Member Author

mdboom commented May 1, 2013

With the latest commit (which should not have affected test_bbox_inches_tight_subtitle_legend), that test is now passing, so there's obviously some kind of race condition with that test. (That doesn't surprise me, it's always been a bit flaky). Not sure how to tackle that, though.

I wonder if the inkscape failure may be due to some race condition there -- since it's now possible that many inkscape processes will be launched at the same time, I wouldn't be surprised if there is greater memory pressure, etc.

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

I think I finally have something that works on Travis (excepting the usual Travis network errors). Any volunteers to do a little more local testing before merging?

@WeatherGod
Copy link
Member

Sure, I'll give it a spin.

@WeatherGod
Copy link
Member

Uhm, I am getting an import error saying that pyparsing >= 1.5.6 is
required, but I have 1.5.7 installed.

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

@WeatherGod: Is the import error on building, or importing once installed? Are you certain you're in the same virtualenv/version of Python where you have pyparsing installed?

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

@WeatherGod: also -- how are you running the tests? If from nosetests at the commandline, I have personally run into issues before where the nosetests command is using a different Python from what I intended.

@WeatherGod
Copy link
Member

It is an error on import. See the following output (where I am using a
virtualenv called "centos6"):

[centos6] [broot@rd22 matplotlib]$ python
Python 2.7.3 (default, May 29 2012, 13:36:18)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/nas/home/broot/centos6/lib/python2.7/site-packages/matplotlib-1.3.x-py2.7-linux-x86_64.egg/matplotlib/__init__.py",
line 121, in <module>
    '.'.join(str(x) for x in _required)))
ImportError: matplotlib requires pyparsing >= 1.5.6
>>> import pyparsing
>>> pyparsing.__version__
'1.5.7'

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

Odd indeed. Here's a wild guess: do you have a pyparsing.py file sitting around from an old matplotlib installation in /nas/home/broot/centos6/lib/python2.7/site-packages/matplotlib-1.3.x-py2.7-linux-x86_64.egg/matplotlib/? matplotlib is still using the old import semantics (it doesn't do from __future__ import absolute_import) so if there were a pyparsing.py in that directory it would take precendence over the globally installed one. If that's the case, try cleaning the git directory with git clean -fxd and reinstalling.

@WeatherGod
Copy link
Member

Oh, wow, I didn't think of that. I didn't think to check the egg
directory. Indeed, I did have that. I blasted away all of that and I am
rebuilding now.

As a side note, I notice that setup.py now does a download, if needed,
during the build/install/test process. Wouldn't the debian people have a
problem with that? I vaguely recall them raising an issue with it when our
tests used to download the yahoo stock data.

- python setup.py install

script:
- mkdir ../tmp_test_dir
- cd ../tmp_test_dir
- python ../matplotlib/tests.py -sv
- echo Testing using 4 processes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The echo statement here doesn't match what is passed to the command line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Thanks for catching.

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

@WeatherGod: since setuptools is doing the downloading, my understanding is that it's ok, since the Debian build environment will cause the downloading not to happen. They basically have a generic way to handle all setuptools-based builds.

@WeatherGod
Copy link
Member

So, even with two processes and 4GB of RAM, I am running into issues where the processes run such high memory usages (~3GB each) that my system starts to swapping like crazy. And it seems like it isn't making any further progress even though my processors keep getting pegged (whenever an io wait is finished, that is). I can do a simple matplotlib.test() run with no problems, so I don't know why doing it with two processes is much, much worse.

@mdboom
Copy link
Member Author

mdboom commented May 8, 2013

@WeatherGod: Thanks. That's a useful data point. I'm not sure what further to investigate. For me, on Fedora 18 (which shouldn't be fundamentally different from your CentOS box), with 4GB and 4 cores, I'm not seeing runaway memory usage. This is also on Python 2.7. What do you get for:

› gs --version
9.06
› inkscape --version
Inkscape 0.48.4 r9939 (Dec 18 2012)

@WeatherGod
Copy link
Member

[centos6] [broot@rd22 ~]$ gs --version
8.70
[centos6] [broot@rd22 ~]$ inkscape --version
inkscape: Command not found.

I can do tests tonight on a Ubuntu 12.04 machine that is quite a bit
beefier, and I have full control over the packages installed on it.

@WeatherGod
Copy link
Member

So, on my Ubuntu 12.04 machine, setting it to 2 processes, it just hangs immediately after completing the first test. But, if I run it in single process mode, it works just fine.

ben@tigger:~$ gs --version
9.05
ben@tigger:~$ inkscape --version
Inkscape 0.48.3.1 r9886 (Jan 29 2013)

@mdboom
Copy link
Member Author

mdboom commented May 9, 2013

Ok. Interesting -- I guess this is farther from ready than I thought. The Travis tests are running in a Ubuntu 12.04 VM, if I recall correctly, so I'm surprised it works there and not for you. We did see hanging initially when trying to use multiprocessing.cpu_count() to get the number of cores, but if you're explicitly specifying 2 cores you aren't just hitting that problem. Thanks for trying -- I'll have to go back to the drawing board (perhaps try this on a Ubuntu VM myself) and see if I can come up with any good question to ask...

@pelson: Have you tried running the tests locally, or only on Travis?

@mdboom
Copy link
Member Author

mdboom commented May 15, 2013

I'm still unable to reproduce @WeatherGod's issue (I tried on a clean Ubuntu 12.04 VM). Very strange.

How do we all feel about this PR? Personally, I'd love to have it in (since it works for me -- sorry to be selfish), and it shouldn't be any worse for anyone not running the tests in parallel.

Travis seems to like running the tests in parallel as well, so I'm leaning toward turning it on in .travis.yml, but that of course is also optional.

@WeatherGod
Copy link
Member

On Wed, May 15, 2013 at 3:00 PM, Michael Droettboom <
notifications@github.com> wrote:

I'm still unable to reproduce @WeatherGod https://github.com/WeatherGod's
issue (I tried on a clean Ubuntu 12.04 VM). Very strange.

How do we all feel about this PR? Personally, I'd love to have it in
(since it works for me -- sorry to be selfish), and it shouldn't be any
worse for anyone not running the tests in parallel.

Travis seems to like running the tests in parallel as well, so I'm leaning
toward turning it on in .travis.yml, but that of course is also optional.

Perhaps we should double-check our basic assumptions. Exactly what command
did you use to run the parallelized tests?

@mdboom
Copy link
Member Author

mdboom commented May 16, 2013

After installing matplotlib, from a temporary directory, I run:

$PATH_TO_MATPLOTLIB_SOURCE/tests.py --processes=-1 --process-timeout=300

@pelson
Copy link
Member

pelson commented May 17, 2013

This is absolutely fine to be merged. I too have memory problems (Intel® Xeon® Processor E5520 with 8 threads and 5.7Gb addressable RAM on RHEL6 64bit, Python 2.7) which mean I cannot run the tests in parallel (even with --processes=1) but this PR represents an improvement (and the travis tests are a lot quicker). So 👍 for v1.3.x.

@pelson
Copy link
Member

pelson commented May 17, 2013

For the record, running with the -sv flags my machine froze (and was terminated with ctrl+c) at:

$> python tests.py  --processes=1 --process-timeout=300 -sv
...

test_line_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_for_non_affine_transData (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok

^C
Traceback (most recent call last):
  File "tests.py", line 20, in <module>
    run()
  File "tests.py", line 17, in run
    defaultTest=default_test_modules)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 118, in __init__
    **extra_args)
  File "lib/python2.7/unittest/main.py", line 95, in __init__
    self.runTests()
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 197, in runTests
    result = self.testRunner.run(self.test)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 349, in run
    timeout=nexttimeout)
  File "<string>", line 2, in get
  File "lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    kind, result = conn.recv()
KeyboardInterrupt
Process Process-2:
Traceback (most recent call last):
  File "lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 625, in runner
    keyboardCaught, shouldStop, loaderClass, resultClass, config)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 692, in __runner
    keyboardCaught.set()
  File "iris/sci/lib/python2.7/multiprocessing/managers.py", line 1010, in set
    return self._callmethod('set')
  File "lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe

@WeatherGod
Copy link
Member

Come to think of it, I too also ran my commands with the -sv option.

@pelson
Copy link
Member

pelson commented May 17, 2013

Come to think of it, I too also ran my commands with the -sv option.

For the record, I don't think it is because you're running it with sv - I just did that to see if I could see where it stalls (I'm not sure if you can though...)

@mdboom
Copy link
Member Author

mdboom commented May 17, 2013

Ok -- it definitely sucks that it's failing for an unknown reason for (at least) @pelson, and @WeatherGod, but I think I'll merge this so we at least get the benefits for Travis as we head into the release period, and then I'll open a new issue for getting to the bottom of the failures.

mdboom added a commit that referenced this pull request May 17, 2013
@mdboom mdboom merged commit d39f9c0 into matplotlib:master May 17, 2013
@mdboom mdboom deleted the parallelize_tests branch August 7, 2014 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants