parallelize_tests #1951

mdboom · 2013-04-26T15:49:00Z

This is a possible solution to #1508 to get the test suite running in parallel.

The main fix was to not use a single gs process to do the conversion but to instantiate it every time. This makes parallelizing much easier, and it was never much of an optimization anyway. (Alternately, we could use one gs process if not parallelized, but I couldn't figure out how to detect in nose which mode we were in).

There was also a bug where the mathtext tests were being generated with different function names than what they were called in the module's namespace, which broke pickling of those functions through multiprocessing.

On my 4 core i7 machine, the wall clock time was 10:32, now it is 3:05 running all 4 cores. Pretty nice!

This also adds more tests to the default suite run by Travis.

mdboom · 2013-04-29T21:11:55Z

Does anyone have any thoughts about the test failures? I'm stumped, and I can't reproduce them on a Ubuntu Precise VM I have (which apparently matches what Travis has).

WeatherGod · 2013-04-29T21:20:21Z

Why is the testing/util.py file deleted?

mdboom · 2013-04-29T21:27:24Z

testing/util.py only contains one class, which was a version of expect used to communicate with a long-lived gs process -- it feeds it PDF filenames and it writes out PNG files. Since this approach doesn't work when multiprocessing (since each process would need its own ghostscript process), I just deleted it. It could probably be made to work by being more clever about it, but it's a very small optimization over just creating a new gs process for each image.

pelson · 2013-04-30T08:42:00Z

Does anyone have any thoughts about the test failures?

I can't see travis-ci test results anymore (this has been true for some time now) - my browser crashes and burns when it sees the long log (firefox 17.0.5). It renders the travis-ci system into a pretty useless binary pass/fail system for me. Can we look at making the logs shorter again? (I have a "print if fails" executable in cartopy to hide the build output unless it went wrong, for example: https://travis-ci.org/SciTools/cartopy/builds/6686231). I can't see beyond the build output as it stands.

pelson · 2013-04-30T08:44:27Z

In principle this looks good. I'm guessing you're adding the necessary multiprocess arguments to your python tests.py call since there is no update to that file?

I also can't see what has changed that would make travis-ci use multiprocessing - I would have expected an update to .travis.yml

mdboom · 2013-04-30T12:42:56Z

@pelson: I haven't had any trouble getting the logs from Travis lately. They recently did make the log fetching more chunked which has helped a lot. But I could capitulate and hide setup parts of the log if there's no other way.

In any event, for this specific PR, you can get the raw log here:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/6665520/log.txt

As for how to run this with multiprocess -- tests.py merely passes its arguments along to nosetests verbatim, so it's still up to the user to pass --process=-1 to run the tests in parallel. I think that's the right thing because we don't want to override any of the nose defaults and surprise people.

I had originally added --process=-1 to the .travis.yml, but then took it out when the tests failed to see if that would resolve it. It didn't (the results are identical). So the failures are not the result of multiprocessing, but due to some other change here, which makes it all the more puzzling.

mdboom · 2013-04-30T16:56:51Z

Ok -- I've got this mostly working. The initial failures were due to still importing matplotlib.testing.util after it had been removed.

It still seems, however, that running the tests in parallel on Travis doesn't work:

The command "cd ../tmp_test_dir" exited with 0.
$ python ../matplotlib/tests.py -sv --processes=-1 --process-timeout=300
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

I'm perfectly fine with merging this as is, because it's still very useful for running tests locally. I'm not entirely sure that Travis would give us access to more than one core per VM anyway -- I find the Travis docs completely unwieldy, so I can't confirm or deny that, but I wouldn't be surprised.

pelson · 2013-05-01T08:00:01Z

FWIW you can get at the log output in the form: https://api.travis-ci.org/jobs/{build_id}/log.txt?deansi=true

I was able to reproduce the freezing that travis-ci saw - it was related to providing a negative value to the processes argument so instead I went for the less elegant --processes=$(nproc) approach.

After fixing that, I was able to get the tests to run in ~91 seconds with just 3 failiures: https://api.travis-ci.org/jobs/6781057/log.txt?deansi=true . I think it might be worth tracking these down (if its not too thorny). (my extra commits were in my copy of your branch https://github.com/pelson/matplotlib/tree/multiprocess_testing)

mdboom · 2013-05-01T11:56:34Z

@pelson: Thanks for getting to the bottom of the hanging. nosetests uses multiprocessing.cpu_count under the hood, which in turn uses num = os.sysconf('SC_NPROCESSORS_ONLN'). I guess that system call never returns?

In any case, I think your solution is just fine. I think it also makes sense to log the value of $(nproc) for informational purposes.

mdboom · 2013-05-01T13:29:01Z

Ok -- it seems we're down to one failure: test_bbox_inches_tight_suptile_legend. The image is a different size than expected.

mdboom · 2013-05-01T13:51:04Z

With the latest commit (which should not have affected test_bbox_inches_tight_subtitle_legend), that test is now passing, so there's obviously some kind of race condition with that test. (That doesn't surprise me, it's always been a bit flaky). Not sure how to tackle that, though.

I wonder if the inkscape failure may be due to some race condition there -- since it's now possible that many inkscape processes will be launched at the same time, I wouldn't be surprised if there is greater memory pressure, etc.

… multiprocessing

.

…ry to create a directory at the same time

…nkscape on Travis

…t of memory when running too many inkscapes/ghostscripts in parallel

mdboom · 2013-05-08T16:45:09Z

I think I finally have something that works on Travis (excepting the usual Travis network errors). Any volunteers to do a little more local testing before merging?

WeatherGod · 2013-05-08T16:46:43Z

Sure, I'll give it a spin.

WeatherGod · 2013-05-08T17:05:27Z

Uhm, I am getting an import error saying that pyparsing >= 1.5.6 is
required, but I have 1.5.7 installed.

mdboom · 2013-05-08T17:10:58Z

@WeatherGod: Is the import error on building, or importing once installed? Are you certain you're in the same virtualenv/version of Python where you have pyparsing installed?

mdboom · 2013-05-08T17:16:43Z

@WeatherGod: also -- how are you running the tests? If from nosetests at the commandline, I have personally run into issues before where the nosetests command is using a different Python from what I intended.

…ry on Travis

WeatherGod · 2013-05-08T17:36:53Z

It is an error on import. See the following output (where I am using a
virtualenv called "centos6"):

[centos6] [broot@rd22 matplotlib]$ python
Python 2.7.3 (default, May 29 2012, 13:36:18)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/nas/home/broot/centos6/lib/python2.7/site-packages/matplotlib-1.3.x-py2.7-linux-x86_64.egg/matplotlib/__init__.py",
line 121, in <module>
    '.'.join(str(x) for x in _required)))
ImportError: matplotlib requires pyparsing >= 1.5.6
>>> import pyparsing
>>> pyparsing.__version__
'1.5.7'

mdboom · 2013-05-08T17:47:10Z

Odd indeed. Here's a wild guess: do you have a pyparsing.py file sitting around from an old matplotlib installation in /nas/home/broot/centos6/lib/python2.7/site-packages/matplotlib-1.3.x-py2.7-linux-x86_64.egg/matplotlib/? matplotlib is still using the old import semantics (it doesn't do from __future__ import absolute_import) so if there were a pyparsing.py in that directory it would take precendence over the globally installed one. If that's the case, try cleaning the git directory with git clean -fxd and reinstalling.

WeatherGod · 2013-05-08T17:58:03Z

Oh, wow, I didn't think of that. I didn't think to check the egg
directory. Indeed, I did have that. I blasted away all of that and I am
rebuilding now.

As a side note, I notice that setup.py now does a download, if needed,
during the build/install/test process. Wouldn't the debian people have a
problem with that? I vaguely recall them raising an issue with it when our
tests used to download the yahoo stock data.

WeatherGod · 2013-05-08T18:18:00Z

.travis.yml

  - python setup.py install

 script:
  - mkdir ../tmp_test_dir
  - cd ../tmp_test_dir
-  - python ../matplotlib/tests.py -sv
+  - echo Testing using 4 processes


The echo statement here doesn't match what is passed to the command line

Indeed. Thanks for catching.

mdboom · 2013-05-08T18:51:54Z

@WeatherGod: since setuptools is doing the downloading, my understanding is that it's ok, since the Debian build environment will cause the downloading not to happen. They basically have a generic way to handle all setuptools-based builds.

WeatherGod · 2013-05-08T19:36:56Z

So, even with two processes and 4GB of RAM, I am running into issues where the processes run such high memory usages (~3GB each) that my system starts to swapping like crazy. And it seems like it isn't making any further progress even though my processors keep getting pegged (whenever an io wait is finished, that is). I can do a simple matplotlib.test() run with no problems, so I don't know why doing it with two processes is much, much worse.

mdboom · 2013-05-08T19:44:53Z

@WeatherGod: Thanks. That's a useful data point. I'm not sure what further to investigate. For me, on Fedora 18 (which shouldn't be fundamentally different from your CentOS box), with 4GB and 4 cores, I'm not seeing runaway memory usage. This is also on Python 2.7. What do you get for:

› gs --version
9.06
› inkscape --version
Inkscape 0.48.4 r9939 (Dec 18 2012)

WeatherGod · 2013-05-08T19:51:44Z

[centos6] [broot@rd22 ~]$ gs --version
8.70
[centos6] [broot@rd22 ~]$ inkscape --version
inkscape: Command not found.

I can do tests tonight on a Ubuntu 12.04 machine that is quite a bit
beefier, and I have full control over the packages installed on it.

WeatherGod · 2013-05-09T03:30:52Z

So, on my Ubuntu 12.04 machine, setting it to 2 processes, it just hangs immediately after completing the first test. But, if I run it in single process mode, it works just fine.

ben@tigger:~$ gs --version
9.05
ben@tigger:~$ inkscape --version
Inkscape 0.48.3.1 r9886 (Jan 29 2013)

mdboom · 2013-05-09T13:25:23Z

Ok. Interesting -- I guess this is farther from ready than I thought. The Travis tests are running in a Ubuntu 12.04 VM, if I recall correctly, so I'm surprised it works there and not for you. We did see hanging initially when trying to use multiprocessing.cpu_count() to get the number of cores, but if you're explicitly specifying 2 cores you aren't just hitting that problem. Thanks for trying -- I'll have to go back to the drawing board (perhaps try this on a Ubuntu VM myself) and see if I can come up with any good question to ask...

@pelson: Have you tried running the tests locally, or only on Travis?

mdboom · 2013-05-15T18:59:59Z

I'm still unable to reproduce @WeatherGod's issue (I tried on a clean Ubuntu 12.04 VM). Very strange.

How do we all feel about this PR? Personally, I'd love to have it in (since it works for me -- sorry to be selfish), and it shouldn't be any worse for anyone not running the tests in parallel.

Travis seems to like running the tests in parallel as well, so I'm leaning toward turning it on in .travis.yml, but that of course is also optional.

WeatherGod · 2013-05-16T13:44:34Z

On Wed, May 15, 2013 at 3:00 PM, Michael Droettboom <
notifications@github.com> wrote:

I'm still unable to reproduce @WeatherGod https://github.com/WeatherGod's
issue (I tried on a clean Ubuntu 12.04 VM). Very strange.

How do we all feel about this PR? Personally, I'd love to have it in
(since it works for me -- sorry to be selfish), and it shouldn't be any
worse for anyone not running the tests in parallel.

Travis seems to like running the tests in parallel as well, so I'm leaning
toward turning it on in .travis.yml, but that of course is also optional.

Perhaps we should double-check our basic assumptions. Exactly what command
did you use to run the parallelized tests?

mdboom · 2013-05-16T14:20:57Z

After installing matplotlib, from a temporary directory, I run:

$PATH_TO_MATPLOTLIB_SOURCE/tests.py --processes=-1 --process-timeout=300

pelson · 2013-05-17T08:18:04Z

This is absolutely fine to be merged. I too have memory problems (Intel® Xeon® Processor E5520 with 8 threads and 5.7Gb addressable RAM on RHEL6 64bit, Python 2.7) which mean I cannot run the tests in parallel (even with --processes=1) but this PR represents an improvement (and the travis tests are a lot quicker). So 👍 for v1.3.x.

pelson · 2013-05-17T10:43:41Z

For the record, running with the -sv flags my machine froze (and was terminated with ctrl+c) at:

$> python tests.py  --processes=1 --process-timeout=300 -sv
...

test_line_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_for_non_affine_transData (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok

^C
Traceback (most recent call last):
  File "tests.py", line 20, in <module>
    run()
  File "tests.py", line 17, in run
    defaultTest=default_test_modules)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 118, in __init__
    **extra_args)
  File "lib/python2.7/unittest/main.py", line 95, in __init__
    self.runTests()
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 197, in runTests
    result = self.testRunner.run(self.test)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 349, in run
    timeout=nexttimeout)
  File "<string>", line 2, in get
  File "lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    kind, result = conn.recv()
KeyboardInterrupt
Process Process-2:
Traceback (most recent call last):
  File "lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 625, in runner
    keyboardCaught, shouldStop, loaderClass, resultClass, config)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 692, in __runner
    keyboardCaught.set()
  File "iris/sci/lib/python2.7/multiprocessing/managers.py", line 1010, in set
    return self._callmethod('set')
  File "lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe

WeatherGod · 2013-05-17T14:27:08Z

Come to think of it, I too also ran my commands with the -sv option.

pelson · 2013-05-17T15:16:12Z

Come to think of it, I too also ran my commands with the -sv option.

For the record, I don't think it is because you're running it with sv - I just did that to see if I could see where it stalls (I'm not sure if you can though...)

mdboom · 2013-05-17T15:19:36Z

Ok -- it definitely sucks that it's failing for an unknown reason for (at least) @pelson, and @WeatherGod, but I think I'll merge this so we at least get the benefits for Travis as we head into the release period, and then I'll open a new issue for getting to the bottom of the failures.

parallelize_tests

ghost assigned mdboom Apr 30, 2013

mdboom mentioned this pull request May 1, 2013

Make the travis output quieter on v1.2.x #1965

Merged

mdboom added 16 commits May 8, 2013 11:49

Go back to one ghostscript process per test

1f42367

Fix some test images that weren't being run on Travis before

6410dab

Run tests in parallel on Travis and in tox

9795b15

Include all test files

12bbda8

Make the test function name match so that they can be pickled through…

ecf9ec3

… multiprocessing

Don't call "show" from a test

a127c31

Fix typo so correct gs executable will be run

852c127

Remove import of old util

de64395

Don't run tests in parallel on Travis

2ba68a4

Use $(nproc) to determine number of processors, as discovered by @pelson

268c91c

.

Use cbook.mkdirs, since with multiprocessing tests each process may t…

720fa80

…ry to create a directory at the same time

Make the log quieter

61fb780

Only compare PNG version of hexbin extent -- it otherwise can crash i…

623a128

…nkscape on Travis

Don't switch backends in a test

3c94011

Make pip install of pyparsing quiet also

8c5071b

Limited number of processes on Travis to 8, since otherwise we run ou…

3063af9

…t of memory when running too many inkscapes/ghostscripts in parallel

Remove tests we're no longer using because they consume too much memo…

807125d

…ry on Travis

WeatherGod reviewed May 8, 2013
View reviewed changes

Fix .travis.yml message

08ba300

mdboom added a commit that referenced this pull request May 17, 2013

Merge pull request #1951 from mdboom/parallelize_tests

d39f9c0

parallelize_tests

mdboom merged commit d39f9c0 into matplotlib:master May 17, 2013

mdboom mentioned this pull request May 17, 2013

Running tests in parallel occasionally hangs in unpredictable ways #2021

Closed

dmcdougall mentioned this pull request Jun 16, 2013

parallelize our test suite #1508

Closed

mdboom deleted the parallelize_tests branch August 7, 2014 13:50

Uh oh!

parallelize_tests #1951

parallelize_tests #1951

Uh oh!

Conversation

mdboom commented Apr 26, 2013

Uh oh!

mdboom commented Apr 29, 2013

Uh oh!

WeatherGod commented Apr 29, 2013

Uh oh!

mdboom commented Apr 29, 2013

Uh oh!

pelson commented Apr 30, 2013

Uh oh!

pelson commented Apr 30, 2013

Uh oh!

mdboom commented Apr 30, 2013

Uh oh!

mdboom commented Apr 30, 2013

Uh oh!

pelson commented May 1, 2013

Uh oh!

mdboom commented May 1, 2013

Uh oh!

mdboom commented May 1, 2013

Uh oh!

mdboom commented May 1, 2013

Uh oh!

mdboom commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

mdboom commented May 8, 2013

Uh oh!

mdboom commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

mdboom commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

WeatherGod May 8, 2013

Choose a reason for hiding this comment

Uh oh!

mdboom May 8, 2013

Choose a reason for hiding this comment

Uh oh!

mdboom commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

mdboom commented May 8, 2013

Uh oh!

WeatherGod commented May 8, 2013

Uh oh!

WeatherGod commented May 9, 2013

Uh oh!

mdboom commented May 9, 2013

Uh oh!

mdboom commented May 15, 2013

Uh oh!

WeatherGod commented May 16, 2013

Uh oh!

mdboom commented May 16, 2013

Uh oh!

pelson commented May 17, 2013

Uh oh!

pelson commented May 17, 2013

Uh oh!

WeatherGod commented May 17, 2013

Uh oh!

pelson commented May 17, 2013

Uh oh!

mdboom commented May 17, 2013

Uh oh!

Uh oh!