Skip to content

possible race condition stops build with python3.4 #3738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sandrotosi opened this issue Oct 29, 2014 · 28 comments
Closed

possible race condition stops build with python3.4 #3738

sandrotosi opened this issue Oct 29, 2014 · 28 comments

Comments

@sandrotosi
Copy link
Contributor

Hello,
we noticed a problem in Debian buildd machines, where sometimes the build of mpl with python3.4 stuck at:

python3.4 ./setup.py build 
============================================================================
Edit setup.cfg to change the build options

BUILDING MATPLOTLIB
            matplotlib: yes [1.4.2]
                python: yes [3.4.2 (default, Oct  8 2014, 12:51:46)  [GCC
                        4.9.1]]
              platform: yes [gnukfreebsd9]

REQUIRED DEPENDENCIES AND EXTENSIONS
                 numpy: yes [version 1.8.2]
                   six: yes [using six version 1.8.0]
              dateutil: yes [using dateutil version 2.2]
                  pytz: yes [using pytz version 2012c]
               tornado: yes [using tornado version 3.2.2]
             pyparsing: yes [using pyparsing version 2.0.3]
                 pycxx: yes [Official versions of PyCXX are not compatible
                        with matplotlib on Python 3.x, since they lack
                        support for the buffer object.  Using local copy]
                libagg: yes [pkg-config information for 'libagg' could not
                        be found. Using local copy.]
              freetype: yes [version 2.5.2]
                   png: yes [version 1.2.50]
                 qhull: yes [pkg-config information for 'qhull' could not be
                        found. Using local copy.]

OPTIONAL SUBPACKAGES
           sample_data: yes [installing]
              toolkits: yes [installing]
                 tests: yes [using nose version 1.3.4 / using unittest.mock]
        toolkits_tests: yes [using nose version 1.3.4 / using unittest.mock]

OPTIONAL BACKEND EXTENSIONS
                macosx: no  [Mac OS-X only]
                qt5agg: yes [installing, Qt: 5.3.2, PyQt: 5.3.2]
E: Caught signal ‘Terminated’: terminating immediately
make: *** [build-3.4-stamp] Terminated
debian/rules:32: recipe for target 'build-3.4-stamp' failed
Build killed with signal TERM after 150 minutes of inactivity

(which got killed after 150 mins).

For now this has happened on kfreebsd-i386[1], s390x[2], i386[3] from early 1.4.1RC releases[4](the 'maybe-failed' entries)

[1] https://buildd.debian.org/status/fetch.php?pkg=matplotlib&arch=kfreebsd-i386&ver=1.4.2-1&stamp=1414402804
[2] https://buildd.debian.org/status/fetch.php?pkg=matplotlib&arch=s390x&ver=1.4.2-1&stamp=1414192251
[3] https://buildd.debian.org/status/fetch.php?pkg=matplotlib&arch=i386&ver=1.4.1~rc1-1&stamp=1413455146
[4] https://buildd.debian.org/status/logs.php?pkg=matplotlib

this is currently prevent matplotlib to be build on all Debian release architectures and thus reaching testing and Jessie for the freeze.

@mdboom
Copy link
Member

mdboom commented Oct 29, 2014

This may be due to trying to import PyQt4 after PyQt5. You can control which GUI backends are checked using the setup.cfg file. Try adding qt4agg = False to the [gui_support] section of setup.cfg.

@jenshnielsen
Copy link
Member

Since this is a timeout I guess it is likely to be using the other code path and forking a process
to import pyqt(4,5) into with the use of multiprocessing and never returning from the process called by map.

Perhaps we should change map to use map_async and get with a timeout to avoid this?

https://docs.python.org/2/library/multiprocessing.html

@mdboom
Copy link
Member

mdboom commented Oct 29, 2014

Good point, @jenshnielsen: I missed that multiprocessing path. Yes, I think having a timeout (and ideally getting a traceback from the child process) would be better.

@sandrotosi
Copy link
Contributor Author

yeah I was looking suspiciously ay the mp code too, but I just had a glance and didnt want to point fingers :) if you come up with a patch to test, gimme a shout and I will try to upload the new package asap.

@tacaswell
Copy link
Member

@sandrotosi What is the deadline for the Jessie freeze?

@sandrotosi
Copy link
Contributor Author

@tacaswell ehm... a week ago :) but I will try to get an unblock for mpl

@jenshnielsen
Copy link
Member

@sandrotosi Do you have any changes to test the fixes in #3741

This puts a 5 second timeout on the get return of the result. But it might still wait at the p.close() stage so alternatively we might have to call p.terminate() if a timeout error is raised?

@sandrotosi
Copy link
Contributor Author

@jenshnielsen my "test" would be to upload it to Debian and see if it really fixes the problem, but since it requires a bit of time on my side and quite a lot of resources on the Debian build machine I would like not to rush on fixing it (even if it's important) and get an answer to your last question

@tacaswell
Copy link
Member

@jenshnielsen I bet this is the same issue we were having with 3.2, could you reproduce that on any of your machines?

@jenshnielsen
Copy link
Member

@sandrotosi Sounds fair. The problem is that I can't reproduce the issue so I am working in the blind.

@tacaswell No I can't reproduce it right now. Good point about 3.2 I will try testing it on 3.2 in a Ubuntu 12.04 VM. That is probably the closest I can get to the travis issue locally.

@tacaswell
Copy link
Member

You can also turn 3.2 back on on travis

On Wed Oct 29 2014 at 6:07:02 PM Jens Hedegaard Nielsen <
notifications@github.com> wrote:

@sandrotosi https://github.com/sandrotosi Sounds fair. The problem is
that I can't reproduce the issue so I am working in the blind.

@tacaswell https://github.com/tacaswell No I can't reproduce it right
now. Good point about 3.2 I will try testing it on 3.2 in a Ubuntu 12.04
VM. That is probably the closest I can get to the travis issue locally.


Reply to this email directly or view it on GitHub
#3738 (comment)
.

@sandrotosi
Copy link
Contributor Author

@jenshnielsen me neither has ever faced this problem on my amd64 box, just spotted on our build nodes. if you're confident that's the way to go, I will upload it and see the results

@jenshnielsen
Copy link
Member

@sandrotosi Let me do a few more experiments with VM's and Travis and see if I can reproduce the issue.

@jenshnielsen
Copy link
Member

I could not reproduce the issue (neither locally or on travis) but I have changed the code a bit to forcefully terminate if the process has not returned within 5 sec. With my current understanding of multiprocessing I think this is the right thing to do but I could be wrong.

@sandrotosi
Copy link
Contributor Author

To my surprise, I found a couple[1][2] of patches that already disable multiprocessing checks in Debian for GTK3Agg and GTK3Cairo (so I adapted a bit the proposed patch) - if the test with QT is succesful we might have also fixed the same problems with GTK3* (hence dropping a bit of delta between vanilla mpl and Debian release)

During the night (EU TZ) will build the package on my machine and tomorrow morning will upload the results hopefully during the day we will know the outcome

@jenshnielsen
Copy link
Member

Great. It would be good to get to the bottom of this.

@jenshnielsen
Copy link
Member

BTW: I did have a similar issue with the GTK3 backend on OSX and homebrew GTK. This seemed to be caused by a broken pygobject build which would segfault the subprocess.

@sandrotosi
Copy link
Contributor Author

package built fine on my machine, uploaded to Debian, and its now building on all the relevant architectures: I'll monitor it and let you know of any developments

@sandrotosi
Copy link
Contributor Author

while s390x succeeded, kfreebsd-i386 didnt :( https://buildd.debian.org/status/package.php?p=matplotlib&suite=sid

@mdboom mdboom assigned mdboom and unassigned mdboom Oct 30, 2014
@jenshnielsen
Copy link
Member

Sorry I messed up the fix that I wrote yesterday night. The new version should function correctly. If the sub process runs for more than 10 sec it will be killed and the backend will be skipped. This is still not optimal for the gtk backends since they will not be build. It doesn't matter for the QT backends since these are runtime only dependencies.

@sandrotosi
Copy link
Contributor Author

@jenshnielsen where can I find the new version? anyway, incredibly enough, as I was about to try to setup a loop to see when it will stuck, at the first iteration I replicated the problem on my laptop: what tests do you want me to perform? of course i run a simple "python3.4 setup.py build"...

@jenshnielsen
Copy link
Member

I have pushed a new commit on the pull request which should terminate the right way

@sandrotosi
Copy link
Contributor Author

ok, I'm doing the same dance as yesterday and let you know after the buildd will pick up the new upload

@sandrotosi
Copy link
Contributor Author

Good news!!! most of the architectures have finished building succesfully \o/ https://buildd.debian.org/status/package.php?p=matplotlib I'll keep an eye on the others but i dont expect (hopefully...) surprises - THANKS A LOT for the quick support on this

@tacaswell
Copy link
Member

closed by #3741

@sedimentation-fault
Copy link

Sorry for commenting on an old bug, but I am trying to install matplotlib-2.2.4 for Python 3.6 and 3.7 on a Gentoo system. It had worked previously, but I had to rebuild it because I wanted to remove Python 2.7 support.

So I had it installed for Python 2.7, 3.6 and 3.7. Then I decided to throw away 2.7. This time compilation for Python 3.6 went smooth, but for 3.7 it could not find gtk3agg due to 'Check timed out'. I thought it was a 3.7 problem, so I disabled 3.7 too. Now it cannot find gtk3agg for 3.6 due to the same timeout reason.

I decided to re-enable building for both Python 3.6 and 3.7 again - and again, gtk3agg is built for 3.6 but not for 3.7 due to timeout.

Now, a word about the load on the system while this is going on: the load is anywhere between 17 and 44(!). Yes, fortyfour. Who cares? There are 8 virtual CPUs at 3GHz each and 50 GB of RAM waiting to be used - so I use them. Since when has the load on a system any effect on the result of a test like 'do you have GTK3 for Python installed?'?

As you can see, just setting a 5 sec. timeout on a test will not cut it. The result is clearly dependent on the load - and you should not impose a maximum load on your users. What you can do is possibly inquire about the current load and set a timeout that depends on it. That is, if you would set a timeout of 5 sec. for a load of 0.2, then how high should it be for a load of 17 - or even 44?

You have to think more deeply about this...

@tacaswell
Copy link
Member

tacaswell commented May 2, 2020

We can not do the checks in-process due to conflicts of importing more than one GUI toolkit into the same process, so we do it in a sub-process. If we do it without a timeout we run the risk of hanging the build forever if things go extremely sideways (as this issue reported).

This seems like a rather extreme edge case where you have loaded your system to the point where it is essentially non-responsive. I am disinclined to pick up the complexity of a load-dependent timeout for this case, but if you open a PR we can discuss how what the complexity trade off actually looks like.

The fastest path forward for you is to inject a patch into the gentoo build process (https://wiki.gentoo.org/wiki//etc/portage/patches I think that this falls under the category of "site-specific patches") that removes the timeout. It is between you and the gentoo packagers if they take that upstream.

In the future please open a new issue that refers to old ones rather than commenting on old ones.


As a side-point, I strongly suspect that you are actually taking longer in wall time to get things done (due to the cost of context switching / cache misses etc) by over loading your system that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants