-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Use os.getpid() in configdir, to avoid multiprocess concurrency issues #15911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why would it a problem that multiple subprocesses share the same config/cachedir? In fact that's normally the case (they all share ~/.config/matplotlib ~/.cache/matplotlib), why would it be different when using a tmpdir? It's actually not really clear to me what this is trying to solve. |
Perhaps this will illustrate:
process1:
create /tmp/foo
atexit rmtree /tmp/foo
process2:
create /tmp/foo
atexit rmtree /tmp/foo
process1:
use /tmp/foo
exit, triggering rmtree
process2:
try to use /tmp/foo
discover that /tmp/foo has been rmtree'd before we were done with it
…On Wed, Dec 11, 2019 at 9:41 AM Antony Lee ***@***.***> wrote:
Why would it a problem that multiple subprocesses share the same
config/cachedir? In fact that's normally the case (they all share
~/.config/matplotlib ~/.cache/matplotlib), why would it be different when
using a tmpdir?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#15911?email_source=notifications&email_token=AMGVVMK7IQ2ZFTOR4DNUQNDQYEQ4PA5CNFSM4JZSLZK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGT7EWA#issuecomment-564654680>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMGVVMLWMVWVU677F3PRYZ3QYEQ4PANCNFSM4JZSLZKQ>
.
|
fair point... but I think separate processes can't actually end up creating the same directory? mkdtemp is documented as
and if you look at the implementation I think at the OS-level if there is a name collision this is handled by the FileExistsError case (https://github.com/python/cpython/blob/3.8/Lib/tempfile.py#L358). |
I'm not really sure why they were all picking the same directory name
yesterday. I was a little surprised by it too.
uwsgi was involved, forking 20 processes.
I'm convinced that adding the pid to the directory name helped.
I suppose it's possible the random number generator (on which I assume
mkdtemp depends) was seeded in a common parent process before the child
forks, and then the children inherited that random number sequence without
reseeding.
…On Wed, Dec 11, 2019 at 9:58 AM Antony Lee ***@***.***> wrote:
fair point... but I think separate processes can't actually end up
creating the same directory? mkdtemp is documented as
There are no race conditions in the directory’s creation.
and if you look at the implementation I think at the OS-level if there is
a name collision this is handled by the FileExistsError case (
https://github.com/python/cpython/blob/3.8/Lib/tempfile.py#L358).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#15911?email_source=notifications&email_token=AMGVVMJEI6VTYI2GORVN3GTQYES6DA5CNFSM4JZSLZK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUA3EY#issuecomment-564661651>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMGVVMMJXJBNN6FTJ5R34JLQYES6DANCNFSM4JZSLZKQ>
.
|
Can you check whether the fork occurred before or after the temporary directory was created and the atexit call registered? I'm still puzzled... |
On Wed, Dec 11, 2019 at 10:41 AM Antony Lee ***@***.***> wrote:
Can you check whether the fork occurred before or after the temporary
directory was created and the atexit call registered? I'm still puzzled...
Unfortunately, this was on a production system that I can't bounce lightly.
But it seems safe to assume that uwsgi was forking the 20 processes before
matplotlib code got involved. Otherwise, uwsgi would about have to have a
matplotlib dependency itself, which would seem strange.
I believe uwsgi would fork 20 processes, each just prior to chaining into
my company's code that does have a matplotlib dependency.
Here's a quick grep:
$ find . -type f -print | grep -i uwsgi
below cmd output started 2019 Wed Dec 11 10:50:37 AM PST
./uWSGI-2.0.18.dist-info/RECORD
./uWSGI-2.0.18.dist-info/top_level.txt
./uWSGI-2.0.18.dist-info/METADATA
./uWSGI-2.0.18.dist-info/LICENSE
./uWSGI-2.0.18.dist-info/WHEEL
./uWSGI-2.0.18.dist-info/INSTALLER
./__pycache__/uwsgidecorators.cpython-36.pyc
./uwsgidecorators.py
above cmd output done 2019 Wed Dec 11 10:50:37 AM PST
dstromberg@dstromberg-inspiron-5570:~/.local/lib/python3.6/site-packages
x86_64-pc-linux-gnu 2352
$ find . -type f -print | grep -i uwsgi | xargs grep -il matplotlib
below cmd output started 2019 Wed Dec 11 10:50:39 AM PST
above cmd output done 2019 Wed Dec 11 10:50:39 AM PST
Thanks!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpret both the docs of mkdtemp and its implementation as guaranteeing that two calls to mkdtemp, even in separate (e.g., after-fork) processes, can never return the same directory.
Note that if that's not the case, then the proposed patch here can still fail if e.g. two separate OSes access the same tmpdir (e.g. on an NFS), so that's why I would like to understand what is really happening...
On Thu, Dec 12, 2019 at 2:08 AM Antony Lee ***@***.***> wrote:
***@***.**** requested changes on this pull request.
I interpret both the docs of mkdtemp and its implementation as
guaranteeing that two calls to mkdtemp, even in separate (e.g., after-fork)
processes, can never return the same directory.
Note that if that's not the case, then the proposed patch here can still
fail if e.g. two separate OSes access the same tmpdir (e.g. on an NFS), so
that's why I would like to understand what is really happening..
Sadly, it could take significant time to track down the more general cause
of this issue.
Here's a small test program that confirms what you were saying: mkdtemp
tries to be unique even across fork, even with a common RNG seed:
#!/usr/bin/python3
"""Test mkdtemp with fork."""
import os
import random
import shutil
import sys
import tempfile
def main():
"""Do the test."""
for i in range(5):
pid = os.fork()
if pid == 0:
# child
random.seed(0)
directory = tempfile.mkdtemp(prefix='foo')
print(f'child: {directory}')
shutil.rmtree(directory)
sys.exit(0)
else:
# parent
print('parent')
continue
os.wait()
main()
Here's the error I was seeing (from a pty log, but with control characters
removed) :
*** Python threads support is disabled. You can enable it with
--enable-threads ***
Python main interpreter initialized at 0x18d1050
your server socket listen backlog is limited to 100 connections
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
orig_st = os.lstat(path)
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
orig_st = os.lstat(path)
orig_st = os.lstat(path)
orig_st = os.lstat(path)
orig_st = os.lstat(path)
orig_st = os.lstat(path)
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
FileNotFoundError: [Errno 2] No such file or directory:
'/tmp/matplotlib-sroiri33'
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
worker 5 buried after 1 seconds
Involved technologies: Python 3.6, falcon, uwsgi, matplotlib, my company's
product named grok3.
FWIW,I think diskless computers are rare today, and even when they were
popular they would seldom share /tmp instead using a volatile filesystem
like a tmpfs. But I could add a hostname and nonce if you feel that would
help get something along these lines merged.
Thanks.
|
Just to be sure:
|
Also, your log suggest the error is only occurring at shutdown (the line |
On Thu, Dec 12, 2019 at 10:08 AM Antony Lee ***@***.***> wrote:
Just to be sure:
- what's the version of matplotlib you're using? does it happen with
mpl master?
- you are positive that adding the pid solves the problem?
We have a few different docker containers using matplotlib, two unpinned
and one pinned:
$ find . -name 'req*.txt' -print | xargs egrep matplotlib
below cmd output started 2019 Thu Dec 12 10:22:58 AM PST
./rt/classifier/requirements.txt:matplotlib==2.2.4
./falcon_server/requirements.txt:matplotlib
./requirements.txt:matplotlib
The one that matters in this case should be the falcon_server one - so
unpinned. If it's important, I can try to get back onto the production
system, but access to that system is constrained to being initiated from a
single laptop in this office (a remote worker has a 2nd). I checked my
logs for a 'pip freeze', but unfortunately didn't have one.
I'm highly confident that adding the pid helped. I was Slacking with the
only other person on the production system at the time, and no one else
spoke up about it in our daily standup meeting when we discussed the matter
- so a surprise fix from someone out of communication with us is unlikely.
Also, we tried to start the software a few times and got the error quoted a
few times in a row, but when I added the pid to the temporary directory,
the problem went away on the next software start.
Also, your log suggest the error is only occurring at shutdown (the line
orig_st = os.lstat(path) only occurs in the implementation of
shutil.rmtree, which is called by the atexit handler)?
Yes, something in our API initialization is starting a short-lived
subprocess that uses matplotlib and exits almost immediately, before
getting into the pith of our API. The API actually starts up even without
the pid in the tempdir path, but it gives those ominous errors early on.
We do make use of matplotlib for actual plotting, but sometimes it's a
dependency of something else we're building on without graphs/charts too.
Grok3 is not a small system - it's mostly in CPython and there are about
38,000 SLOC Python.
Thanks.
|
At least it would be nice to know if this only occurs with mpl2.2.4, which is quite old (and likely will go unsupported relatively soon, though it's not my call), or mpl3+, which has diverged quite a bit (due to the dropping of Py2 support). |
Actually I now have a repro, that fails even if including the pid in the tmpdir (i.e., this PR is insufficient to fix it): import os
import time
import matplotlib as mpl
print(mpl.get_configdir())
print(mpl.get_cachedir())
if os.fork(): # parent exits immediately, destroying the tmp configdir.
pass
else:
time.sleep(1) # child waits that parent quits, then crashes at exit...
# or if later doing something that tries to access the cachedir, e.g.
# import matplotlib.font_manager (first do Looks like the following patch works, can you confirm? diff --git i/lib/matplotlib/__init__.py w/lib/matplotlib/__init__.py
index 11968a34b..f62a55932 100644
--- i/lib/matplotlib/__init__.py
+++ w/lib/matplotlib/__init__.py
@@ -523,16 +523,6 @@ def get_home():
return None
-def _create_tmp_config_or_cache_dir():
- """
- If the config or cache directory cannot be created, create a temporary one.
- """
- configdir = os.environ['MPLCONFIGDIR'] = (
- tempfile.mkdtemp(prefix='matplotlib-'))
- atexit.register(shutil.rmtree, configdir)
- return configdir
-
-
def _get_xdg_config_dir():
"""
Return the XDG configuration directory, according to the `XDG
@@ -551,7 +541,29 @@ def _get_xdg_cache_dir():
return os.environ.get('XDG_CACHE_HOME') or str(Path.home() / ".cache")
-def _get_config_or_cache_dir(xdg_base):
+# The following functions rely on getpid() to make sure 1) that a temporarily
+# created config/cachedir is not shared between parent and child processes (the
+# parent may exit and destroy it first), and 2) that child processes don't
+# inherit the parent's atexit handler.
+
+
+def _create_tmp_config_or_cache_dir():
+ """
+ If the config or cache directory cannot be created, create a temporary one.
+ """
+ configdir = tempfile.mkdtemp(prefix='matplotlib-')
+
+ @atexit.register
+ def cleanup(getpid=os.getpid, pid=os.getpid(),
+ rmtree=shutil.rmtree, configdir=configdir):
+ if getpid() == pid:
+ rmtree(configdir)
+
+ return configdir
+
+
+@functools.lru_cache()
+def _get_config_or_cache_dir(xdg_base, pid):
configdir = os.environ.get('MPLCONFIGDIR')
if configdir:
configdir = Path(configdir).resolve()
@@ -569,7 +581,6 @@ def _get_config_or_cache_dir(xdg_base):
return _create_tmp_config_or_cache_dir()
-@_logged_cached('CONFIGDIR=%s')
def get_configdir():
"""
Return the string representing the configuration directory.
@@ -586,10 +597,11 @@ def get_configdir():
configuration directory.
5. A writable directory could not be found or created; return None.
"""
- return _get_config_or_cache_dir(_get_xdg_config_dir())
+ configdir = _get_config_or_cache_dir(_get_xdg_config_dir(), os.getpid())
+ _log.debug("CONFIGDIR=%s", configdir)
+ return configdir
-@_logged_cached('CACHEDIR=%s')
def get_cachedir():
"""
Return the location of the cache directory.
@@ -597,7 +609,9 @@ def get_cachedir():
The procedure used to find the directory is the same as for
_get_config_dir, except using `$XDG_CACHE_HOME`/`~/.cache` instead.
"""
- return _get_config_or_cache_dir(_get_xdg_cache_dir())
+ cachedir = _get_config_or_cache_dir(_get_xdg_cache_dir(), os.getpid())
+ _log.debug("CACHEDIR=%s", cachedir)
+ return cachedir
def _get_data_path(): |
Sounds good.
Sadly, I can't try to reproduce the error right now. We're a young startup
and we're aiming to deploy our product to our first customer this week - so
time is in very short supply right now.
Also, this is a customer system - it's possible I won't be able to try to
reproduce the bug on their system even after we have more time.
I might be able to try to reproduce it on another system in a week or two.
I need to set up a uwsgi test anyway.
Thanks.
…On Thu, Dec 12, 2019 at 11:47 AM Antony Lee ***@***.***> wrote:
Actually I now have a repro, that fails even if including the pid in the
tmpdir (i.e., this PR is insufficient to fix it):
import osimport time
import matplotlib as mplprint(mpl.get_configdir())print(mpl.get_cachedir())
if os.fork(): # parent exits immediately, destroying the tmp configdir.
passelse:
time.sleep(1) # child waits that parent quits, then crashes at exit...
# or if later doing something that tries to access the cachedir, e.g.
# import matplotlib.font_manager
(first do chmod a-x ~/.{config,cache}/matplotlib (temporarily) to force
the creation of config/cachedirs in /tmp)
Looks like the following patch works, can you confirm?
diff --git i/lib/matplotlib/__init__.py w/lib/matplotlib/__init__.py
index 11968a34b..f62a55932 100644--- i/lib/matplotlib/__init__.py+++ w/lib/matplotlib/__init__.py@@ -523,16 +523,6 @@ def get_home():
return None
-def _create_tmp_config_or_cache_dir():- """- If the config or cache directory cannot be created, create a temporary one.- """- configdir = os.environ['MPLCONFIGDIR'] = (- tempfile.mkdtemp(prefix='matplotlib-'))- atexit.register(shutil.rmtree, configdir)- return configdir--
def _get_xdg_config_dir():
"""
Return the XDG configuration directory, according to the `XDG@@ -551,7 +541,29 @@ def _get_xdg_cache_dir():
return os.environ.get('XDG_CACHE_HOME') or str(Path.home() / ".cache")
-def _get_config_or_cache_dir(xdg_base):+# The following functions rely on getpid() to make sure 1) that a temporarily+# created config/cachedir is not shared between parent and child processes (the+# parent may exit and destroy it first), and 2) that child processes don't+# inherit the parent's atexit handler.+++def _create_tmp_config_or_cache_dir():+ """+ If the config or cache directory cannot be created, create a temporary one.+ """+ configdir = tempfile.mkdtemp(prefix='matplotlib-')++ @atexit.register+ def cleanup(getpid=os.getpid, pid=os.getpid(),+ rmtree=shutil.rmtree, configdir=configdir):+ if getpid() == pid:+ rmtree(configdir)++ return ***@***.***_cache()+def _get_config_or_cache_dir(xdg_base, pid):
configdir = os.environ.get('MPLCONFIGDIR')
if configdir:
configdir = Path(configdir).resolve()@@ -569,7 +581,6 @@ def _get_config_or_cache_dir(xdg_base):
return _create_tmp_config_or_cache_dir()
-@_logged_cached('CONFIGDIR=%s')
def get_configdir():
"""
Return the string representing the configuration directory.@@ -586,10 +597,11 @@ def get_configdir():
configuration directory.
5. A writable directory could not be found or created; return None.
"""- return _get_config_or_cache_dir(_get_xdg_config_dir())+ configdir = _get_config_or_cache_dir(_get_xdg_config_dir(), os.getpid())+ _log.debug("CONFIGDIR=%s", configdir)+ return configdir
-@_logged_cached('CACHEDIR=%s')
def get_cachedir():
"""
Return the location of the cache directory.@@ -597,7 +609,9 @@ def get_cachedir():
The procedure used to find the directory is the same as for
_get_config_dir, except using `$XDG_CACHE_HOME`/`~/.cache` instead.
"""- return _get_config_or_cache_dir(_get_xdg_cache_dir())+ cachedir = _get_config_or_cache_dir(_get_xdg_cache_dir(), os.getpid())+ _log.debug("CACHEDIR=%s", cachedir)+ return cachedir
def _get_data_path():
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#15911?email_source=notifications&email_token=AMGVVMNWU6WYTAQW7YKYTMDQYKIM7A5CNFSM4JZSLZK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGXZ4NI#issuecomment-565157429>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMGVVMP4KROKN37NJJ7JTSTQYKIM7ANCNFSM4JZSLZKQ>
.
|
Actually, looking at this again, there's a much simpler solution: you can set the MPLCONFIGDIR environment variable (which, despite the name, applies to both the config and cache dirs) to a writable directory that you control. Would that work for you? |
PR Summary
PR Checklist