-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Limit the reading size from pipes to their default buffer size on Unix systems #121313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note on Linux, you can change the size of the buffer backing a pipe (up to the value found in Hence, changing the defaults could have an impact in code where this buffer is adjusted beyond the default. |
I've added a constant that can easily be changed if a pipe buffer has another size. |
Merged, thanks for the contribution. If anyone wants to investigate if things could be even better with Linux's fcntl F_GETPIPE_SZ and F_SETPIPE_SZ, feel free, and report back. Opening a new issue if that seems worthwhile. |
Re-opening until either of the 2 opened PRs is merged (to the future committer, if any, don't forget to close the issue if I forget). |
…fer size to 64KiB (GH-123559) Increases the multiprocessing connection buffer size from 8k to 64k for efficiency, without overallocating. Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Victor Stinner <vstinner@python.org>
Feature or enhancement
Proposal:
There are different ways for processes to communicate in Python, one of which, on Unix-based systems, is pipes.
While analyzing the performance behavior of the following code snippet, I found that Python handles reading large amounts of data from pipes very inefficiently:
Due to passing chunks of arrays to the different processes, Python has to use pipes instead of shared memory to distribute the workloads.
For a process to receive data from other processes in this form, the _recv_bytes and subsequently _recv functions in Lib/multiprocessing/connection.py are called, where a while loop reads from a given pipe file descriptor until all data has been read.
To read from the pipe on Unix-based systems, the os_read_impl function in Modules/posixmodule.c is executed, with the length that has to be read and the file descriptor as parameters. It is in this function that the problematic behavior arises.
Let's say the process needs to read 16MiB from the pipe. The os_read_impl function is called, and a length of 16MiB is passed. In order to read this much data, the function creates a new PyBytes object of size 16MiB. This is done through a malloc call, which, due to the huge size of the object, can't place it on the existing heap but must divert to calling mmap. The mmap system call creates a new virtual memory area of size 16MiB. After that, a _Py_read function call results in trying to read 16MiB from the pipe. The problem hereby is, that, on Unix systems, a pipe can by default only return 16 times the base page size, which would be 16*4KiB on x86-64, of data per read (Source: man 7 pipe). This results in reading only 64KiB of data into the 16MiB area. Here, a Linux-specific issue arises. Since the virtual memory area is big enough and properly aligned, a 2MiB Transparent Huge Page is placed to back this chunk of data. This results in zeroing 2MiB of physical memory in order to read just 64KiB, which already leads to performance degradation. Due to the way the read data is later on stored in the _recv function, the entire virtual memory area needs to be resized to the amount that was actually read. Therefore, another costly system call is made to resize the 16MiB area to 64KiB, which completely destroys all the work done to prepare the 2MiB huge page.
This process is repeated as long as the entire 16MiB have been read, 64KiB chunks at a time. Each time the os_read_impl function is called again, the virtual memory area holding the previously read data needs to be unmapped through a system call. Therefore, hundreds of mmap, read, mremap, munmap system calls are executed to read all of the data.
Following is an excerpt of using strace to track the system calls the above script causes:
The following patch inside the os_read_impl function results in a 3x+ performance improvement due to eliminating all of the expensive memory system calls. This is the initial proposal, see the pull request for updated code):
This patch first gets the base page size once and stores it into a static variable. Then, it checks if more data than 16 times the base page size is read, which in the case of a pipe would limit the read to exactly 16 times the base page size. If that is the case, a check is done to see if the read will happen from a pipe, and if it is, the length of the read is capped to 16 times the base page size. This patch adds about 700ns-900ns of overhead when trying to read more than 64KiB from a pipe. In case it is not executed, due to the reading size being less than 64KiB, the overhead has been measured to be less than 100ns.
This caps the PyBytes objects buffer size to 64Kib. A result of that is, that the buffer is small enough to be put on the normal heap, eliminating all of the mmap, mremap and munmap system calls.
A Linux-specific advantage of this patch is that since the addition of medium-sized Transparent Huge Pages, smaller huge pages, such as 64KiB and 128KiB, can be placed to cover this memory range, instead of a 2MiB page or a lot of 4KiB pages.
Performance stats for the multiprocessing code snippet:
Without patch:
With patch:
There is probably a better example than the multiprocessing script above, but the issue is clear: Allocating a huge virtual memory area just to read 64KiB and then resize it is very costly in terms of performance.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
The text was updated successfully, but these errors were encountered: