Skip to content

Limit the reading size from pipes to their default buffer size on Unix systems #121313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aplaikner opened this issue Jul 3, 2024 · 5 comments
Closed
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@aplaikner
Copy link
Contributor

aplaikner commented Jul 3, 2024

Feature or enhancement

Proposal:

There are different ways for processes to communicate in Python, one of which, on Unix-based systems, is pipes.

While analyzing the performance behavior of the following code snippet, I found that Python handles reading large amounts of data from pipes very inefficiently:

import multiprocessing as mp
import numpy

# Array size of 128MiB
size = pow(512, 3)
# Fill array with ones
arr = numpy.ones(size)

def fn(array):
    return numpy.sum(array)

def multi_process_sum(arr):
    # each of 8 processes receives range of 16MiB
    c = int(size / 8) 
    ranges = [(i,i + c) for i in range(0, size, c)] 

    # create pool of 8 processes
    pool = mp.Pool(processes = 8)
    # pass each process a part of the array
    return int(sum(pool.map(fn, [arr[i:i + c] for i in range(0, size, c)])))

# assert sum of all ones is the size, call multiprocessing function
assert(multi_process_sum(arr) == size)
# zero array
arr[0:size] = 0
# assert array is zeroes
assert(multi_process_sum(arr) == 0)

Due to passing chunks of arrays to the different processes, Python has to use pipes instead of shared memory to distribute the workloads.

For a process to receive data from other processes in this form, the _recv_bytes and subsequently _recv functions in Lib/multiprocessing/connection.py are called, where a while loop reads from a given pipe file descriptor until all data has been read.

To read from the pipe on Unix-based systems, the os_read_impl function in Modules/posixmodule.c is executed, with the length that has to be read and the file descriptor as parameters. It is in this function that the problematic behavior arises.

Let's say the process needs to read 16MiB from the pipe. The os_read_impl function is called, and a length of 16MiB is passed. In order to read this much data, the function creates a new PyBytes object of size 16MiB. This is done through a malloc call, which, due to the huge size of the object, can't place it on the existing heap but must divert to calling mmap. The mmap system call creates a new virtual memory area of size 16MiB. After that, a _Py_read function call results in trying to read 16MiB from the pipe. The problem hereby is, that, on Unix systems, a pipe can by default only return 16 times the base page size, which would be 16*4KiB on x86-64, of data per read (Source: man 7 pipe). This results in reading only 64KiB of data into the 16MiB area. Here, a Linux-specific issue arises. Since the virtual memory area is big enough and properly aligned, a 2MiB Transparent Huge Page is placed to back this chunk of data. This results in zeroing 2MiB of physical memory in order to read just 64KiB, which already leads to performance degradation. Due to the way the read data is later on stored in the _recv function, the entire virtual memory area needs to be resized to the amount that was actually read. Therefore, another costly system call is made to resize the 16MiB area to 64KiB, which completely destroys all the work done to prepare the 2MiB huge page.

This process is repeated as long as the entire 16MiB have been read, 64KiB chunks at a time. Each time the os_read_impl function is called again, the virtual memory area holding the previously read data needs to be unmapped through a system call. Therefore, hundreds of mmap, read, mremap, munmap system calls are executed to read all of the data.

Following is an excerpt of using strace to track the system calls the above script causes:

[pid 35834] mmap(NULL, 68227072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f68fa800000
[pid 35834] read(3, "\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0"..., 68223207) = 65536
[pid 35834] mremap(0x7f68fa800000, 68227072, 69632, MREMAP_MAYMOVE) = 0x7f68fa800000
[pid 35834] munmap(0x7f68f6600000, 69632) = 0
[pid 35834] mmap(NULL, 68161536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f68f6600000
[pid 35834] read(3, "\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0"..., 68157671) = 65536
[pid 35834] mremap(0x7f68f6600000, 68161536, 69632, MREMAP_MAYMOVE) = 0x7f68f6600000
[pid 35834] munmap(0x7f68fa800000, 69632) = 0
[pid 35834] mmap(NULL, 68096000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f68fa800000
[pid 35834] read(3, "\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0"..., 68092135) = 65536
[pid 35834] mremap(0x7f68fa800000, 68096000, 69632, MREMAP_MAYMOVE) = 0x7f68fa800000
[pid 35834] munmap(0x7f68f6600000, 69632) = 0

The following patch inside the os_read_impl function results in a 3x+ performance improvement due to eliminating all of the expensive memory system calls. This is the initial proposal, see the pull request for updated code):

#ifndef MS_WINDOWS
    static long page_size;
    if (page_size == 0) 
        page_size = sysconf(_SC_PAGE_SIZE);

    if (length > page_size * 16) {
    	struct stat statbuffer;
    	fstat(fd, &statbuffer);
    	if (S_ISFIFO(statbuffer.st_mode)) {
	    length = page_size* 16;
    	}
    }
#endif

This patch first gets the base page size once and stores it into a static variable. Then, it checks if more data than 16 times the base page size is read, which in the case of a pipe would limit the read to exactly 16 times the base page size. If that is the case, a check is done to see if the read will happen from a pipe, and if it is, the length of the read is capped to 16 times the base page size. This patch adds about 700ns-900ns of overhead when trying to read more than 64KiB from a pipe. In case it is not executed, due to the reading size being less than 64KiB, the overhead has been measured to be less than 100ns.

This caps the PyBytes objects buffer size to 64Kib. A result of that is, that the buffer is small enough to be put on the normal heap, eliminating all of the mmap, mremap and munmap system calls.

A Linux-specific advantage of this patch is that since the addition of medium-sized Transparent Huge Pages, smaller huge pages, such as 64KiB and 128KiB, can be placed to cover this memory range, instead of a 2MiB page or a lot of 4KiB pages.

Performance stats for the multiprocessing code snippet:

Without patch:

[root@archlinux python_stuff]# perf stat -r 20 -B python python-benchmark.py

 Performance counter stats for 'python python-benchmark.py' (20 runs):

          8,298.58 msec task-clock                       #    1.158 CPUs utilized               ( +-  0.16% )
            33,740      context-switches                 #    4.066 K/sec                       ( +-  0.09% )
                46      cpu-migrations                   #    5.543 /sec                        ( +-  3.60% )
            94,066      page-faults                      #   11.335 K/sec                       ( +-  0.05% )
    28,161,034,405      cycles                           #    3.393 GHz                         ( +-  0.14% )
     9,578,089,890      instructions                     #    0.34  insn per cycle              ( +-  0.04% )
     1,809,040,836      branches                         #  217.994 M/sec                       ( +-  0.04% )
         8,786,927      branch-misses                    #    0.49% of all branches             ( +-  0.55% )

            7.1681 +- 0.0174 seconds time elapsed  ( +-  0.24% )

With patch:

[root@archlinux python_stuff]# perf stat -r 20 -B /foo/bin/python3 python-benchmark.py

 Performance counter stats for '/foo/bin/python3 python-benchmark.py' (20 runs):

          3,244.46 msec task-clock                       #    1.445 CPUs utilized               ( +-  0.27% )
            47,142      context-switches                 #   14.530 K/sec                       ( +-  0.43% )
                41      cpu-migrations                   #   12.637 /sec                        ( +-  4.07% )
            27,259      page-faults                      #    8.402 K/sec                       ( +-  0.19% )
    11,065,428,740      cycles                           #    3.411 GHz                         ( +-  0.34% )
     4,633,648,706      instructions                     #    0.42  insn per cycle              ( +-  0.02% )
       735,229,240      branches                         #  226.610 M/sec                       ( +-  0.03% )
         5,904,629      branch-misses                    #    0.80% of all branches             ( +-  0.91% )

           2.24607 +- 0.00299 seconds time elapsed  ( +-  0.13% )

There is probably a better example than the multiprocessing script above, but the issue is clear: Allocating a huge virtual memory area just to read 64KiB and then resize it is very costly in terms of performance.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@NicolasT
Copy link
Contributor

NicolasT commented Jul 6, 2024

Note on Linux, you can change the size of the buffer backing a pipe (up to the value found in /proc/sys/fs/pipe-max-size), using fcntl with the F_SETPIPE_SZ flag. This is, IIRC, done in some OpenStack Swift services (for use with splice).

Hence, changing the defaults could have an impact in code where this buffer is adjusted beyond the default.

@aplaikner
Copy link
Contributor Author

I've added a constant that can easily be changed if a pipe buffer has another size.

@aplaikner
Copy link
Contributor Author

aplaikner commented Aug 19, 2024

Hi all,

Just a friendly reminder about pull request #121315. I submitted the latest version on July 7th, and I wanted to check if there’s any chance someone could take a look when they have time. It seems like @gpshead is the core reviewer assigned to this.

Thank you!

@gpshead gpshead self-assigned this Aug 31, 2024
gpshead pushed a commit that referenced this issue Aug 31, 2024
…size on POSIX systems (GH-121315)

See #121313 for analysis, but this greatly reduces memory overallocation and overhead when multiprocessing is sending non-small data over its pipes between processes.
@gpshead
Copy link
Member

gpshead commented Aug 31, 2024

Merged, thanks for the contribution. If anyone wants to investigate if things could be even better with Linux's fcntl F_GETPIPE_SZ and F_SETPIPE_SZ, feel free, and report back. Opening a new issue if that seems worthwhile.

@picnixz
Copy link
Member

picnixz commented Sep 1, 2024

Re-opening until either of the 2 opened PRs is merged (to the future committer, if any, don't forget to close the issue if I forget).

@picnixz picnixz reopened this Sep 1, 2024
@picnixz picnixz added the stdlib Python modules in the Lib dir label Sep 1, 2024
gpshead pushed a commit that referenced this issue Sep 3, 2024
…fer size to 64KiB (GH-123559)

Increases the multiprocessing connection buffer size from 8k to 64k for efficiency, without overallocating.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
@gpshead gpshead closed this as completed Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants
@gpshead @NicolasT @hugovk @picnixz @aplaikner and others