-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-120754: Reduce system calls in full-file readall case #120755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reduces the system call count of a simple program[0] that reads all the `.rst` files in Doc by over 10% (5706 -> 4734 system calls on my linux system, 5813 -> 4875 on my macOS) This reduces the number of `fstat()` calls always and seek calls most the time. Stat was always called twice, once at open (to error early on directories), and a second time to get the size of the file to be able to read the whole file in one read. Now the size is cached with the first call. The code keeps an optimization that if the user had previously read a lot of data, the current position is subtracted from the number of bytes to read. That is somewhat expensive so only do it on larger files, otherwise just try and read the extra bytes and resize the PyBytes as needeed. I built a little test program to validate the behavior + assumptions around relative costs and then ran it under `strace` to get a log of the system calls. Full samples below[1]. After the changes, this is everything in one `filename.read_text()`: ```python3 openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3` fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0` ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` This does make some tradeoffs 1. If the file size changes between open() and readall(), this will still get all the data but might have more read calls. 2. I experimented with avoiding the stat + cached result for small files in general, but on my dev workstation at least that tended to reduce performance compared to using the fstat(). [0] ```python3 from pathlib import Path nlines = [] for filename in Path("cpython/Doc").glob("**/*.rst"): nlines.append(len(filename.read_text())) ``` [1] before small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` after small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` before large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` after large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ```
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Could you add some tests? And share benchmark results compared against the main branch? |
Is there a standard way to add tests for "this set of system calls is made" or "this many system calls is made"? I tried hunting through the existing tests but couldn't find anything like that or a good way to do that for underlying C code. Would definitely be nice to have a test around re: Benchmarking, I did some with a test program and included details in the initial commit: 78c4de0, wall clock on my dev machine changes were generally in the noise. Happy to work on running a more general suite. |
I simply meant to test that the code still works correctly with the changes you made. Set up git worktree, build the main branch and |
For testing, the existing test_fileio checks basic behavior of |
I ran pyperformance benchmark and didn't get any big swings / just noise. Writing a little pyperf benchmark around "read whole file: import pyperf
from pathlib import Path
def read_file(path_obj):
path_obj.read_text()
runner = pyperf.Runner()
runner.bench_func('read_file_small', read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func('read_file_large', read_file, Path("Doc/c-api/typeobj.rst")) cmaloney/readall_faster
main
for my particular Mac |
Thanks, this is excellent. cpython/Lib/test/test_subprocess.py Line 3437 in 0152dc4
|
@hauntsaninja planning to try and make a separate PR for that (list item per what I think will be separate commits)
Then use that infrastructure here (So this PR will get a merge commit + new commit which updates the test for the less system calls). I don't think that needs a separate GH Issue to track, if it does let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard way to add tests for "this set of system calls is made" or "this many system calls is made
The tests for IO are spread around multiple files but I think test_fileio
is the best one for that. If you want to emulate the number of calls being made, you could try to align the Python implementation with the C implementation (which is usually what we try to achieve). Note that the python implementation calls read/lseek/fstat
directly for FileIO, so you may also try to mock them as well. For the C implementation, yes, the strace
alternative is probably the best, but I think it's a nice idea to see whether you could also improve the Python implementation itself.
re: _pyio I'll look at how far its behavior is currently from _io when I add the system call test. I would like not to pull getting them to match into the scope of work for this PR. Longer term I would really like to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good to me!
It might make sense to just use DEFAULT_BUFFER_SIZE
for your threshold, especially so if #118144 is merged
I agree that pyio and a strace test can be done in another PR
I requested review from Serhiy in case he has time to take a look, if not I'll merge soon
Misc/NEWS.d/next/Core and Builtins/2024-06-19-19-54-35.gh-issue-120754.uF29sj.rst
Outdated
Show resolved
Hide resolved
…e-120754.uF29sj.rst
Re: DEFAULT_BUFFER_SIZE, I actually experimented with "just allocate and try and try to read DEFAULT_BUFFER_SIZE always", and found that for both small and large files it was slower. Not entirely sure what the slowdown was, but led me to the "cache the size" approach which is uniformly faster. Definitely an interesting constant to raise, and I think fairly important on the write side. Would be curious to see numbers for read. |
size_t is too small (and read would cap it anyways) to read the whole file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
A minor nit regarding the comments: I'm going to align them to the existing style used in this file; hope you don't mind :)
Modules/_io/fileio.c
Outdated
@@ -72,6 +75,7 @@ typedef struct { | |||
unsigned int closefd : 1; | |||
char finalizing; | |||
unsigned int blksize; | |||
Py_off_t size_estimated; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use the same name in the C and Python implementation, I suggest to rename this member to: estimated_size.
Modules/_io/fileio.c
Outdated
bufsize = _PY_READ_MAX; | ||
} | ||
else { | ||
bufsize = Py_SAFE_DOWNCAST(end, Py_off_t, size_t) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this cast is safe, Py_off_t can be bigger than size_t. You should do something like:
bufsize = (size_t)Py_MIN(end, SIZE_MAX);
bufsize++;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into issues in test_largefile on Windows x86 which caused me to add this. Py_off_t is long long
on that while size_t is int
cpython/Modules/_io/_iomodule.h
Lines 95 to 106 in 4f1e1df
#ifdef MS_WINDOWS | |
/* Windows uses long long for offsets */ | |
typedef long long Py_off_t; | |
# define PyLong_AsOff_t PyLong_AsLongLong | |
# define PyLong_FromOff_t PyLong_FromLongLong | |
# define PY_OFF_T_MAX LLONG_MAX | |
# define PY_OFF_T_MIN LLONG_MIN | |
# define PY_OFF_T_COMPAT long long /* type compatible with off_t */ | |
# define PY_PRIdOFF "lld" /* format to use for that type */ | |
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, misread this. The if end >= _PY_READ_MAX
just before should catch this. (_PY_READ_MAX
<= SIZE_MAX).
https://github.com/python/cpython/blob/main/Include/internal/pycore_fileutils.h#L65-L76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, in fact the maximum is PY_SSIZE_T_MAX:
bufsize = (size_t)Py_MIN(end, PY_SSIZE_T_MAX);
if (bufsize < PY_SSIZE_T_MAX) {
bufsize++;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, replace bufsize = Py_SAFE_DOWNCAST(end, Py_off_t, size_t) + 1;
with just bufsize = (size_t)end + 1;
. I just dislike Py_SAFE_DOWNCAST()
macro, it's not safe, the name is misleading.
if self._estimated_size <= 0: | ||
bufsize = DEFAULT_BUFFER_SIZE | ||
else: | ||
bufsize = self._estimated_size + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the "+1"? It may overallocate 1 byte which is inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read loop currently needs to do a os.read()
/ _py_Read
which is a single byte which returns 0 size to find the end of the file and exit the loop. The very beginning of that loop does a check for "if buffer is full, grow buffer" so not over-allocating by one byte results in a much bigger allocation by that. In the _io
case it then shrinks it back down at the end, whereas in the _pyio
case the EOF read is never appended.
Could avoid the extra byte by writing a specialized "read known size" (w/ fallback to "read until EOF"), but was trying to avoid making more variants of the read loop and limit risk a bit.
As an aside: the _pyio
implementation seems to have a lot of extra memory allocation and copy in the default case because os.read()
internally allocates a buffer which it then copies into its bytearray
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on renaming the members to be consistent tomorrow
Modules/_io/fileio.c
Outdated
bufsize = _PY_READ_MAX; | ||
} | ||
else { | ||
bufsize = Py_SAFE_DOWNCAST(end, Py_off_t, size_t) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, misread this. The if end >= _PY_READ_MAX
just before should catch this. (_PY_READ_MAX
<= SIZE_MAX).
https://github.com/python/cpython/blob/main/Include/internal/pycore_fileutils.h#L65-L76
if self._estimated_size <= 0: | ||
bufsize = DEFAULT_BUFFER_SIZE | ||
else: | ||
bufsize = self._estimated_size + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read loop currently needs to do a os.read()
/ _py_Read
which is a single byte which returns 0 size to find the end of the file and exit the loop. The very beginning of that loop does a check for "if buffer is full, grow buffer" so not over-allocating by one byte results in a much bigger allocation by that. In the _io
case it then shrinks it back down at the end, whereas in the _pyio
case the EOF read is never appended.
Could avoid the extra byte by writing a specialized "read known size" (w/ fallback to "read until EOF"), but was trying to avoid making more variants of the read loop and limit risk a bit.
As an aside: the _pyio
implementation seems to have a lot of extra memory allocation and copy in the default case because os.read()
internally allocates a buffer which it then copies into its bytearray
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per review, update range checks to be more clear and accurate
Co-authored-by: Victor Stinner <vstinner@python.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged, thank you. It's a nice optimization. |
|
@cmaloney: Oh, test_largefile failed. Can you investigate? |
Looks like just the C implementation ( |
@vstinner I think #121357 will fix the failure, although I'm unable to reproduce locally so far. |
…se (python#120755) This reduces the system call count of a simple program[0] that reads all the `.rst` files in Doc by over 10% (5706 -> 4734 system calls on my linux system, 5813 -> 4875 on my macOS) This reduces the number of `fstat()` calls always and seek calls most the time. Stat was always called twice, once at open (to error early on directories), and a second time to get the size of the file to be able to read the whole file in one read. Now the size is cached with the first call. The code keeps an optimization that if the user had previously read a lot of data, the current position is subtracted from the number of bytes to read. That is somewhat expensive so only do it on larger files, otherwise just try and read the extra bytes and resize the PyBytes as needeed. I built a little test program to validate the behavior + assumptions around relative costs and then ran it under `strace` to get a log of the system calls. Full samples below[1]. After the changes, this is everything in one `filename.read_text()`: ```python3 openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3` fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0` ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` This does make some tradeoffs 1. If the file size changes between open() and readall(), this will still get all the data but might have more read calls. 2. I experimented with avoiding the stat + cached result for small files in general, but on my dev workstation at least that tended to reduce performance compared to using the fstat(). [0] ```python3 from pathlib import Path nlines = [] for filename in Path("cpython/Doc").glob("**/*.rst"): nlines.append(len(filename.read_text())) ``` [1] Before small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` After small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` Before large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` After large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com> Co-authored-by: Victor Stinner <vstinner@python.org>
…se (python#120755) This reduces the system call count of a simple program[0] that reads all the `.rst` files in Doc by over 10% (5706 -> 4734 system calls on my linux system, 5813 -> 4875 on my macOS) This reduces the number of `fstat()` calls always and seek calls most the time. Stat was always called twice, once at open (to error early on directories), and a second time to get the size of the file to be able to read the whole file in one read. Now the size is cached with the first call. The code keeps an optimization that if the user had previously read a lot of data, the current position is subtracted from the number of bytes to read. That is somewhat expensive so only do it on larger files, otherwise just try and read the extra bytes and resize the PyBytes as needeed. I built a little test program to validate the behavior + assumptions around relative costs and then ran it under `strace` to get a log of the system calls. Full samples below[1]. After the changes, this is everything in one `filename.read_text()`: ```python3 openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3` fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0` ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` This does make some tradeoffs 1. If the file size changes between open() and readall(), this will still get all the data but might have more read calls. 2. I experimented with avoiding the stat + cached result for small files in general, but on my dev workstation at least that tended to reduce performance compared to using the fstat(). [0] ```python3 from pathlib import Path nlines = [] for filename in Path("cpython/Doc").glob("**/*.rst"): nlines.append(len(filename.read_text())) ``` [1] Before small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` After small file: ``` openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343 read(3, "", 1) = 0 close(3) = 0 ``` Before large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffe52525930) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` After large file: ``` openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0 ioctl(3, TCGETS, 0x7ffdfac04b40) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104 read(3, "", 1) = 0 close(3) = 0 ``` Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com> Co-authored-by: Victor Stinner <vstinner@python.org>
This reduces the system call count of a simple program (see commits) that reads all the
.rst
files in Doc by over 10% (5706 -> 4734 system calls on my linux system, 5813 -> 4875 on my macOS)This reduces the number of
fstat()
calls always and seek calls most the time. Stat was always called twice, once at open (to error early on directories), and a second time to get the size of the file to be able to read the whole file in one read. Now the size is cached with the first call.