-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
GH-116380: Speed up glob.[i]glob()
by making fewer system calls (take 2)
#137474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ls (take 2) ## Filtered recursive walk Expanding a recursive `**` segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, `glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with `os.scandir()`, and then filters paths through a regex based on "`**/*.py`, with no further filesystem access needed. This fixes an issue where `glob()` could return duplicate results. ## Tracking path existence We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern: - Certain special pattern segments (`""`, `"."` and `".."`) leave the flag unchanged - Literal pattern segments (e.g. `foo/bar`) set the flag to false - Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because children are found via `os.scandir()`) - Recursive pattern segments (e.g. `**`) leave the flag unchanged for the root path, and set it to true for descendants discovered via `os.scandir()`. If the flag is false at the end, we call `lstat()` on each path to filter out missing paths. ## Minor speed-ups - Exclude paths that don't match a non-terminal non-recursive wildcard pattern _prior_ to calling `is_dir()`. - Use a stack rather than recursion to implement recursive wildcards. - This fixes a recursion error when globbing deep trees. - Pre-compile regular expressions and pre-join literal pattern segments. - Convert to/from `bytes` (a minor use-case) in `iglob()` rather than supporting `bytes` throughout. This particularly simplifies the code needed to handle relative bytes paths with `dir_fd`. - Avoid calling `os.path.join()`; instead we keep paths in a normalized form and append trailing slashes when needed. - Avoid calling `os.path.normcase()`; instead we use case-insensitive regex matching. ## Implementation notes Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are: 1. Support for `dir_fd` 2. Support for `include_hidden` 3. Support for generating paths relative to `root_dir` This unifies the implementations of globbing in the `glob` and `pathlib` modules. Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Lib/glob.py
Outdated
def __init__(self, sep=os.path.sep, case_sensitive=os.name != 'nt', | ||
case_pedantic=False, recursive=False, include_hidden=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps make these keyword-only arguments? Might improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind expanding on that suggestion a bit? I don't follow how that would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. def __init__(self, *, sep=os.path.sep, ...)
. No behavioural changes, but it might help for future refactorings as it will enforce kw usage at the call-site. Not an issue in this case, though, as _StringGlobber()
is already constructed with keyword-arguments.
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
This was previously merged and then reverted because a newly-added test failed on one buildbot. I'm not sure that the test was ever valid, so I've removed it here.
Filtered recursive walk
Expanding a recursive
**
segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example,glob.glob("foo/**/*.py", recursive=True)
recursively walksfoo/
withos.scandir()
, and then filters paths through a regex based on "**/*.py
, with no further filesystem access needed.This fixes an issue where
glob()
could return duplicate results.Tracking path existence
We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:
""
,"."
and".."
) leave the flag unchangedfoo/bar
) set the flag to false*/*.py
) set the flag to true (because children are found viaos.scandir()
)**
) leave the flag unchanged for the root path, and set it to true for descendants discovered viaos.scandir()
.If the flag is false at the end, we call
lstat()
on each path to filter out missing paths.Minor speed-ups
is_dir()
.bytes
(a minor use-case) iniglob()
rather than supportingbytes
throughout. This particularly simplifies the code needed to handle relative bytes paths withdir_fd
.os.path.join()
; instead we keep paths in a normalized form and append trailing slashes when needed.os.path.normcase()
; instead we use case-insensitive regex matching.Implementation notes
Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:
dir_fd
include_hidden
root_dir
This unifies the implementations of globbing in the
glob
andpathlib
modules.Results
Speedups via
python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)"
from CPython source directory on Linux:Lib/*
Lib/*/
Lib/*.py
Lib/**
Lib/**/
Lib/**/*
Lib/**/**
Lib/**/*/
Lib/**/*.py
Lib/**/__init__.py
Lib/**/*/*.py
Lib/**/*/__init__.py
glob.glob()
by reducing number of system calls made #116380📚 Documentation preview 📚: https://cpython-previews--137474.org.readthedocs.build/