GH-116380: Speed up `glob.[i]glob()` by making fewer system calls (take 2) #137474

barneygale · 2025-08-06T16:22:41Z

This was previously merged and then reverted because a newly-added test failed on one buildbot. I'm not sure that the test was ever valid, so I've removed it here.

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This fixes an issue where glob() could return duplicate results.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

Certain special pattern segments ("", "." and "..") leave the flag unchanged
Literal pattern segments (e.g. foo/bar) set the flag to false
Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
Use a stack rather than recursion to implement recursive wildcards.
- This fixes a recursion error when globbing deep trees.
Pre-compile regular expressions and pre-join literal pattern segments.
Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

Support for dir_fd
Support for include_hidden
Support for generating paths relative to root_dir
Support for suppressing the top-level path

This unifies the implementations of globbing in the glob and pathlib modules.

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern	speedup
`Lib/*`	1.89x
`Lib/*/`	1.89x
`Lib/*.py`	1.29x
`Lib/**`	5.27x
`Lib/**/`	1.21x
`Lib/*/`	1.88x
`Lib//`	16.4x
`Lib/*//`	2.13x
`Lib/*/.py`	1.70x
`Lib/**/__init__.py`	1.01x
`Lib/*//*.py`	2.33x
`Lib/*//__init__.py`	1.68x

Issue: Speed up glob.glob() by reducing number of system calls made #116380

📚 Documentation preview 📚: https://cpython-previews--137474.org.readthedocs.build/

…ls (take 2) ## Filtered recursive walk Expanding a recursive `**` segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, `glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with `os.scandir()`, and then filters paths through a regex based on "`**/*.py`, with no further filesystem access needed. This fixes an issue where `glob()` could return duplicate results. ## Tracking path existence We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern: - Certain special pattern segments (`""`, `"."` and `".."`) leave the flag unchanged - Literal pattern segments (e.g. `foo/bar`) set the flag to false - Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because children are found via `os.scandir()`) - Recursive pattern segments (e.g. `**`) leave the flag unchanged for the root path, and set it to true for descendants discovered via `os.scandir()`. If the flag is false at the end, we call `lstat()` on each path to filter out missing paths. ## Minor speed-ups - Exclude paths that don't match a non-terminal non-recursive wildcard pattern _prior_ to calling `is_dir()`. - Use a stack rather than recursion to implement recursive wildcards. - This fixes a recursion error when globbing deep trees. - Pre-compile regular expressions and pre-join literal pattern segments. - Convert to/from `bytes` (a minor use-case) in `iglob()` rather than supporting `bytes` throughout. This particularly simplifies the code needed to handle relative bytes paths with `dir_fd`. - Avoid calling `os.path.join()`; instead we keep paths in a normalized form and append trailing slashes when needed. - Avoid calling `os.path.normcase()`; instead we use case-insensitive regex matching. ## Implementation notes Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are: 1. Support for `dir_fd` 2. Support for `include_hidden` 3. Support for generating paths relative to `root_dir` This unifies the implementations of globbing in the `glob` and `pathlib` modules. Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Doc/library/glob.rst

Lib/glob.py

AA-Turner · 2025-08-06T17:44:00Z

Lib/glob.py

+    def __init__(self, sep=os.path.sep, case_sensitive=os.name != 'nt',
+                 case_pedantic=False, recursive=False, include_hidden=False):


Perhaps make these keyword-only arguments? Might improve readability.

Would you mind expanding on that suggestion a bit? I don't follow how that would help

i.e. def __init__(self, *, sep=os.path.sep, ...). No behavioural changes, but it might help for future refactorings as it will enforce kw usage at the call-site. Not an issue in this case, though, as _StringGlobber() is already constructed with keyword-arguments.

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>

Doc/library/glob.rst

barneygale added the performance Performance or resource usage label Aug 6, 2025

bedevere-app bot added the awaiting core review label Aug 6, 2025

bedevere-app bot mentioned this pull request Aug 6, 2025

Speed up glob.glob() by reducing number of system calls made #116380

Open

Fix version number

7cc555f

AA-Turner reviewed Aug 6, 2025

View reviewed changes

barneygale and others added 2 commits August 6, 2025 18:58

Apply suggestions from code review

51515ce

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>

Address review feedback

9843b6a

AA-Turner reviewed Aug 6, 2025

View reviewed changes

Doc/library/glob.rst Outdated Show resolved Hide resolved

barneygale added 3 commits August 6, 2025 19:02

Errant tabs

a116d94

Make Globber arguments keyword-only

e2bb3cb

Tweak Globber argument defaults

c093c4e

barneygale requested a review from AA-Turner August 7, 2025 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GH-116380: Speed up `glob.[i]glob()` by making fewer system calls (take 2) #137474

GH-116380: Speed up `glob.[i]glob()` by making fewer system calls (take 2) #137474

barneygale commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AA-Turner Aug 6, 2025

Uh oh!

barneygale Aug 6, 2025

Uh oh!

AA-Turner Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

		def __init__(self, sep=os.path.sep, case_sensitive=os.name != 'nt',
		case_pedantic=False, recursive=False, include_hidden=False):

Uh oh!

GH-116380: Speed up glob.[i]glob() by making fewer system calls (take 2) #137474

Are you sure you want to change the base?

GH-116380: Speed up glob.[i]glob() by making fewer system calls (take 2) #137474

Conversation

barneygale commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Filtered recursive walk

Tracking path existence

Minor speed-ups

Implementation notes

Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AA-Turner Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

barneygale Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

AA-Turner Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

GH-116380: Speed up `glob.[i]glob()` by making fewer system calls (take 2) #137474

GH-116380: Speed up `glob.[i]glob()` by making fewer system calls (take 2) #137474

barneygale commented Aug 6, 2025 •

edited

Loading