Skip to content

GH-116380: Speed up glob.[i]glob() by making fewer system calls (take 2) #137474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Aug 6, 2025

This was previously merged and then reverted because a newly-added test failed on one buildbot. I'm not sure that the test was ever valid, so I've removed it here.

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This fixes an issue where glob() could return duplicate results.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

  • Certain special pattern segments ("", "." and "..") leave the flag unchanged
  • Literal pattern segments (e.g. foo/bar) set the flag to false
  • Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
  • Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

  • Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
  • Use a stack rather than recursion to implement recursive wildcards.
    • This fixes a recursion error when globbing deep trees.
  • Pre-compile regular expressions and pre-join literal pattern segments.
  • Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
  • Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
  • Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

  1. Support for dir_fd
  2. Support for include_hidden
  3. Support for generating paths relative to root_dir
  4. Support for suppressing the top-level path

This unifies the implementations of globbing in the glob and pathlib modules.

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern speedup
Lib/* 1.89x
Lib/*/ 1.89x
Lib/*.py 1.29x
Lib/** 5.27x
Lib/**/ 1.21x
Lib/**/* 1.88x
Lib/**/** 16.4x
Lib/**/*/ 2.13x
Lib/**/*.py 1.70x
Lib/**/__init__.py 1.01x
Lib/**/*/*.py 2.33x
Lib/**/*/__init__.py 1.68x

📚 Documentation preview 📚: https://cpython-previews--137474.org.readthedocs.build/

…ls (take 2)

## Filtered recursive walk

Expanding a recursive `**` segment entails walking the entire directory
tree, and so any subsequent pattern segments (except special segments) can
be evaluated by filtering the expanded paths through a regex. For example,
`glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with
`os.scandir()`, and then filters paths through a regex based on "`**/*.py`,
with no further filesystem access needed.

This fixes an issue where `glob()` could return duplicate results.

## Tracking path existence

We store a flag alongside each path indicating whether the path is
guaranteed to exist. As we process the pattern:

- Certain special pattern segments (`""`, `"."` and `".."`) leave the flag
  unchanged
- Literal pattern segments (e.g. `foo/bar`) set the flag to false
- Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because
  children are found via `os.scandir()`)
- Recursive pattern segments (e.g. `**`) leave the flag unchanged for the
  root path, and set it to true for descendants discovered via
  `os.scandir()`.

If the flag is false at the end, we call `lstat()` on each path to filter
out missing paths.

## Minor speed-ups

- Exclude paths that don't match a non-terminal non-recursive wildcard
  pattern _prior_ to calling `is_dir()`.
- Use a stack rather than recursion to implement recursive wildcards.
  - This fixes a recursion error when globbing deep trees.
- Pre-compile regular expressions and pre-join literal pattern segments.
- Convert to/from `bytes` (a minor use-case) in `iglob()` rather than
  supporting `bytes` throughout. This particularly simplifies the code
  needed to handle relative bytes paths with `dir_fd`.
- Avoid calling `os.path.join()`; instead we keep paths in a normalized
  form and append trailing slashes when needed.
- Avoid calling `os.path.normcase()`; instead we use case-insensitive regex
  matching.

## Implementation notes

Much of this functionality is already present in pathlib's implementation
of globbing. The specific additions we make are:

1. Support for `dir_fd`
2. Support for `include_hidden`
3. Support for generating paths relative to `root_dir`

This unifies the implementations of globbing in the `glob` and `pathlib`
modules.

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Lib/glob.py Outdated
Comment on lines 172 to 173
def __init__(self, sep=os.path.sep, case_sensitive=os.name != 'nt',
case_pedantic=False, recursive=False, include_hidden=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make these keyword-only arguments? Might improve readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind expanding on that suggestion a bit? I don't follow how that would help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. def __init__(self, *, sep=os.path.sep, ...). No behavioural changes, but it might help for future refactorings as it will enforce kw usage at the call-site. Not an issue in this case, though, as _StringGlobber() is already constructed with keyword-arguments.

barneygale and others added 2 commits August 6, 2025 18:58
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
@barneygale barneygale requested a review from AA-Turner August 7, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants