Skip to content

GH-116380: Speed up glob.[i]glob() by making fewer system calls (take 2) #137474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
GH-116380: Speed up glob.[i]glob() by making fewer system calls (ta…
…ke 2)

## Filtered recursive walk

Expanding a recursive `**` segment entails walking the entire directory
tree, and so any subsequent pattern segments (except special segments) can
be evaluated by filtering the expanded paths through a regex. For example,
`glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with
`os.scandir()`, and then filters paths through a regex based on "`**/*.py`,
with no further filesystem access needed.

This fixes an issue where `glob()` could return duplicate results.

## Tracking path existence

We store a flag alongside each path indicating whether the path is
guaranteed to exist. As we process the pattern:

- Certain special pattern segments (`""`, `"."` and `".."`) leave the flag
  unchanged
- Literal pattern segments (e.g. `foo/bar`) set the flag to false
- Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because
  children are found via `os.scandir()`)
- Recursive pattern segments (e.g. `**`) leave the flag unchanged for the
  root path, and set it to true for descendants discovered via
  `os.scandir()`.

If the flag is false at the end, we call `lstat()` on each path to filter
out missing paths.

## Minor speed-ups

- Exclude paths that don't match a non-terminal non-recursive wildcard
  pattern _prior_ to calling `is_dir()`.
- Use a stack rather than recursion to implement recursive wildcards.
  - This fixes a recursion error when globbing deep trees.
- Pre-compile regular expressions and pre-join literal pattern segments.
- Convert to/from `bytes` (a minor use-case) in `iglob()` rather than
  supporting `bytes` throughout. This particularly simplifies the code
  needed to handle relative bytes paths with `dir_fd`.
- Avoid calling `os.path.join()`; instead we keep paths in a normalized
  form and append trailing slashes when needed.
- Avoid calling `os.path.normcase()`; instead we use case-insensitive regex
  matching.

## Implementation notes

Much of this functionality is already present in pathlib's implementation
of globbing. The specific additions we make are:

1. Support for `dir_fd`
2. Support for `include_hidden`
3. Support for generating paths relative to `root_dir`

This unifies the implementations of globbing in the `glob` and `pathlib`
modules.

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
  • Loading branch information
3 people committed Aug 6, 2025
commit 74aed7628efb5c1b25840d03d730e6f9c42096fe
18 changes: 10 additions & 8 deletions Doc/library/glob.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,6 @@ The :mod:`glob` module defines the following functions:
Using the "``**``" pattern in large directory trees may consume
an inordinate amount of time.

.. note::
This function may return duplicate path names if *pathname*
contains multiple "``**``" patterns and *recursive* is true.

.. versionchanged:: 3.5
Support for recursive globs using "``**``".

Expand All @@ -88,6 +84,11 @@ The :mod:`glob` module defines the following functions:
.. versionchanged:: 3.11
Added the *include_hidden* parameter.

.. versionchanged:: 3.14
Matching path names are returned only once. In previous versions, this
function may return duplicate path names if *pathname* contains multiple
"``**``" patterns and *recursive* is true.


.. function:: iglob(pathname, *, root_dir=None, dir_fd=None, recursive=False, \
include_hidden=False)
Expand All @@ -98,10 +99,6 @@ The :mod:`glob` module defines the following functions:
.. audit-event:: glob.glob pathname,recursive glob.iglob
.. audit-event:: glob.glob/2 pathname,recursive,root_dir,dir_fd glob.iglob

.. note::
This function may return duplicate path names if *pathname*
contains multiple "``**``" patterns and *recursive* is true.

.. versionchanged:: 3.5
Support for recursive globs using "``**``".

Expand All @@ -111,6 +108,11 @@ The :mod:`glob` module defines the following functions:
.. versionchanged:: 3.11
Added the *include_hidden* parameter.

.. versionchanged:: 3.14
Matching path names are yielded only once. In previous versions, this
function may yield duplicate path names if *pathname* contains multiple
"``**``" patterns and *recursive* is true.


.. function:: escape(pathname)

Expand Down
8 changes: 8 additions & 0 deletions Doc/whatsnew/3.15.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,14 @@ difflib
(Contributed by Jiahao Li in :gh:`134580`.)


glob
----

* Reduce the number of system calls in :func:`glob.glob` and :func:`~glob.iglob`,
thereby improving the speed of globbing operations by 20-80%.
(Contributed by Barney Gale in :gh:`116380`.)


hashlib
-------

Expand Down
Loading
Loading