GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

barneygale · 2024-02-06T04:10:20Z

When expanding and filtering paths for a ** wildcard segment, build an re.Pattern object from the subsequent pattern parts, rather than the entire pattern, and match against the os.DirEntry object prior to instantiating a path object.

Also skip compiling a pattern when expanding a * wildcard segment.

Issue: Speed up pathlib.Path.glob() by removing redundant regex matching #115060

… regex matching When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern. Also skip compiling a pattern when expanding a `*` wildcard segment.

barneygale · 2024-02-06T04:20:36Z

Notable improvements:

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('*', follow_symlinks=False))"
2000 loops, best of 5: 180 usec per loop  # before
2000 loops, best of 5: 159 usec per loop  # after
# --> 1.13x faster

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*.py', follow_symlinks=False))"
5 loops, best of 5: 54   msec per loop  # before
5 loops, best of 5: 40.9 msec per loop  # after
# --> 1.32x faster

Everything else is about the same.

This reverts commit b382e40.

zooba · 2024-02-08T00:22:51Z

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

Since it doesn't require changing any test cases, and I know the tests cases are pretty thorough for this area, I don't think there's any reason to not sign off. Maybe trigger a buildbot run with the tag to make sure it doesn't behave strangely on any of those setups - they can occasionally be a bit unusual and find some edge cases.

barneygale · 2024-02-08T17:13:11Z

Thanks Steve.

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

The algorithm might be worthy of a blog post at this point!

The main change is that we now filter partial paths through a regex corresponding to a partial pattern in _select_recursive, rather than complete paths through a regex corresponding to a complete pattern in PathBase.glob(). We can do this because previous parts have already been filtered by _select_children(), and so there's no need to re-filter them.

The secondary change (which includes the addition of _entry_str()) is to match against os.DirEntry.path directly, which allows us to skip construction of path objects for files that don't match.

zooba · 2024-02-08T22:07:39Z

Okay, today it made sense :) Guess I'm more awake right now. Reading the changes from the bottom up might have helped as well.

Personally, I don't think you can have too many comments in an algorithm like this, particularly when it's recursive and split between a couple of functions. I'll suggest a few comments that would've helped me, but I don't think there are any code changes needed.

zooba

Just comments that may help make it more understandable. No changes required

Lib/pathlib/_abc.py

Lib/pathlib/__init__.py

… regex matching (python#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.

barneygale added performance Performance or resource usage topic-pathlib labels Feb 6, 2024

barneygale requested a review from zooba February 6, 2024 04:10

bedevere-app bot added the awaiting core review label Feb 6, 2024

bedevere-app bot mentioned this pull request Feb 6, 2024

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

Closed

barneygale added 4 commits February 6, 2024 04:58

Match against os.DirEntry.path in _select_recursive()

6abb80d

Matching against dot-prefixed path is fine (and faster!)

b382e40

Revert "Matching against dot-prefixed path is fine (and faster!)"

e1472fc

This reverts commit b382e40.

Skip computing prefix len when not matching

284c42e

Rename prefix_len --> parent_len for clarity.

169b1e7

zooba reviewed Feb 8, 2024

View reviewed changes

Lib/pathlib/_abc.py Show resolved Hide resolved

Lib/pathlib/_abc.py Show resolved Hide resolved

Lib/pathlib/_abc.py Show resolved Hide resolved

Lib/pathlib/__init__.py Outdated Show resolved Hide resolved

barneygale added 4 commits February 8, 2024 22:56

Comments, naming.

1c4184f

segment --> component

2873ed8

Test post-** matching when globbing ..

90d5a12

Couple more test cases

a40924b

barneygale merged commit 6f93b4d into python:main Feb 10, 2024

bedevere-app bot removed the awaiting core review label Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

Uh oh!

barneygale commented Feb 6, 2024 •

edited

Loading

Uh oh!

barneygale commented Feb 6, 2024 •

edited

Loading

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

barneygale commented Feb 8, 2024 •

edited

Loading

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

zooba left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching #115061

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching #115061

Uh oh!

Conversation

barneygale commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barneygale commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

barneygale commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

zooba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

barneygale commented Feb 6, 2024 •

edited

Loading

barneygale commented Feb 6, 2024 •

edited

Loading

barneygale commented Feb 8, 2024 •

edited

Loading