Skip to content

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Feb 6, 2024

When expanding and filtering paths for a ** wildcard segment, build an re.Pattern object from the subsequent pattern parts, rather than the entire pattern, and match against the os.DirEntry object prior to instantiating a path object.

Also skip compiling a pattern when expanding a * wildcard segment.

… regex matching

When expanding and filtering paths for a `**` wildcard segment, build an
`re.Pattern` object from the subsequent pattern parts, rather than the
entire pattern.

Also skip compiling a pattern when expanding a `*` wildcard segment.
@barneygale
Copy link
Contributor Author

barneygale commented Feb 6, 2024

Notable improvements:

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('*', follow_symlinks=False))"
2000 loops, best of 5: 180 usec per loop  # before
2000 loops, best of 5: 159 usec per loop  # after
# --> 1.13x faster

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*.py', follow_symlinks=False))"
5 loops, best of 5: 54   msec per loop  # before
5 loops, best of 5: 40.9 msec per loop  # after
# --> 1.32x faster

Everything else is about the same.

@zooba
Copy link
Member

zooba commented Feb 8, 2024

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

Since it doesn't require changing any test cases, and I know the tests cases are pretty thorough for this area, I don't think there's any reason to not sign off. Maybe trigger a buildbot run with the tag to make sure it doesn't behave strangely on any of those setups - they can occasionally be a bit unusual and find some edge cases.

@barneygale
Copy link
Contributor Author

barneygale commented Feb 8, 2024

Thanks Steve.

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

The algorithm might be worthy of a blog post at this point!

The main change is that we now filter partial paths through a regex corresponding to a partial pattern in _select_recursive, rather than complete paths through a regex corresponding to a complete pattern in PathBase.glob(). We can do this because previous parts have already been filtered by _select_children(), and so there's no need to re-filter them.

The secondary change (which includes the addition of _entry_str()) is to match against os.DirEntry.path directly, which allows us to skip construction of path objects for files that don't match.

@zooba
Copy link
Member

zooba commented Feb 8, 2024

Okay, today it made sense :) Guess I'm more awake right now. Reading the changes from the bottom up might have helped as well.

Personally, I don't think you can have too many comments in an algorithm like this, particularly when it's recursive and split between a couple of functions. I'll suggest a few comments that would've helped me, but I don't think there are any code changes needed.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments that may help make it more understandable. No changes required

@barneygale barneygale merged commit 6f93b4d into python:main Feb 10, 2024
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this pull request Feb 14, 2024
… regex matching (python#115061)

When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants