Skip to content

Commit 5810430

Browse files
committed
Convert regex engine's subre tree from binary to N-ary style.
Instead of having left and right child links in subre structs, have a single child link plus a sibling link. Multiple children of a tree node are now reached by chasing the sibling chain. The beneficiary of this is alternation tree nodes. A regular expression with N (>1) branches is now represented by one alternation node with N children, rather than a tree that includes N alternation nodes as well as N children. While the old representation didn't really cost anything extra at execution time, it was pretty horrid for compilation purposes, because each of the alternation nodes had its own NFA, which we were too stupid not to separately optimize. (To make matters worse, all of those NFAs described the entire alternation pattern, not just the portion of it that one might expect from the tree structure.) We continue to require concatenation nodes to have exactly two children. This data structure is now prepared to support more, but the executor's logic would need some careful redesign, and it's not clear that a lot of benefit could be had. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
1 parent cebc1d3 commit 5810430

File tree

4 files changed

+185
-165
lines changed

4 files changed

+185
-165
lines changed

src/backend/regex/README

+2-2
Original file line numberDiff line numberDiff line change
@@ -129,9 +129,9 @@ If not, we can reject the match immediately without iterating through many
129129
possibilities.
130130

131131
As an example, consider the regex "(a[bc]+)\1". The compiled
132-
representation will have a top-level concatenation subre node. Its left
132+
representation will have a top-level concatenation subre node. Its first
133133
child is a capture node, and the child of that is a plain DFA node for
134-
"a[bc]+". The concatenation's right child is a backref node for \1.
134+
"a[bc]+". The concatenation's second child is a backref node for \1.
135135
The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
136136
where the backref has been replaced by a copy of the DFA for its referent
137137
expression. When executed, the concatenation node will have to search for

0 commit comments

Comments
 (0)