Skip to content

Commit 2a0af7f

Browse files
committed
Allow complemented character class escapes within regex brackets.
The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
1 parent 6b40d9b commit 2a0af7f

File tree

10 files changed

+672
-271
lines changed

10 files changed

+672
-271
lines changed

doc/src/sgml/func.sgml

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
60976097
non-ASCII characters to belong to any of these classes.)
60986098
In addition to these standard character
60996099
classes, <productname>PostgreSQL</productname> defines
6100+
the <literal>word</literal> character class, which is the same as
6101+
<literal>alnum</literal> plus the underscore (<literal>_</literal>)
6102+
character, and
61006103
the <literal>ascii</literal> character class, which contains exactly
61016104
the 7-bit ASCII set.
61026105
</para>
@@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
61086111
matching empty strings at the beginning
61096112
and end of a word respectively. A word is defined as a sequence
61106113
of word characters that is neither preceded nor followed by word
6111-
characters. A word character is an <literal>alnum</literal> character (as
6112-
defined by the <acronym>POSIX</acronym> character class described above)
6113-
or an underscore. This is an extension, compatible with but not
6114+
characters. A word character is any character belonging to the
6115+
<literal>word</literal> character class, that is, any letter, digit,
6116+
or underscore. This is an extension, compatible with but not
61146117
specified by <acronym>POSIX</acronym> 1003.2, and should be used with
61156118
caution in software intended to be portable to other systems.
61166119
The constraint escapes described below are usually preferable; they
@@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63306333

63316334
<row>
63326335
<entry> <literal>\w</literal> </entry>
6333-
<entry> <literal>[[:alnum:]_]</literal>
6334-
(note underscore is included) </entry>
6336+
<entry> <literal>[[:word:]]</literal> </entry>
63356337
</row>
63366338

63376339
<row>
@@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63466348

63476349
<row>
63486350
<entry> <literal>\W</literal> </entry>
6349-
<entry> <literal>[^[:alnum:]_]</literal>
6350-
(note underscore is included) </entry>
6351+
<entry> <literal>[^[:word:]]</literal> </entry>
63516352
</row>
63526353
</tbody>
63536354
</tgroup>
63546355
</table>
63556356

63566357
<para>
6357-
Within bracket expressions, <literal>\d</literal>, <literal>\s</literal>,
6358-
and <literal>\w</literal> lose their outer brackets,
6359-
and <literal>\D</literal>, <literal>\S</literal>, and <literal>\W</literal> are illegal.
6360-
(So, for example, <literal>[a-c\d]</literal> is equivalent to
6358+
The class-shorthand escapes also work within bracket expressions,
6359+
although the definitions shown above are not quite syntactically
6360+
valid in that context.
6361+
For example, <literal>[a-c\d]</literal> is equivalent to
63616362
<literal>[a-c[:digit:]]</literal>.
6362-
Also, <literal>[a-c\D]</literal>, which is equivalent to
6363-
<literal>[a-c^[:digit:]]</literal>, is illegal.)
63646363
</para>
63656364

63666365
<table id="posix-constraint-escapes-table">

src/backend/regex/re_syntax.n

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -519,15 +519,10 @@ character classes:
519519
(note underscore)
520520
.RE
521521
.PP
522-
Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
523-
and `\fB\ew\fR'\&
524-
lose their outer brackets,
525-
and `\fB\eD\fR', `\fB\eS\fR',
526-
and `\fB\eW\fR'\&
527-
are illegal.
528-
.VS 8.2
529-
(So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
530-
Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
522+
The class-shorthand escapes also work within bracket expressions,
523+
although the definitions shown above are not quite syntactically
524+
valid in that context.
525+
For example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
531526
.VE 8.2
532527
.PP
533528
A constraint escape (AREs only) is a constraint,

src/backend/regex/regc_color.c

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -936,7 +936,16 @@ okcolors(struct nfa *nfa,
936936
}
937937
else if (cd->nschrs == 0 && cd->nuchrs == 0)
938938
{
939-
/* parent empty, its arcs change color to subcolor */
939+
/*
940+
* Parent is now empty, so just change all its arcs to the
941+
* subcolor, then free the parent.
942+
*
943+
* It is not obvious that simply relabeling the arcs like this is
944+
* OK; it appears to risk creating duplicate arcs. We are
945+
* basically relying on the assumption that processing of a
946+
* bracket expression can't create arcs of both a color and its
947+
* subcolor between the bracket's endpoints.
948+
*/
940949
cd->sub = NOSUB;
941950
scd = &cm->cd[sco];
942951
assert(scd->nschrs > 0 || scd->nuchrs > 0);
@@ -1062,17 +1071,34 @@ colorcomplement(struct nfa *nfa,
10621071
struct colordesc *cd;
10631072
struct colordesc *end = CDEND(cm);
10641073
color co;
1074+
struct arc *a;
10651075

10661076
assert(of != from);
10671077

10681078
/* A RAINBOW arc matches all colors, making the complement empty */
10691079
if (findarc(of, PLAIN, RAINBOW) != NULL)
10701080
return;
10711081

1082+
/* Otherwise, transiently mark the colors that appear in of's out-arcs */
1083+
for (a = of->outs; a != NULL; a = a->outchain)
1084+
{
1085+
if (a->type == PLAIN)
1086+
{
1087+
assert(a->co >= 0);
1088+
cd = &cm->cd[a->co];
1089+
assert(!UNUSEDCOLOR(cd));
1090+
cd->flags |= COLMARK;
1091+
}
1092+
}
1093+
1094+
/* Scan colors, clear transient marks, add arcs for unmarked colors */
10721095
for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
1073-
if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
1074-
if (findarc(of, PLAIN, co) == NULL)
1075-
newarc(nfa, type, co, from, to);
1096+
{
1097+
if (cd->flags & COLMARK)
1098+
cd->flags &= ~COLMARK;
1099+
else if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
1100+
newarc(nfa, type, co, from, to);
1101+
}
10761102
}
10771103

10781104

src/backend/regex/regc_lex.c

Lines changed: 16 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -193,83 +193,6 @@ prefixes(struct vars *v)
193193
}
194194
}
195195

196-
/*
197-
* lexnest - "call a subroutine", interpolating string at the lexical level
198-
*
199-
* Note, this is not a very general facility. There are a number of
200-
* implicit assumptions about what sorts of strings can be subroutines.
201-
*/
202-
static void
203-
lexnest(struct vars *v,
204-
const chr *beginp, /* start of interpolation */
205-
const chr *endp) /* one past end of interpolation */
206-
{
207-
assert(v->savenow == NULL); /* only one level of nesting */
208-
v->savenow = v->now;
209-
v->savestop = v->stop;
210-
v->now = beginp;
211-
v->stop = endp;
212-
}
213-
214-
/*
215-
* string constants to interpolate as expansions of things like \d
216-
*/
217-
static const chr backd[] = { /* \d */
218-
CHR('['), CHR('['), CHR(':'),
219-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
220-
CHR(':'), CHR(']'), CHR(']')
221-
};
222-
static const chr backD[] = { /* \D */
223-
CHR('['), CHR('^'), CHR('['), CHR(':'),
224-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
225-
CHR(':'), CHR(']'), CHR(']')
226-
};
227-
static const chr brbackd[] = { /* \d within brackets */
228-
CHR('['), CHR(':'),
229-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
230-
CHR(':'), CHR(']')
231-
};
232-
static const chr backs[] = { /* \s */
233-
CHR('['), CHR('['), CHR(':'),
234-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
235-
CHR(':'), CHR(']'), CHR(']')
236-
};
237-
static const chr backS[] = { /* \S */
238-
CHR('['), CHR('^'), CHR('['), CHR(':'),
239-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
240-
CHR(':'), CHR(']'), CHR(']')
241-
};
242-
static const chr brbacks[] = { /* \s within brackets */
243-
CHR('['), CHR(':'),
244-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
245-
CHR(':'), CHR(']')
246-
};
247-
static const chr backw[] = { /* \w */
248-
CHR('['), CHR('['), CHR(':'),
249-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
250-
CHR(':'), CHR(']'), CHR('_'), CHR(']')
251-
};
252-
static const chr backW[] = { /* \W */
253-
CHR('['), CHR('^'), CHR('['), CHR(':'),
254-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
255-
CHR(':'), CHR(']'), CHR('_'), CHR(']')
256-
};
257-
static const chr brbackw[] = { /* \w within brackets */
258-
CHR('['), CHR(':'),
259-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
260-
CHR(':'), CHR(']'), CHR('_')
261-
};
262-
263-
/*
264-
* lexword - interpolate a bracket expression for word characters
265-
* Possibly ought to inquire whether there is a "word" character class.
266-
*/
267-
static void
268-
lexword(struct vars *v)
269-
{
270-
lexnest(v, backw, ENDOF(backw));
271-
}
272-
273196
/*
274197
* next - get next token
275198
*/
@@ -292,14 +215,6 @@ next(struct vars *v)
292215
RETV(SBEGIN, 0); /* same as \A */
293216
}
294217

295-
/* if we're nested and we've hit end, return to outer level */
296-
if (v->savenow != NULL && ATEOS())
297-
{
298-
v->now = v->savenow;
299-
v->stop = v->savestop;
300-
v->savenow = v->savestop = NULL;
301-
}
302-
303218
/* skip white space etc. if appropriate (not in literal or []) */
304219
if (v->cflags & REG_EXPANDED)
305220
switch (v->lexcon)
@@ -420,32 +335,15 @@ next(struct vars *v)
420335
NOTE(REG_UNONPOSIX);
421336
if (ATEOS())
422337
FAILW(REG_EESCAPE);
423-
(DISCARD) lexescape(v);
338+
if (!lexescape(v))
339+
return 0;
424340
switch (v->nexttype)
425341
{ /* not all escapes okay here */
426342
case PLAIN:
343+
case CCLASSS:
344+
case CCLASSC:
427345
return 1;
428346
break;
429-
case CCLASS:
430-
switch (v->nextvalue)
431-
{
432-
case 'd':
433-
lexnest(v, brbackd, ENDOF(brbackd));
434-
break;
435-
case 's':
436-
lexnest(v, brbacks, ENDOF(brbacks));
437-
break;
438-
case 'w':
439-
lexnest(v, brbackw, ENDOF(brbackw));
440-
break;
441-
default:
442-
FAILW(REG_EESCAPE);
443-
break;
444-
}
445-
/* lexnest done, back up and try again */
446-
v->nexttype = v->lasttype;
447-
return next(v);
448-
break;
449347
}
450348
/* not one of the acceptable escapes */
451349
FAILW(REG_EESCAPE);
@@ -691,49 +589,17 @@ next(struct vars *v)
691589
}
692590
RETV(PLAIN, *v->now++);
693591
}
694-
(DISCARD) lexescape(v);
695-
if (ISERR())
696-
FAILW(REG_EESCAPE);
697-
if (v->nexttype == CCLASS)
698-
{ /* fudge at lexical level */
699-
switch (v->nextvalue)
700-
{
701-
case 'd':
702-
lexnest(v, backd, ENDOF(backd));
703-
break;
704-
case 'D':
705-
lexnest(v, backD, ENDOF(backD));
706-
break;
707-
case 's':
708-
lexnest(v, backs, ENDOF(backs));
709-
break;
710-
case 'S':
711-
lexnest(v, backS, ENDOF(backS));
712-
break;
713-
case 'w':
714-
lexnest(v, backw, ENDOF(backw));
715-
break;
716-
case 'W':
717-
lexnest(v, backW, ENDOF(backW));
718-
break;
719-
default:
720-
assert(NOTREACHED);
721-
FAILW(REG_ASSERT);
722-
break;
723-
}
724-
/* lexnest done, back up and try again */
725-
v->nexttype = v->lasttype;
726-
return next(v);
727-
}
728-
/* otherwise, lexescape has already done the work */
729-
return !ISERR();
592+
return lexescape(v);
730593
}
731594

732595
/*
733596
* lexescape - parse an ARE backslash escape (backslash already eaten)
734-
* Note slightly nonstandard use of the CCLASS type code.
597+
*
598+
* This is used for ARE backslashes both normally and inside bracket
599+
* expressions. In the latter case, not all escape types are allowed,
600+
* but the caller must reject unwanted ones after we return.
735601
*/
736-
static int /* not actually used, but convenient for RETV */
602+
static int
737603
lexescape(struct vars *v)
738604
{
739605
chr c;
@@ -775,11 +641,11 @@ lexescape(struct vars *v)
775641
break;
776642
case CHR('d'):
777643
NOTE(REG_ULOCALE);
778-
RETV(CCLASS, 'd');
644+
RETV(CCLASSS, CC_DIGIT);
779645
break;
780646
case CHR('D'):
781647
NOTE(REG_ULOCALE);
782-
RETV(CCLASS, 'D');
648+
RETV(CCLASSC, CC_DIGIT);
783649
break;
784650
case CHR('e'):
785651
NOTE(REG_UUNPORT);
@@ -802,11 +668,11 @@ lexescape(struct vars *v)
802668
break;
803669
case CHR('s'):
804670
NOTE(REG_ULOCALE);
805-
RETV(CCLASS, 's');
671+
RETV(CCLASSS, CC_SPACE);
806672
break;
807673
case CHR('S'):
808674
NOTE(REG_ULOCALE);
809-
RETV(CCLASS, 'S');
675+
RETV(CCLASSC, CC_SPACE);
810676
break;
811677
case CHR('t'):
812678
RETV(PLAIN, CHR('\t'));
@@ -828,11 +694,11 @@ lexescape(struct vars *v)
828694
break;
829695
case CHR('w'):
830696
NOTE(REG_ULOCALE);
831-
RETV(CCLASS, 'w');
697+
RETV(CCLASSS, CC_WORD);
832698
break;
833699
case CHR('W'):
834700
NOTE(REG_ULOCALE);
835-
RETV(CCLASS, 'W');
701+
RETV(CCLASSC, CC_WORD);
836702
break;
837703
case CHR('x'):
838704
NOTE(REG_UUNPORT);

0 commit comments

Comments
 (0)