Skip to content

Commit b9cf94c

Browse files
committed
Doc: improve documentation around jsonpath regular expressions.
Provide some documentation about the differences between XQuery regular expressions and those supported by Spencer's regex engine. Since SQL now exposes XQuery regexps with the LIKE_REGEX operator, I made this a standalone section designed to help somebody who has to translate a LIKE_REGEX query to Postgres. (Eventually we might extend Spencer's engine to allow precise implementation of XQuery, but not today.) Reference that in the jsonpath docs, provide definitions of the XQuery flag letters, and add a description of the JavaScript-inspired string literal syntax used within jsonpath. Also point out explicitly that backslashes used within like_regex patterns will need to be doubled. This also syncs the docs with the decision implemented in commit d5b90cd to desupport XQuery's 'x' flag for now. Jonathan Katz and Tom Lane Discussion: https://postgr.es/m/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg@mail.gmail.com
1 parent 596cf9e commit b9cf94c

File tree

2 files changed

+235
-11
lines changed

2 files changed

+235
-11
lines changed

doc/src/sgml/func.sgml

Lines changed: 209 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5966,6 +5966,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
59665966

59675967
<!-- end re_syntax.n man page -->
59685968

5969+
<sect3 id="posix-vs-xquery">
5970+
<title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
5971+
5972+
<indexterm zone="posix-vs-xquery">
5973+
<primary><literal>LIKE_REGEX</literal></primary>
5974+
</indexterm>
5975+
5976+
<indexterm zone="posix-vs-xquery">
5977+
<primary>XQuery regular expressions</primary>
5978+
</indexterm>
5979+
5980+
<para>
5981+
Since SQL:2008, the SQL standard includes
5982+
a <literal>LIKE_REGEX</literal> operator that performs pattern
5983+
matching according to the XQuery regular expression
5984+
standard. <productname>PostgreSQL</productname> does not yet
5985+
implement this operator, but you can get very similar behavior using
5986+
the <function>regexp_match()</function> function, since XQuery
5987+
regular expressions are quite close to the ARE syntax described above.
5988+
</para>
5989+
5990+
<para>
5991+
Notable differences between the existing POSIX-based
5992+
regular-expression feature and XQuery regular expressions include:
5993+
5994+
<itemizedlist>
5995+
<listitem>
5996+
<para>
5997+
XQuery character class subtraction is not supported. An example of
5998+
this feature is using the following to match only English
5999+
consonants: <literal>[a-z-[aeiou]]</literal>.
6000+
</para>
6001+
</listitem>
6002+
<listitem>
6003+
<para>
6004+
XQuery character class shorthands <literal>\c</literal>,
6005+
<literal>\C</literal>, <literal>\i</literal>,
6006+
and <literal>\I</literal> are not supported.
6007+
</para>
6008+
</listitem>
6009+
<listitem>
6010+
<para>
6011+
XQuery character class elements
6012+
using <literal>\p{UnicodeProperty}</literal> or the
6013+
inverse <literal>\P{UnicodeProperty}</literal> are not supported.
6014+
</para>
6015+
</listitem>
6016+
<listitem>
6017+
<para>
6018+
POSIX interprets character classes such as <literal>\w</literal>
6019+
(see <xref linkend="posix-class-shorthand-escapes-table"/>)
6020+
according to the prevailing locale (which you can control by
6021+
attaching a <literal>COLLATE</literal> clause to the operator or
6022+
function). XQuery specifies these classes by reference to Unicode
6023+
character properties, so equivalent behavior is obtained only with
6024+
a locale that follows the Unicode rules.
6025+
</para>
6026+
</listitem>
6027+
<listitem>
6028+
<para>
6029+
The SQL standard (not XQuery itself) attempts to cater for more
6030+
variants of <quote>newline</quote> than POSIX does. The
6031+
newline-sensitive matching options described above consider only
6032+
ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
6033+
us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
6034+
(a Windows-style newline), and some Unicode-only characters like
6035+
LINE SEPARATOR (U+2028) as newlines as well.
6036+
Notably, <literal>.</literal> and <literal>\s</literal> should
6037+
count <literal>\r\n</literal> as one character not two according to
6038+
SQL.
6039+
</para>
6040+
</listitem>
6041+
<listitem>
6042+
<para>
6043+
Of the character-entry escapes described in
6044+
<xref linkend="posix-character-entry-escapes-table"/>,
6045+
XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
6046+
and <literal>\t</literal>.
6047+
</para>
6048+
</listitem>
6049+
<listitem>
6050+
<para>
6051+
XQuery does not support
6052+
the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
6053+
for character classes within bracket expressions.
6054+
</para>
6055+
</listitem>
6056+
<listitem>
6057+
<para>
6058+
XQuery does not have lookahead or lookbehind constraints,
6059+
nor any of the constraint escapes described in
6060+
<xref linkend="posix-constraint-escapes-table"/>.
6061+
</para>
6062+
</listitem>
6063+
<listitem>
6064+
<para>
6065+
The metasyntax forms described in <xref linkend="posix-metasyntax"/>
6066+
do not exist in XQuery.
6067+
</para>
6068+
</listitem>
6069+
<listitem>
6070+
<para>
6071+
The regular expression flag letters defined by XQuery are
6072+
related to but not the same as the option letters for POSIX
6073+
(<xref linkend="posix-embedded-options-table"/>). While the
6074+
<literal>i</literal> and <literal>q</literal> options behave the
6075+
same, others do not:
6076+
<itemizedlist>
6077+
<listitem>
6078+
<para>
6079+
XQuery's <literal>s</literal> (allow dot to match newline)
6080+
and <literal>m</literal> (allow <literal>^</literal>
6081+
and <literal>$</literal> to match at newlines) flags provide
6082+
access to the same behaviors as
6083+
POSIX's <literal>n</literal>, <literal>p</literal>
6084+
and <literal>w</literal> flags, but they
6085+
do <emphasis>not</emphasis> match the behavior of
6086+
POSIX's <literal>s</literal> and <literal>m</literal> flags.
6087+
Note in particular that dot-matches-newline is the default
6088+
behavior in POSIX but not XQuery.
6089+
</para>
6090+
</listitem>
6091+
<listitem>
6092+
<para>
6093+
XQuery's <literal>x</literal> (ignore whitespace in pattern) flag
6094+
is noticeably different from POSIX's expanded-mode flag.
6095+
POSIX's <literal>x</literal> flag also
6096+
allows <literal>#</literal> to begin a comment in the pattern,
6097+
and POSIX will not ignore a whitespace character after a
6098+
backslash.
6099+
</para>
6100+
</listitem>
6101+
</itemizedlist>
6102+
</para>
6103+
</listitem>
6104+
</itemizedlist>
6105+
</para>
6106+
6107+
</sect3>
59696108
</sect2>
59706109
</sect1>
59716110

@@ -11739,6 +11878,14 @@ table2-mapping
1173911878
</programlisting>
1174011879
</para>
1174111880
</listitem>
11881+
11882+
<listitem>
11883+
<para>
11884+
There are minor differences in the interpretation of regular
11885+
expression patterns used in <literal>like_regex</literal> filters, as
11886+
described in <xref linkend="jsonpath-regular-expressions"/>.
11887+
</para>
11888+
</listitem>
1174211889
</itemizedlist>
1174311890

1174411891
<sect3 id="strict-and-lax-modes">
@@ -11818,6 +11965,63 @@ table2-mapping
1181811965

1181911966
</sect3>
1182011967

11968+
<sect3 id="jsonpath-regular-expressions">
11969+
<title>Regular Expressions</title>
11970+
11971+
<indexterm zone="jsonpath-regular-expressions">
11972+
<primary><literal>LIKE_REGEX</literal></primary>
11973+
<secondary>in SQL/JSON</secondary>
11974+
</indexterm>
11975+
11976+
<para>
11977+
SQL/JSON path expressions allow matching text to a regular expression
11978+
with the <literal>like_regex</literal> filter. For example, the
11979+
following SQL/JSON path query would case-insensitively match all
11980+
strings in an array that start with an English vowel:
11981+
<programlisting>
11982+
'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
11983+
</programlisting>
11984+
</para>
11985+
11986+
<para>
11987+
The optional <literal>flag</literal> string may include one or more of
11988+
the characters
11989+
<literal>i</literal> for case-insensitive match,
11990+
<literal>m</literal> to allow <literal>^</literal>
11991+
and <literal>$</literal> to match at newlines,
11992+
<literal>s</literal> to allow <literal>.</literal> to match a newline,
11993+
and <literal>q</literal> to quote the whole pattern (reducing the
11994+
behavior to a simple substring match).
11995+
</para>
11996+
11997+
<para>
11998+
The SQL/JSON standard borrows its definition for regular expressions
11999+
from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
12000+
XQuery standard. PostgreSQL does not currently support the
12001+
<literal>LIKE_REGEX</literal> operator. Therefore,
12002+
the <literal>like_regex</literal> filter is implemented using the
12003+
POSIX regular expression engine described in
12004+
<xref linkend="functions-posix-regexp"/>. This leads to various minor
12005+
discrepancies from standard SQL/JSON behavior, which are cataloged in
12006+
<xref linkend="posix-vs-xquery"/>.
12007+
Note, however, that the flag-letter incompatibilities described there
12008+
do not apply to SQL/JSON, as it translates the XQuery flag letters to
12009+
match what the POSIX engine expects.
12010+
</para>
12011+
12012+
<para>
12013+
Keep in mind that the pattern argument of <literal>like_regex</literal>
12014+
is a JSON path string literal, written according to the rules given in
12015+
<xref linkend="datatype-jsonpath"/>. This means in particular that any
12016+
backslashes you want to use in the regular expression must be doubled.
12017+
For example, to match strings that contain only digits:
12018+
<programlisting>
12019+
'$ ? (@ like_regex "^\\d+$")'
12020+
</programlisting>
12021+
</para>
12022+
12023+
</sect3>
12024+
1182112025
<sect3 id="functions-sqljson-path-operators">
1182212026
<title>SQL/JSON Path Operators and Methods</title>
1182312027

@@ -12059,10 +12263,11 @@ table2-mapping
1205912263
<row>
1206012264
<entry><literal>like_regex</literal></entry>
1206112265
<entry>
12062-
Tests pattern matching with POSIX regular expressions
12063-
(see <xref linkend="functions-posix-regexp"/>). Supported flags
12064-
are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
12065-
<literal>x</literal>, and <literal>q</literal>.</entry>
12266+
Tests whether the first operand matches the regular expression
12267+
given by the second operand, optionally with modifications
12268+
described by a string of <literal>flag</literal> characters (see
12269+
<xref linkend="jsonpath-regular-expressions"/>)
12270+
</entry>
1206612271
<entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
1206712272
<entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
1206812273
<entry><literal>"abc", "aBdC", "abdacb"</literal></entry>

doc/src/sgml/json.sgml

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -666,13 +666,32 @@ SELECT jdoc-&gt;'guid', jdoc-&gt;'name' FROM api WHERE jdoc @&gt; '{"tags": ["qu
666666
</itemizedlist>
667667

668668
<para>
669-
An SQL/JSON path expression is an SQL character string literal,
670-
so it must be enclosed in single quotes when passed to an SQL/JSON
671-
query function. Following the JavaScript
672-
conventions, character string literals within the path expression
673-
must be enclosed in double quotes. Any single quotes within this
674-
character string literal must be escaped with a single quote
675-
by the SQL convention.
669+
An SQL/JSON path expression is typically written in an SQL query as an
670+
SQL character string literal, so it must be enclosed in single quotes,
671+
and any single quotes desired within the value must be doubled
672+
(see <xref linkend="sql-syntax-strings"/>).
673+
Some forms of path expressions require string literals within them.
674+
These embedded string literals follow JavaScript/ECMAScript conventions:
675+
they must be surrounded by double quotes, and backslash escapes may be
676+
used within them to represent otherwise-hard-to-type characters.
677+
In particular, the way to write a double quote within an embedded string
678+
literal is <literal>\"</literal>, and to write a backslash itself, you
679+
must write <literal>\\</literal>. Other special backslash sequences
680+
include those recognized in JSON strings:
681+
<literal>\b</literal>,
682+
<literal>\f</literal>,
683+
<literal>\n</literal>,
684+
<literal>\r</literal>,
685+
<literal>\t</literal>,
686+
<literal>\v</literal>
687+
for various ASCII control characters, and
688+
<literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode
689+
character identified by its 4-hex-digit code point. The backslash
690+
syntax also includes two cases not allowed by JSON:
691+
<literal>\x<replaceable>NN</replaceable></literal> for a character code
692+
written with only two hex digits, and
693+
<literal>\u{<replaceable>N...</replaceable>}</literal> for a character
694+
code written with 1 to 6 hex digits.
676695
</para>
677696

678697
<para>

0 commit comments

Comments
 (0)