Skip to content

Commit 0a97edb

Browse files
committed
Doc: improve documentation around jsonpath regular expressions.
Provide some documentation about the differences between XQuery regular expressions and those supported by Spencer's regex engine. Since SQL now exposes XQuery regexps with the LIKE_REGEX operator, I made this a standalone section designed to help somebody who has to translate a LIKE_REGEX query to Postgres. (Eventually we might extend Spencer's engine to allow precise implementation of XQuery, but not today.) Reference that in the jsonpath docs, provide definitions of the XQuery flag letters, and add a description of the JavaScript-inspired string literal syntax used within jsonpath. Also point out explicitly that backslashes used within like_regex patterns will need to be doubled. This also syncs the docs with the decision implemented in commit d5b90cd to desupport XQuery's 'x' flag for now. Jonathan Katz and Tom Lane Discussion: https://postgr.es/m/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg@mail.gmail.com
1 parent e1c8743 commit 0a97edb

File tree

2 files changed

+235
-11
lines changed

2 files changed

+235
-11
lines changed

doc/src/sgml/func.sgml

Lines changed: 209 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
59705970

59715971
<!-- end re_syntax.n man page -->
59725972

5973+
<sect3 id="posix-vs-xquery">
5974+
<title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
5975+
5976+
<indexterm zone="posix-vs-xquery">
5977+
<primary><literal>LIKE_REGEX</literal></primary>
5978+
</indexterm>
5979+
5980+
<indexterm zone="posix-vs-xquery">
5981+
<primary>XQuery regular expressions</primary>
5982+
</indexterm>
5983+
5984+
<para>
5985+
Since SQL:2008, the SQL standard includes
5986+
a <literal>LIKE_REGEX</literal> operator that performs pattern
5987+
matching according to the XQuery regular expression
5988+
standard. <productname>PostgreSQL</productname> does not yet
5989+
implement this operator, but you can get very similar behavior using
5990+
the <function>regexp_match()</function> function, since XQuery
5991+
regular expressions are quite close to the ARE syntax described above.
5992+
</para>
5993+
5994+
<para>
5995+
Notable differences between the existing POSIX-based
5996+
regular-expression feature and XQuery regular expressions include:
5997+
5998+
<itemizedlist>
5999+
<listitem>
6000+
<para>
6001+
XQuery character class subtraction is not supported. An example of
6002+
this feature is using the following to match only English
6003+
consonants: <literal>[a-z-[aeiou]]</literal>.
6004+
</para>
6005+
</listitem>
6006+
<listitem>
6007+
<para>
6008+
XQuery character class shorthands <literal>\c</literal>,
6009+
<literal>\C</literal>, <literal>\i</literal>,
6010+
and <literal>\I</literal> are not supported.
6011+
</para>
6012+
</listitem>
6013+
<listitem>
6014+
<para>
6015+
XQuery character class elements
6016+
using <literal>\p{UnicodeProperty}</literal> or the
6017+
inverse <literal>\P{UnicodeProperty}</literal> are not supported.
6018+
</para>
6019+
</listitem>
6020+
<listitem>
6021+
<para>
6022+
POSIX interprets character classes such as <literal>\w</literal>
6023+
(see <xref linkend="posix-class-shorthand-escapes-table"/>)
6024+
according to the prevailing locale (which you can control by
6025+
attaching a <literal>COLLATE</literal> clause to the operator or
6026+
function). XQuery specifies these classes by reference to Unicode
6027+
character properties, so equivalent behavior is obtained only with
6028+
a locale that follows the Unicode rules.
6029+
</para>
6030+
</listitem>
6031+
<listitem>
6032+
<para>
6033+
The SQL standard (not XQuery itself) attempts to cater for more
6034+
variants of <quote>newline</quote> than POSIX does. The
6035+
newline-sensitive matching options described above consider only
6036+
ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
6037+
us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
6038+
(a Windows-style newline), and some Unicode-only characters like
6039+
LINE SEPARATOR (U+2028) as newlines as well.
6040+
Notably, <literal>.</literal> and <literal>\s</literal> should
6041+
count <literal>\r\n</literal> as one character not two according to
6042+
SQL.
6043+
</para>
6044+
</listitem>
6045+
<listitem>
6046+
<para>
6047+
Of the character-entry escapes described in
6048+
<xref linkend="posix-character-entry-escapes-table"/>,
6049+
XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
6050+
and <literal>\t</literal>.
6051+
</para>
6052+
</listitem>
6053+
<listitem>
6054+
<para>
6055+
XQuery does not support
6056+
the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
6057+
for character classes within bracket expressions.
6058+
</para>
6059+
</listitem>
6060+
<listitem>
6061+
<para>
6062+
XQuery does not have lookahead or lookbehind constraints,
6063+
nor any of the constraint escapes described in
6064+
<xref linkend="posix-constraint-escapes-table"/>.
6065+
</para>
6066+
</listitem>
6067+
<listitem>
6068+
<para>
6069+
The metasyntax forms described in <xref linkend="posix-metasyntax"/>
6070+
do not exist in XQuery.
6071+
</para>
6072+
</listitem>
6073+
<listitem>
6074+
<para>
6075+
The regular expression flag letters defined by XQuery are
6076+
related to but not the same as the option letters for POSIX
6077+
(<xref linkend="posix-embedded-options-table"/>). While the
6078+
<literal>i</literal> and <literal>q</literal> options behave the
6079+
same, others do not:
6080+
<itemizedlist>
6081+
<listitem>
6082+
<para>
6083+
XQuery's <literal>s</literal> (allow dot to match newline)
6084+
and <literal>m</literal> (allow <literal>^</literal>
6085+
and <literal>$</literal> to match at newlines) flags provide
6086+
access to the same behaviors as
6087+
POSIX's <literal>n</literal>, <literal>p</literal>
6088+
and <literal>w</literal> flags, but they
6089+
do <emphasis>not</emphasis> match the behavior of
6090+
POSIX's <literal>s</literal> and <literal>m</literal> flags.
6091+
Note in particular that dot-matches-newline is the default
6092+
behavior in POSIX but not XQuery.
6093+
</para>
6094+
</listitem>
6095+
<listitem>
6096+
<para>
6097+
XQuery's <literal>x</literal> (ignore whitespace in pattern) flag
6098+
is noticeably different from POSIX's expanded-mode flag.
6099+
POSIX's <literal>x</literal> flag also
6100+
allows <literal>#</literal> to begin a comment in the pattern,
6101+
and POSIX will not ignore a whitespace character after a
6102+
backslash.
6103+
</para>
6104+
</listitem>
6105+
</itemizedlist>
6106+
</para>
6107+
</listitem>
6108+
</itemizedlist>
6109+
</para>
6110+
6111+
</sect3>
59736112
</sect2>
59746113
</sect1>
59756114

@@ -11793,6 +11932,14 @@ table2-mapping
1179311932
</programlisting>
1179411933
</para>
1179511934
</listitem>
11935+
11936+
<listitem>
11937+
<para>
11938+
There are minor differences in the interpretation of regular
11939+
expression patterns used in <literal>like_regex</literal> filters, as
11940+
described in <xref linkend="jsonpath-regular-expressions"/>.
11941+
</para>
11942+
</listitem>
1179611943
</itemizedlist>
1179711944

1179811945
<sect3 id="strict-and-lax-modes">
@@ -11872,6 +12019,63 @@ table2-mapping
1187212019

1187312020
</sect3>
1187412021

12022+
<sect3 id="jsonpath-regular-expressions">
12023+
<title>Regular Expressions</title>
12024+
12025+
<indexterm zone="jsonpath-regular-expressions">
12026+
<primary><literal>LIKE_REGEX</literal></primary>
12027+
<secondary>in SQL/JSON</secondary>
12028+
</indexterm>
12029+
12030+
<para>
12031+
SQL/JSON path expressions allow matching text to a regular expression
12032+
with the <literal>like_regex</literal> filter. For example, the
12033+
following SQL/JSON path query would case-insensitively match all
12034+
strings in an array that start with an English vowel:
12035+
<programlisting>
12036+
'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
12037+
</programlisting>
12038+
</para>
12039+
12040+
<para>
12041+
The optional <literal>flag</literal> string may include one or more of
12042+
the characters
12043+
<literal>i</literal> for case-insensitive match,
12044+
<literal>m</literal> to allow <literal>^</literal>
12045+
and <literal>$</literal> to match at newlines,
12046+
<literal>s</literal> to allow <literal>.</literal> to match a newline,
12047+
and <literal>q</literal> to quote the whole pattern (reducing the
12048+
behavior to a simple substring match).
12049+
</para>
12050+
12051+
<para>
12052+
The SQL/JSON standard borrows its definition for regular expressions
12053+
from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
12054+
XQuery standard. PostgreSQL does not currently support the
12055+
<literal>LIKE_REGEX</literal> operator. Therefore,
12056+
the <literal>like_regex</literal> filter is implemented using the
12057+
POSIX regular expression engine described in
12058+
<xref linkend="functions-posix-regexp"/>. This leads to various minor
12059+
discrepancies from standard SQL/JSON behavior, which are cataloged in
12060+
<xref linkend="posix-vs-xquery"/>.
12061+
Note, however, that the flag-letter incompatibilities described there
12062+
do not apply to SQL/JSON, as it translates the XQuery flag letters to
12063+
match what the POSIX engine expects.
12064+
</para>
12065+
12066+
<para>
12067+
Keep in mind that the pattern argument of <literal>like_regex</literal>
12068+
is a JSON path string literal, written according to the rules given in
12069+
<xref linkend="datatype-jsonpath"/>. This means in particular that any
12070+
backslashes you want to use in the regular expression must be doubled.
12071+
For example, to match strings that contain only digits:
12072+
<programlisting>
12073+
'$ ? (@ like_regex "^\\d+$")'
12074+
</programlisting>
12075+
</para>
12076+
12077+
</sect3>
12078+
1187512079
<sect3 id="functions-sqljson-path-operators">
1187612080
<title>SQL/JSON Path Operators and Methods</title>
1187712081

@@ -12113,10 +12317,11 @@ table2-mapping
1211312317
<row>
1211412318
<entry><literal>like_regex</literal></entry>
1211512319
<entry>
12116-
Tests pattern matching with POSIX regular expressions
12117-
(see <xref linkend="functions-posix-regexp"/>). Supported flags
12118-
are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
12119-
<literal>x</literal>, and <literal>q</literal>.</entry>
12320+
Tests whether the first operand matches the regular expression
12321+
given by the second operand, optionally with modifications
12322+
described by a string of <literal>flag</literal> characters (see
12323+
<xref linkend="jsonpath-regular-expressions"/>)
12324+
</entry>
1212012325
<entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
1212112326
<entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
1212212327
<entry><literal>"abc", "aBdC", "abdacb"</literal></entry>

doc/src/sgml/json.sgml

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -666,13 +666,32 @@ SELECT jdoc-&gt;'guid', jdoc-&gt;'name' FROM api WHERE jdoc @&gt; '{"tags": ["qu
666666
</itemizedlist>
667667

668668
<para>
669-
An SQL/JSON path expression is an SQL character string literal,
670-
so it must be enclosed in single quotes when passed to an SQL/JSON
671-
query function. Following the JavaScript
672-
conventions, character string literals within the path expression
673-
must be enclosed in double quotes. Any single quotes within this
674-
character string literal must be escaped with a single quote
675-
by the SQL convention.
669+
An SQL/JSON path expression is typically written in an SQL query as an
670+
SQL character string literal, so it must be enclosed in single quotes,
671+
and any single quotes desired within the value must be doubled
672+
(see <xref linkend="sql-syntax-strings"/>).
673+
Some forms of path expressions require string literals within them.
674+
These embedded string literals follow JavaScript/ECMAScript conventions:
675+
they must be surrounded by double quotes, and backslash escapes may be
676+
used within them to represent otherwise-hard-to-type characters.
677+
In particular, the way to write a double quote within an embedded string
678+
literal is <literal>\"</literal>, and to write a backslash itself, you
679+
must write <literal>\\</literal>. Other special backslash sequences
680+
include those recognized in JSON strings:
681+
<literal>\b</literal>,
682+
<literal>\f</literal>,
683+
<literal>\n</literal>,
684+
<literal>\r</literal>,
685+
<literal>\t</literal>,
686+
<literal>\v</literal>
687+
for various ASCII control characters, and
688+
<literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode
689+
character identified by its 4-hex-digit code point. The backslash
690+
syntax also includes two cases not allowed by JSON:
691+
<literal>\x<replaceable>NN</replaceable></literal> for a character code
692+
written with only two hex digits, and
693+
<literal>\u{<replaceable>N...</replaceable>}</literal> for a character
694+
code written with 1 to 6 hex digits.
676695
</para>
677696

678697
<para>

0 commit comments

Comments
 (0)