Skip to content

Commit e2e46a9

Browse files
committed
Fix documentation of regular expression character-entry escapes.
The docs claimed that \uhhhh would be interpreted as a Unicode value regardless of the database encoding, but it's never been implemented that way: \uhhhh and \xhhhh actually mean exactly the same thing, namely the character that pg_mb2wchar translates to 0xhhhh. Moreover we were falsely dismissive of the usefulness of Unicode code points above FFFF. Fix that. It's been like this for ages, so back-patch to all supported branches.
1 parent 541ec18 commit e2e46a9

File tree

1 file changed

+17
-4
lines changed

1 file changed

+17
-4
lines changed

doc/src/sgml/func.sgml

+17-4
Original file line numberDiff line numberDiff line change
@@ -4653,7 +4653,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
46534653
<entry> <literal>\e</> </entry>
46544654
<entry> the character whose collating-sequence name
46554655
is <literal>ESC</>,
4656-
or failing that, the character with octal value 033 </entry>
4656+
or failing that, the character with octal value <literal>033</> </entry>
46574657
</row>
46584658

46594659
<row>
@@ -4679,15 +4679,17 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
46794679
<row>
46804680
<entry> <literal>\u</><replaceable>wxyz</> </entry>
46814681
<entry> (where <replaceable>wxyz</> is exactly four hexadecimal digits)
4682-
the UTF16 (Unicode, 16-bit) character <literal>U+</><replaceable>wxyz</>
4683-
in the local byte ordering </entry>
4682+
the character whose hexadecimal value is
4683+
<literal>0x</><replaceable>wxyz</>
4684+
</entry>
46844685
</row>
46854686

46864687
<row>
46874688
<entry> <literal>\U</><replaceable>stuvwxyz</> </entry>
46884689
<entry> (where <replaceable>stuvwxyz</> is exactly eight hexadecimal
46894690
digits)
4690-
reserved for a hypothetical Unicode extension to 32 bits
4691+
the character whose hexadecimal value is
4692+
<literal>0x</><replaceable>stuvwxyz</>
46914693
</entry>
46924694
</row>
46934695

@@ -4736,6 +4738,17 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
47364738
Octal digits are <literal>0</>-<literal>7</>.
47374739
</para>
47384740

4741+
<para>
4742+
Numeric character-entry escapes specifying values outside the ASCII range
4743+
(0-127) have meanings dependent on the database encoding. When the
4744+
encoding is UTF-8, escape values are equivalent to Unicode code points,
4745+
for example <literal>\u1234</> means the character <literal>U+1234</>.
4746+
For other multibyte encodings, character-entry escapes usually just
4747+
specify the concatenation of the byte values for the character. If the
4748+
escape value does not correspond to any legal character in the database
4749+
encoding, no error will be raised, but it will never match any data.
4750+
</para>
4751+
47394752
<para>
47404753
The character-entry escapes are always taken as ordinary characters.
47414754
For example, <literal>\135</> is <literal>]</> in ASCII, but

0 commit comments

Comments
 (0)