Skip to content

Commit b107744

Browse files
committed
Improve comment in regc_pg_locale.c.
Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
1 parent 3fae25c commit b107744

File tree

1 file changed

+13
-16
lines changed

1 file changed

+13
-16
lines changed

src/backend/regex/regc_pg_locale.c

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -21,22 +21,22 @@
2121
#include "utils/pg_locale.h"
2222

2323
/*
24-
* To provide as much functionality as possible on a variety of platforms,
25-
* without going so far as to implement everything from scratch, we use
26-
* several implementation strategies depending on the situation:
24+
* For the libc provider, to provide as much functionality as possible on a
25+
* variety of platforms without going so far as to implement everything from
26+
* scratch, we use several implementation strategies depending on the
27+
* situation:
2728
*
2829
* 1. In C/POSIX collations, we use hard-wired code. We can't depend on
2930
* the <ctype.h> functions since those will obey LC_CTYPE. Note that these
3031
* collations don't give a fig about multibyte characters.
3132
*
32-
* 2. In the "default" collation (which is supposed to obey LC_CTYPE):
33-
*
34-
* 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
33+
* 2. When working in UTF8 encoding, we use the <wctype.h> functions.
3534
* This assumes that every platform uses Unicode codepoints directly
36-
* as the wchar_t representation of Unicode. On some platforms
35+
* as the wchar_t representation of Unicode. (XXX: ICU makes this assumption
36+
* even for non-UTF8 encodings, which may be a problem.) On some platforms
3737
* wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
3838
*
39-
* 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
39+
* 3. In all other encodings, we use the <ctype.h> functions for pg_wchar
4040
* values up to 255, and punt for values above that. This is 100% correct
4141
* only in single-byte encodings such as LATINn. However, non-Unicode
4242
* multibyte encodings are mostly Far Eastern character sets for which the
@@ -46,14 +46,11 @@
4646
* the platform's wchar_t representation matches what we do in pg_wchar
4747
* conversions.
4848
*
49-
* 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
50-
* functions, under exactly the same cases as #2.
51-
*
52-
* There is one notable difference between cases 2 and 3: in the "default"
53-
* collation we force ASCII letters to follow ASCII upcase/downcase rules,
54-
* while in a non-default collation we just let the library functions do what
55-
* they will. The case where this matters is treatment of I/i in Turkish,
56-
* and the behavior is meant to match the upper()/lower() SQL functions.
49+
* As a special case, in the "default" collation, (2) and (3) force ASCII
50+
* letters to follow ASCII upcase/downcase rules, while in a non-default
51+
* collation we just let the library functions do what they will. The case
52+
* where this matters is treatment of I/i in Turkish, and the behavior is
53+
* meant to match the upper()/lower() SQL functions.
5754
*
5855
* We store the active collation setting in static variables. In principle
5956
* it could be passed down to here via the regex library's "struct vars" data

0 commit comments

Comments
 (0)