Skip to content

Commit cf98467

Browse files
committed
Improve behavior of to_timestamp()/to_date() functions
to_timestamp()/to_date() functions were introduced mainly for Oracle compatibility, and became very popular among PostgreSQL users. However, some behavior of to_timestamp()/to_date() functions are both incompatible with Oracle and confusing for our users. This behavior is related to handling of spaces and separators in non FX (fixed format) mode. This commit reworks this behavior making less confusing, better documented and more compatible with Oracle. Nevertheless, there are still following incompatibilities with Oracle. 1) We don't insist that there are no format string patterns unmatched to input string. 2) In FX mode we don't insist space and separators in format string to exactly match input string. 3) When format string patterns are divided by mix of spaces and separators, we don't distinguish them, while Oracle takes into account only last group of spaces/separators. Discussion: https://postgr.es/m/1873520224.1784572.1465833145330.JavaMail.yahoo%40mail.yahoo.com Author: Artur Zakirov, Alexander Korotkov, Liudmila Mantrova Review: Amul Sul, Robert Haas, Tom Lane, Dmitry Dolgov, David G. Johnston
1 parent 5f08acc commit cf98467

File tree

4 files changed

+280
-26
lines changed

4 files changed

+280
-26
lines changed

doc/src/sgml/func.sgml

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6262,16 +6262,57 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
62626262
<listitem>
62636263
<para>
62646264
<function>to_timestamp</function> and <function>to_date</function>
6265-
skip multiple blank spaces in the input string unless the
6266-
<literal>FX</literal> option is used. For example,
6267-
<literal>to_timestamp('2000&nbsp;&nbsp;&nbsp;&nbsp;JUN', 'YYYY MON')</literal> works, but
6265+
skip multiple blank spaces at the beginning of the input string and
6266+
around date and time values unless the <literal>FX</literal> option is used. For example,
6267+
<literal>to_timestamp('&nbsp;2000&nbsp;&nbsp;&nbsp;&nbsp;JUN', 'YYYY MON')</literal> and
6268+
<literal>to_timestamp('2000 - JUN', 'YYYY-MON')</literal> work, but
62686269
<literal>to_timestamp('2000&nbsp;&nbsp;&nbsp;&nbsp;JUN', 'FXYYYY MON')</literal> returns an error
6269-
because <function>to_timestamp</function> expects one space only.
6270+
because <function>to_timestamp</function> expects a single space only.
62706271
<literal>FX</literal> must be specified as the first item in
62716272
the template.
62726273
</para>
62736274
</listitem>
62746275

6276+
<listitem>
6277+
<para>
6278+
A separator (a space or a non-letter/non-digit character) in the template string of
6279+
<function>to_timestamp</function> and <function>to_date</function>
6280+
matches any single separator in the input string or is skipped,
6281+
unless the <literal>FX</literal> option is used.
6282+
For example, <literal>to_timestamp('2000JUN', 'YYYY///MON')</literal> and
6283+
<literal>to_timestamp('2000/JUN', 'YYYY MON')</literal> work, but
6284+
<literal>to_timestamp('2000//JUN', 'YYYY/MON')</literal>
6285+
returns an error because the number of separators in the input string
6286+
exceeds the number of separators in the template.
6287+
</para>
6288+
<para>
6289+
If <literal>FX</literal> is specified, separator in template string
6290+
matches to exactly one character in input string. Notice we don't insist
6291+
input string character to be the same as template string separator.
6292+
For example, <literal>to_timestamp('2000/JUN', 'FXYYYY MON')</literal>
6293+
works, but <literal>to_timestamp('2000/JUN', 'FXYYYY&nbsp;&nbsp;MON')</literal>
6294+
returns an error because a space second template string space consumed
6295+
letter <literal>J</literal> from the input string.
6296+
</para>
6297+
</listitem>
6298+
6299+
<listitem>
6300+
<para>
6301+
<literal>TZH</literal> template pattern can match a signed number.
6302+
Without the <literal>FX</literal> option, it may lead to ambiguity in
6303+
interpretation of the minus sign, which can also be interpreted as a separator.
6304+
This ambiguity is resolved as follows. If the number of separators before
6305+
<literal>TZH</literal> in the template string is less than the number of
6306+
separators before the minus sign in the input string, the minus sign
6307+
is interpreted as part of <literal>TZH</literal>.
6308+
Otherwise, the minus sign is considered to be a separator between values.
6309+
For example, <literal>to_timestamp('2000 -10', 'YYYY TZH')</literal> matches
6310+
<literal>-10</literal> to <literal>TZH</literal>, but
6311+
<literal>to_timestamp('2000 -10', 'YYYY&nbsp;&nbsp;TZH')</literal>
6312+
matches <literal>10</literal> to <literal>TZH</literal>.
6313+
</para>
6314+
</listitem>
6315+
62756316
<listitem>
62766317
<para>
62776318
Ordinary text is allowed in <function>to_char</function>
@@ -6287,6 +6328,19 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
62876328
string; for example <literal>"XX"</literal> skips two input characters
62886329
(whether or not they are <literal>XX</literal>).
62896330
</para>
6331+
<tip>
6332+
<para>
6333+
Prior to <productname>PostgreSQL</productname> 12, it was possible to
6334+
skip arbitrary text in the input string using non-letter or non-digit
6335+
characters. For example,
6336+
<literal>to_timestamp('2000y6m1d', 'yyyy-MM-DD')</literal> used to
6337+
work. Now you can only use letter characters for this purpose. For example,
6338+
<literal>to_timestamp('2000y6m1d', 'yyyytMMtDDt')</literal> and
6339+
<literal>to_timestamp('2000y6m1d', 'yyyy"y"MM"m"DD"d"')</literal>
6340+
skip <literal>y</literal>, <literal>m</literal>, and
6341+
<literal>d</literal>.
6342+
</para>
6343+
</tip>
62906344
</listitem>
62916345

62926346
<listitem>

src/backend/utils/adt/formatting.c

Lines changed: 98 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,8 @@ typedef struct
165165
#define NODE_TYPE_END 1
166166
#define NODE_TYPE_ACTION 2
167167
#define NODE_TYPE_CHAR 3
168+
#define NODE_TYPE_SEPARATOR 4
169+
#define NODE_TYPE_SPACE 5
168170

169171
#define SUFFTYPE_PREFIX 1
170172
#define SUFFTYPE_POSTFIX 2
@@ -955,6 +957,7 @@ typedef struct NUMProc
955957
static const KeyWord *index_seq_search(const char *str, const KeyWord *kw,
956958
const int *index);
957959
static const KeySuffix *suff_search(const char *str, const KeySuffix *suf, int type);
960+
static bool is_separator_char(const char *str);
958961
static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
959962
static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
960963
const KeySuffix *suf, const int *index, int ver, NUMDesc *Num);
@@ -1044,6 +1047,16 @@ suff_search(const char *str, const KeySuffix *suf, int type)
10441047
return NULL;
10451048
}
10461049

1050+
static bool
1051+
is_separator_char(const char *str)
1052+
{
1053+
/* ASCII printable character, but not letter or digit */
1054+
return (*str > 0x20 && *str < 0x7F &&
1055+
!(*str >= 'A' && *str <= 'Z') &&
1056+
!(*str >= 'a' && *str <= 'z') &&
1057+
!(*str >= '0' && *str <= '9'));
1058+
}
1059+
10471060
/* ----------
10481061
* Prepare NUMDesc (number description struct) via FormatNode struct
10491062
* ----------
@@ -1319,7 +1332,14 @@ parse_format(FormatNode *node, const char *str, const KeyWord *kw,
13191332
if (*str == '\\' && *(str + 1) == '"')
13201333
str++;
13211334
chlen = pg_mblen(str);
1322-
n->type = NODE_TYPE_CHAR;
1335+
1336+
if (ver == DCH_TYPE && is_separator_char(str))
1337+
n->type = NODE_TYPE_SEPARATOR;
1338+
else if (isspace((unsigned char) *str))
1339+
n->type = NODE_TYPE_SPACE;
1340+
else
1341+
n->type = NODE_TYPE_CHAR;
1342+
13231343
memcpy(n->character, str, chlen);
13241344
n->character[chlen] = '\0';
13251345
n->key = NULL;
@@ -2987,27 +3007,66 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
29873007
int len,
29883008
value;
29893009
bool fx_mode = false;
3010+
/* number of extra skipped characters (more than given in format string) */
3011+
int extra_skip = 0;
29903012

29913013
for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++)
29923014
{
2993-
if (n->type != NODE_TYPE_ACTION)
3015+
/*
3016+
* Ignore spaces at the beginning of the string and before fields when
3017+
* not in FX (fixed width) mode.
3018+
*/
3019+
if (!fx_mode && (n->type != NODE_TYPE_ACTION || n->key->id != DCH_FX) &&
3020+
(n->type == NODE_TYPE_ACTION || n == node))
3021+
{
3022+
while (*s != '\0' && isspace((unsigned char) *s))
3023+
{
3024+
s++;
3025+
extra_skip++;
3026+
}
3027+
}
3028+
3029+
if (n->type == NODE_TYPE_SPACE || n->type == NODE_TYPE_SEPARATOR)
3030+
{
3031+
if (!fx_mode)
3032+
{
3033+
/*
3034+
* In non FX (fixed format) mode one format string space or
3035+
* separator match to one space or separator in input string.
3036+
* Or match nothing if there is no space or separator in
3037+
* the current position of input string.
3038+
*/
3039+
extra_skip--;
3040+
if (isspace((unsigned char) *s) || is_separator_char(s))
3041+
{
3042+
s++;
3043+
extra_skip++;
3044+
}
3045+
}
3046+
else
3047+
{
3048+
/*
3049+
* In FX mode, on format string space or separator we consume
3050+
* exactly one character from input string. Notice we don't
3051+
* insist that the consumed character match the format's
3052+
* character.
3053+
*/
3054+
s += pg_mblen(s);
3055+
}
3056+
continue;
3057+
}
3058+
else if (n->type != NODE_TYPE_ACTION)
29943059
{
29953060
/*
2996-
* Separator, so consume one character from input string. Notice
2997-
* we don't insist that the consumed character match the format's
2998-
* character.
3061+
* Text character, so consume one character from input string.
3062+
* Notice we don't insist that the consumed character match the
3063+
* format's character.
3064+
* Text field ignores FX mode.
29993065
*/
30003066
s += pg_mblen(s);
30013067
continue;
30023068
}
30033069

3004-
/* Ignore spaces before fields when not in FX (fixed width) mode */
3005-
if (!fx_mode && n->key->id != DCH_FX)
3006-
{
3007-
while (*s != '\0' && isspace((unsigned char) *s))
3008-
s++;
3009-
}
3010-
30113070
from_char_set_mode(out, n->key->date_mode);
30123071

30133072
switch (n->key->id)
@@ -3086,10 +3145,24 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
30863145
n->key->name)));
30873146
break;
30883147
case DCH_TZH:
3089-
out->tzsign = *s == '-' ? -1 : +1;
3090-
3148+
/*
3149+
* Value of TZH might be negative. And the issue is that we
3150+
* might swallow minus sign as the separator. So, if we have
3151+
* skipped more characters than specified in the format string,
3152+
* then we consider prepending last skipped minus to TZH.
3153+
*/
30913154
if (*s == '+' || *s == '-' || *s == ' ')
3155+
{
3156+
out->tzsign = *s == '-' ? -1 : +1;
30923157
s++;
3158+
}
3159+
else
3160+
{
3161+
if (extra_skip > 0 && *(s - 1) == '-')
3162+
out->tzsign = -1;
3163+
else
3164+
out->tzsign = +1;
3165+
}
30933166

30943167
from_char_parse_int_len(&out->tzh, &s, 2, n);
30953168
break;
@@ -3261,6 +3334,17 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
32613334
SKIP_THth(s, n->suffix);
32623335
break;
32633336
}
3337+
3338+
/* Ignore all spaces after fields */
3339+
if (!fx_mode)
3340+
{
3341+
extra_skip = 0;
3342+
while (*s != '\0' && isspace((unsigned char) *s))
3343+
{
3344+
s++;
3345+
extra_skip++;
3346+
}
3347+
}
32643348
}
32653349
}
32663350

0 commit comments

Comments
 (0)