Skip to content

Commit 8d1dadb

Browse files
committed
Accept XML documents when xmloption = content, as required by SQL:2006+.
Previously we were using the SQL:2003 definition, which doesn't allow this, but that creates a serious dump/restore gotcha: there is no setting of xmloption that will allow all valid XML data. Hence, switch to the 2006 definition. Since libxml doesn't accept <!DOCTYPE> directives in the mode we use for CONTENT parsing, the implementation is to detect <!DOCTYPE> in the input and switch to DOCUMENT parsing mode. This should not cost much, because <!DOCTYPE> should be close to the front of the input if it's there at all. It's possible that this causes the error messages for malformed input to be slightly different than they were before, if said input includes <!DOCTYPE>; but that does not seem like a big problem. In passing, buy back a few cycles in parsing of large XML documents by not doing strlen() of the whole input in parse_xml_decl(). Back-patch because dump/restore failures are not nice. This change shouldn't break any cases that worked before, so it seems safe to back-patch. Chapman Flack (revised a bit by me) Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
1 parent 05f110c commit 8d1dadb

File tree

6 files changed

+271
-29
lines changed

6 files changed

+271
-29
lines changed

doc/src/sgml/datatype.sgml

+5-13
Original file line numberDiff line numberDiff line change
@@ -4208,9 +4208,11 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
42084208
<para>
42094209
The <type>xml</type> type can store well-formed
42104210
<quote>documents</quote>, as defined by the XML standard, as well
4211-
as <quote>content</quote> fragments, which are defined by the
4212-
production <literal>XMLDecl? content</literal> in the XML
4213-
standard. Roughly, this means that content fragments can have
4211+
as <quote>content</quote> fragments, which are defined by reference
4212+
to the more permissive
4213+
<ulink url="https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#DocumentNode"><quote>document node</quote></ulink>
4214+
of the XQuery and XPath data model.
4215+
Roughly, this means that content fragments can have
42144216
more than one top-level element or character node. The expression
42154217
<literal><replaceable>xmlvalue</replaceable> IS DOCUMENT</literal>
42164218
can be used to evaluate whether a particular <type>xml</type>
@@ -4285,16 +4287,6 @@ SET xmloption TO { DOCUMENT | CONTENT };
42854287
data are allowed.
42864288
</para>
42874289

4288-
<note>
4289-
<para>
4290-
With the default XML option setting, you cannot directly cast
4291-
character strings to type <type>xml</type> if they contain a
4292-
document type declaration, because the definition of XML content
4293-
fragment does not accept them. If you need to do that, either
4294-
use <literal>XMLPARSE</literal> or change the XML option.
4295-
</para>
4296-
</note>
4297-
42984290
</sect2>
42994291

43004292
<sect2>

src/backend/utils/adt/xml.c

+125-16
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ static int parse_xml_decl(const xmlChar *str, size_t *lenp,
141141
xmlChar **version, xmlChar **encoding, int *standalone);
142142
static bool print_xml_decl(StringInfo buf, const xmlChar *version,
143143
pg_enc encoding, int standalone);
144+
static bool xml_doctype_in_content(const xmlChar *str);
144145
static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
145146
bool preserve_whitespace, int encoding);
146147
static text *xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt);
@@ -1243,8 +1244,15 @@ parse_xml_decl(const xmlChar *str, size_t *lenp,
12431244
if (xmlStrncmp(p, (xmlChar *) "<?xml", 5) != 0)
12441245
goto finished;
12451246

1246-
/* if next char is name char, it's a PI like <?xml-stylesheet ...?> */
1247-
utf8len = strlen((const char *) (p + 5));
1247+
/*
1248+
* If next char is a name char, it's a PI like <?xml-stylesheet ...?>
1249+
* rather than an XMLDecl, so we have done what we came to do and found no
1250+
* XMLDecl.
1251+
*
1252+
* We need an input length value for xmlGetUTF8Char, but there's no need
1253+
* to count the whole document size, so use strnlen not strlen.
1254+
*/
1255+
utf8len = strnlen((const char *) (p + 5), MAX_MULTIBYTE_CHAR_LEN);
12481256
utf8char = xmlGetUTF8Char(p + 5, &utf8len);
12491257
if (PG_XMLISNAMECHAR(utf8char))
12501258
goto finished;
@@ -1415,6 +1423,88 @@ print_xml_decl(StringInfo buf, const xmlChar *version,
14151423
return false;
14161424
}
14171425

1426+
/*
1427+
* Test whether an input that is to be parsed as CONTENT contains a DTD.
1428+
*
1429+
* The SQL/XML:2003 definition of CONTENT ("XMLDecl? content") is not
1430+
* satisfied by a document with a DTD, which is a bit of a wart, as it means
1431+
* the CONTENT type is not a proper superset of DOCUMENT. SQL/XML:2006 and
1432+
* later fix that, by redefining content with reference to the "more
1433+
* permissive" Document Node of the XQuery/XPath Data Model, such that any
1434+
* DOCUMENT value is indeed also a CONTENT value. That definition is more
1435+
* useful, as CONTENT becomes usable for parsing input of unknown form (think
1436+
* pg_restore).
1437+
*
1438+
* As used below in parse_xml when parsing for CONTENT, libxml does not give
1439+
* us the 2006+ behavior, but only the 2003; it will choke if the input has
1440+
* a DTD. But we can provide the 2006+ definition of CONTENT easily enough,
1441+
* by detecting this case first and simply doing the parse as DOCUMENT.
1442+
*
1443+
* A DTD can be found arbitrarily far in, but that would be a contrived case;
1444+
* it will ordinarily start within a few dozen characters. The only things
1445+
* that can precede it are an XMLDecl (here, the caller will have called
1446+
* parse_xml_decl already), whitespace, comments, and processing instructions.
1447+
* This function need only return true if it sees a valid sequence of such
1448+
* things leading to <!DOCTYPE. It can simply return false in any other
1449+
* cases, including malformed input; that will mean the input gets parsed as
1450+
* CONTENT as originally planned, with libxml reporting any errors.
1451+
*
1452+
* This is only to be called from xml_parse, when pg_xml_init has already
1453+
* been called. The input is already in UTF8 encoding.
1454+
*/
1455+
static bool
1456+
xml_doctype_in_content(const xmlChar *str)
1457+
{
1458+
const xmlChar *p = str;
1459+
1460+
for (;;)
1461+
{
1462+
const xmlChar *e;
1463+
1464+
SKIP_XML_SPACE(p);
1465+
if (*p != '<')
1466+
return false;
1467+
p++;
1468+
1469+
if (*p == '!')
1470+
{
1471+
p++;
1472+
1473+
/* if we see <!DOCTYPE, we can return true */
1474+
if (xmlStrncmp(p, (xmlChar *) "DOCTYPE", 7) == 0)
1475+
return true;
1476+
1477+
/* otherwise, if it's not a comment, fail */
1478+
if (xmlStrncmp(p, (xmlChar *) "--", 2) != 0)
1479+
return false;
1480+
/* find end of comment: find -- and a > must follow */
1481+
p = xmlStrstr(p + 2, (xmlChar *) "--");
1482+
if (!p || p[2] != '>')
1483+
return false;
1484+
/* advance over comment, and keep scanning */
1485+
p += 3;
1486+
continue;
1487+
}
1488+
1489+
/* otherwise, if it's not a PI <?target something?>, fail */
1490+
if (*p != '?')
1491+
return false;
1492+
p++;
1493+
1494+
/* find end of PI (the string ?> is forbidden within a PI) */
1495+
e = xmlStrstr(p, (xmlChar *) "?>");
1496+
if (!e)
1497+
return false;
1498+
1499+
/* we don't check PIs carefully, but do reject "xml" target */
1500+
if (e - p >= 3 && xmlStrncasecmp(p, (xmlChar *) "xml", 3) == 0)
1501+
return false;
1502+
1503+
/* advance over PI, keep scanning */
1504+
p = e + 2;
1505+
}
1506+
}
1507+
14181508

14191509
/*
14201510
* Convert a C string to XML internal representation
@@ -1450,14 +1540,38 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
14501540
/* Use a TRY block to ensure we clean up correctly */
14511541
PG_TRY();
14521542
{
1543+
bool parse_as_document = false;
1544+
int res_code;
1545+
size_t count = 0;
1546+
xmlChar *version = NULL;
1547+
int standalone = 0;
1548+
14531549
xmlInitParser();
14541550

14551551
ctxt = xmlNewParserCtxt();
14561552
if (ctxt == NULL || xmlerrcxt->err_occurred)
14571553
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
14581554
"could not allocate parser context");
14591555

1556+
/* Decide whether to parse as document or content */
14601557
if (xmloption_arg == XMLOPTION_DOCUMENT)
1558+
parse_as_document = true;
1559+
else
1560+
{
1561+
/* Parse and skip over the XML declaration, if any */
1562+
res_code = parse_xml_decl(utf8string,
1563+
&count, &version, NULL, &standalone);
1564+
if (res_code != 0)
1565+
xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
1566+
"invalid XML content: invalid XML declaration",
1567+
res_code);
1568+
1569+
/* Is there a DOCTYPE element? */
1570+
if (xml_doctype_in_content(utf8string + count))
1571+
parse_as_document = true;
1572+
}
1573+
1574+
if (parse_as_document)
14611575
{
14621576
/*
14631577
* Note, that here we try to apply DTD defaults
@@ -1472,23 +1586,18 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
14721586
XML_PARSE_NOENT | XML_PARSE_DTDATTR
14731587
| (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS));
14741588
if (doc == NULL || xmlerrcxt->err_occurred)
1475-
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
1476-
"invalid XML document");
1589+
{
1590+
/* Use original option to decide which error code to throw */
1591+
if (xmloption_arg == XMLOPTION_DOCUMENT)
1592+
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
1593+
"invalid XML document");
1594+
else
1595+
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT,
1596+
"invalid XML content");
1597+
}
14771598
}
14781599
else
14791600
{
1480-
int res_code;
1481-
size_t count;
1482-
xmlChar *version;
1483-
int standalone;
1484-
1485-
res_code = parse_xml_decl(utf8string,
1486-
&count, &version, NULL, &standalone);
1487-
if (res_code != 0)
1488-
xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
1489-
"invalid XML content: invalid XML declaration",
1490-
res_code);
1491-
14921601
doc = xmlNewDoc(version);
14931602
Assert(doc->encoding == NULL);
14941603
doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");

src/test/regress/expected/xml.out

+46
Original file line numberDiff line numberDiff line change
@@ -532,6 +532,13 @@ LINE 1: EXECUTE foo ('bad');
532532
DETAIL: line 1: Start tag expected, '<' not found
533533
bad
534534
^
535+
SELECT xml '<!DOCTYPE a><a/><b/>';
536+
ERROR: invalid XML document
537+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
538+
^
539+
DETAIL: line 1: Extra content at the end of the document
540+
<!DOCTYPE a><a/><b/>
541+
^
535542
SET XML OPTION CONTENT;
536543
EXECUTE foo ('<bar/>');
537544
xmlconcat
@@ -545,6 +552,45 @@ EXECUTE foo ('good');
545552
<foo/>good
546553
(1 row)
547554

555+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
556+
xml
557+
--------------------------------------------------------------------
558+
<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>
559+
(1 row)
560+
561+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
562+
xml
563+
------------------------------
564+
<!-- hi--> <!DOCTYPE a><a/>
565+
(1 row)
566+
567+
SELECT xml '<!DOCTYPE a><a/>';
568+
xml
569+
------------------
570+
<!DOCTYPE a><a/>
571+
(1 row)
572+
573+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
574+
ERROR: invalid XML content
575+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
576+
^
577+
DETAIL: line 1: StartTag: invalid element name
578+
<!-- hi--> oops <!DOCTYPE a><a/>
579+
^
580+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
581+
ERROR: invalid XML content
582+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
583+
^
584+
DETAIL: line 1: StartTag: invalid element name
585+
<!-- hi--> <oops/> <!DOCTYPE a><a/>
586+
^
587+
SELECT xml '<!DOCTYPE a><a/><b/>';
588+
ERROR: invalid XML content
589+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
590+
^
591+
DETAIL: line 1: Extra content at the end of the document
592+
<!DOCTYPE a><a/><b/>
593+
^
548594
-- Test backwards parsing
549595
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
550596
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

src/test/regress/expected/xml_1.out

+42
Original file line numberDiff line numberDiff line change
@@ -429,11 +429,53 @@ EXECUTE foo ('<bar/>');
429429
ERROR: prepared statement "foo" does not exist
430430
EXECUTE foo ('bad');
431431
ERROR: prepared statement "foo" does not exist
432+
SELECT xml '<!DOCTYPE a><a/><b/>';
433+
ERROR: unsupported XML feature
434+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
435+
^
436+
DETAIL: This functionality requires the server to be built with libxml support.
437+
HINT: You need to rebuild PostgreSQL using --with-libxml.
432438
SET XML OPTION CONTENT;
433439
EXECUTE foo ('<bar/>');
434440
ERROR: prepared statement "foo" does not exist
435441
EXECUTE foo ('good');
436442
ERROR: prepared statement "foo" does not exist
443+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
444+
ERROR: unsupported XML feature
445+
LINE 1: SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?...
446+
^
447+
DETAIL: This functionality requires the server to be built with libxml support.
448+
HINT: You need to rebuild PostgreSQL using --with-libxml.
449+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
450+
ERROR: unsupported XML feature
451+
LINE 1: SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/...
452+
^
453+
DETAIL: This functionality requires the server to be built with libxml support.
454+
HINT: You need to rebuild PostgreSQL using --with-libxml.
455+
SELECT xml '<!DOCTYPE a><a/>';
456+
ERROR: unsupported XML feature
457+
LINE 1: SELECT xml '<!DOCTYPE a><a/>';
458+
^
459+
DETAIL: This functionality requires the server to be built with libxml support.
460+
HINT: You need to rebuild PostgreSQL using --with-libxml.
461+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
462+
ERROR: unsupported XML feature
463+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
464+
^
465+
DETAIL: This functionality requires the server to be built with libxml support.
466+
HINT: You need to rebuild PostgreSQL using --with-libxml.
467+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
468+
ERROR: unsupported XML feature
469+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
470+
^
471+
DETAIL: This functionality requires the server to be built with libxml support.
472+
HINT: You need to rebuild PostgreSQL using --with-libxml.
473+
SELECT xml '<!DOCTYPE a><a/><b/>';
474+
ERROR: unsupported XML feature
475+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
476+
^
477+
DETAIL: This functionality requires the server to be built with libxml support.
478+
HINT: You need to rebuild PostgreSQL using --with-libxml.
437479
-- Test backwards parsing
438480
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
439481
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

src/test/regress/expected/xml_2.out

+46
Original file line numberDiff line numberDiff line change
@@ -512,6 +512,13 @@ LINE 1: EXECUTE foo ('bad');
512512
DETAIL: line 1: Start tag expected, '<' not found
513513
bad
514514
^
515+
SELECT xml '<!DOCTYPE a><a/><b/>';
516+
ERROR: invalid XML document
517+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
518+
^
519+
DETAIL: line 1: Extra content at the end of the document
520+
<!DOCTYPE a><a/><b/>
521+
^
515522
SET XML OPTION CONTENT;
516523
EXECUTE foo ('<bar/>');
517524
xmlconcat
@@ -525,6 +532,45 @@ EXECUTE foo ('good');
525532
<foo/>good
526533
(1 row)
527534

535+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
536+
xml
537+
--------------------------------------------------------------------
538+
<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>
539+
(1 row)
540+
541+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
542+
xml
543+
------------------------------
544+
<!-- hi--> <!DOCTYPE a><a/>
545+
(1 row)
546+
547+
SELECT xml '<!DOCTYPE a><a/>';
548+
xml
549+
------------------
550+
<!DOCTYPE a><a/>
551+
(1 row)
552+
553+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
554+
ERROR: invalid XML content
555+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
556+
^
557+
DETAIL: line 1: StartTag: invalid element name
558+
<!-- hi--> oops <!DOCTYPE a><a/>
559+
^
560+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
561+
ERROR: invalid XML content
562+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
563+
^
564+
DETAIL: line 1: StartTag: invalid element name
565+
<!-- hi--> <oops/> <!DOCTYPE a><a/>
566+
^
567+
SELECT xml '<!DOCTYPE a><a/><b/>';
568+
ERROR: invalid XML content
569+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
570+
^
571+
DETAIL: line 1: Extra content at the end of the document
572+
<!DOCTYPE a><a/><b/>
573+
^
528574
-- Test backwards parsing
529575
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
530576
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

0 commit comments

Comments
 (0)