Skip to content

Commit 78f84fe

Browse files
committed
Accept XML documents when xmloption = content, as required by SQL:2006+.
Previously we were using the SQL:2003 definition, which doesn't allow this, but that creates a serious dump/restore gotcha: there is no setting of xmloption that will allow all valid XML data. Hence, switch to the 2006 definition. Since libxml doesn't accept <!DOCTYPE> directives in the mode we use for CONTENT parsing, the implementation is to detect <!DOCTYPE> in the input and switch to DOCUMENT parsing mode. This should not cost much, because <!DOCTYPE> should be close to the front of the input if it's there at all. It's possible that this causes the error messages for malformed input to be slightly different than they were before, if said input includes <!DOCTYPE>; but that does not seem like a big problem. In passing, buy back a few cycles in parsing of large XML documents by not doing strlen() of the whole input in parse_xml_decl(). Back-patch because dump/restore failures are not nice. This change shouldn't break any cases that worked before, so it seems safe to back-patch. Chapman Flack (revised a bit by me) Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
1 parent a169503 commit 78f84fe

File tree

6 files changed

+271
-29
lines changed

6 files changed

+271
-29
lines changed

doc/src/sgml/datatype.sgml

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4096,9 +4096,11 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
40964096
<para>
40974097
The <type>xml</type> type can store well-formed
40984098
<quote>documents</quote>, as defined by the XML standard, as well
4099-
as <quote>content</quote> fragments, which are defined by the
4100-
production <literal>XMLDecl? content</literal> in the XML
4101-
standard. Roughly, this means that content fragments can have
4099+
as <quote>content</quote> fragments, which are defined by reference
4100+
to the more permissive
4101+
<ulink url="https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#DocumentNode"><quote>document node</quote></ulink>
4102+
of the XQuery and XPath data model.
4103+
Roughly, this means that content fragments can have
41024104
more than one top-level element or character node. The expression
41034105
<literal><replaceable>xmlvalue</replaceable> IS DOCUMENT</literal>
41044106
can be used to evaluate whether a particular <type>xml</type>
@@ -4173,16 +4175,6 @@ SET xmloption TO { DOCUMENT | CONTENT };
41734175
data are allowed.
41744176
</para>
41754177

4176-
<note>
4177-
<para>
4178-
With the default XML option setting, you cannot directly cast
4179-
character strings to type <type>xml</type> if they contain a
4180-
document type declaration, because the definition of XML content
4181-
fragment does not accept them. If you need to do that, either
4182-
use <literal>XMLPARSE</literal> or change the XML option.
4183-
</para>
4184-
</note>
4185-
41864178
</sect2>
41874179

41884180
<sect2>

src/backend/utils/adt/xml.c

Lines changed: 125 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@ static int parse_xml_decl(const xmlChar *str, size_t *lenp,
139139
xmlChar **version, xmlChar **encoding, int *standalone);
140140
static bool print_xml_decl(StringInfo buf, const xmlChar *version,
141141
pg_enc encoding, int standalone);
142+
static bool xml_doctype_in_content(const xmlChar *str);
142143
static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
143144
bool preserve_whitespace, int encoding);
144145
static text *xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt);
@@ -1154,8 +1155,15 @@ parse_xml_decl(const xmlChar *str, size_t *lenp,
11541155
if (xmlStrncmp(p, (xmlChar *) "<?xml", 5) != 0)
11551156
goto finished;
11561157

1157-
/* if next char is name char, it's a PI like <?xml-stylesheet ...?> */
1158-
utf8len = strlen((const char *) (p + 5));
1158+
/*
1159+
* If next char is a name char, it's a PI like <?xml-stylesheet ...?>
1160+
* rather than an XMLDecl, so we have done what we came to do and found no
1161+
* XMLDecl.
1162+
*
1163+
* We need an input length value for xmlGetUTF8Char, but there's no need
1164+
* to count the whole document size, so use strnlen not strlen.
1165+
*/
1166+
utf8len = strnlen((const char *) (p + 5), MAX_MULTIBYTE_CHAR_LEN);
11591167
utf8char = xmlGetUTF8Char(p + 5, &utf8len);
11601168
if (PG_XMLISNAMECHAR(utf8char))
11611169
goto finished;
@@ -1326,6 +1334,88 @@ print_xml_decl(StringInfo buf, const xmlChar *version,
13261334
return false;
13271335
}
13281336

1337+
/*
1338+
* Test whether an input that is to be parsed as CONTENT contains a DTD.
1339+
*
1340+
* The SQL/XML:2003 definition of CONTENT ("XMLDecl? content") is not
1341+
* satisfied by a document with a DTD, which is a bit of a wart, as it means
1342+
* the CONTENT type is not a proper superset of DOCUMENT. SQL/XML:2006 and
1343+
* later fix that, by redefining content with reference to the "more
1344+
* permissive" Document Node of the XQuery/XPath Data Model, such that any
1345+
* DOCUMENT value is indeed also a CONTENT value. That definition is more
1346+
* useful, as CONTENT becomes usable for parsing input of unknown form (think
1347+
* pg_restore).
1348+
*
1349+
* As used below in parse_xml when parsing for CONTENT, libxml does not give
1350+
* us the 2006+ behavior, but only the 2003; it will choke if the input has
1351+
* a DTD. But we can provide the 2006+ definition of CONTENT easily enough,
1352+
* by detecting this case first and simply doing the parse as DOCUMENT.
1353+
*
1354+
* A DTD can be found arbitrarily far in, but that would be a contrived case;
1355+
* it will ordinarily start within a few dozen characters. The only things
1356+
* that can precede it are an XMLDecl (here, the caller will have called
1357+
* parse_xml_decl already), whitespace, comments, and processing instructions.
1358+
* This function need only return true if it sees a valid sequence of such
1359+
* things leading to <!DOCTYPE. It can simply return false in any other
1360+
* cases, including malformed input; that will mean the input gets parsed as
1361+
* CONTENT as originally planned, with libxml reporting any errors.
1362+
*
1363+
* This is only to be called from xml_parse, when pg_xml_init has already
1364+
* been called. The input is already in UTF8 encoding.
1365+
*/
1366+
static bool
1367+
xml_doctype_in_content(const xmlChar *str)
1368+
{
1369+
const xmlChar *p = str;
1370+
1371+
for (;;)
1372+
{
1373+
const xmlChar *e;
1374+
1375+
SKIP_XML_SPACE(p);
1376+
if (*p != '<')
1377+
return false;
1378+
p++;
1379+
1380+
if (*p == '!')
1381+
{
1382+
p++;
1383+
1384+
/* if we see <!DOCTYPE, we can return true */
1385+
if (xmlStrncmp(p, (xmlChar *) "DOCTYPE", 7) == 0)
1386+
return true;
1387+
1388+
/* otherwise, if it's not a comment, fail */
1389+
if (xmlStrncmp(p, (xmlChar *) "--", 2) != 0)
1390+
return false;
1391+
/* find end of comment: find -- and a > must follow */
1392+
p = xmlStrstr(p + 2, (xmlChar *) "--");
1393+
if (!p || p[2] != '>')
1394+
return false;
1395+
/* advance over comment, and keep scanning */
1396+
p += 3;
1397+
continue;
1398+
}
1399+
1400+
/* otherwise, if it's not a PI <?target something?>, fail */
1401+
if (*p != '?')
1402+
return false;
1403+
p++;
1404+
1405+
/* find end of PI (the string ?> is forbidden within a PI) */
1406+
e = xmlStrstr(p, (xmlChar *) "?>");
1407+
if (!e)
1408+
return false;
1409+
1410+
/* we don't check PIs carefully, but do reject "xml" target */
1411+
if (e - p >= 3 && xmlStrncasecmp(p, (xmlChar *) "xml", 3) == 0)
1412+
return false;
1413+
1414+
/* advance over PI, keep scanning */
1415+
p = e + 2;
1416+
}
1417+
}
1418+
13291419

13301420
/*
13311421
* Convert a C string to XML internal representation
@@ -1361,14 +1451,38 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
13611451
/* Use a TRY block to ensure we clean up correctly */
13621452
PG_TRY();
13631453
{
1454+
bool parse_as_document = false;
1455+
int res_code;
1456+
size_t count = 0;
1457+
xmlChar *version = NULL;
1458+
int standalone = 0;
1459+
13641460
xmlInitParser();
13651461

13661462
ctxt = xmlNewParserCtxt();
13671463
if (ctxt == NULL || xmlerrcxt->err_occurred)
13681464
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
13691465
"could not allocate parser context");
13701466

1467+
/* Decide whether to parse as document or content */
13711468
if (xmloption_arg == XMLOPTION_DOCUMENT)
1469+
parse_as_document = true;
1470+
else
1471+
{
1472+
/* Parse and skip over the XML declaration, if any */
1473+
res_code = parse_xml_decl(utf8string,
1474+
&count, &version, NULL, &standalone);
1475+
if (res_code != 0)
1476+
xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
1477+
"invalid XML content: invalid XML declaration",
1478+
res_code);
1479+
1480+
/* Is there a DOCTYPE element? */
1481+
if (xml_doctype_in_content(utf8string + count))
1482+
parse_as_document = true;
1483+
}
1484+
1485+
if (parse_as_document)
13721486
{
13731487
/*
13741488
* Note, that here we try to apply DTD defaults
@@ -1383,23 +1497,18 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
13831497
XML_PARSE_NOENT | XML_PARSE_DTDATTR
13841498
| (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS));
13851499
if (doc == NULL || xmlerrcxt->err_occurred)
1386-
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
1387-
"invalid XML document");
1500+
{
1501+
/* Use original option to decide which error code to throw */
1502+
if (xmloption_arg == XMLOPTION_DOCUMENT)
1503+
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
1504+
"invalid XML document");
1505+
else
1506+
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT,
1507+
"invalid XML content");
1508+
}
13881509
}
13891510
else
13901511
{
1391-
int res_code;
1392-
size_t count;
1393-
xmlChar *version;
1394-
int standalone;
1395-
1396-
res_code = parse_xml_decl(utf8string,
1397-
&count, &version, NULL, &standalone);
1398-
if (res_code != 0)
1399-
xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
1400-
"invalid XML content: invalid XML declaration",
1401-
res_code);
1402-
14031512
doc = xmlNewDoc(version);
14041513
Assert(doc->encoding == NULL);
14051514
doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");

src/test/regress/expected/xml.out

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,13 @@ LINE 1: EXECUTE foo ('bad');
515515
DETAIL: line 1: Start tag expected, '<' not found
516516
bad
517517
^
518+
SELECT xml '<!DOCTYPE a><a/><b/>';
519+
ERROR: invalid XML document
520+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
521+
^
522+
DETAIL: line 1: Extra content at the end of the document
523+
<!DOCTYPE a><a/><b/>
524+
^
518525
SET XML OPTION CONTENT;
519526
EXECUTE foo ('<bar/>');
520527
xmlconcat
@@ -528,6 +535,45 @@ EXECUTE foo ('good');
528535
<foo/>good
529536
(1 row)
530537

538+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
539+
xml
540+
--------------------------------------------------------------------
541+
<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>
542+
(1 row)
543+
544+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
545+
xml
546+
------------------------------
547+
<!-- hi--> <!DOCTYPE a><a/>
548+
(1 row)
549+
550+
SELECT xml '<!DOCTYPE a><a/>';
551+
xml
552+
------------------
553+
<!DOCTYPE a><a/>
554+
(1 row)
555+
556+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
557+
ERROR: invalid XML content
558+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
559+
^
560+
DETAIL: line 1: StartTag: invalid element name
561+
<!-- hi--> oops <!DOCTYPE a><a/>
562+
^
563+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
564+
ERROR: invalid XML content
565+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
566+
^
567+
DETAIL: line 1: StartTag: invalid element name
568+
<!-- hi--> <oops/> <!DOCTYPE a><a/>
569+
^
570+
SELECT xml '<!DOCTYPE a><a/><b/>';
571+
ERROR: invalid XML content
572+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
573+
^
574+
DETAIL: line 1: Extra content at the end of the document
575+
<!DOCTYPE a><a/><b/>
576+
^
531577
-- Test backwards parsing
532578
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
533579
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

src/test/regress/expected/xml_1.out

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -417,11 +417,53 @@ EXECUTE foo ('<bar/>');
417417
ERROR: prepared statement "foo" does not exist
418418
EXECUTE foo ('bad');
419419
ERROR: prepared statement "foo" does not exist
420+
SELECT xml '<!DOCTYPE a><a/><b/>';
421+
ERROR: unsupported XML feature
422+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
423+
^
424+
DETAIL: This functionality requires the server to be built with libxml support.
425+
HINT: You need to rebuild PostgreSQL using --with-libxml.
420426
SET XML OPTION CONTENT;
421427
EXECUTE foo ('<bar/>');
422428
ERROR: prepared statement "foo" does not exist
423429
EXECUTE foo ('good');
424430
ERROR: prepared statement "foo" does not exist
431+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
432+
ERROR: unsupported XML feature
433+
LINE 1: SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?...
434+
^
435+
DETAIL: This functionality requires the server to be built with libxml support.
436+
HINT: You need to rebuild PostgreSQL using --with-libxml.
437+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
438+
ERROR: unsupported XML feature
439+
LINE 1: SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/...
440+
^
441+
DETAIL: This functionality requires the server to be built with libxml support.
442+
HINT: You need to rebuild PostgreSQL using --with-libxml.
443+
SELECT xml '<!DOCTYPE a><a/>';
444+
ERROR: unsupported XML feature
445+
LINE 1: SELECT xml '<!DOCTYPE a><a/>';
446+
^
447+
DETAIL: This functionality requires the server to be built with libxml support.
448+
HINT: You need to rebuild PostgreSQL using --with-libxml.
449+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
450+
ERROR: unsupported XML feature
451+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
452+
^
453+
DETAIL: This functionality requires the server to be built with libxml support.
454+
HINT: You need to rebuild PostgreSQL using --with-libxml.
455+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
456+
ERROR: unsupported XML feature
457+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
458+
^
459+
DETAIL: This functionality requires the server to be built with libxml support.
460+
HINT: You need to rebuild PostgreSQL using --with-libxml.
461+
SELECT xml '<!DOCTYPE a><a/><b/>';
462+
ERROR: unsupported XML feature
463+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
464+
^
465+
DETAIL: This functionality requires the server to be built with libxml support.
466+
HINT: You need to rebuild PostgreSQL using --with-libxml.
425467
-- Test backwards parsing
426468
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
427469
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

src/test/regress/expected/xml_2.out

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -497,6 +497,13 @@ LINE 1: EXECUTE foo ('bad');
497497
DETAIL: line 1: Start tag expected, '<' not found
498498
bad
499499
^
500+
SELECT xml '<!DOCTYPE a><a/><b/>';
501+
ERROR: invalid XML document
502+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
503+
^
504+
DETAIL: line 1: Extra content at the end of the document
505+
<!DOCTYPE a><a/><b/>
506+
^
500507
SET XML OPTION CONTENT;
501508
EXECUTE foo ('<bar/>');
502509
xmlconcat
@@ -510,6 +517,45 @@ EXECUTE foo ('good');
510517
<foo/>good
511518
(1 row)
512519

520+
SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>';
521+
xml
522+
--------------------------------------------------------------------
523+
<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>
524+
(1 row)
525+
526+
SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>';
527+
xml
528+
------------------------------
529+
<!-- hi--> <!DOCTYPE a><a/>
530+
(1 row)
531+
532+
SELECT xml '<!DOCTYPE a><a/>';
533+
xml
534+
------------------
535+
<!DOCTYPE a><a/>
536+
(1 row)
537+
538+
SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
539+
ERROR: invalid XML content
540+
LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>';
541+
^
542+
DETAIL: line 1: StartTag: invalid element name
543+
<!-- hi--> oops <!DOCTYPE a><a/>
544+
^
545+
SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
546+
ERROR: invalid XML content
547+
LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>';
548+
^
549+
DETAIL: line 1: StartTag: invalid element name
550+
<!-- hi--> <oops/> <!DOCTYPE a><a/>
551+
^
552+
SELECT xml '<!DOCTYPE a><a/><b/>';
553+
ERROR: invalid XML content
554+
LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>';
555+
^
556+
DETAIL: line 1: Extra content at the end of the document
557+
<!DOCTYPE a><a/><b/>
558+
^
513559
-- Test backwards parsing
514560
CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
515561
CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');

0 commit comments

Comments
 (0)