Skip to content

Commit 5950a98

Browse files
committed
1. I've now produced an updated version (and called it 0.2) of my XML
parser interface code. It now uses libxml2 instead of expat (though I've left the old code in the tarball). This means *proper* XPath support, and the provided function allows you to wrap your result set in XML tags to produce a new XML document. John Gray
1 parent 44ae35c commit 5950a98

File tree

4 files changed

+114
-77
lines changed

4 files changed

+114
-77
lines changed

contrib/xml/Makefile

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,22 @@ subdir = contrib/xml
88
top_builddir = ../..
99
include $(top_builddir)/src/Makefile.global
1010

11-
override CFLAGS+= $(CFLAGS_SL)
11+
override CFLAGS+= $(CFLAGS_SL) -g
1212

1313

1414
#
1515
# DLOBJS is the dynamically-loaded object files. The "funcs" queries
1616
# include CREATE FUNCTIONs that load routines from these files.
1717
#
18-
DLOBJS= pgxml$(DLSUFFIX)
18+
DLOBJS= pgxml_dom$(DLSUFFIX)
1919

2020

21-
QUERIES= pgxml.sql
21+
QUERIES= pgxml_dom.sql
2222

2323
all: $(DLOBJS) $(QUERIES)
2424

25-
# Requires the expat library
26-
2725
%.so: %.o
28-
$(CC) -shared -lexpat -o $@ $<
26+
$(CC) -shared -lxml2 -o $@ $<
2927

3028

3129
%.sql: %.source
@@ -41,3 +39,7 @@ all: $(DLOBJS) $(QUERIES)
4139

4240
clean:
4341
rm -f $(DLOBJS) $(QUERIES)
42+
43+
44+
45+

contrib/xml/README

Lines changed: 65 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,35 @@
1-
This package contains a couple of simple routines for hooking the
2-
expat XML parser up to PostgreSQL. This is a work-in-progress and all
3-
very basic at the moment (see the file TODO for some outline of what
4-
remains to be done).
1+
This package contains some simple routines for manipulating XML
2+
documents stored in PostgreSQL. This is a work-in-progress and
3+
somewhat basic at the moment (see the file TODO for some outline of
4+
what remains to be done).
55

6-
At present, two functions are defined, one which checks
7-
well-formedness, and the other which performs very simple XPath-type
8-
queries.
6+
At present, two modules (based on different XML handling libraries)
7+
are provided.
98

109
Prerequisite:
1110

11+
pgxml.c:
1212
expat parser 1.95.0 or newer (http://expat.sourceforge.net)
1313

14-
I used a shared library version -I'm sure you could use a static
15-
library if you wished though. I had no problems compiling from source.
14+
or
15+
16+
pgxml_dom.c:
17+
libxml2 (http://xmlsoft.org)
18+
19+
The libxml2 version provides more complete XPath functionality, and
20+
seems like a good way to go. I've left the old versions in there for
21+
comparison.
22+
23+
Compiling and loading:
24+
----------------------
25+
26+
The Makefile only builds the libxml2 version.
27+
28+
To compile, just type make.
29+
30+
Then you can use psql to load the two function definitions:
31+
\i pgxml_dom.sql
32+
1633

1734
Function documentation and usage:
1835
---------------------------------
@@ -22,10 +39,21 @@ pgxml_parse(text) returns bool
2239
well-formed or not. It returns NULL if the parser couldn't be
2340
created for any reason.
2441

42+
pgxml_xpath (XQuery functions) - differs between the versions:
43+
44+
pgxml.c (expat version) has:
45+
2546
pgxml_xpath(text doc, text xpath, int n) returns text
2647
parses doc and returns the cdata of the nth occurence of
27-
the "XPath" listed. See below for details on the syntax.
48+
the "simple path" entry.
2849

50+
However, the remainder of this document will cover the pgxml_dom.c version.
51+
52+
pgxml_xpath(text doc, text xpath, text toptag, text septag) returns text
53+
evaluates xpath on doc, and returns the result wrapped in
54+
<toptag>...</toptag> and each result node wrapped in
55+
<septag></septag>. toptag and septag may be empty strings, in which
56+
case the respective tag will be omitted.
2957

3058
Example:
3159

@@ -49,30 +77,42 @@ descriptions, in case anyone is wondering):
4977
one can type:
5078

5179
select docid,
52-
pgxml_xpath(document,'/site/name',1) as sitename,
53-
pgxml_xpath(document,'/site/location',1) as location
80+
pgxml_xpath(document,'//site/name/text()','','') as sitename,
81+
pgxml_xpath(document,'//site/location/text()','','') as location
5482
from docstore;
5583

5684
and get as output:
5785

58-
docid | sitename | location
59-
-------+-----------------------------+------------
60-
1 | Church Farm, Ashton Keynes | SU04209424
61-
2 | Glebe Farm, Long Itchington | SP41506500
62-
(2 rows)
86+
docid | sitename | location
87+
-------+--------------------------------------+------------
88+
1 | Church Farm, Ashton Keynes | SU04209424
89+
2 | Glebe Farm, Long Itchington | SP41506500
90+
3 | The Bungalow, Thames Lane, Cricklade | SU10229362
91+
(3 rows)
92+
93+
or, to illustrate the use of the extra tags:
6394

95+
select docid as id,
96+
pgxml_xpath(document,'//find/type/text()','set','findtype')
97+
from docstore;
6498

65-
"XPath" syntax supported
66-
------------------------
99+
id | pgxml_xpath
100+
----+-------------------------------------------------------------------------
101+
1 | <set></set>
102+
2 | <set><findtype>Urn</findtype></set>
103+
3 | <set><findtype>Pottery</findtype><findtype>Animal bone</findtype></set>
104+
(3 rows)
67105

68-
At present it only supports paths of the form:
69-
'tag1/tag2' or '/tag1/tag2'
106+
Which produces a new, well-formed document. Note that document 1 had
107+
no matching instances, so the set returned contains no
108+
elements. document 2 has 1 matching element and document 3 has 2.
70109

71-
The first case will find any <tag2> within a <tag1>, the second will
72-
find any <tag2> within a <tag1> at the top level of the document.
110+
This is just scratching the surface because XPath allows all sorts of
111+
operations.
73112

74-
The real XPath is much more complex (see TODO file).
113+
Note: I've only implemented the return of nodeset and string values so
114+
far. This covers (I think) many types of queries, however.
75115

116+
John Gray <jgray@azuli.co.uk> 16 August 2001
76117

77-
John Gray <jgray@azuli.co.uk> 26 July 2001
78118

contrib/xml/TODO

Lines changed: 40 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,57 @@
11
PGXML TODO List
22
===============
33

4-
Some of these items still require much more thought! The data model
5-
for XML documents and the parsing model of expat don't really fit so
6-
well with a standard SQL model.
4+
Some of these items still require much more thought! Since the first
5+
release, the XPath support has improved (because I'm no longer using a
6+
homemade algorithm!).
77

8-
1. Generalised XML parsing support
8+
1. Performance considerations
99

10-
Allow a user to specify handlers (in any PL) to be used by the parser.
11-
This must permit distinct sets of parser settings -user may want some
12-
documents in a database to parsed with one set of handlers, others
13-
with a different set.
10+
At present each document is parsed to produce the DOM tree on every query.
1411

15-
i.e. the pgxml_parse function would take as parameters (document,
16-
parsername) where parsername was the identifier for a collection of
17-
handler etc. settings.
12+
Pros:
13+
Easy
14+
No persistent memory or storage allocation for parsed trees
15+
(libxml docs suggest representation of a document might
16+
be 4 times the size of the text)
1817

19-
"Stub" handlers in the pgxml code would invoke the functions through
20-
the standard fmgr interface. The parser interface would define the
21-
prototype for these functions. How does the handler function know
22-
which document/context has resulted it in being called?
18+
Cons:
19+
Slow/ CPU intensive to parse.
20+
Makes it difficult for PLs to apply libxml manipulations to create
21+
new documents or amend existing ones.
2322

24-
Mechanism for defining collection of parser settings (in a table? -but
25-
maybe copied for efficiency into a structure when first required by a
26-
query?)
2723

28-
2. Support for other parsers
24+
2. XQuery
2925

30-
Expat may not be the best choice as a parser because a new parser
31-
instance is needed for each document i.e. all the handlers must be set
32-
again for each document. Another parser may have a more efficient way
33-
of parsing a set of documents identically.
26+
I'm not sure if the addition of XQuery would be best as a function or
27+
as a new front-end parser. This is one to think about, but with a
28+
decent implementation of XPath, one of the prerequisites is covered.
3429

35-
3. XPath support
30+
3. DOM Interfaces
3631

37-
Proper XPath support. I really need to sit down and plough
38-
through the specification...
32+
Expose more aspects of the DOM to user functions/ PLs. This would
33+
allow a procedure in a PL to run some queries and then use exposed
34+
interfaces to libxml to create an XML document out of the query
35+
results. I accept the argument that this might be more properly
36+
performed on the client side.
3937

40-
The very simple text comparison system currently used is too
41-
basic. Need to convert the path to an ordered list of nodes. Each node
42-
is an element qualifier, and may have a list of attribute
43-
qualifications attached. This probably requires lexx/yacc combination.
44-
(James Clark has written a yacc grammar for XPath). Not all the
45-
features of XPath are necessarily relevant.
38+
4. Returning sets of documents from XPath queries.
4639

47-
An option to return subdocuments (i.e. subelements AND cdata, not just
48-
cdata). This should maybe be the default.
49-
50-
4. Multiple occurences of elements.
51-
52-
This section is all very sketchy, and has various weaknesses.
40+
Although the current implementation allows you to amalgamate the
41+
returned results into a single document, it's quite possible that
42+
you'd like to use the returned set of nodes as a source for FROM.
5343

5444
Is there a good way to optimise/index the results of certain XPath
5545
operations to make them faster?:
5646

57-
select docid, pgxml_xpath(document,'/site/location',1) as location
58-
where pgxml_xpath(document,'/site/name',1) = 'Church Farm';
47+
select docid, pgxml_xpath(document,'//site/location/text()','','') as location
48+
where pgxml_xpath(document,'//site/name/text()','','') = 'Church Farm';
5949

6050
and with multiple element occurences in a document?
6151

62-
select d.docid, pgxml_xpath(d.document,'/site/location',1)
52+
select d.docid, pgxml_xpath(d.document,'//site/location/text()','','')
6353
from docstore d,
64-
pgxml_xpaths('docstore','document','feature/type','docid') ft
54+
pgxml_xpaths('docstore','document','//feature/type/text()','docid') ft
6555
where ft.key = d.docid and ft.value ='Limekiln';
6656

6757
pgxml_xpaths params are relname, attrname, xpath, returnkey. It would
@@ -71,10 +61,15 @@ defined by relname and attrname.
7161

7262
The pgxml_xpaths function could be the basis of a functional index,
7363
which could speed up the above query very substantially, working
74-
through the normal query planner mechanism. Syntax above is fragile
75-
through using names rather than OID.
64+
through the normal query planner mechanism.
65+
66+
5. Return type support.
67+
68+
Better support for returning e.g. numeric or boolean values. I need to
69+
get to grips with the returned data from libxml first.
70+
7671

77-
John Gray <jgray@azuli.co.uk>
72+
John Gray <jgray@azuli.co.uk> 16 August 2001
7873

7974

8075

contrib/xml/pgxml.source

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@
33
CREATE FUNCTION pgxml_parse(text) RETURNS bool
44
AS '_OBJWD_/pgxml_DLSUFFIX_' LANGUAGE 'c' WITH (isStrict);
55

6-
CREATE FUNCTION pgxml_xpath(text,text,int) RETURNS text
6+
CREATE FUNCTION pgxml_xpath(text,text,text,text) RETURNS text
77
AS '_OBJWD_/pgxml_DLSUFFIX_' LANGUAGE 'c' WITH (isStrict);

0 commit comments

Comments
 (0)