Skip to content

Commit 4730946

Browse files
committed
Rewrite GiST documentation into something actually useful.
Christopher Kings-Lynne
1 parent 9bd04c0 commit 4730946

File tree

1 file changed

+260
-110
lines changed

1 file changed

+260
-110
lines changed

doc/src/sgml/gist.sgml

Lines changed: 260 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,113 +1,263 @@
11
<!--
2-
$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.12 2003/09/29 18:18:35 momjian Exp $
2+
$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.13 2003/10/31 22:41:21 tgl Exp $
33
-->
44

5-
<Chapter Id="gist">
6-
<DocInfo>
7-
<AuthorGroup>
8-
<Author>
9-
<FirstName>Gene</FirstName>
10-
<Surname>Selkov</Surname>
11-
</Author>
12-
</AuthorGroup>
13-
<Date>Transcribed 1998-02-19</Date>
14-
</DocInfo>
15-
<Title>GiST Indexes</Title>
16-
17-
<Para>
18-
The information about GIST is at
19-
<ULink url="http://GiST.CS.Berkeley.EDU:8000/gist/">http://GiST.CS.Berkeley.EDU:8000/gist/</ULink>
20-
21-
with more on different indexing and sorting schemes at
22-
<ULink url="http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/">http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/</ULink>.
23-
24-
And there is more interesting reading at
25-
<ULink url="http://epoch.cs.berkeley.edu:8000/">http://epoch.cs.berkeley.edu:8000/</ULink> and
26-
<ULink url="http://www.sai.msu.su/~megera/postgres/gist/">http://www.sai.msu.su/~megera/postgres/gist/</ULink>.
27-
</para>
28-
29-
<Para>
30-
<Note>
31-
<Title>Author</Title>
32-
<Para>
33-
This extraction from an email sent by
34-
Eugene Selkov, Jr. (<email>selkovjr@mcs.anl.gov</email>)
35-
contains good information
36-
on GiST. Hopefully we will learn more in the future and update this information.
37-
- thomas 1998-03-01
38-
</Para>
39-
</Note>
40-
</para>
41-
<Para>
42-
Well, I can't say I quite understand what's going on, but at least
43-
I (almost) succeeded in porting GiST examples to linux. The GiST access
44-
method is already in the postgres tree (<FileName>src/backend/access/gist</FileName>).
45-
</para>
46-
<Para>
47-
<ULink url="ftp://s2k-ftp.cs.berkeley.edu/pub/gist/pggist/pggist.tgz">Examples at Berkeley</ULink>
48-
come with an overview of the methods and demonstrate spatial index
49-
mechanisms for 2D boxes, polygons, integer intervals and text
50-
(see also <ULink url="http://gist.cs.berkeley.edu:8000/gist/">GiST at Berkeley</ULink>).
51-
In the box example, we
52-
are supposed to see a performance gain when using the GiST index; it did
53-
work for me but I do not have a reasonably large collection of boxes
54-
to check that. Other examples also worked, except polygons: I got an
55-
error doing
56-
57-
<ProgramListing>
58-
test=> CREATE INDEX pix ON polytmp
59-
test-> USING GIST (p:box gist_poly_ops) WITH (ISLOSSY);
60-
ERROR: cannot open pix
61-
62-
(PostgreSQL 6.3 Sun Feb 1 14:57:30 EST 1998)
63-
</ProgramListing>
64-
</para>
65-
<Para>
66-
I could not get sense of this error message; it appears to be something
67-
we'd rather ask the developers about (see also Note 4 below). What I
68-
would suggest here is that someone of you linux guys (linux==gcc?) fetch the
69-
original sources quoted above and apply my patch (see attachment) and
70-
tell us what you feel about it. Looks cool to me, but I would not like
71-
to hold it up while there are so many competent people around.
72-
</para>
73-
<Para>
74-
A few notes on the sources:
75-
</para>
76-
<Para>
77-
1. I failed to make use of the original (HP-UX) Makefile and rearranged
78-
the Makefile from the ancient postgres95 tutorial to do the job. I tried
79-
to keep it generic, but I am a very poor makefile writer -- just did
80-
some monkey work. Sorry about that, but I guess it is now a little
81-
more portable that the original makefile.
82-
</para>
83-
<Para>
84-
2. I built the example sources right under pgsql/src (just extracted the
85-
tar file there). The aforementioned Makefile assumes it is one level
86-
below pgsql/src (in our case, in pgsql/src/pggist).
87-
</para>
88-
<Para>
89-
3. The changes I made to the *.c files were all about #include's,
90-
function prototypes and typecasting. Other than that, I just threw
91-
away a bunch of unused vars and added a couple parentheses to please
92-
gcc. I hope I did not screw up too much :)
93-
</para>
94-
<Para>
95-
4. There is a comment in polyproc.sql:
96-
97-
<ProgramListing>
98-
-- -- there's a memory leak in rtree poly_ops!!
99-
-- -- CREATE INDEX pix2 ON polytmp USING RTREE (p poly_ops);
100-
</ProgramListing>
101-
102-
Roger that!! I thought it could be related to a number of
103-
<ProductName>PostgreSQL</ProductName> versions
104-
back and tried the query. My system went nuts and I had to shoot down
105-
the postmaster in about ten minutes.
106-
</para>
107-
108-
<Para>
109-
I will continue to look into GiST for a while, but I would also
110-
appreciate
111-
more examples of R-tree usage.
112-
</para>
113-
</Chapter>
5+
<chapter Id="GiST">
6+
<title>GiST Indexes</title>
7+
8+
<sect1 id="intro">
9+
<title>Introduction</title>
10+
11+
<para>
12+
<acronym>GiST</acronym> stands for Generalized Search Tree. It is a
13+
balanced, tree-structured access method, that acts as a base template in
14+
which to implement arbitrary indexing schemes. B+-trees, R-trees and many
15+
other indexing schemes can be implemented in <acronym>GiST</acronym>.
16+
</para>
17+
18+
<para>
19+
One advantage of <acronym>GiST</acronym> is that it allows the development
20+
of custom data types with the appropriate access methods, by
21+
an expert in the domain of the data type, rather than a database expert.
22+
</para>
23+
24+
<para>
25+
Some of the information here is derived from <ulink
26+
url="http://gist.cs.berkeley.edu/">the University of California at
27+
Berkeley's GiST Indexing Project web site</ulink> and Marcel Kornacker's
28+
thesis,
29+
<ulink url="http://citeseer.nj.nec.com/448594.html">Access Methods for
30+
Next-Generation Database Systems</ulink>. The <acronym>GiST</acronym>
31+
implementation in <productname>PostgreSQL</productname> is primarily
32+
maintained by Teodor Sigaev and Oleg Bartunov, and there is more
33+
information on their website: <ulink
34+
url="http://www.sai.msu.su/~megera/postgres/gist/"></>.
35+
</para>
36+
37+
</sect1>
38+
39+
<sect1 id="extensibility">
40+
<title>Extensibility</title>
41+
42+
<para>
43+
Traditionally, implementing a new index access method meant a lot of
44+
difficult work. It was necessary to understand the inner workings of the
45+
database, such as the lock manager and Write-Ahead Log. The
46+
<acronym>GiST</acronym> interface has a high level of abstraction,
47+
requiring the access method implementor to only implement the semantics of
48+
the data type being accessed. The <acronym>GiST</acronym> layer itself
49+
takes care of concurrency, logging and searching the tree structure.
50+
</para>
51+
52+
<para>
53+
This extensibility should not be confused with the extensibility of the
54+
other standard search trees in terms of the data they can handle. For
55+
example, <productname>PostgreSQL</productname> supports extensible B+-trees
56+
and R-trees. That means that you can use
57+
<productname>PostgreSQL</productname> to build a B+-tree or R-tree over any
58+
data type you want. But B+-trees only support range predicates
59+
(<literal>&lt;</literal>, <literal>=</literal>, <literal>&gt;</literal>),
60+
and R-trees only support n-D range queries (contains, contained, equals).
61+
</para>
62+
63+
<para>
64+
So if you index, say, an image collection with a
65+
<productname>PostgreSQL</productname> B+-tree, you can only issue queries
66+
such as <quote>is imagex equal to imagey</quote>, <quote>is imagex less
67+
than imagey</quote> and <quote>is imagex greater than imagey</quote>?
68+
Depending on how you define <quote>equals</quote>, <quote>less than</quote>
69+
and <quote>greater than</quote> in this context, this could be useful.
70+
However, by using a <acronym>GiST</acronym> based index, you could create
71+
ways to ask domain-specific questions, perhaps <quote>find all images of
72+
horses</quote> or <quote>find all over-exposed images</quote>.
73+
</para>
74+
75+
<para>
76+
All it takes to get a <acronym>GiST</acronym> access method up and running
77+
is to implement seven user-defined methods, which define the behavior of
78+
keys in the tree. Of course these methods have to be pretty fancy to
79+
support fancy queries, but for all the standard queries (B+-trees,
80+
R-trees, etc.) they're relatively straightforward. In short,
81+
<acronym>GiST</acronym> combines extensibility along with generality, code
82+
reuse, and a clean interface.
83+
</para>
84+
85+
</sect1>
86+
87+
<sect1 id="implementation">
88+
<title>Implementation</title>
89+
90+
<para>
91+
There are seven methods that an index operator class for
92+
<acronym>GiST</acronym> must provide:
93+
</para>
94+
95+
<variablelist>
96+
<varlistentry>
97+
<term>consistent</term>
98+
<listitem>
99+
<para>
100+
Given a predicate <literal>p</literal> on a tree page, and a user
101+
query, <literal>q</literal>, this method will return false if it is
102+
certain that both <literal>p</literal> and <literal>q</literal> cannot
103+
be true for a given data item.
104+
</para>
105+
</listitem>
106+
</varlistentry>
107+
108+
<varlistentry>
109+
<term>union</term>
110+
<listitem>
111+
<para>
112+
This method consolidates information in the tree. Given a set of
113+
entries, this function generates a new predicate that is true for all
114+
the entries.
115+
</para>
116+
</listitem>
117+
</varlistentry>
118+
119+
<varlistentry>
120+
<term>compress</term>
121+
<listitem>
122+
<para>
123+
Converts the data item into a format suitable for physical storage in
124+
an index page.
125+
</para>
126+
</listitem>
127+
</varlistentry>
128+
129+
<varlistentry>
130+
<term>decompress</term>
131+
<listitem>
132+
<para>
133+
The reverse of the <function>compress</function> method. Converts the
134+
index representation of the data item into a format that can be
135+
manipulated by the database.
136+
</para>
137+
</listitem>
138+
</varlistentry>
139+
140+
<varlistentry>
141+
<term>penalty</term>
142+
<listitem>
143+
<para>
144+
Returns a value indicating the <quote>cost</quote> of inserting the new
145+
entry into a particular branch of the tree. items will be inserted
146+
down the path of least <function>penalty</function> in the tree.
147+
</para>
148+
</listitem>
149+
</varlistentry>
150+
151+
<varlistentry>
152+
<term>picksplit</term>
153+
<listitem>
154+
<para>
155+
When a page split is necessary, this function decides which entries on
156+
the page are to stay on the old page, and which are to move to the new
157+
page.
158+
</para>
159+
</listitem>
160+
</varlistentry>
161+
162+
<varlistentry>
163+
<term>same</term>
164+
<listitem>
165+
<para>
166+
Returns true if two entries are identical, false otherwise.
167+
</para>
168+
</listitem>
169+
</varlistentry>
170+
171+
</variablelist>
172+
173+
</sect1>
174+
175+
<sect1 id="limitations">
176+
<title>Limitations</title>
177+
178+
<para>
179+
The current implementation of <acronym>GiST</acronym> within
180+
<productname>PostgreSQL</productname> has some major limitations:
181+
<acronym>GiST</acronym> access is not concurrent; the
182+
<acronym>GiST</acronym> interface doesn't allow the development of certain
183+
data types, such as digital trees (see papers by Aoki et al); and there
184+
is not yet any support for write-ahead logging of updates in
185+
<acronym>GiST</acronym> indexes.
186+
</para>
187+
188+
<para>
189+
Solutions to the concurrency problems appear in Marcel Kornacker's
190+
thesis; however these ideas have not yet been put into practice in the
191+
<productname>PostgreSQL</productname> implementation.
192+
</para>
193+
194+
<para>
195+
The lack of write-ahead logging is just a small matter of programming,
196+
but since it isn't done yet, a crash could render a <acronym>GiST</acronym>
197+
index inconsistent, forcing a REINDEX.
198+
</para>
199+
200+
</sect1>
201+
202+
<sect1 id="examples">
203+
<title>Examples</title>
204+
205+
<para>
206+
To see example implementations of index methods implemented using
207+
<acronym>GiST</acronym>, examine the following contrib modules:
208+
</para>
209+
210+
<variablelist>
211+
<varlistentry>
212+
<term>btree_gist</term>
213+
<listitem>
214+
<para>B-Tree</para>
215+
</listitem>
216+
</varlistentry>
217+
218+
<varlistentry>
219+
<term>cube</term>
220+
<listitem>
221+
<para>Indexing for multi-dimensional cubes</para>
222+
</listitem>
223+
</varlistentry>
224+
225+
<varlistentry>
226+
<term>intarray</term>
227+
<listitem>
228+
<para>RD-Tree for one-dimensional array of int4 values</para>
229+
</listitem>
230+
</varlistentry>
231+
232+
<varlistentry>
233+
<term>ltree</term>
234+
<listitem>
235+
<para>Indexing for tree-like stuctures</para>
236+
</listitem>
237+
</varlistentry>
238+
239+
<varlistentry>
240+
<term>rtree_gist</term>
241+
<listitem>
242+
<para>R-Tree</para>
243+
</listitem>
244+
</varlistentry>
245+
246+
<varlistentry>
247+
<term>seg</term>
248+
<listitem>
249+
<para>Storage and indexed access for <quote>float ranges</quote></para>
250+
</listitem>
251+
</varlistentry>
252+
253+
<varlistentry>
254+
<term>tsearch and tsearch2</term>
255+
<listitem>
256+
<para>Full text indexing</para>
257+
</listitem>
258+
</varlistentry>
259+
</variablelist>
260+
261+
</sect1>
262+
263+
</chapter>

0 commit comments

Comments
 (0)