|
1 | 1 | <!--
|
2 |
| -$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.12 2003/09/29 18:18:35 momjian Exp $ |
| 2 | +$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.13 2003/10/31 22:41:21 tgl Exp $ |
3 | 3 | -->
|
4 | 4 |
|
5 |
| -<Chapter Id="gist"> |
6 |
| -<DocInfo> |
7 |
| -<AuthorGroup> |
8 |
| -<Author> |
9 |
| -<FirstName>Gene</FirstName> |
10 |
| -<Surname>Selkov</Surname> |
11 |
| -</Author> |
12 |
| -</AuthorGroup> |
13 |
| -<Date>Transcribed 1998-02-19</Date> |
14 |
| -</DocInfo> |
15 |
| -<Title>GiST Indexes</Title> |
16 |
| - |
17 |
| -<Para> |
18 |
| -The information about GIST is at |
19 |
| - <ULink url="http://GiST.CS.Berkeley.EDU:8000/gist/">http://GiST.CS.Berkeley.EDU:8000/gist/</ULink> |
20 |
| - |
21 |
| -with more on different indexing and sorting schemes at |
22 |
| -<ULink url="http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/">http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/</ULink>. |
23 |
| - |
24 |
| -And there is more interesting reading at |
25 |
| -<ULink url="http://epoch.cs.berkeley.edu:8000/">http://epoch.cs.berkeley.edu:8000/</ULink> and |
26 |
| -<ULink url="http://www.sai.msu.su/~megera/postgres/gist/">http://www.sai.msu.su/~megera/postgres/gist/</ULink>. |
27 |
| -</para> |
28 |
| - |
29 |
| -<Para> |
30 |
| -<Note> |
31 |
| -<Title>Author</Title> |
32 |
| -<Para> |
33 |
| -This extraction from an email sent by |
34 |
| -Eugene Selkov, Jr. (<email>selkovjr@mcs.anl.gov</email>) |
35 |
| -contains good information |
36 |
| -on GiST. Hopefully we will learn more in the future and update this information. |
37 |
| -- thomas 1998-03-01 |
38 |
| -</Para> |
39 |
| -</Note> |
40 |
| -</para> |
41 |
| -<Para> |
42 |
| -Well, I can't say I quite understand what's going on, but at least |
43 |
| -I (almost) succeeded in porting GiST examples to linux. The GiST access |
44 |
| -method is already in the postgres tree (<FileName>src/backend/access/gist</FileName>). |
45 |
| -</para> |
46 |
| -<Para> |
47 |
| -<ULink url="ftp://s2k-ftp.cs.berkeley.edu/pub/gist/pggist/pggist.tgz">Examples at Berkeley</ULink> |
48 |
| -come with an overview of the methods and demonstrate spatial index |
49 |
| -mechanisms for 2D boxes, polygons, integer intervals and text |
50 |
| -(see also <ULink url="http://gist.cs.berkeley.edu:8000/gist/">GiST at Berkeley</ULink>). |
51 |
| -In the box example, we |
52 |
| -are supposed to see a performance gain when using the GiST index; it did |
53 |
| -work for me but I do not have a reasonably large collection of boxes |
54 |
| -to check that. Other examples also worked, except polygons: I got an |
55 |
| -error doing |
56 |
| - |
57 |
| -<ProgramListing> |
58 |
| -test=> CREATE INDEX pix ON polytmp |
59 |
| -test-> USING GIST (p:box gist_poly_ops) WITH (ISLOSSY); |
60 |
| -ERROR: cannot open pix |
61 |
| - |
62 |
| -(PostgreSQL 6.3 Sun Feb 1 14:57:30 EST 1998) |
63 |
| -</ProgramListing> |
64 |
| -</para> |
65 |
| -<Para> |
66 |
| -I could not get sense of this error message; it appears to be something |
67 |
| -we'd rather ask the developers about (see also Note 4 below). What I |
68 |
| -would suggest here is that someone of you linux guys (linux==gcc?) fetch the |
69 |
| -original sources quoted above and apply my patch (see attachment) and |
70 |
| -tell us what you feel about it. Looks cool to me, but I would not like |
71 |
| -to hold it up while there are so many competent people around. |
72 |
| -</para> |
73 |
| -<Para> |
74 |
| -A few notes on the sources: |
75 |
| -</para> |
76 |
| -<Para> |
77 |
| -1. I failed to make use of the original (HP-UX) Makefile and rearranged |
78 |
| - the Makefile from the ancient postgres95 tutorial to do the job. I tried |
79 |
| - to keep it generic, but I am a very poor makefile writer -- just did |
80 |
| - some monkey work. Sorry about that, but I guess it is now a little |
81 |
| - more portable that the original makefile. |
82 |
| -</para> |
83 |
| -<Para> |
84 |
| -2. I built the example sources right under pgsql/src (just extracted the |
85 |
| - tar file there). The aforementioned Makefile assumes it is one level |
86 |
| - below pgsql/src (in our case, in pgsql/src/pggist). |
87 |
| -</para> |
88 |
| -<Para> |
89 |
| -3. The changes I made to the *.c files were all about #include's, |
90 |
| - function prototypes and typecasting. Other than that, I just threw |
91 |
| - away a bunch of unused vars and added a couple parentheses to please |
92 |
| - gcc. I hope I did not screw up too much :) |
93 |
| -</para> |
94 |
| -<Para> |
95 |
| -4. There is a comment in polyproc.sql: |
96 |
| - |
97 |
| -<ProgramListing> |
98 |
| --- -- there's a memory leak in rtree poly_ops!! |
99 |
| --- -- CREATE INDEX pix2 ON polytmp USING RTREE (p poly_ops); |
100 |
| -</ProgramListing> |
101 |
| - |
102 |
| - Roger that!! I thought it could be related to a number of |
103 |
| - <ProductName>PostgreSQL</ProductName> versions |
104 |
| - back and tried the query. My system went nuts and I had to shoot down |
105 |
| - the postmaster in about ten minutes. |
106 |
| -</para> |
107 |
| - |
108 |
| -<Para> |
109 |
| -I will continue to look into GiST for a while, but I would also |
110 |
| -appreciate |
111 |
| -more examples of R-tree usage. |
112 |
| -</para> |
113 |
| -</Chapter> |
| 5 | +<chapter Id="GiST"> |
| 6 | +<title>GiST Indexes</title> |
| 7 | + |
| 8 | +<sect1 id="intro"> |
| 9 | + <title>Introduction</title> |
| 10 | + |
| 11 | + <para> |
| 12 | + <acronym>GiST</acronym> stands for Generalized Search Tree. It is a |
| 13 | + balanced, tree-structured access method, that acts as a base template in |
| 14 | + which to implement arbitrary indexing schemes. B+-trees, R-trees and many |
| 15 | + other indexing schemes can be implemented in <acronym>GiST</acronym>. |
| 16 | + </para> |
| 17 | + |
| 18 | + <para> |
| 19 | + One advantage of <acronym>GiST</acronym> is that it allows the development |
| 20 | + of custom data types with the appropriate access methods, by |
| 21 | + an expert in the domain of the data type, rather than a database expert. |
| 22 | + </para> |
| 23 | + |
| 24 | + <para> |
| 25 | + Some of the information here is derived from <ulink |
| 26 | + url="http://gist.cs.berkeley.edu/">the University of California at |
| 27 | + Berkeley's GiST Indexing Project web site</ulink> and Marcel Kornacker's |
| 28 | + thesis, |
| 29 | + <ulink url="http://citeseer.nj.nec.com/448594.html">Access Methods for |
| 30 | + Next-Generation Database Systems</ulink>. The <acronym>GiST</acronym> |
| 31 | + implementation in <productname>PostgreSQL</productname> is primarily |
| 32 | + maintained by Teodor Sigaev and Oleg Bartunov, and there is more |
| 33 | + information on their website: <ulink |
| 34 | + url="http://www.sai.msu.su/~megera/postgres/gist/"></>. |
| 35 | + </para> |
| 36 | + |
| 37 | +</sect1> |
| 38 | + |
| 39 | +<sect1 id="extensibility"> |
| 40 | + <title>Extensibility</title> |
| 41 | + |
| 42 | + <para> |
| 43 | + Traditionally, implementing a new index access method meant a lot of |
| 44 | + difficult work. It was necessary to understand the inner workings of the |
| 45 | + database, such as the lock manager and Write-Ahead Log. The |
| 46 | + <acronym>GiST</acronym> interface has a high level of abstraction, |
| 47 | + requiring the access method implementor to only implement the semantics of |
| 48 | + the data type being accessed. The <acronym>GiST</acronym> layer itself |
| 49 | + takes care of concurrency, logging and searching the tree structure. |
| 50 | + </para> |
| 51 | + |
| 52 | + <para> |
| 53 | + This extensibility should not be confused with the extensibility of the |
| 54 | + other standard search trees in terms of the data they can handle. For |
| 55 | + example, <productname>PostgreSQL</productname> supports extensible B+-trees |
| 56 | + and R-trees. That means that you can use |
| 57 | + <productname>PostgreSQL</productname> to build a B+-tree or R-tree over any |
| 58 | + data type you want. But B+-trees only support range predicates |
| 59 | + (<literal><</literal>, <literal>=</literal>, <literal>></literal>), |
| 60 | + and R-trees only support n-D range queries (contains, contained, equals). |
| 61 | + </para> |
| 62 | + |
| 63 | + <para> |
| 64 | + So if you index, say, an image collection with a |
| 65 | + <productname>PostgreSQL</productname> B+-tree, you can only issue queries |
| 66 | + such as <quote>is imagex equal to imagey</quote>, <quote>is imagex less |
| 67 | + than imagey</quote> and <quote>is imagex greater than imagey</quote>? |
| 68 | + Depending on how you define <quote>equals</quote>, <quote>less than</quote> |
| 69 | + and <quote>greater than</quote> in this context, this could be useful. |
| 70 | + However, by using a <acronym>GiST</acronym> based index, you could create |
| 71 | + ways to ask domain-specific questions, perhaps <quote>find all images of |
| 72 | + horses</quote> or <quote>find all over-exposed images</quote>. |
| 73 | + </para> |
| 74 | + |
| 75 | + <para> |
| 76 | + All it takes to get a <acronym>GiST</acronym> access method up and running |
| 77 | + is to implement seven user-defined methods, which define the behavior of |
| 78 | + keys in the tree. Of course these methods have to be pretty fancy to |
| 79 | + support fancy queries, but for all the standard queries (B+-trees, |
| 80 | + R-trees, etc.) they're relatively straightforward. In short, |
| 81 | + <acronym>GiST</acronym> combines extensibility along with generality, code |
| 82 | + reuse, and a clean interface. |
| 83 | + </para> |
| 84 | + |
| 85 | +</sect1> |
| 86 | + |
| 87 | +<sect1 id="implementation"> |
| 88 | + <title>Implementation</title> |
| 89 | + |
| 90 | + <para> |
| 91 | + There are seven methods that an index operator class for |
| 92 | + <acronym>GiST</acronym> must provide: |
| 93 | + </para> |
| 94 | + |
| 95 | + <variablelist> |
| 96 | + <varlistentry> |
| 97 | + <term>consistent</term> |
| 98 | + <listitem> |
| 99 | + <para> |
| 100 | + Given a predicate <literal>p</literal> on a tree page, and a user |
| 101 | + query, <literal>q</literal>, this method will return false if it is |
| 102 | + certain that both <literal>p</literal> and <literal>q</literal> cannot |
| 103 | + be true for a given data item. |
| 104 | + </para> |
| 105 | + </listitem> |
| 106 | + </varlistentry> |
| 107 | + |
| 108 | + <varlistentry> |
| 109 | + <term>union</term> |
| 110 | + <listitem> |
| 111 | + <para> |
| 112 | + This method consolidates information in the tree. Given a set of |
| 113 | + entries, this function generates a new predicate that is true for all |
| 114 | + the entries. |
| 115 | + </para> |
| 116 | + </listitem> |
| 117 | + </varlistentry> |
| 118 | + |
| 119 | + <varlistentry> |
| 120 | + <term>compress</term> |
| 121 | + <listitem> |
| 122 | + <para> |
| 123 | + Converts the data item into a format suitable for physical storage in |
| 124 | + an index page. |
| 125 | + </para> |
| 126 | + </listitem> |
| 127 | + </varlistentry> |
| 128 | + |
| 129 | + <varlistentry> |
| 130 | + <term>decompress</term> |
| 131 | + <listitem> |
| 132 | + <para> |
| 133 | + The reverse of the <function>compress</function> method. Converts the |
| 134 | + index representation of the data item into a format that can be |
| 135 | + manipulated by the database. |
| 136 | + </para> |
| 137 | + </listitem> |
| 138 | + </varlistentry> |
| 139 | + |
| 140 | + <varlistentry> |
| 141 | + <term>penalty</term> |
| 142 | + <listitem> |
| 143 | + <para> |
| 144 | + Returns a value indicating the <quote>cost</quote> of inserting the new |
| 145 | + entry into a particular branch of the tree. items will be inserted |
| 146 | + down the path of least <function>penalty</function> in the tree. |
| 147 | + </para> |
| 148 | + </listitem> |
| 149 | + </varlistentry> |
| 150 | + |
| 151 | + <varlistentry> |
| 152 | + <term>picksplit</term> |
| 153 | + <listitem> |
| 154 | + <para> |
| 155 | + When a page split is necessary, this function decides which entries on |
| 156 | + the page are to stay on the old page, and which are to move to the new |
| 157 | + page. |
| 158 | + </para> |
| 159 | + </listitem> |
| 160 | + </varlistentry> |
| 161 | + |
| 162 | + <varlistentry> |
| 163 | + <term>same</term> |
| 164 | + <listitem> |
| 165 | + <para> |
| 166 | + Returns true if two entries are identical, false otherwise. |
| 167 | + </para> |
| 168 | + </listitem> |
| 169 | + </varlistentry> |
| 170 | + |
| 171 | + </variablelist> |
| 172 | + |
| 173 | +</sect1> |
| 174 | + |
| 175 | +<sect1 id="limitations"> |
| 176 | + <title>Limitations</title> |
| 177 | + |
| 178 | + <para> |
| 179 | + The current implementation of <acronym>GiST</acronym> within |
| 180 | + <productname>PostgreSQL</productname> has some major limitations: |
| 181 | + <acronym>GiST</acronym> access is not concurrent; the |
| 182 | + <acronym>GiST</acronym> interface doesn't allow the development of certain |
| 183 | + data types, such as digital trees (see papers by Aoki et al); and there |
| 184 | + is not yet any support for write-ahead logging of updates in |
| 185 | + <acronym>GiST</acronym> indexes. |
| 186 | + </para> |
| 187 | + |
| 188 | + <para> |
| 189 | + Solutions to the concurrency problems appear in Marcel Kornacker's |
| 190 | + thesis; however these ideas have not yet been put into practice in the |
| 191 | + <productname>PostgreSQL</productname> implementation. |
| 192 | + </para> |
| 193 | + |
| 194 | + <para> |
| 195 | + The lack of write-ahead logging is just a small matter of programming, |
| 196 | + but since it isn't done yet, a crash could render a <acronym>GiST</acronym> |
| 197 | + index inconsistent, forcing a REINDEX. |
| 198 | + </para> |
| 199 | + |
| 200 | +</sect1> |
| 201 | + |
| 202 | +<sect1 id="examples"> |
| 203 | + <title>Examples</title> |
| 204 | + |
| 205 | + <para> |
| 206 | + To see example implementations of index methods implemented using |
| 207 | + <acronym>GiST</acronym>, examine the following contrib modules: |
| 208 | + </para> |
| 209 | + |
| 210 | + <variablelist> |
| 211 | + <varlistentry> |
| 212 | + <term>btree_gist</term> |
| 213 | + <listitem> |
| 214 | + <para>B-Tree</para> |
| 215 | + </listitem> |
| 216 | + </varlistentry> |
| 217 | + |
| 218 | + <varlistentry> |
| 219 | + <term>cube</term> |
| 220 | + <listitem> |
| 221 | + <para>Indexing for multi-dimensional cubes</para> |
| 222 | + </listitem> |
| 223 | + </varlistentry> |
| 224 | + |
| 225 | + <varlistentry> |
| 226 | + <term>intarray</term> |
| 227 | + <listitem> |
| 228 | + <para>RD-Tree for one-dimensional array of int4 values</para> |
| 229 | + </listitem> |
| 230 | + </varlistentry> |
| 231 | + |
| 232 | + <varlistentry> |
| 233 | + <term>ltree</term> |
| 234 | + <listitem> |
| 235 | + <para>Indexing for tree-like stuctures</para> |
| 236 | + </listitem> |
| 237 | + </varlistentry> |
| 238 | + |
| 239 | + <varlistentry> |
| 240 | + <term>rtree_gist</term> |
| 241 | + <listitem> |
| 242 | + <para>R-Tree</para> |
| 243 | + </listitem> |
| 244 | + </varlistentry> |
| 245 | + |
| 246 | + <varlistentry> |
| 247 | + <term>seg</term> |
| 248 | + <listitem> |
| 249 | + <para>Storage and indexed access for <quote>float ranges</quote></para> |
| 250 | + </listitem> |
| 251 | + </varlistentry> |
| 252 | + |
| 253 | + <varlistentry> |
| 254 | + <term>tsearch and tsearch2</term> |
| 255 | + <listitem> |
| 256 | + <para>Full text indexing</para> |
| 257 | + </listitem> |
| 258 | + </varlistentry> |
| 259 | + </variablelist> |
| 260 | + |
| 261 | +</sect1> |
| 262 | + |
| 263 | +</chapter> |
0 commit comments