|
| 1 | +<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.1 2006/09/14 11:16:27 teodor Exp $ --> |
| 2 | + |
| 3 | +<chapter id="GIN"> |
| 4 | +<title>GIN Indexes</title> |
| 5 | + |
| 6 | + <indexterm> |
| 7 | + <primary>index</primary> |
| 8 | + <secondary>GIN</secondary> |
| 9 | + </indexterm> |
| 10 | + |
| 11 | +<sect1 id="gin-intro"> |
| 12 | + <title>Introduction</title> |
| 13 | + |
| 14 | + <para> |
| 15 | + <acronym>GIN</acronym> stands for Generalized Inverted Index. It is |
| 16 | + an index structure storing a set of (key, posting list) pairs, where |
| 17 | + 'posting list' is a set of rows in which the key occurs. The |
| 18 | + row may contain many keys. |
| 19 | + </para> |
| 20 | + |
| 21 | + <para> |
| 22 | + It is generalized in the sense that a <acronym>GIN</acronym> index |
| 23 | + does not need to be aware of the operation that it accelerates. |
| 24 | + Instead, it uses custom strategies defined for particular data types. |
| 25 | + </para> |
| 26 | + |
| 27 | + <para> |
| 28 | + One advantage of <acronym>GIN</acronym> is that it allows the development |
| 29 | + of custom data types with the appropriate access methods, by |
| 30 | + an expert in the domain of the data type, rather than a database expert. |
| 31 | + This is much the same advantage as using <acronym>GiST</acronym>. |
| 32 | + </para> |
| 33 | + |
| 34 | + <para> |
| 35 | + The <acronym>GIN</acronym> |
| 36 | + implementation in <productname>PostgreSQL</productname> is primarily |
| 37 | + maintained by Teodor Sigaev and Oleg Bartunov, and there is more |
| 38 | + information on their |
| 39 | + <ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>. |
| 40 | + </para> |
| 41 | + |
| 42 | +</sect1> |
| 43 | + |
| 44 | +<sect1 id="gin-extensibility"> |
| 45 | + <title>Extensibility</title> |
| 46 | + |
| 47 | + <para> |
| 48 | + The <acronym>GIN</acronym> interface has a high level of abstraction, |
| 49 | + requiring the access method implementer to only implement the semantics of |
| 50 | + the data type being accessed. The <acronym>GIN</acronym> layer itself |
| 51 | + takes care of concurrency, logging and searching the tree structure. |
| 52 | + </para> |
| 53 | + |
| 54 | + <para> |
| 55 | + All it takes to get a <acronym>GIN</acronym> access method working |
| 56 | + is to implement four user-defined methods, which define the behavior of |
| 57 | + keys in the tree. In short, <acronym>GIN</acronym> combines extensibility |
| 58 | + along with generality, code reuse, and a clean interface. |
| 59 | + </para> |
| 60 | + |
| 61 | +</sect1> |
| 62 | + |
| 63 | +<sect1 id="gin-implementation"> |
| 64 | + <title>Implementation</title> |
| 65 | + |
| 66 | + <para> |
| 67 | + Internally, <acronym>GIN</acronym> consists of a B-tree index constructed |
| 68 | + over keys, where each key is an element of the indexed value |
| 69 | + (element of array, for example) and where each tuple in a leaf page is |
| 70 | + either a pointer to a B-tree over heap pointers (PT, posting tree), or a |
| 71 | + list of heap pointers (PL, posting list) if the tuple is small enough. |
| 72 | + </para> |
| 73 | + |
| 74 | + <para> |
| 75 | + There are four methods that an index operator class for |
| 76 | + <acronym>GIN</acronym> must provide (prototypes are in pseudocode): |
| 77 | + </para> |
| 78 | + |
| 79 | + <variablelist> |
| 80 | + <varlistentry> |
| 81 | + <term>int compare( Datum a, Datum b )</term> |
| 82 | + <listitem> |
| 83 | + <para> |
| 84 | + Compares keys (not indexed values!) and returns an integer less than |
| 85 | + zero, zero, or greater than zero, indicating whether the first key is |
| 86 | + less than, equal to, or greater than the second. |
| 87 | + </para> |
| 88 | + </listitem> |
| 89 | + </varlistentry> |
| 90 | + |
| 91 | + <varlistentry> |
| 92 | + <term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term> |
| 93 | + <listitem> |
| 94 | + <para> |
| 95 | + Returns an array of keys of value to be indexed, nkeys should |
| 96 | + contain the number of returned keys. |
| 97 | + </para> |
| 98 | + </listitem> |
| 99 | + </varlistentry> |
| 100 | + |
| 101 | + <varlistentry> |
| 102 | + <term>Datum* extractQuery(Datum query, uint32 nkeys, |
| 103 | + StrategyNumber n)</term> |
| 104 | + <listitem> |
| 105 | + <para> |
| 106 | + Returns an array of keys of the query to be executed. n contains |
| 107 | + strategy number of operation (see <xref linkend="xindex-strategies">). |
| 108 | + Depending on n, query may be different type. |
| 109 | + </para> |
| 110 | + </listitem> |
| 111 | + </varlistentry> |
| 112 | + |
| 113 | + <varlistentry> |
| 114 | + <term>bool consistent( bool check[], StrategyNumber n, Datum query)</term> |
| 115 | + <listitem> |
| 116 | + <para> |
| 117 | + Returns TRUE if indexed value satisfies query qualifier with strategy n |
| 118 | + (or may satisfy in case of RECHECK mark in operator class). |
| 119 | + Each element of the check array is TRUE if indexed value has a |
| 120 | + corresponding key in the query: if (check[i] == TRUE ) the i-th key of |
| 121 | + the query is present in the indexed value. |
| 122 | + </para> |
| 123 | + </listitem> |
| 124 | + </varlistentry> |
| 125 | + |
| 126 | + </variablelist> |
| 127 | + |
| 128 | +</sect1> |
| 129 | + |
| 130 | +<sect1 id="gin-tips"> |
| 131 | +<title>GIN tips and trics</title> |
| 132 | + |
| 133 | + <variablelist> |
| 134 | + <varlistentry> |
| 135 | + <term>Create vs insert</term> |
| 136 | + <listitem> |
| 137 | + <para> |
| 138 | + In most cases, insertion into <acronym>GIN</acronym> index is slow because |
| 139 | + many GIN keys may be inserted for each table row. So, when loading data |
| 140 | + in bulk it may be useful to drop index and recreate it |
| 141 | + after the data is loaded in the table. |
| 142 | + </para> |
| 143 | + </listitem> |
| 144 | + </varlistentry> |
| 145 | + |
| 146 | + <varlistentry> |
| 147 | + <term>gin_fuzzy_search_limit</term> |
| 148 | + <listitem> |
| 149 | + <para> |
| 150 | + The primary goal of development <acronym>GIN</acronym> indices was |
| 151 | + support for highly scalable, full-text search in |
| 152 | + <productname>PostgreSQL</productname> and there are often situations when |
| 153 | + a full-text search returns a very large set of results. Since reading |
| 154 | + tuples from the disk and sorting them could take a lot of time, this is |
| 155 | + unacceptable for production. (Note that the index search itself is very |
| 156 | + fast.) |
| 157 | + </para> |
| 158 | + <para> |
| 159 | + Such queries usually contain very frequent words, so the results are not |
| 160 | + very helpful. To facilitate execution of such queries |
| 161 | + <acronym>GIN</acronym> has a configurable soft upper limit of the size |
| 162 | + of the returned set, determined by the |
| 163 | + <varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by |
| 164 | + default (no limit). |
| 165 | + </para> |
| 166 | + <para> |
| 167 | + If a non-zero search limit is set, then the returned set is a subset of |
| 168 | + the whole result set, chosen at random. |
| 169 | + </para> |
| 170 | + <para> |
| 171 | + "Soft" means that the actual number of returned results could slightly |
| 172 | + differ from the specified limit, depending on the query and the quality |
| 173 | + of the system's random number generator. |
| 174 | + </para> |
| 175 | + </listitem> |
| 176 | + </varlistentry> |
| 177 | + <variablelist> |
| 178 | + |
| 179 | +</sect1> |
| 180 | + |
| 181 | +<sect1 id="gin-limit"> |
| 182 | + <title>Limitations</title> |
| 183 | + |
| 184 | + <para> |
| 185 | + <acronym>GIN</acronym> doesn't support full scan of index due to it's |
| 186 | + extremely inefficiency: because of a lot of keys per value, |
| 187 | + each heap pointer will returned several times. |
| 188 | + </para> |
| 189 | + |
| 190 | + <para> |
| 191 | + When extractQuery returns zero number of keys, <acronym>GIN</acronym> will |
| 192 | + emit a error: for different opclass and strategy semantic meaning of void |
| 193 | + query may be different (for example, any array contains void array, |
| 194 | + but they aren't overlapped with void one), and <acronym>GIN</acronym> can't |
| 195 | + suggest reasonable answer. |
| 196 | + </para> |
| 197 | + |
| 198 | + <para> |
| 199 | + <acronym>GIN</acronym> searches keys only by equality matching. This may |
| 200 | + be improved in future. |
| 201 | + </para> |
| 202 | +</sect1> |
| 203 | +<sect1 id="gin-examples"> |
| 204 | + <title>Examples</title> |
| 205 | + |
| 206 | + <para> |
| 207 | + The <productname>PostgreSQL</productname> source distribution includes |
| 208 | + <acronym>GIN</acronym> classes for one-dimensional arrays of all internal |
| 209 | + types. The following |
| 210 | + <filename>contrib</> modules also contain <acronym>GIN</acronym> |
| 211 | + operator classes: |
| 212 | + </para> |
| 213 | + |
| 214 | + <variablelist> |
| 215 | + <varlistentry> |
| 216 | + <term>intarray</term> |
| 217 | + <listitem> |
| 218 | + <para>Enhanced support for int4[]</para> |
| 219 | + </listitem> |
| 220 | + </varlistentry> |
| 221 | + |
| 222 | + <varlistentry> |
| 223 | + <term>tsearch2</term> |
| 224 | + <listitem> |
| 225 | + <para>Support for inverted text indexing. This is much faster for very |
| 226 | + large, mostly-static sets of documents. |
| 227 | + </para> |
| 228 | + </listitem> |
| 229 | + </varlistentry> |
| 230 | + |
| 231 | +</chapter> |
0 commit comments