Skip to content

Commit 0ca9907

Browse files
committed
GIN documentation and slightly improving GiST docs.
Thanks to Christopher Kings-Lynne <chris.kingslynne@gmail.com> for initial version and Jeff Davis <pgsql@j-davis.com> for inspection
1 parent 4eef745 commit 0ca9907

File tree

8 files changed

+395
-20
lines changed

8 files changed

+395
-20
lines changed

doc/src/sgml/config.sgml

+15-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.85 2006/09/08 15:55:52 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.86 2006/09/14 11:16:27 teodor Exp $ -->
22

33
<chapter Id="runtime-config">
44
<title>Server Configuration</title>
@@ -2172,7 +2172,20 @@ SELECT * FROM parent WHERE key = 2400;
21722172
</para>
21732173
</listitem>
21742174
</varlistentry>
2175-
2175+
2176+
<varlistentry id="guc-gin-fuzzy-search-limit" xreflabel="gin_fuzzy_search_limit">
2177+
<term><varname>gin_fuzzy_search_limit</varname> (<type>integer</type>)</term>
2178+
<indexterm>
2179+
<primary><varname>gin_fuzzy_search_limit</> configuration parameter</primary>
2180+
</indexterm>
2181+
<listitem>
2182+
<para>
2183+
Soft upper limit of the size of the returned set by GIN index. For more
2184+
information see <xref linkend="gin-tips">.
2185+
</para>
2186+
</listitem>
2187+
</varlistentry>
2188+
21762189
</variablelist>
21772190
</sect2>
21782191
</sect1>

doc/src/sgml/filelist.sgml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.46 2006/09/05 03:09:56 momjian Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.47 2006/09/14 11:16:27 teodor Exp $ -->
22

33
<!entity history SYSTEM "history.sgml">
44
<!entity info SYSTEM "info.sgml">
@@ -78,6 +78,7 @@
7878
<!entity catalogs SYSTEM "catalogs.sgml">
7979
<!entity geqo SYSTEM "geqo.sgml">
8080
<!entity gist SYSTEM "gist.sgml">
81+
<!entity gin SYSTEM "gin.sgml">
8182
<!entity planstats SYSTEM "planstats.sgml">
8283
<!entity indexam SYSTEM "indexam.sgml">
8384
<!entity nls SYSTEM "nls.sgml">

doc/src/sgml/geqo.sgml

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.36 2006/03/10 19:10:48 momjian Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.37 2006/09/14 11:16:27 teodor Exp $ -->
22

33
<chapter id="geqo">
44
<chapterinfo>
@@ -49,8 +49,8 @@
4949
methods</firstterm> (e.g., nested loop, hash join, merge join in
5050
<productname>PostgreSQL</productname>) to process individual joins
5151
and a diversity of <firstterm>indexes</firstterm> (e.g.,
52-
B-tree, hash, GiST in <productname>PostgreSQL</productname>) as access
53-
paths for relations.
52+
B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
53+
access paths for relations.
5454
</para>
5555

5656
<para>

doc/src/sgml/gin.sgml

+231
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.1 2006/09/14 11:16:27 teodor Exp $ -->
2+
3+
<chapter id="GIN">
4+
<title>GIN Indexes</title>
5+
6+
<indexterm>
7+
<primary>index</primary>
8+
<secondary>GIN</secondary>
9+
</indexterm>
10+
11+
<sect1 id="gin-intro">
12+
<title>Introduction</title>
13+
14+
<para>
15+
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is
16+
an index structure storing a set of (key, posting list) pairs, where
17+
'posting list' is a set of rows in which the key occurs. The
18+
row may contain many keys.
19+
</para>
20+
21+
<para>
22+
It is generalized in the sense that a <acronym>GIN</acronym> index
23+
does not need to be aware of the operation that it accelerates.
24+
Instead, it uses custom strategies defined for particular data types.
25+
</para>
26+
27+
<para>
28+
One advantage of <acronym>GIN</acronym> is that it allows the development
29+
of custom data types with the appropriate access methods, by
30+
an expert in the domain of the data type, rather than a database expert.
31+
This is much the same advantage as using <acronym>GiST</acronym>.
32+
</para>
33+
34+
<para>
35+
The <acronym>GIN</acronym>
36+
implementation in <productname>PostgreSQL</productname> is primarily
37+
maintained by Teodor Sigaev and Oleg Bartunov, and there is more
38+
information on their
39+
<ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>.
40+
</para>
41+
42+
</sect1>
43+
44+
<sect1 id="gin-extensibility">
45+
<title>Extensibility</title>
46+
47+
<para>
48+
The <acronym>GIN</acronym> interface has a high level of abstraction,
49+
requiring the access method implementer to only implement the semantics of
50+
the data type being accessed. The <acronym>GIN</acronym> layer itself
51+
takes care of concurrency, logging and searching the tree structure.
52+
</para>
53+
54+
<para>
55+
All it takes to get a <acronym>GIN</acronym> access method working
56+
is to implement four user-defined methods, which define the behavior of
57+
keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
58+
along with generality, code reuse, and a clean interface.
59+
</para>
60+
61+
</sect1>
62+
63+
<sect1 id="gin-implementation">
64+
<title>Implementation</title>
65+
66+
<para>
67+
Internally, <acronym>GIN</acronym> consists of a B-tree index constructed
68+
over keys, where each key is an element of the indexed value
69+
(element of array, for example) and where each tuple in a leaf page is
70+
either a pointer to a B-tree over heap pointers (PT, posting tree), or a
71+
list of heap pointers (PL, posting list) if the tuple is small enough.
72+
</para>
73+
74+
<para>
75+
There are four methods that an index operator class for
76+
<acronym>GIN</acronym> must provide (prototypes are in pseudocode):
77+
</para>
78+
79+
<variablelist>
80+
<varlistentry>
81+
<term>int compare( Datum a, Datum b )</term>
82+
<listitem>
83+
<para>
84+
Compares keys (not indexed values!) and returns an integer less than
85+
zero, zero, or greater than zero, indicating whether the first key is
86+
less than, equal to, or greater than the second.
87+
</para>
88+
</listitem>
89+
</varlistentry>
90+
91+
<varlistentry>
92+
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
93+
<listitem>
94+
<para>
95+
Returns an array of keys of value to be indexed, nkeys should
96+
contain the number of returned keys.
97+
</para>
98+
</listitem>
99+
</varlistentry>
100+
101+
<varlistentry>
102+
<term>Datum* extractQuery(Datum query, uint32 nkeys,
103+
StrategyNumber n)</term>
104+
<listitem>
105+
<para>
106+
Returns an array of keys of the query to be executed. n contains
107+
strategy number of operation (see <xref linkend="xindex-strategies">).
108+
Depending on n, query may be different type.
109+
</para>
110+
</listitem>
111+
</varlistentry>
112+
113+
<varlistentry>
114+
<term>bool consistent( bool check[], StrategyNumber n, Datum query)</term>
115+
<listitem>
116+
<para>
117+
Returns TRUE if indexed value satisfies query qualifier with strategy n
118+
(or may satisfy in case of RECHECK mark in operator class).
119+
Each element of the check array is TRUE if indexed value has a
120+
corresponding key in the query: if (check[i] == TRUE ) the i-th key of
121+
the query is present in the indexed value.
122+
</para>
123+
</listitem>
124+
</varlistentry>
125+
126+
</variablelist>
127+
128+
</sect1>
129+
130+
<sect1 id="gin-tips">
131+
<title>GIN tips and trics</title>
132+
133+
<variablelist>
134+
<varlistentry>
135+
<term>Create vs insert</term>
136+
<listitem>
137+
<para>
138+
In most cases, insertion into <acronym>GIN</acronym> index is slow because
139+
many GIN keys may be inserted for each table row. So, when loading data
140+
in bulk it may be useful to drop index and recreate it
141+
after the data is loaded in the table.
142+
</para>
143+
</listitem>
144+
</varlistentry>
145+
146+
<varlistentry>
147+
<term>gin_fuzzy_search_limit</term>
148+
<listitem>
149+
<para>
150+
The primary goal of development <acronym>GIN</acronym> indices was
151+
support for highly scalable, full-text search in
152+
<productname>PostgreSQL</productname> and there are often situations when
153+
a full-text search returns a very large set of results. Since reading
154+
tuples from the disk and sorting them could take a lot of time, this is
155+
unacceptable for production. (Note that the index search itself is very
156+
fast.)
157+
</para>
158+
<para>
159+
Such queries usually contain very frequent words, so the results are not
160+
very helpful. To facilitate execution of such queries
161+
<acronym>GIN</acronym> has a configurable soft upper limit of the size
162+
of the returned set, determined by the
163+
<varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by
164+
default (no limit).
165+
</para>
166+
<para>
167+
If a non-zero search limit is set, then the returned set is a subset of
168+
the whole result set, chosen at random.
169+
</para>
170+
<para>
171+
"Soft" means that the actual number of returned results could slightly
172+
differ from the specified limit, depending on the query and the quality
173+
of the system's random number generator.
174+
</para>
175+
</listitem>
176+
</varlistentry>
177+
<variablelist>
178+
179+
</sect1>
180+
181+
<sect1 id="gin-limit">
182+
<title>Limitations</title>
183+
184+
<para>
185+
<acronym>GIN</acronym> doesn't support full scan of index due to it's
186+
extremely inefficiency: because of a lot of keys per value,
187+
each heap pointer will returned several times.
188+
</para>
189+
190+
<para>
191+
When extractQuery returns zero number of keys, <acronym>GIN</acronym> will
192+
emit a error: for different opclass and strategy semantic meaning of void
193+
query may be different (for example, any array contains void array,
194+
but they aren't overlapped with void one), and <acronym>GIN</acronym> can't
195+
suggest reasonable answer.
196+
</para>
197+
198+
<para>
199+
<acronym>GIN</acronym> searches keys only by equality matching. This may
200+
be improved in future.
201+
</para>
202+
</sect1>
203+
<sect1 id="gin-examples">
204+
<title>Examples</title>
205+
206+
<para>
207+
The <productname>PostgreSQL</productname> source distribution includes
208+
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal
209+
types. The following
210+
<filename>contrib</> modules also contain <acronym>GIN</acronym>
211+
operator classes:
212+
</para>
213+
214+
<variablelist>
215+
<varlistentry>
216+
<term>intarray</term>
217+
<listitem>
218+
<para>Enhanced support for int4[]</para>
219+
</listitem>
220+
</varlistentry>
221+
222+
<varlistentry>
223+
<term>tsearch2</term>
224+
<listitem>
225+
<para>Support for inverted text indexing. This is much faster for very
226+
large, mostly-static sets of documents.
227+
</para>
228+
</listitem>
229+
</varlistentry>
230+
231+
</chapter>

doc/src/sgml/indices.sgml

+33-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.61 2006/09/13 23:42:26 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.62 2006/09/14 11:16:27 teodor Exp $ -->
22

33
<chapter id="indexes">
44
<title id="indexes-title">Indexes</title>
@@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id);
116116

117117
<para>
118118
<productname>PostgreSQL</productname> provides several index types:
119-
B-tree, Hash, and GiST. Each index type uses a different
119+
B-tree, Hash, GIN and GiST. Each index type uses a different
120120
algorithm that is best suited to different types of queries.
121121
By default, the <command>CREATE INDEX</command> command will create a
122122
B-tree index, which fits the most common situations.
@@ -238,6 +238,37 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
238238
classes are available in the <literal>contrib</> collection or as separate
239239
projects. For more information see <xref linkend="GiST">.
240240
</para>
241+
<para>
242+
<indexterm>
243+
<primary>index</primary>
244+
<secondary>GIN</secondary>
245+
</indexterm>
246+
<indexterm>
247+
<primary>GIN</primary>
248+
<see>index</see>
249+
</indexterm>
250+
GIN is a inverted index and it's usable for values which have more
251+
than one key, arrays for example. Like to GiST, GIN may support
252+
many different user-defined indexing strategies and the particular
253+
operators with which a GIN index can be used vary depending on the
254+
indexing strategy.
255+
As an example, the standard distribution of
256+
<productname>PostgreSQL</productname> includes GIN operator classes
257+
for one-dimentional arrays, which support indexed
258+
queries using these operators:
259+
260+
<simplelist>
261+
<member><literal>&lt;@</literal></member>
262+
<member><literal>@&gt;</literal></member>
263+
<member><literal>=</literal></member>
264+
<member><literal>&amp;&amp;</literal></member>
265+
</simplelist>
266+
267+
(See <xref linkend="functions-array"> for the meaning of
268+
these operators.)
269+
Another GIN operator classes are available in the <literal>contrib</>
270+
tsearch2 and intarray modules. For more information see <xref linkend="GIN">.
271+
</para>
241272
</sect1>
242273

243274

doc/src/sgml/mvcc.sgml

+16-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/mvcc.sgml,v 2.58 2006/09/03 01:59:09 momjian Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/mvcc.sgml,v 2.59 2006/09/14 11:16:27 teodor Exp $ -->
22

33
<chapter id="mvcc">
44
<title>Concurrency Control</title>
@@ -987,6 +987,20 @@ UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222;
987987
</para>
988988
</listitem>
989989
</varlistentry>
990+
991+
<varlistentry>
992+
<term>
993+
<acronym>GIN</acronym> indexes
994+
</term>
995+
<listitem>
996+
<para>
997+
Short-term share/exclusive page-level locks are used for
998+
read/write access. Locks are released immediately after each
999+
index row is fetched or inserted. However, note that GIN index
1000+
usually requires several inserts per one table row.
1001+
</para>
1002+
</listitem>
1003+
</varlistentry>
9901004
</variablelist>
9911005
</para>
9921006

@@ -995,7 +1009,7 @@ UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222;
9951009
applications; since they also have more features than hash
9961010
indexes, they are the recommended index type for concurrent
9971011
applications that need to index scalar data. When dealing with
998-
non-scalar data, B-trees are not useful, and GiST indexes should
1012+
non-scalar data, B-trees are not useful, and GiST or GIN indexes should
9991013
be used instead.
10001014
</para>
10011015
</sect1>

doc/src/sgml/ref/create_opclass.sgml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.15 2006/09/10 17:36:52 tgl Exp $
2+
$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.16 2006/09/14 11:16:27 teodor Exp $
33
PostgreSQL documentation
44
-->
55

@@ -192,7 +192,7 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
192192
<para>
193193
The data type actually stored in the index. Normally this is
194194
the same as the column data type, but some index methods
195-
(only GiST at this writing) allow it to be different. The
195+
(GIN and GiST for now) allow it to be different. The
196196
<literal>STORAGE</> clause must be omitted unless the index
197197
method allows a different type to be used.
198198
</para>

0 commit comments

Comments
 (0)