Skip to content

Commit e6dbcb7

Browse files
committed
Extend GIN to support partial-match searches, and extend tsquery to support
prefix matching using this facility. Teodor Sigaev and Oleg Bartunov
1 parent e1bdd07 commit e6dbcb7

32 files changed

+1279
-503
lines changed

doc/src/sgml/datatype.sgml

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="datatype">
44
<title id="datatype-title">Data Types</title>
@@ -3298,18 +3298,17 @@ SELECT * FROM test;
32983298
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
32993299
tsvector
33003300
----------------------------------------------------
3301-
'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
3301+
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
33023302
</programlisting>
33033303

3304-
(As the example shows, the sorting is first by length and then
3305-
alphabetically, but that detail is seldom important.) To represent
3304+
To represent
33063305
lexemes containing whitespace or punctuation, surround them with quotes:
33073306

33083307
<programlisting>
33093308
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
33103309
tsvector
33113310
-------------------------------------------
3312-
'the' ' ' 'lexeme' 'spaces' 'contains'
3311+
' ' 'contains' 'lexeme' 'spaces' 'the'
33133312
</programlisting>
33143313

33153314
(We use dollar-quoted string literals in this example and the next one,
@@ -3320,7 +3319,7 @@ SELECT $$the lexeme ' ' contains spaces$$::tsvector;
33203319
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
33213320
tsvector
33223321
------------------------------------------------
3323-
'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
3322+
'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
33243323
</programlisting>
33253324

33263325
Optionally, integer <firstterm>position(s)</>
@@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
33303329
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
33313330
tsvector
33323331
-------------------------------------------------------------------------------
3333-
'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
3332+
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
33343333
</programlisting>
33353334

33363335
A position normally indicates the source word's location in the
@@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
33693368
select 'The Fat Rats'::tsvector;
33703369
tsvector
33713370
--------------------
3372-
'Fat' 'The' 'Rats'
3371+
'Fat' 'Rats' 'The'
33733372
</programlisting>
33743373

33753374
For most English-text-searching applications the above words would
@@ -3439,6 +3438,19 @@ SELECT 'fat:ab &amp; cat'::tsquery;
34393438
</programlisting>
34403439
</para>
34413440

3441+
<para>
3442+
Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</>
3443+
to specify prefix matching:
3444+
<programlisting>
3445+
SELECT 'super:*'::tsquery;
3446+
tsquery
3447+
-----------
3448+
'super':*
3449+
</programlisting>
3450+
This query will match any word in a <type>tsvector</> that begins
3451+
with <quote>super</>.
3452+
</para>
3453+
34423454
<para>
34433455
Quoting rules for lexemes are the same as described above for
34443456
lexemes in <type>tsvector</>; and, as with <type>tsvector</>,

doc/src/sgml/gin.sgml

Lines changed: 95 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="GIN">
44
<title>GIN Indexes</title>
@@ -52,15 +52,15 @@
5252
</para>
5353

5454
<para>
55-
All it takes to get a <acronym>GIN</acronym> access method working
56-
is to implement four user-defined methods, which define the behavior of
55+
All it takes to get a <acronym>GIN</acronym> access method working is to
56+
implement four (or five) user-defined methods, which define the behavior of
5757
keys in the tree and the relationships between keys, indexed values,
5858
and indexable queries. In short, <acronym>GIN</acronym> combines
5959
extensibility with generality, code reuse, and a clean interface.
6060
</para>
6161

6262
<para>
63-
The four methods that an index operator class for
63+
The four methods that an operator class for
6464
<acronym>GIN</acronym> must provide are:
6565
</para>
6666

@@ -77,7 +77,7 @@
7777
</varlistentry>
7878

7979
<varlistentry>
80-
<term>Datum* extractValue(Datum inputValue, int32 *nkeys)</term>
80+
<term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
8181
<listitem>
8282
<para>
8383
Returns an array of keys given a value to be indexed. The
@@ -87,8 +87,8 @@
8787
</varlistentry>
8888

8989
<varlistentry>
90-
<term>Datum* extractQuery(Datum query, int32 *nkeys,
91-
StrategyNumber n)</term>
90+
<term>Datum *extractQuery(Datum query, int32 *nkeys,
91+
StrategyNumber n, bool **pmatch)</term>
9292
<listitem>
9393
<para>
9494
Returns an array of keys given a value to be queried; that is,
@@ -100,13 +100,22 @@
100100
to consult <literal>n</> to determine the data type of
101101
<literal>query</> and the key values that need to be extracted.
102102
The number of returned keys must be stored into <literal>*nkeys</>.
103-
If number of keys is equal to zero then <function>extractQuery</>
104-
should store 0 or -1 into <literal>*nkeys</>. 0 means that any
105-
row matches the <literal>query</> and sequence scan should be
106-
produced. -1 means nothing can satisfy <literal>query</>.
107-
Choice of value should be based on semantics meaning of operation with
108-
given strategy number.
103+
If the query contains no keys then <function>extractQuery</>
104+
should store 0 or -1 into <literal>*nkeys</>, depending on the
105+
semantics of the operator. 0 means that every
106+
value matches the <literal>query</> and a sequential scan should be
107+
produced. -1 means nothing can match the <literal>query</>.
108+
<literal>pmatch</> is an output argument for use when partial match
109+
is supported. To use it, <function>extractQuery</> must allocate
110+
an array of <literal>*nkeys</> booleans and store its address at
111+
<literal>*pmatch</>. Each element of the array should be set to TRUE
112+
if the corresponding key requires partial match, FALSE if not.
113+
If <literal>*pmatch</> is set to NULL then GIN assumes partial match
114+
is not required. The variable is initialized to NULL before call,
115+
so this argument can simply be ignored by operator classes that do
116+
not support partial match.
109117
</para>
118+
110119
</listitem>
111120
</varlistentry>
112121

@@ -133,6 +142,39 @@
133142

134143
</variablelist>
135144

145+
<para>
146+
Optionally, an operator class for
147+
<acronym>GIN</acronym> can supply a fifth method:
148+
</para>
149+
150+
<variablelist>
151+
152+
<varlistentry>
153+
<term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
154+
<listitem>
155+
<para>
156+
Compare a partial-match query to an index key. Returns an integer
157+
whose sign indicates the result: less than zero means the index key
158+
does not match the query, but the index scan should continue; zero
159+
means that the index key does match the query; greater than zero
160+
indicates that the index scan should stop because no more matches
161+
are possible. The strategy number <literal>n</> of the operator
162+
that generated the partial match query is provided, in case its
163+
semantics are needed to determine when to end the scan.
164+
</para>
165+
</listitem>
166+
</varlistentry>
167+
168+
</variablelist>
169+
170+
<para>
171+
To support <quote>partial match</> queries, an operator class must
172+
provide the <function>comparePartial</> method, and its
173+
<function>extractQuery</> method must set the <literal>pmatch</>
174+
parameter when a partial-match query is encountered. See
175+
<xref linkend="gin-partial-match"> for details.
176+
</para>
177+
136178
</sect1>
137179

138180
<sect1 id="gin-implementation">
@@ -146,6 +188,33 @@
146188
list of heap pointers (PL, posting list) if the list is small enough.
147189
</para>
148190

191+
<sect2 id="gin-partial-match">
192+
<title>Partial match algorithm</title>
193+
194+
<para>
195+
GIN can support <quote>partial match</> queries, in which the query
196+
does not determine an exact match for one or more keys, but the possible
197+
matches fall within a reasonably narrow range of key values (within the
198+
key sorting order determined by the <function>compare</> support method).
199+
The <function>extractQuery</> method, instead of returning a key value
200+
to be matched exactly, returns a key value that is the lower bound of
201+
the range to be searched, and sets the <literal>pmatch</> flag true.
202+
The key range is then searched using the <function>comparePartial</>
203+
method. <function>comparePartial</> must return zero for an actual
204+
match, less than zero for a non-match that is still within the range
205+
to be searched, or greater than zero if the index key is past the range
206+
that could match.
207+
</para>
208+
209+
<para>
210+
During a partial-match scan, all <literal>itemPointer</>s for matching keys
211+
are OR'ed into a <literal>TIDBitmap</>.
212+
The scan fails if the <literal>TIDBitmap</> becomes lossy.
213+
In this case an error message will be reported with advice
214+
to increase <literal>work_mem</>.
215+
</para>
216+
</sect2>
217+
149218
</sect1>
150219

151220
<sect1 id="gin-tips">
@@ -236,8 +305,14 @@
236305
</para>
237306

238307
<para>
239-
<acronym>GIN</acronym> searches keys only by equality matching. This might
240-
be improved in future.
308+
It is possible for an operator class to circumvent the restriction against
309+
full index scan. To do that, <function>extractValue</> must return at least
310+
one (possibly dummy) key for every indexed value, and
311+
<function>extractQuery</function> must convert an unrestricted search into
312+
a partial-match query that will scan the whole index. This is inefficient
313+
but might be necessary to avoid corner-case failures with operators such
314+
as LIKE. Note however that failure could still occur if the intermediate
315+
<literal>TIDBitmap</> becomes lossy.
241316
</para>
242317
</sect1>
243318

@@ -247,9 +322,11 @@
247322
<para>
248323
The <productname>PostgreSQL</productname> source distribution includes
249324
<acronym>GIN</acronym> operator classes for <type>tsvector</> and
250-
for one-dimensional arrays of all internal types. The following
251-
<filename>contrib</> modules also contain <acronym>GIN</acronym>
252-
operator classes:
325+
for one-dimensional arrays of all internal types. Prefix searching in
326+
<type>tsvector</> is implemented using the <acronym>GIN</> partial match
327+
feature.
328+
The following <filename>contrib</> modules also contain
329+
<acronym>GIN</acronym> operator classes:
253330
</para>
254331

255332
<variablelist>

doc/src/sgml/textsearch.sgml

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="textsearch">
44
<title id="textsearch-title">Full Text Search</title>
@@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB');
754754
'fat' | 'rat':AB
755755
</programlisting>
756756

757+
Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
758+
759+
<programlisting>
760+
SELECT to_tsquery('supern:*A &amp; star:A*B');
761+
to_tsquery
762+
--------------------------
763+
'supern':*A &amp; 'star':*AB
764+
</programlisting>
765+
766+
Such a lexeme will match any word in a <type>tsvector</> that begins
767+
with the given string.
768+
</para>
769+
770+
<para>
757771
<function>to_tsquery</function> can also accept single-quoted
758772
phrases. This is primarily useful when the configuration includes a
759773
thesaurus dictionary that may trigger on such phrases.
@@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' &amp; !crab');
798812
</programlisting>
799813

800814
Note that <function>plainto_tsquery</> cannot
801-
recognize either Boolean operators or weight labels in its input:
815+
recognize Boolean operators, weight labels, or prefix-match labels
816+
in its input:
802817

803818
<programlisting>
804819
SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');

doc/src/sgml/xindex.sgml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<sect1 id="xindex">
44
<title>Interfacing Extensions To Indexes</title>
@@ -444,6 +444,13 @@
444444
<entry>consistent - determine whether value matches query condition</entry>
445445
<entry>4</entry>
446446
</row>
447+
<row>
448+
<entry>comparePartial - (optional method) compare partial key from
449+
query and key from index, and return an integer less than zero, zero,
450+
or greater than zero, indicating whether GIN should ignore this index
451+
entry, treat the entry as a match, or stop the index scan</entry>
452+
<entry>5</entry>
453+
</row>
447454
</tbody>
448455
</tgroup>
449456
</table>

0 commit comments

Comments
 (0)