1
- <!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.58 2010/08/20 13:59:45 tgl Exp $ -->
1
+ <!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.59 2010/08/25 21:42:55 tgl Exp $ -->
2
2
3
3
<chapter id="textsearch">
4
4
<title>Full Text Search</title>
112
112
as a sorted array of normalized lexemes. Along with the lexemes it is
113
113
often desirable to store positional information to use for
114
114
<firstterm>proximity ranking</firstterm>, so that a document that
115
- contains a more <quote>dense</> region of query words is
115
+ contains a more <quote>dense</> region of query words is
116
116
assigned a higher rank than one with scattered query words.
117
117
</para>
118
118
</listitem>
@@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... "
1151
1151
<screen>
1152
1152
SELECT ts_headline('english',
1153
1153
'The most common type of search
1154
- is to find all documents containing given query terms
1154
+ is to find all documents containing given query terms
1155
1155
and return them in order of their similarity to the
1156
1156
query.',
1157
1157
to_tsquery('query & similarity'));
1158
1158
ts_headline
1159
1159
------------------------------------------------------------
1160
- containing given <b>query</b> terms
1160
+ containing given <b>query</b> terms
1161
1161
and return them in order of their <b>similarity</b> to the
1162
1162
<b>query</b>.
1163
1163
@@ -1166,7 +1166,7 @@ SELECT ts_headline('english',
1166
1166
is to find all documents containing given query terms
1167
1167
and return them in order of their similarity to the
1168
1168
query.',
1169
- to_tsquery('query & similarity'),
1169
+ to_tsquery('query & similarity'),
1170
1170
'StartSel = <, StopSel = >');
1171
1171
ts_headline
1172
1172
-------------------------------------------------------
@@ -2064,6 +2064,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
2064
2064
(notice that one token can produce more than one lexeme)
2065
2065
</para>
2066
2066
</listitem>
2067
+ <listitem>
2068
+ <para>
2069
+ a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
2070
+ the original token with a new token to be passed to subsequent
2071
+ dictionaries (a dictionary that does this is called a
2072
+ <firstterm>filtering dictionary</>)
2073
+ </para>
2074
+ </listitem>
2067
2075
<listitem>
2068
2076
<para>
2069
2077
an empty array if the dictionary knows the token, but it is a stop word
@@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
2096
2104
until some dictionary recognizes it as a known word. If it is identified
2097
2105
as a stop word, or if no dictionary recognizes the token, it will be
2098
2106
discarded and not indexed or searched for.
2107
+ Normally, the first dictionary that returns a non-<literal>NULL</>
2108
+ output determines the result, and any remaining dictionaries are not
2109
+ consulted; but a filtering dictionary can replace the given word
2110
+ with a modified word, which is then passed to subsequent dictionaries.
2111
+ </para>
2112
+
2113
+ <para>
2099
2114
The general rule for configuring a list of dictionaries
2100
2115
is to place first the most narrow, most specific dictionary, then the more
2101
2116
general dictionaries, finishing with a very general dictionary, like
@@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en
2112
2127
</programlisting>
2113
2128
</para>
2114
2129
2130
+ <para>
2131
+ A filtering dictionary can be placed anywhere in the list, except at the
2132
+ end where it'd be useless. Filtering dictionaries are useful to partially
2133
+ normalize words to simplify the task of later dictionaries. For example,
2134
+ a filtering dictionary could be used to remove accents from accented
2135
+ letters, as is done by the
2136
+ <link linkend="unaccent"><filename>contrib/unaccent</></link>
2137
+ extension module.
2138
+ </para>
2139
+
2115
2140
<sect2 id="textsearch-stopwords">
2116
2141
<title>Stop Words</title>
2117
2142
@@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2184
2209
Here, <literal>english</literal> is the base name of a file of stop words.
2185
2210
The file's full name will be
2186
2211
<filename>$SHAREDIR/tsearch_data/english.stop</>,
2187
- where <literal>$SHAREDIR</> means the
2212
+ where <literal>$SHAREDIR</> means the
2188
2213
<productname>PostgreSQL</productname> installation's shared-data directory,
2189
2214
often <filename>/usr/local/share/postgresql</> (use <command>pg_config
2190
2215
--sharedir</> to determine it if you're not sure).
@@ -2295,85 +2320,82 @@ SELECT * FROM ts_debug('english', 'Paris');
2295
2320
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2296
2321
</screen>
2297
2322
</para>
2298
-
2323
+
2299
2324
<para>
2300
- An asterisk (<literal>*</literal>) at the end of definition word indicates
2301
- that definition word is a prefix, and <function>to_tsquery()</function>
2302
- function will transform that definition to the prefix search format (see
2303
- <xref linkend="textsearch-parsing-queries">).
2304
- Notice that it is ignored in <function>to_tsvector()</function>.
2325
+ The only parameter required by the <literal>synonym</> template is
2326
+ <literal>SYNONYMS</>, which is the base name of its configuration file
2327
+ — <literal>my_synonyms</> in the above example.
2328
+ The file's full name will be
2329
+ <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2330
+ (where <literal>$SHAREDIR</> means the
2331
+ <productname>PostgreSQL</> installation's shared-data directory).
2332
+ The file format is just one line
2333
+ per word to be substituted, with the word followed by its synonym,
2334
+ separated by white space. Blank lines and trailing spaces are ignored.
2335
+ </para>
2336
+
2337
+ <para>
2338
+ The <literal>synonym</> template also has an optional parameter
2339
+ <literal>CaseSensitive</>, which defaults to <literal>false</>. When
2340
+ <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2341
+ are folded to lower case, as are input tokens. When it is
2342
+ <literal>true</>, words and tokens are not folded to lower case,
2343
+ but are compared as-is.
2305
2344
</para>
2306
2345
2307
2346
<para>
2308
- Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
2347
+ An asterisk (<literal>*</literal>) can be placed at the end of a synonym
2348
+ in the configuration file. This indicates that the synonym is a prefix.
2349
+ The asterisk is ignored when the entry is used in
2350
+ <function>to_tsvector()</function>, but when it is used in
2351
+ <function>to_tsquery()</function>, the result will be a query item with
2352
+ the prefix match marker (see
2353
+ <xref linkend="textsearch-parsing-queries">).
2354
+ For example, suppose we have these entries in
2355
+ <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
2309
2356
<programlisting>
2310
2357
postgres pgsql
2311
2358
postgresql pgsql
2312
2359
postgre pgsql
2313
2360
gogle googl
2314
2361
indices index*
2315
2362
</programlisting>
2316
- </para>
2317
-
2318
- <para>
2319
- Results:
2363
+ Then we will get these results:
2320
2364
<screen>
2321
- =# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
2322
- =# SELECT ts_lexize('syn','indices');
2365
+ mydb =# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
2366
+ mydb =# SELECT ts_lexize('syn','indices');
2323
2367
ts_lexize
2324
2368
-----------
2325
2369
{index}
2326
2370
(1 row)
2327
2371
2328
- =# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
2329
- =# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
2330
- =# SELECT to_tsquery('tst','indices');
2372
+ mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
2373
+ mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
2374
+ mydb=# SELECT to_tsvector('tst','indices');
2375
+ to_tsvector
2376
+ -------------
2377
+ 'index':1
2378
+ (1 row)
2379
+
2380
+ mydb=# SELECT to_tsquery('tst','indices');
2331
2381
to_tsquery
2332
2382
------------
2333
2383
'index':*
2334
2384
(1 row)
2335
2385
2336
- =# SELECT 'indexes are very useful'::tsvector;
2386
+ mydb =# SELECT 'indexes are very useful'::tsvector;
2337
2387
tsvector
2338
2388
---------------------------------
2339
2389
'are' 'indexes' 'useful' 'very'
2340
2390
(1 row)
2341
2391
2342
- =# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
2392
+ mydb =# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
2343
2393
?column?
2344
2394
----------
2345
2395
t
2346
2396
(1 row)
2347
-
2348
- =# SELECT to_tsvector('tst','indices');
2349
- to_tsvector
2350
- -------------
2351
- 'index':1
2352
- (1 row)
2353
2397
</screen>
2354
2398
</para>
2355
-
2356
- <para>
2357
- The only parameter required by the <literal>synonym</> template is
2358
- <literal>SYNONYMS</>, which is the base name of its configuration file
2359
- — <literal>my_synonyms</> in the above example.
2360
- The file's full name will be
2361
- <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2362
- (where <literal>$SHAREDIR</> means the
2363
- <productname>PostgreSQL</> installation's shared-data directory).
2364
- The file format is just one line
2365
- per word to be substituted, with the word followed by its synonym,
2366
- separated by white space. Blank lines and trailing spaces are ignored.
2367
- </para>
2368
-
2369
- <para>
2370
- The <literal>synonym</> template also has an optional parameter
2371
- <literal>CaseSensitive</>, which defaults to <literal>false</>. When
2372
- <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2373
- are folded to lower case, as are input tokens. When it is
2374
- <literal>true</>, words and tokens are not folded to lower case,
2375
- but are compared as-is.
2376
- </para>
2377
2399
</sect2>
2378
2400
2379
2401
<sect2 id="textsearch-thesaurus">
0 commit comments