Skip to content

Commit 97c40ce

Browse files
committed
Allow empty replacement strings in contrib/unaccent.
This is useful in languages where diacritic signs are represented as separate characters; it's also one step towards letting unaccent be used for arbitrary substring substitutions. In passing, improve the user documentation for unaccent, which was sadly vague about some important details. Mohammad Alhashash, reviewed by Abhijit Menon-Sen
1 parent 5586327 commit 97c40ce

File tree

2 files changed

+54
-11
lines changed

2 files changed

+54
-11
lines changed

contrib/unaccent/unaccent.c

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -104,11 +104,21 @@ initTrie(char *filename)
104104

105105
while ((line = tsearch_readline(&trst)) != NULL)
106106
{
107-
/*
108-
* The format of each line must be "src trg" where src and trg
109-
* are sequences of one or more non-whitespace characters,
110-
* separated by whitespace. Whitespace at start or end of
111-
* line is ignored.
107+
/*----------
108+
* The format of each line must be "src" or "src trg", where
109+
* src and trg are sequences of one or more non-whitespace
110+
* characters, separated by whitespace. Whitespace at start
111+
* or end of line is ignored. If trg is omitted, an empty
112+
* string is used as the replacement.
113+
*
114+
* We use a simple state machine, with states
115+
* 0 initial (before src)
116+
* 1 in src
117+
* 2 in whitespace after src
118+
* 3 in trg
119+
* 4 in whitespace after trg
120+
* -1 syntax error detected (line will be ignored)
121+
*----------
112122
*/
113123
int state;
114124
char *ptr;
@@ -160,7 +170,14 @@ initTrie(char *filename)
160170
}
161171
}
162172

163-
if (state >= 3)
173+
if (state == 1 || state == 2)
174+
{
175+
/* trg was omitted, so use "" */
176+
trg = "";
177+
trglen = 0;
178+
}
179+
180+
if (state > 0)
164181
rootTrie = placeChar(rootTrie,
165182
(unsigned char *) src, srclen,
166183
trg, trglen);

doc/src/sgml/unaccent.sgml

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@
4545
<itemizedlist>
4646
<listitem>
4747
<para>
48-
Each line represents a pair, consisting of a character with accent
49-
followed by a character without accent. The first is translated into
50-
the second. For example,
48+
Each line represents one translation rule, consisting of a character with
49+
accent followed by a character without accent. The first is translated
50+
into the second. For example,
5151
<programlisting>
5252
&Agrave; A
5353
&Aacute; A
@@ -57,6 +57,27 @@
5757
&Aring; A
5858
&AElig; A
5959
</programlisting>
60+
The two characters must be separated by whitespace, and any leading or
61+
trailing whitespace on a line is ignored.
62+
</para>
63+
</listitem>
64+
65+
<listitem>
66+
<para>
67+
Alternatively, if only one character is given on a line, instances of
68+
that character are deleted; this is useful in languages where accents
69+
are represented by separate characters.
70+
</para>
71+
</listitem>
72+
73+
<listitem>
74+
<para>
75+
As with other <productname>PostgreSQL</> text search configuration files,
76+
the rules file must be stored in UTF-8 encoding. The data is
77+
automatically translated into the current database's encoding when
78+
loaded. Any lines containing untranslatable characters are silently
79+
ignored, so that rules files can contain rules that are not applicable in
80+
the current encoding.
6081
</para>
6182
</listitem>
6283
</itemizedlist>
@@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
132153

133154
<para>
134155
The <function>unaccent()</> function removes accents (diacritic signs) from
135-
a given string. Basically, it's a wrapper around the
136-
<filename>unaccent</> dictionary, but it can be used outside normal
156+
a given string. Basically, it's a wrapper around
157+
<filename>unaccent</>-type dictionaries, but it can be used outside normal
137158
text search contexts.
138159
</para>
139160

@@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
145166
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>
146167
</synopsis>
147168

169+
<para>
170+
If the <replaceable class="PARAMETER">dictionary</replaceable> argument is
171+
omitted, <literal>unaccent</> is assumed.
172+
</para>
173+
148174
<para>
149175
For example:
150176
<programlisting>

0 commit comments

Comments
 (0)