Skip to content

Commit 1073123

Browse files
committed
Update docs to explain that 7.1 locks down LC_COLLATE and LC_CTYPE at
initdb time. A few copy-editing cleanups, too.
1 parent 671f798 commit 1073123

File tree

1 file changed

+52
-45
lines changed

1 file changed

+52
-45
lines changed

doc/src/sgml/charset.sgml

Lines changed: 52 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.5 2000/12/22 21:51:57 petere Exp $ -->
1+
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.6 2001/01/19 04:47:50 tgl Exp $ -->
22

33
<chapter id="charset">
44
<title>Localization</>
@@ -54,15 +54,15 @@
5454
cultural preferences regarding alphabets, sorting, number
5555
formatting, etc. <productname>PostgreSQL</> uses the standard ISO
5656
C and POSIX-like locale facilities provided by the server operating
57-
system. For additional information refer the documentation of your
57+
system. For additional information refer to the documentation of your
5858
system.
5959
</para>
6060

6161
<sect2>
6262
<title>Overview</>
6363

6464
<para>
65-
Locale support is not build into <productname>PostgreSQL</> by
65+
Locale support is not built into <productname>PostgreSQL</> by
6666
default; to enable it, supply the <option>--enable-locale</> option
6767
to the <filename>configure</> script:
6868
<informalexample>
@@ -95,7 +95,7 @@ export LANG=sv_SE
9595

9696
<para>
9797
Occasionally it is useful to mix rules from several locales, e.g.,
98-
use U.S. rules but Spanish messages. To do that a set of
98+
use U.S. collation rules but Spanish messages. To do that a set of
9999
environment variables exist that override the default of
100100
<envar>LANG</> for a particular category:
101101

@@ -141,14 +141,23 @@ export LANG=sv_SE
141141
</para>
142142

143143
<para>
144-
Once you have chosen a set of localization rules this way you must
145-
keep them fixed for any particular database cluster. That means
146-
that the locales that were active when you ran <filename>initdb</>
147-
must be kept the same when you start the postmaster. Otherwise,
148-
the changed sort order can corrupt indexes or make your data
149-
disappear mysteriously. It is currently not possible to change the
150-
locales after database initialization or to use more than one set
151-
of locales for a given database cluster.
144+
Note that the locale behavior is determined by the environment
145+
variables seen by the server, not by the environment of any client.
146+
Therefore, be careful to set these variables before starting the
147+
postmaster.
148+
</para>
149+
150+
<para>
151+
The <envar>LC_COLLATE</> and <envar>LC_CTYPE</> variables affect the
152+
sort order of indexes. Therefore, these values must be kept fixed
153+
for any particular database cluster, or indexes on text columns will
154+
become corrupt. <productname>Postgres</productname> enforces this
155+
by recording the values of <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
156+
that are seen by <command>initdb</>. The server automatically adopts
157+
those two values when it is started; only the other <envar>LC_</>
158+
categories can be set from the environment at server startup.
159+
In short, only one collation order can be used in a database cluster,
160+
and it is chosen at <command>initdb</> time.
152161
</para>
153162
</sect2>
154163

@@ -183,7 +192,10 @@ export LANG=sv_SE
183192
<para>
184193
The only severe drawback of using the locale support in
185194
<productname>PostgreSQL</> is its speed. So use locale only if you
186-
actually need it.
195+
actually need it. It should be noted in particular that selecting
196+
a non-C locale disables index optimizations for <literal>LIKE</> and
197+
<literal>~</> operators, which can make a huge difference in the
198+
speed of searches that use those operators.
187199
</para>
188200
</sect2>
189201

@@ -261,7 +273,7 @@ perl: warning: Falling back to the standard locale ("C").
261273

262274
<para>
263275
<acronym>MB</acronym> also fixes some problems concerning 8-bit single byte
264-
character sets including ISO8859. (I would not say all of problems
276+
character sets including ISO8859. (I would not say all problems
265277
have been fixed. I just confirmed that the regression test ran fine
266278
and a few French characters could be used with the patch. Please let
267279
me know if you find any problem while using 8-bit characters.)
@@ -271,7 +283,7 @@ perl: warning: Falling back to the standard locale ("C").
271283
<title>Enabling MB</title>
272284

273285
<para>
274-
Run configure with a multibyte option:
286+
Run configure with the multibyte option:
275287

276288
<programlisting>
277289
% ./configure --enable-multibyte[=<replaceable>encoding_system</replaceable>]
@@ -383,11 +395,11 @@ perl: warning: Falling back to the standard locale ("C").
383395
% initdb -E EUC_JP
384396
</programlisting>
385397

386-
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
398+
sets the default encoding to EUC_JP (Extended Unix Code for Japanese).
387399
Note that you can use "--encoding" instead of "-E" if you prefer
388400
to type longer option strings.
389401
If no -E or --encoding option is given, the encoding
390-
specified at the compile time is used.
402+
specified at configure time is used.
391403
</para>
392404

393405
<para>
@@ -397,8 +409,8 @@ perl: warning: Falling back to the standard locale ("C").
397409
% createdb -E EUC_KR korean
398410
</programlisting>
399411

400-
will create a database named "korean" with EUC_KR encoding. The
401-
another way to accomplish this is to use a SQL command:
412+
will create a database named "korean" with EUC_KR encoding.
413+
Another way to accomplish this is to use a SQL command:
402414

403415
<programlisting>
404416
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
@@ -527,20 +539,11 @@ char *pg_encoding_to_char(int <replaceable>encoding_id</replaceable>)
527539
</para>
528540
</listitem>
529541

530-
<listitem>
531-
<para>
532-
Using <envar>PGCLIENTENCODING</envar>.
533-
534-
If an environment variable <envar>PGCLIENTENCODING</envar> is defined in the
535-
frontend, an automatic encoding translation is done by the backend.
536-
</para>
537-
</listitem>
538-
539542
<listitem>
540543
<para>
541544
Using <command>SET CLIENT_ENCODING TO</command>.
542545

543-
Setting the frontend side encoding can be done a SQL command:
546+
Setting the frontend side encoding can be done by this SQL command:
544547

545548
<programlisting>
546549
SET CLIENT_ENCODING TO 'encoding';
@@ -552,7 +555,7 @@ SET CLIENT_ENCODING TO 'encoding';
552555
SET NAMES 'encoding';
553556
</programlisting>
554557

555-
To query the current the frontend encoding:
558+
To query the current frontend encoding:
556559

557560
<programlisting>
558561
SHOW CLIENT_ENCODING;
@@ -565,6 +568,17 @@ RESET CLIENT_ENCODING;
565568
</programlisting>
566569
</para>
567570
</listitem>
571+
572+
<listitem>
573+
<para>
574+
Using <envar>PGCLIENTENCODING</envar>.
575+
576+
If environment variable <envar>PGCLIENTENCODING</envar> is defined
577+
in the client's environment, that client encoding is automatically
578+
selected when a backend connection is made. (This can subsequently
579+
be overridden using any of the other methods mentioned above.)
580+
</para>
581+
</listitem>
568582
</itemizedlist>
569583
</para>
570584
</sect2>
@@ -588,7 +602,7 @@ RESET CLIENT_ENCODING;
588602
<para>
589603
Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
590604
then some Japanese characters could not be translated into LATIN1. In
591-
this case, a letter cannot be represented in the LATIN1 character set,
605+
this case, a letter that cannot be represented in the LATIN1 character set
592606
would be transformed as:
593607

594608
<programlisting>
@@ -601,7 +615,7 @@ RESET CLIENT_ENCODING;
601615
<title>References</title>
602616

603617
<para>
604-
These are good sources to start learning various kind of encoding
618+
These are good sources to start learning about various kinds of encoding
605619
systems.
606620

607621
<itemizedlist>
@@ -724,8 +738,7 @@ Mar 1, 1998 PL1 released
724738
<para>
725739
<!--
726740
[Here is a good documentation explaining how to use WIN1250 on
727-
Windows/ODBC from Pavel Behal. Please note that Installation step 1)
728-
is not necceary in 6.5.1 - Tatsuo]
741+
Windows/ODBC from Pavel Behal]
729742

730743
Version: 0.91 for PgSQL 6.5
731744
Author: Pavel Behal
@@ -815,20 +828,14 @@ Sorry for my Eglish and C code, I'm not native :-)
815828
<title>WIN1250 on Windows/ODBC</title>
816829
<step>
817830
<para>
818-
Change the three relevant files in the source directories.
819-
</para>
820-
</step>
821-
822-
<step>
823-
<para>
824-
Compile <productname>Postgres</productname> with local enabled
831+
Compile <productname>Postgres</productname> with locale enabled
825832
and the multibyte encoding set to <literal>LATIN2</literal>.
826833
</para>
827834
</step>
828835

829836
<step>
830837
<para>
831-
Set up your instalation. Do not forget to create locale
838+
Set up your installation. Do not forget to create locale
832839
variables in your profile (environment). For example (this may
833840
not be correct for <emphasis>your</emphasis> environment):
834841

@@ -936,16 +943,16 @@ HostCharset <replaceable>host_spec</> <replaceable>host_charset</>
936943
<para>
937944
The <filename>charset.conf</> file is always processed up to the
938945
end, so you can easily specify exceptions from the previous
939-
rules. In the src/data you will find charset.conf example and a few
940-
recoding tables.
946+
rules. In the <filename>src/data/</> directory you will find an
947+
example <filename>charset.conf</> and a few recoding tables.
941948
</para>
942949

943950
<para>
944951
As this solution is based on the client's IP address and character
945952
set mapping there are obviously some restrictions as well. You
946953
cannot use different encodings on the same host at the same
947954
time. It is also inconvenient when you boot your client hosts into
948-
more operating systems. Nevertheless, when these restrictions are
955+
multiple operating systems. Nevertheless, when these restrictions are
949956
not limiting and you do not need multi-byte characters than it is a
950957
simple and effective solution.
951958
</para>

0 commit comments

Comments
 (0)