CH 02
CH 02
CH 02
QA268.U545 2004
005.7’2—dc21
2003052158
Copyright © 1991–2003 by Unicode, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other-
wise, without the prior written permission of the publisher or Unicode, Inc. Printed in the United
States of America. Published simultaneously in Canada.
For information on obtaining permission for use of material from this work, please submit a written
request to the Unicode Consortium, Post Office Box 39146, Mountain View, CA 94039-1476, USA,
Fax +1 650 693 3010 or to Pearson Education, Inc., Rights and Contracts Department, 75 Arlington
Street, Suite 300 Boston, MA 02116, USA, Fax: +1 617 848 7047.
ISBN 0-321-18578-1
Text printed on recycled paper
1 2 3 4 5 6 7 8 9 10—CRW—0706050403
First printing, August 2003
Chapter 2
General Structure 2
This chapter discusses the fundamental principles governing the design of the Unicode
Standard and presents an informal overview of its main features. The chapter starts by
placing the Unicode Standard in an architectural context by discussing the nature of text
representation and text processing and its bearing on character encoding decisions. Next,
the Unicode Design Principles are introduced—ten basic principles that convey the essence
of the standard. The Unicode Design Principles serve as a kind of tutorial framework for
understanding the Unicode Standard, and they are a useful place from which to start to get
a summary of the overall nature of the standard.
The chapter then moves on to the Unicode character encoding model, introducing the con-
cepts of character, code point, and encoding forms, and diagramming the relationships
between them. This provides an explanation of the encoding forms UTF-8, UTF-16, and
UTF-32 and some general guidelines regarding the circumstances under which one form
would be preferable to another.
The section on Unicode allocation then describes the overall structure of the Unicode
codespace, showing a summary of the code charts and the locations of blocks of characters
associated with different scripts or sets of symbols.
Next, the chapter discusses the issue of writing direction and introduces several special
types of characters important for understanding the Unicode Standard. In particular, the
use of combining characters, the byte order mark, and control characters is explored in some
detail.
Finally, there is an informal statement of the conformance requirements for the Unicode
Standard. This informal statement, with a number of easy-to-understand examples, gives a
general sense of what conformance to the Unicode Standard means. The rigorous, formal
definition of conformance is given in the subsequent Chapter 3, Conformance.
Syllable: @
Word: cat c a t
cesses in the desired languages. These characters may not map directly to any particular set
of text elements that is used by one of these processes.
to become more difficult. A different encoding design for English, such as case-
shift control codes, would have the opposite effect. In designing a new encoding
scheme for complex scripts, such trade-offs must be evaluated and decisions
made explicitly, rather than unconsciously.
For these reasons, design of the Unicode Standard is not specific to the design of particular
basic text-processing algorithms. Instead, it provides an encoding that can be used with a
wide variety of algorithms. In particular, sorting and string comparison algorithms cannot
assume that the assignment of Unicode character code numbers provides an alphabetical
ordering for lexicographic string comparison. Culturally expected sorting orders require
arbitrarily complex sorting algorithms. The expected sort sequence for the same characters
differs across languages; thus, in general, no single acceptable lexicographic ordering exists.
See Unicode Technical Standard #10, “Unicode Collation Algorithm,” for the standard
default mechanism for comparing Unicode strings.
Text processes supporting many languages are often more complex than they are for
English. The character encoding design of the Unicode Standard strives to minimize this
additional complexity, enabling modern computer systems to interchange, render, and
manipulate text in a user’s own script and language—and possibly in other languages as
well.
Universality
The Unicode Standard encodes a single, very large set of characters, encompassing all the
characters needed for worldwide use. This single repertoire is intended to be universal in
coverage, containing all the characters for textual representation in all modern writing sys-
tems, in most historic writing systems for which sufficient information is available to
enable reliable encoding of characters, and symbols used in plain text.
Because the universal repertoire is known and well defined in the standard, it is possible to
specify a rich set of character semantics. By relying on those character semantics, imple-
mentations can provide detailed support for complex operations on text at a reasonable
cost.
The Unicode Standard, by supplying a universal repertoire associated with well-defined
character semantics, does not require the code set independent model of internationaliza-
tion and text handling. That model abstracts away string handling as manipulation of byte
streams of unknown semantics to protect implementations from the details of hundreds of
different character encodings, and selectively late-binds locale-specific character properties
to characters. Of course, it is always possible for code set independent implementations to
retain their model and to treat Unicode characters as just another character set in that con-
text. It is not at all unusual for Unix implementations to simply add UTF-8 as another char-
acter set, parallel to all the other character sets they support. However, by contrast, the
Unicode approach—because it is associated with a universal repertoire—assumes that
characters and their properties are inherently and inextricably associated. If an internation-
alized application can be structured to work directly in terms of Unicode characters, all lev-
els of the implementation can reliably and efficiently access character storage and be
assured of the universal applicability of character property semantics.
Efficiency
The Unicode Standard is designed to make efficient implementation possible. There are no
escape characters or shift states in the Unicode character encoding model. Each character
code has the same status as any other character code; all codes are equally accessible.
All Unicode encoding forms are self-synchronizing and non-overlapping. This makes ran-
domly accessing and searching inside streams of characters efficient.
By convention, characters of a script are grouped together as far as is practical. Not only is
this practice convenient for looking up characters in the code charts, but it makes imple-
mentations more compact and compression methods more efficient. The common punc-
tuation characters are shared.
Formatting characters are given specific and unambiguous functions in the Unicode Stan-
dard. This design simplifies the support of subsets. To keep implementations simple and
efficient, stateful controls and formatting characters are avoided wherever possible.
from a single character. The distinction between characters and glyphs is illustrated in
Figure 2-2.
Even the letter “a” has a wide variety of glyphs that can represent it. A lowercase Cyrillic “Ò”
also has a variety of glyphs; the second glyph shown in Figure 2-2 is customary for italic in
Russia, while the third is customary for italic in Serbia. Sequences such as “fi” may be
shown with two independent glyphs or with a ligature glyph. Arabic letters are shown with
different glyphs, depending on their position in a word; the glyphs in Figure 2-2 show inde-
pendent, final, initial, and medial forms.
For certain scripts, such as Arabic and the various Indic scripts, the number of glyphs
needed to display a given script may be significantly larger than the number of characters
encoding the basic units of that script. The number of glyphs may also depend on the
orthographic style supported by the font. For example, an Arabic font intended to support
the Nastaliq style of Arabic script may possess many thousands of glyphs. However, the
character encoding employs the same few dozen letters regardless of the font style used to
depict the character data in context.
A font and its associated rendering process define an arbitrary mapping from Unicode
characters to glyphs. Some of the glyphs in a font may be independent forms for individual
characters; others may be rendering forms that do not directly correspond to any single
character.
The process of mapping from characters in the memory representation to glyphs is one
aspect of text rendering. The final appearance of rendered text may also depend on context
(neighboring characters in the memory representation), variations in typographic design
of the fonts used, and formatting information (point size, superscript, subscript, and so
on). The results on screen or paper can differ considerably from the prototypical shape of a
letter or character, as shown in Figure 2-3.
For the Latin script, this relationship between character code sequence and glyph is rela-
tively simple and well known; for several other scripts, it is documented in this standard.
However, in all cases, fine typography requires a more elaborate set of rules than given here.
The Unicode Standard documents the default relationship between character sequences
and glyphic appearance for the purpose of ensuring that the same text content can be
stored with the same, and therefore interchangeable, sequence of character codes.
What the user thinks of as a single character—which may or may not be represented by a
single glyph—may be represented in the Unicode standard as multiple code points. See
Figure 2-4 for examples.
Text
Rendering
Process
Semantics
Characters have well-defined semantics. Character property tables are provided for use in
parsing, sorting, and other algorithms requiring semantic knowledge about the code
points. The properties identified by the Unicode Standard include numeric, spacing, com-
bination, and directionality properties (see Chapter 4, Character Properties). Additional
properties may be defined as needed from time to time. By itself, neither the character
name nor its location in the code table designates its properties.
Plain Text
Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a
sequence of Unicode character codes. In contrast, styled text, also known as rich text, is any
text representation consisting of plain text plus added information such as a language iden-
tifier, font size, color, hypertext links, and so on. For example, the text of this book, a mul-
tifont text as formatted by a desktop publishing system, is rich text.
The simplicity of plain text gives it a natural role as a major structural element of rich text.
SGML, RTF, HTML, XML, and TEX are examples of rich text fully represented as plain text
streams, interspersing plain text data with sequences of characters that represent the addi-
tional data structures. They use special conventions embedded within the plain text file,
such as “<p>”, to distinguish the markup or tags from the “real” content. Many popular
word processing packages rely on a buffer of plain text to represent the content, and imple-
ment links to a parallel store of formatting data.
The relative functional roles of both plain and rich text are well established:
• Plain text is the underlying content stream to which formatting can be applied.
• Rich text carries complex formatting information as well as text context.
• Plain text is public, standardized, and universally readable.
• Rich text representation may be implementation-specific or proprietary.
Although some rich text formats have been standardized or made public, the majority of
rich text designs are vehicles for particular implementations and are not necessarily read-
able by other implementations. Given that rich text equals plain text plus added informa-
tion, the extra information in rich text can always be stripped away to reveal the “pure” text
underneath. This operation is often employed, for example, in word processing systems
that use both their own private rich text format and plain text file format as a universal, if
limited, means of exchange. Thus, by default, plain text represents the basic, interchange-
able content of text.
Plain text represents character content only, not its appearance. It can be displayed in a var-
ity of ways and requires a rendering process to make it visible with a particular appearance.
If the same plain text sequence is given to disparate rendering processes, there is no expec-
tation that rendered text in each instance should have the same appearance. Instead, the
disparate rendering processes are simply required to make the text legible according to the
intended reading. This legibility criterion constrains the range of possible appearances. The
relationship between appearance and content of plain text may be summarized as follows:
Plain text must contain enough information to permit the text to be rendered legibly,
and nothing more.
The Unicode Standard encodes plain text. The distinction between plain text and other
forms of data in the same data stream is the function of a higher-level protocol and is not
specified by the Unicode Standard itself.
Logical Order
Unicode text is stored in logical order in the memory representation, roughly correspond-
ing to the order in which text is typed in via the keyboard. In some circumstances, the order
of characters differs from this logical order when the text is displayed or printed. Where
needed to ensure consistent legibility, the Unicode Standard defines the conversion of Uni-
code text from the memory representation to readable (displayed) text. The distinction
between logical order and display order for reading is shown in Figure 2-5.
When the text in Figure 2-5 is ordered for display, the glyph that represents the first charac-
ter of the English text appears at the left. The logical start character of the Hebrew text,
however, is represented by the Hebrew glyph closest to the right margin. The succeeding
Hebrew glyphs are laid out to the left.
Logical order applies even when characters of different dominant direction are mixed: left-
to-right (Greek, Cyrillic, Latin) with right-to-left (Arabic, Hebrew), or with vertical script.
Properties of directionality inherent in characters generally determine the correct display
order of text. The Unicode bidirectional algorithm specifies how these properties are used
to resolve directional interactions when characters of right-to-left and left-to-right direc-
tionality are mixed. (See Unicode Standard Annex #9, “The Bidirectional Algorithm.”)
However, this inherent directionality is occasionally insufficient to render plain text legibly.
This can occur in certain situations when characters of different directionality are mixed.
The Unicode Standard therefore includes characters to specify changes in direction for use
when the inherent directionality of characters is insufficient. The bidirectional algorithm
provides rules that use these directional layout control characters together with the inher-
ent directional properties of characters to provide the correct presentation of text contain-
ing both left-to-right and right-to-left scripts. This allows for exact control of the display
ordering for legible interchange and also ensures that plain text used for simple items like
file names or labels can always be correctly ordered for display.
For the most part, logical order corresponds to phonetic order. The only current exceptions
are the Thai and Lao scripts, which employ visual ordering; in these two scripts, users tra-
ditionally type in visual order rather than phonetic order.
Characters such as the short i in Devanagari are displayed before the characters that they
logically follow in the memory representation. (See Section 9.1, Devanagari, for further
explanation.)
Combining marks (accent marks in the Greek, Cyrillic, and Latin scripts, vowel marks in
Arabic and Devanagari, and so on) do not appear linearly in the final rendered text. In a
Unicode character sequence, all such characters follow the base character that they modify.
For example, the Latin letter “Ï” is stored as “x” followed by combining “Δ.
Unification
The Unicode Standard avoids duplicate encoding of characters by unifying them within
scripts across languages; characters that are equivalent are given a single code. Common
letters, punctuation marks, symbols, and diacritics are given one code each, regardless of
language, as are common Chinese/Japanese/Korean (CJK) ideographs. (See Section 11.1,
Han.)
It is quite normal for many characters to have different usages, such as comma “,” for either
thousands-separator (English) or decimal-separator (French). The Unicode Standard
avoids duplication of characters due to specific usage in different languages; rather, it
duplicates characters only to support compatibility with base standards. Avoidance of
duplicate encoding of characters is important to avoid visual ambiguity.
There are a few notable instances in the standard where visual ambiguity between different
characters is tolerated, however. For example, in most fonts there is little or no distinction
visible between Latin “o”, Cyrillic “o”, and Greek “o” (omicron). These are not unified
because they are characters from three different scripts, and there are many legacy charac-
ter encodings that distinguish them. As another example, there are three characters whose
glyph is the same uppercase barred D shape, but they correspond to three distinct lower-
case forms. Unifying these uppercase characters would have resulted in unnecessary com-
plications for case mapping.
The Unicode Standard does not attempt to encode features such as language, font, size,
positioning, glyphs, and so forth. For example, it does not preserve language as a part of
character encoding: just as French i grec, German ypsilon, and English wye are all repre-
sented by the same character code, U+0057 “Y”, so too are Chinese zi, Japanese ji, and
Korean ja all represented as the same character code, U+5B57 %.
In determining whether to unify variant ideograph forms across standards, the Unicode
Standard follows the principles described in Section 11.1, Han. Where these principles
determine that two forms constitute a trivial difference, the Unicode Standard assigns a
single code. Otherwise, separate codes are assigned.
There are many characters in the Unicode Standard that could have been unified with exist-
ing visually similar Unicode characters, or that could have been omitted in favor of some
other Unicode mechanism for maintaining the kinds of text distinctions for which they
were intended. However, considerations of interoperability with other standards and sys-
tems often require that such compatibility characters be included in the Unicode Standard.
The status of a character as a compatibility character does not mean that the character is
deprecated in the standard.
Dynamic Composition
The Unicode Standard allows for the dynamic composition of accented forms and Hangul
syllables. Combining characters used to create composite forms are productive. Because the
process of character composition is open-ended, new forms with modifying marks may be
created from a combination of base characters followed by combining characters. For
example, the diaeresis, “¨”, may be combined with all vowels and a number of consonants
in languages using the Latin script and several other scripts.
Equivalent Sequences
Some text elements can be encoded either as static precomposed forms or by dynamic
composition. Common precomposed forms such as U+00DC “Ü”
are included for compatibility with current standards. For static pre-
composed forms, the standard provides a mapping to an equivalent dynamically composed
sequence of characters. (See also Section 3.7, Decomposition.) Thus, different sequences of
Unicode characters are considered equivalent. For example, a precomposed character may
be represented as a composed character sequence (see Figure 2-6 and Figure 2-18).
In cases involving two or more sequences considered to be equivalent, the Unicode Stan-
dard does not prescribe one particular sequence as being the correct one; instead, each
B+Ä B + A + @¨
LJ + A L+J+A
sequence is merely equivalent to the others. In Figure 2-6, the sequences on each side of the
arrows express the same content and would be interpreted the same way.
If an application or user attempts to distinguish non-identical sequences which are none-
theless considered to be equivalent sequences, as shown in the examples in Figure 2-6, it
would not be guaranteed that other applications or users would recognize the same distinc-
tions. To prevent introducing interoperability problems between applications, such dis-
tinctions must be avoided wherever possible.
Where a unique representation is required, a normalized form of Unicode text can be used
to eliminate unwanted distinctions. The Unicode Standard defines four normalization
forms: Normalization Form D (NFD), Normalization Form KD (NFKD), Normalization
Form C (NFC), and Normalization Form KC (NFKC). Roughly speaking, NFD and NFKD
decompose characters where possible, while NFC and NFKC compose characters where
possible. For more information, see Unicode Standard Annex #15, “Unicode Normaliza-
tion Forms,” and Section 5.6, Normalization.
Decompositions. Precomposed characters are formally known as decomposables, because
they have decompositions to one or more other characters. There are two types of decom-
positions:
• Canonical. The character and its decomposition should be treated as essentially
equivalent.
• Compatibility. The decomposition may remove some information (typically
formatting information) that is important to preserve in particular contexts. By
definition, compatibility decomposition is a superset of canonical decomposi-
tion.
Thus there are three types of characters, based on their decomposition behavior:
• Canonical decomposable. The character has a distinct canonical decomposition.
• Compatibility decomposable. The character has a distinct compatibility decom-
position.
• Nondecomposable. The character has no distinct decomposition, neither canon-
ical nor compatibility. Loosely speaking, these characters are said to have “no
decomposition,” even though technically they decompose to themselves.
Figure 2-7 illustrates these three types.
The solid arrows in Figure 2-7 indicate canonical decompositions, and the dotted arrows
indicate compatibility decompositions. The figure illustrates two important points:
• Decompositions may be to single characters or to sequences of characters.
Decompositions to a single character, also known as singleton decompositions,
are seen for the ohm sign and the halfwidth katakana ka in the figure. Because of
examples like these, decomposable characters in Unicode do not always consist
of obvious, separate parts; one can only know their status by examining the
data tables for the standard.
• There are a very small number of characters that are both canonical and com-
patibility decomposable. The example shown in the figure is for the Greek
hooked upsilon symbol with an acute accent. It has a canonical decomposition
to one sequence and a compatibility decomposition to a different sequence.
For more precise definitions of some of these terms, see Chapter 3, Conformance.
Nondecomposables
a
0061
2126 FF76
03A9 30AB
Á
00C1 3384
A
0041 0301 006B 0041
03D3 03D3
Convertibility
Character identity is preserved for interchange with a number of different base standards,
including national, international, and vendor standards. Where variant forms (or even the
same form) are given separate codes within one base standard, they are also kept separate
within the Unicode Standard. This choice guarantees the existence of a mapping between
the Unicode Standard and base standards.
Accurate convertibility is guaranteed between the Unicode Standard and other standards in
wide usage as of May 1993. In general, a single code point in another standard will corre-
spond to a single code point in the Unicode Standard. Sometimes, however, a single code
point in another standard corresponds to a sequence of code points in the Unicode Stan-
dard, or vice versa. Conversion between Unicode text and text in other character codes
must in general be done by explicit table-mapping processes. (See also Section 5.1,
Transcoding to Other Standards.)
Compatibility Characters
Compatibility characters are those that would not have been encoded except for compati-
bility and round-trip convertibility with other standards. They are variants of characters
that already have encodings as normal (that is, non-compatibility) characters in the Uni-
code Standard. Examples of compatibility characters in this sense include all of the glyph
variants in the Compatibility and Specials Area: halfwidth or fullwidth characters from
East Asian character encoding standards, Arabic contextual form glyphs from preexisting
Arabic code pages, Arabic ligatures and ligatures from other scripts, and so on. Other
examples include CJK compatibility ideographs, which are generally duplicates of a unified
Han ideograph, legacy alternate format characters such as U+206C
, and fixed-width space characters used in old typographical systems.
The Compatibility and Specials Area contains a large number of compatibility characters,
but the Unicode Standard also contains many compatibility characters that do not appear
in that area. These include examples such as U+2163 “IV” , U+2007
, and U+00B2 “2” .
Abstract Encoded
00C5
212B
0041 030A
When referring to code points in the Unicode Standard, the usual practice is to refer to
them by their numeric value expressed in hexadecimal, with a “U+” prefix. (See Section 0.3,
Notational Conventions.) Encoded characters can also be referred to by their code points
only, but to prevent ambiguity, the official Unicode name of the character is often also
added; this clearly identifies the abstract character that is encoded. For example:
U+0061
U+10330
U+201DF -
Such citations refer only to the encoded character per se, associating the code point (as an
integral value) with the abstract character that is encoded.
Not all assigned code points represent abstract characters; only Graphic, Format, Control
and Private-use do. Surrogates and Noncharacters are assigned code points but not
assigned to abstract characters. Reserved code points are assignable: any may be assigned in
a future version of the standard. The General Category provides a finer breakdown of
Graphic characters, and is also used to distinguish the other basic types (except between
Noncharacter and Reserved). Other properties defined in the Unicode Character Database
provide for different categorizations of Unicode code points.
Control Codes. Sixty-five code points (U+0000..U+001F and U+007F..U+009F) are
reserved specifically as control codes, for compatibility with the C0 and C1 control codes of
the ISO/IEC 2022 framework. A few of these control codes are given specific interpreta-
tions by the Unicode Standard. (See Section 15.1, Control Codes.)
Noncharacters. Sixty-six code points are not used to encode characters. Noncharacters
consist of U+FDD0..U+FDEF and the last two code points on each plane, including
U+FFFE and U+FFFF on the BMP. (See Section 15.8, Noncharacters.)
Private Use. Three ranges of code points have been set aside for private use. Characters in
these areas will never be defined by the Unicode Standard. These code points can be freely
used for characters of any purpose, but successful interchange requires an agreement
between sender and receiver on their interpretation. (See Section 15.7, Private-Use Charac-
ters.)
Surrogates. 2,048 code points have been allocated for surrogates, which are used in the
UTF-16 encoding form. (See Section 15.5, Surrogates Area.)
Restricted Interchange. Code points that are not assigned to abstract characters are subject
to restrictions in interchange.
• Surrogate code points cannot be conformantly interchanged using Unicode
encoding forms. They do not correspond to Unicode scalar values, and thus do
not have well-formed representations in any Unicode encoding form.
• Noncharacter code points are reserved for internal use, such as for sentinel val-
ues. They should never be interchanged. They do, however, have well-formed
representations in Unicode encoding forms and survive conversions between
encoding forms. This allows sentinel values to be preserved internally across
Unicode encoding forms, even though they are not designed to be used in open
interchange.
• All implementations need to preserve reserved code points because they may
originate in implementations that use a future version of the Unicode Standard.
For example, suppose that one person is using a Unicode 4.0 system and a sec-
ond person is using a Unicode 3.2 system. The first person sends the second
person a document containing some code points newly assigned in Unicode
4.0; these code points were unassigned in Unicode 3.2. The second person may
edit the document, not changing the reserved codes, and send it on. In that case
the second person is interchanging what are, as far as the second person knows,
reserved code points.
Code Point Semantics. The semantics of most code points are established by this standard;
the exceptions are Controls, Private-use, and Noncharacters. Control codes generally have
semantics determined by other standards or protocols (such as ISO/IEC 6429), but there
are a small number of control codes for which the Unicode Standard specifies particular
semantics. See Table 15-1 in Section 15.1, Control Codes, for the exact list of those control
codes. The semantics of private-use characters are outside the scope of the Unicode Stan-
dard; their use is determined by private agreement, as, for example, between vendors. Non-
characters have semantics in internal use only.
model, precisely defined encoding forms specify how each integer (code point) for a Uni-
code character is to be expressed as a sequence of one or more code units. The Unicode
Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-
bit, and 32-bit units. These are correspondingly named UTF-8, UTF-16, and UTF-32.
(The “UTF” is a carryover from earlier terminology meaning Unicode (or UCS) Transfor-
mation Format.) Each of these three encoding forms is an equally legitimate mechanism
for representing Unicode characters; each has advantages in different environments.
All three encoding forms can be used to represent the full range of encoded characters in
the Unicode Standard; they are thus fully interoperable for implementations that may
choose different encoding forms for various reasons. Each of the three Unicode encoding
forms can be efficiently transformed into either of the other two without any loss of data.
Non-overlap. Each of the Unicode encoding forms is designed with the principle of non-
overlap in mind. This means that if a given code point is represented by a certain sequence
of one or more code units, it is impossible for any other code point to ever be represented
by the same sequence of code units.
To illustrate the problems with overlapping encodings, see Figure 2-9. In this encoding
(Windows code page 932), characters are formed from either one or two code bytes.
Whether a sequence is one or two in length depends on the first byte, so that the values for
lead bytes (of a two-byte sequence) and single bytes are disjoint. However, single-byte val-
ues and trail-byte values can overlap. That means that when someone searches for the char-
acter “D”, for example, they might find it (mistakenly) as the trail byte of a two-byte
sequence, or as a single, independent byte. To find out which alternative is correct, a pro-
gram must look backward through text.
84 44
0414
Trail and Single
44 D
0044
84 84
0442
Lead and Trail
84 84
The situation is made more complex by the fact that lead and trail bytes can also overlap, as
in the second part of Figure 2-9. This means that the backward scan has to repeat until it
hits the start of the text or hits a sequence that could not exist as a pair as shown in
Figure 2-10. This is not only inefficient, it is extremely error-prone: corruption of one byte
can cause entire lines of text to be corrupted.
?? ... 84 84 84 84 84 84 44
D
0442 0414 0044
The Unicode encoding forms avoid this problem, because none of the ranges of values for
the lead, trail, or single code units in any of those encoding forms overlap.
Non-overlap makes all of the Unicode encoding forms well behaved for searching and com-
parison. When searching for a particular character, there will never be a mismatch against
some code unit sequence that represents just part of another character. The fact that all
Unicode encoding forms observe this principle of non-overlap distinguishes them from
many legacy East Asian multibyte character encodings, for which overlap of code unit
sequences may be a significant problem for implementations.
Another aspect of non-overlap in the Unicode encoding forms is that all Unicode charac-
ters have determinate boundaries when expressed in any of the encoding forms. That is, the
edges of code unit sequences representing a character are easily determined by local exam-
ination of code units; there is never any need to scan back indefinitely in Unicode text to
correctly determine a character boundary. This property of the encoding forms has some-
times been referred to as self-synchronization. This property has another very important
implication: corruption of a single code unit corrupts only a single character; none of the
surrounding characters are affected.
For example, when randomly accessing a string, a program can find the boundary of a
character with limited backup. In UTF-16, if a pointer points to a leading surrogate, a sin-
gle backup is required. In UTF-8, if a pointer points to a byte starting with 10xxxxxx (in
binary), one to three backups are required to find the beginning of the character.
Conformance. The Unicode Consortium fully endorses the use of any of the three Unicode
encoding forms as a conformant way of implementing the Unicode Standard. It is impor-
tant not to fall into the trap of trying to distinguish “UTF-8 versus Unicode,” for example.
UTF-8, UTF-16, and UTF-32 are all equally valid and conformant ways of implementing
the encoded characters of the Unicode Standard.
Figure 2-11 shows the three Unicode encoding forms, including how they are related to
Unicode code points.
UTF-32
00000041 000003A9 00008A9E 00010384
UTF-16
0041 03A9 8A9E DC00 DB84
UTF-8
41 CE A9 E8 AA 9E F0 90 8E 84
In Figure 2-11, the UTF-32 line shows that each example character can be expressed with
one 32-bit code unit. Those code units have the same values as the code point for the char-
acter. For UTF-16, most characters can be expressed with one 16-bit code unit, whose value
is the same as the code point for the character, but characters with high code point values
require a pair of 16-bit surrogate code units instead. In UTF-8, a character may be
expressed with one, two, three, or four bytes, and the relationship between those byte val-
ues and the code point value is more complex.
UTF-8, UTF-16, and UTF-32 are further described in the subsections that follow. See each
subsection for a general overview of how each encoding form is structured and the general
benefits or drawbacks of each encoding form for particular purposes. For the detailed for-
mal definition of the encoding forms and conformance requirements, see Section 3.9, Uni-
code Encoding Forms.
UTF-32
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented
directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationship
between encoded character and code unit; it is a fixed-width character encoding form. This
makes UTF-32 an ideal form for APIs that pass single character values.
As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code
points in the range 0..10FFFF16—that is, the Unicode codespace. This guarantees interop-
erability with the UTF-16 and UTF-8 encoding forms.
The value of each UTF-32 code unit corresponds exactly to the Unicode code point value.
This situation differs significantly from that for UTF-16 and especially UTF-8, where the
code unit values often change unrecognizably from the code point value. For example,
U+10000 is represented as <00010000> in UTF-32, but it is represented as <F0 90 80 80>
in UTF-8. For UTF-32 it is trivial to determine a Unicode character from its UTF-32 code
unit representation, whereas UTF-16 and UTF-8 representations often require doing a
code unit conversion before the character can be identified in the Unicode code charts.
UTF-32 may be a preferred encoding form where memory or disk storage space for charac-
ters is no particular concern, but where fixed-width, single code unit access to characters is
desired. UTF-32 is also a preferred encoding form for processing characters on most Unix
platforms.
UTF-16
In the UTF-16 encoding form, code points in the range U+0000..U+FFFF are represented
as a single 16-bit code unit; code points in the supplementary planes, in the range
U+10000..U+10FFFF, are instead represented as pairs of 16-bit code units. These pairs of
special code units are known as surrogate pairs. The values of the code units used for surro-
gate pairs are completely disjunct from the code units used for the single code unit repre-
sentations, thus maintaining non-overlap for all code point representations in UTF-16. For
the formal definition of surrogates, see Section 3.8, Surrogates.
UTF-16 optimizes the representation of characters in the Basic Multilingual Plane
(BMP)—that is, the range U+0000..U+FFFF. For that range, which contains the vast
majority of common-use characters for all modern scripts of the world, each character
requires only one 16-bit code unit, thus requiring just half the memory or storage of the
UTF-32 encoding form. For the BMP, UTF-16 can effectively be treated as if it were a fixed-
width encoding form.
However, for supplementary characters, UTF-16 requires two 16-bit code units. The dis-
tinction between characters represented with one versus two 16-bit code units means that
formally UTF-16 is a variable-width encoding form. That fact can create implementation
difficulties, if not carefully taken into account; UTF-16 is somewhat more complicated to
handle than UTF-32.
UTF-16 may be a preferred encoding form in many environments that need to balance effi-
cient access to characters with economical use of storage. It is reasonably compact, and all
the common, heavily used characters fit into a single 16-bit code unit.
UTF-16 is the historical descendant of the earliest form of Unicode, which was originally
designed to use a fixed-width, 16-bit encoding form exclusively. The surrogates were added
to provide an encoding form for the supplementary characters at code points past U+FFFF.
The design of the surrogates made them a simple and efficient extension mechanism that
works well with older Unicode implementations, and that avoids many of the problems of
other variable-width character encodings. See Section 5.4, Handling Surrogate Pairs in UTF-
16, for more information about surrogates and their processing.
For the purpose of sorting text, note that binary order for data represented in the UTF-16
encoding form is not the same as code point order. This means that a slightly different
comparison implementation is needed for code point order. For more information, see
Section 5.17, Binary Order.
UTF-8
To meet the requirements of byte-oriented, ASCII-based systems, a third encoding form is
specified by the Unicode Standard: UTF-8. It is a variable-width encoding form that pre-
serves ASCII transparency, making use of 8-bit code units.
Much existing software and practice in information technology has long depended on
character data being represented as a sequence of bytes. Furthermore, many of the proto-
cols depend not only on ASCII values being invariant, but must make use of or avoid spe-
cial byte values that may have associated control functions. The easiest way to adapt
Unicode implementations to such a situation is to make use of an encoding form that is
already defined in terms of 8-bit code units and that represents all Unicode characters while
not disturbing or reusing any ASCII or C0 control code value. That is the function of
UTF-8.
UTF-8 is a variable-width encoding form, using 8-bit code units, in which the high bits of
each code unit indicate the part of the code unit sequence to which each byte belongs. A
range of 8-bit code unit values is reserved for the first, or leading, element of a UTF-8 code
unit sequences, and a completely disjunct range of 8-bit code unit values is reserved for the
subsequent, or trailing, elements of such sequences; this convention preserves non-overlap
for UTF-8. Table 3-5 on page 77 shows how the bits in a Unicode code point are distributed
among the bytes in the UTF-8 encoding form. See Section 3.9, Unicode Encoding Forms, for
the full, formal definition of UTF-8.
The UTF-8 encoding form maintains transparency for all of the ASCII code points
(0x00..0x7F). That means Unicode code points U+0000..U+007F are converted to single
bytes 0x00..0x7F in UTF-8, and are thus indistinguishable from ASCII itself. Furthermore,
the values 0x00..0x7F do not appear in any byte for the representation of any other Unicode
code point, so that there can be no ambiguity. Beyond the ASCII range of Unicode, many of
the non-ideographic scripts are represented by two bytes per code point in UTF-8; all non-
surrogate code points between U+0800 and U+FFFF are represented by three bytes; and
supplementary code points above U+FFFF require four bytes.
UTF-8 is typically the preferred encoding form for HTML and similar protocols, particu-
larly for the Internet. The ASCII transparency helps migration. UTF-8 also has the advan-
tage that it is already inherently byte-serialized, as for most existing 8-bit character sets;
strings of UTF-8 work easily with C or other programming languages, and many existing
APIs that work for typical Asian multibyte character sets adapt to UTF-8 as well with little
or no change required.
In environments where 8-bit character processing is required for one reason or another,
UTF-8 also has the following attractive features as compared to other multibyte encodings:
• The first byte of a UTF-8 code unit sequence indicates the number of bytes to
follow in a multibyte sequence. This allows for very efficient forward parsing.
• It is also efficient to find the start of a character when beginning from an arbi-
trary location in a byte stream of UTF-8. Programs need to search at most four
bytes backward, and usually much less. It is a simple task to recognize an initial
byte, because initial bytes are constrained to a fixed range of values.
• As with the other encoding forms, there is no overlap of byte values.
Guidelines,” for an example where commonly implemented processes deal with inherently
variable-width text elements, owing to user expectations of the identity of a “character.”
UTF-8 is reasonably compact in terms of the number of bytes used. It is really only at a sig-
nificant size disadvantage when used for East Asian implementations such as Chinese, Jap-
anese, and Korean, which use Han ideographs or Hangul syllables requiring three-byte
code unit sequences in UTF-8. UTF-8 is also significantly less efficient in processing than
the other encoding forms.
A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code
points. This is also, obviously, the same order as for a binary sort of UTF-32 strings.
All three encoding forms give the same results for binary string comparisons or string sort-
ing when dealing only with BMP characters (in the range U+0000..U+FFFF). However,
when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16
binary order does not match Unicode code point order. This can lead to complications
when trying to interoperate with binary sorted lists—for example, between UTF-16 sys-
tems and UTF-8 or UTF-32 systems. However, for data that is sorted according to the con-
ventions of a specific language or locale, rather than using binary order, data will be
ordered the same, regardless of the encoding form.
break up the code units into two or four bytes, respectively, and the order of those bytes
must be clearly defined. Because of this, and because of the rules for the use of the byte
order mark, the three encoding forms of the Unicode Standard result in a total of seven
Unicode encoding schemes, as shown in Table 2-3.
The endian order entry for UTF-8 in Table 2-3 is marked N/A because UTF-8 code units
are 8 bits in size, and the usual machine issues of endian order for larger code units do not
apply. The serialized order of the bytes must not depart from the order defined by the UTF-
8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may
be encountered in contexts where UTF-8 data is converted from other encoding forms that
use a BOM, or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark”
subsection in Section 15.9, Specials, for more information.
Note that some of the Unicode encoding schemes have the same labels as the three Unicode
encoding forms. This could cause confusion, so it is important to keep the context clear
when using these terms: character encoding forms refer to integral data units in memory or
in APIs, and byte order is irrelevant; character encoding schemes refer to byte-serialized
data, as for streaming I/O or in file storage, and byte order must be specified or determin-
able.
The Internet Assigned Names Authority (IANA) maintains a registry of charset names used
on the Internet. Those charset names are very close in meaning to the Unicode character
encoding model’s concept of character encoding schemes, and all of the Unicode character
encoding schemes are in fact registered as charsets. While the two concepts are quite close,
and the names used are identical, some important differences may arise in terms of the
requirements for each, particularly when it comes to handling of the byte order mark. Exer-
cise due caution when equating the two.
Figure 2-12 illustrates the Unicode character encoding schemes, showing how each is
derived from one of the encoding forms by serialization of bytes.
In Figure 2-12, the code units used to express each example character have been serialized
into sequences of bytes. This figure should be compared with Figure 2-11, which shows the
same characters before serialization into sequences of bytes. The “BE” lines show serializa-
tion in big-endian order, whereas the “LE” lines show the bytes reversed into little-endian
order. For UTF-8, the code unit is just an 8-bit byte, so that there is no distinction between
big-endian and little-endian order. UTF-32 and UTF-16 encoding schemes using the byte
order mark are not shown in Figure 2-12, to keep the basic picture regarding serialization of
bytes clearer.
For the detailed formal definition of the Unicode encoding schemes and conformance
requirements, see Section 3.10, Unicode Encoding Schemes. For further general discussion
about character encoding forms and character encoding schemes, both for the Unicode
Standard and as applied to other character encoding standards, see Unicode Technical
Report #17, “Character Encoding Model.” For information about charsets and character
UTF-32BE
00 00 00 41 00 00 03 A9 00 00 8A 9E 00 01 03 84
UTF-32LE
41 00 00 00 A9 03 00 00 9E 8A 00 00 84 03 01 00
UTF-16BE
00 41 03 A9 8A 9E DC 00 DB 84
UTF-16LE
41 00 A9 03 9E 8A 00 DC 84 DB
UTF-8
41 CE A9 E8 AA 9E F0 90 8E 84
conversion, see Unicode Technical Report #22, “Character Mapping Markup Language
(CharMapML).”
Unicode Encoding Forms.) There are a number of techniques for dealing with an isolated
surrogate, such as omitting it, or converting it into U+FFFD to
produce well-formed UTF-16, or simply halting the processing of the string with an error.
For more information on this topic, see Unicode Technical Report #22, “Character Map-
ping Markup Language (CharMapML).”
Planes
The Unicode codespace consists of the numeric values from 0 to 10FFFF16, but in practice
it has proven convenient to think of the codespace as divided up into planes of characters—
each plane consisting of 64K code points. The numerical sense of this is immediately obvi-
ous if one looks at the ranges of code points involved, expressed in hexadecimal. Thus, the
lowest plane, the Basic Multilingual Plane, consists of the range 000016..FFFF16. The next
plane, the Supplementary Multilingual Plane, consists of the range 1000016..1FFFF16, and is
also known as Plane 1, since the most significant hexadecimal digit for all its code positions
is “1”. Plane 2, the Supplementary Ideographic Plane, consists of the range 2000016..2FFFF16,
and so on. Because of these numeric conventions, the Basic Multilingual Plane is also occa-
sionally referred to as Plane 0.
Basic Multilingual Plane. The Basic Multilingual Plane (BMP, or Plane 0) contains all the
common-use characters for all the modern scripts of the world, as well as many historical
and rare characters. By far the majority of all Unicode characters for almost all textual data
can be found in the BMP.
Supplementary Multilingual Plane. The Supplementary Multilingual Plane (SMP, or
Plane 1) is dedicated to the encoding of lesser-used historic scripts, special-purpose
invented scripts, and special notational systems, which either could not be fit into the BMP
or would be of very infrequent usage. Examples of each type include Gothic, Shavian, and
musical symbols, respectively. While few scripts are currently encoded in the SMP in Uni-
code 4.0, there are many major and minor historic scripts that do not yet have their charac-
ters encoded in the Unicode Standard, and many of those will eventually be allocated in the
SMP.
Supplementary Ideographic Plane. The Supplementary Ideographic Plane (SIP, or Plane
2) is the spillover allocation area for those CJK characters that could not be fit in the blocks
set aside for more common CJK characters in the BMP. While there are a small number of
common-use CJK characters in the SIP (for example, for Cantonese usage), the vast major-
ity of Plane 2 characters are extremely rare or of historical interest only.
Details of Allocation
Figure 2-13 gives an overall picture of the allocation areas of the Unicode Standard, with an
emphasis on the identities of the planes.
Plane 2 consists primarily of one big area, starting from the first code point in the plane,
dedicated to more unified CJK character encoding. Then there is a much smaller area,
toward the end of the plane, dedicated to additional CJK compatibility ideographic charac-
ters—which are basically just duplicated character encodings required for round-trip con-
version to various existing legacy East Asian character sets. The CJK compatibility
ideographic characters in Plane 2 are currently all dedicated to round-trip conversion for
the CNS standard and are intended to supplement the CJK compatibility ideographic char-
acters in the BMP, a smaller number of characters dedicated to round-trip conversion for
various Korean, Chinese, and Japanese standards.
Plane 14 contains a small area set aside for language tag characters, and another small area
containing supplementary variation selection characters.
Figure 2-13 also shows that Plane 15 and Plane 16 are allocated, in their entirety, for private
use. Those two planes contain a total of 131,068 characters, to supplement the 6,400 pri-
vate-use characters located in the BMP.
Graphic
Format or Control
Private Use
Reserved
Tags
Figure 2-14 shows the BMP in an expanded format to illustrate the allocation substructure
of that plane in more detail.
0000 0000
0100
Alphabets Latin
1000 0200
0300 Greek
Alpha Extensions
2000 0400 Cyrillic
Symbols
0500 Armenian, Hebrew
3000 CJK Miscellaneous 0600 Arabic
0700 Syriac, Thaana
4000 0800
0900 Devanagari, Bengali
5000 0A00 Gurmukhi, Gujarati
0B00 Oriya, Tamil
6000 0C00 Telugu, Kannada
0D00 Malayalam, Sinhala
CJK Ideographs
7000 0E00 Thai, Lao
0F00 Tibetan
8000 1000 Myanmar, Georgian
1100 Hangul Jamo
9000 1200 Ethiopic
1300
Cherokee
A000 Yi 1400
Canadian Aboriginal
1500 Syllabics
B000 1600
Ogham, Runic
1700 Philippine Scripts
Khmer
C000 Hangul 1800 Mongolian
1900 Limbu, Tai Le
D000 1A00
1B00
Surrogates
E000 1C00
The first allocation area in the BMP is the General Scripts Area. It contains a large number
of modern-use scripts of the world, including Latin, Greek, Cyrillic, Arabic, and so on. This
area is shown in expanded form in Figure 2-14. The order of the various scripts can serve as
a guide to the relative positions where these scripts are found in the code charts. Most of the
characters encoded in this area are graphic characters, but all 65 C0 and C1 control codes
are also located here because the first two character blocks in the Unicode Standard are
organized for exact compatibility with the ASCII and ISO/IEC 8859-1 standards.
A Symbols Area follows the General Scripts Area. It contains all kinds of symbols, including
many characters for use in mathematical notation. It also contains symbols for punctua-
tion as well as most of the important format control characters.
Next is the CJK Miscellaneous Area. It contains some East Asian scripts, such as Hiragana
and Katakana for Japanese, punctuation typically used with East Asian scripts, lists of CJK
radical symbols, and a large number of East Asian compatibility characters.
Immediately following the CJK Miscellaneous Area is the CJKV Ideographs Area. It con-
tains all the unified Han ideographs in the BMP. It is subdivided into a block for the Uni-
fied Repertoire and Ordering (the initial block of 20,902 unified Han ideographs) and
another block containing Extension A (an additional 6,582 unified Han ideographs).
The Asian Scripts Area follows the CJKV Ideographs Area. It currently contains only the Yi
script and 11,172 Hangul syllables for Korean.
The Surrogates Area contains only surrogate code points and no encoded characters. See
Section 15.5, Surrogates Area, for more details.
The Private Use Area in the BMP contains 6,400 private-use characters.
Finally, at the very end of the BMP, there is the Compatibility and Specials Area. It contains
many compatibility characters from widely used corporate and national standards that
have other representations in the Unicode Standard. For example, it contains Arabic pre-
sentation forms, whereas the basic characters for the Arabic script are located in the Gen-
eral Scripts Area. The Compatibility and Specials Area also contains a few important
format control characters and other special characters. See Section 15.9, Specials, for more
details.
Note that the allocation order of various scripts and other groups of characters reflects the
historical evolution of the Unicode Standard. While there is a certain geographic sense to
the ordering of the allocation areas for the scripts, this is only a very loose correlation. The
empty spaces will be filled with future script encodings on a space-available basis. The rel-
evant character encoding committees make use of rationally organized roadmap charts to
help them decide where to encode new scripts within the available space, but until the char-
acters for a script are actually standardized, there are no absolute guarantees where future
allocations will occur. In general, implementations should not make assumptions about
where future scripts may be encoded, based on the identity of neighboring blocks of char-
acters already encoded.
Figure 2-15 shows Plane 1 in expanded format to illustrate the allocation substructure of
that plane in more detail.
1 0000 1 0000
Linear B
1 1000
Aegean Numbers
1 2000
1 5000
1 6000
1 7000
1 0800 Cypriot
1 8000
1 9000
1 A000
1 B000 1 D000
Musical Notation
1 C000 Tai Xuan Jing
1 D000 1 D400
1 F000 1 D800
(1 FFFF)
Plane 1 currently has only two allocation areas. There is a General Scripts Area at the begin-
ning of the plane, containing various small historic scripts. Then there is a Notational Sys-
tems Area, which currently contains sets of musical symbols, alphanumeric symbols for
mathematics, and a system of divination symbols similar to those used for the Yijing.
too, read from top to bottom. That is, letters from left-to-right scripts will be rotated clock-
wise and letters from right-to-left scripts will be rotated counterclockwise.
In contrast to the bidirectional case, the choice to lay out text either vertically or horizon-
tally is treated as a formatting style. Therefore, the Unicode Standard does not provide
directionality controls to specify that choice.
Other script directionalities are found in historical writing systems. For example, some
ancient Numidian texts are written bottom to top, and Egyptian hieroglyphics can be writ-
ten with varying directions for individual lines.
Early Greek used a system called boustrophedon (literally, “ox-turning”). In boustrophedon
writing, characters are arranged into horizontal lines, but the individual lines alternate
between running right to left and running left to right, the way an ox goes back and forth
when plowing a field. The letter images are mirrored in accordance with the direction of
each individual line.
The historical directionalities are of interest almost exclusively to scholars intent on repro-
ducing the exact visual content of ancient texts. The Unicode Standard does not provide
direct support for them. Fixed texts can, however, be written in boustrophedon or in other
directional conventions by using hard line breaks and directionality overrides.
@
Multiple Combining Characters
In some instances, more than one diacritical mark is applied to a single base character (see
Figure 2-17). The Unicode Standard does not restrict the number of combining characters
that may follow a base character. The following discussion summarizes the default treat-
ment of multiple combining characters. (For the formal algorithm, see Chapter 3, Con-
formance.)
Characters Glyph
acters placed above a base character will be stacked vertically, starting with the first
encountered in the logical store and continuing for as many marks above as are required by
the character codes following the base character. For combining characters placed below a
base character, the situation is reversed, with the combining characters starting from the
base character and stacking downward.
When combining characters do not interact typographically, the relative ordering of con-
tiguous combining marks cannot result in any visual distinction and thus is insignificant.
An example of multiple combining characters above the base character is found in Thai,
where a consonant letter can have above it one of the vowels U+0E34 through U+0E37 and,
above that, one of four tone marks U+0E48 through U+0E4B. The order of character codes
that produces this graphic display is base consonant character + vowel character + tone mark
character.
Some specific uses of combining characters override the default stacking behavior by being
positioned horizontally rather than stacking or by ligature with an adjacent nonspacing
mark (see Figure 2-19). When positioned horizontally, the order of codes is reflected by
positioning in the predominant direction of the script with which the codes are used. For
example, in a left-to-right script, horizontal accents would be coded left to right. In
Figure 2-19, the top example is correct and the bottom example is incorrect.
Such override behavior is associated with specific scripts or alphabets. For example, when
used with the Greek script, the “breathing marks” U+0313
(psili) and U+0314 (dasia) require that, when used
together with a following acute or grave accent, they be rendered side-by-side rather than
the accent marks being stacked above the breathing marks. The order of codes here is base
character code + breathing mark code + accent mark code. This example demonstrates the
α
GREEK SMALL LETTER ALPHA
This is
+ COMBINING COMMA ABOVE (psili)
+ COMBINING ACUTE ACCENT (oxia) correct
α
This is
+ COMBINING ACUTE ACCENT (oxia)
+ COMBINING COMMA ABOVE (psili) incorrect
positions within text. In instances such as these, what the user thinks of as a character may
affect how the collation or regular expression will be defined or how the “characters” will
be counted. Words and other higher-level text elements generally do not split within ele-
ments that a user thinks of as a character, even when the Unicode representation of them
may consist of a sequence of encoded characters. The precise scope of these end-user “char-
acters” depends on the particular written language and the orthography it uses. In addition
to the many instances of accented letters, they may extend to digraphs such as Slovak “ch”,
trigraphs or longer combinations, and sequences using spacing letter modifiers, such as
“kw”.
The variety of these end-user perceived characters is quite great—particularly for digraphs,
ligatures, or syllabic units. Furthermore, it depends on the particular language and writing
system that may be involved. Despite this variety, however, there is a core concept of “char-
acters that should be kept together” that can be defined for the Unicode Standard in a lan-
guage-independent way. This core concept is known as a grapheme cluster, and it consists of
any combining character sequence that contains only nonspacing combining marks, or any
sequence of characters that constitutes a Hangul syllable (possibly followed by one or more
nonspacing marks). An implementation operating on such a cluster would almost never
want to break between its elements for rendering, editing, or other such text process; the
grapheme cluster is treated as a single unit. Unicode Standard Annex #29, “Text Bound-
aries,” provides a complete formal definition of a grapheme cluster and discusses its appli-
cation in the context of editing and other text processes. Implementations also may tailor
the definition of a grapheme cluster, so that under limited circumstances, particular to one
written language or another, the grapheme cluster may more closely pertain to what end
users think of as “characters” for that language.
Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file as
containing Unicode text. For UTF-16, the sequence FE16 FF16 (or its byte-reversed coun-
terpart, FF16 FE16) is exceedingly rare at the outset of text files that use other character
encodings. The corresponding UTF-8 BOM sequence, EF16 BB16 BF16, is also exceedingly
rare. In either case, it is therefore unlikely to be confused with real text data. The same is
true for both single-byte and multibyte encodings.
Data streams (or files) that begin with U+FEFF byte order mark are likely to contain Uni-
code characters. It is recommended that applications sending or receiving untyped data
streams of coded characters use this signature. If other signaling methods are used, signa-
tures should not be employed.
Conformance to the Unicode Standard does not requires the use of the BOM as such a sig-
nature. See Section 15.9, Specials, for more information on the byte order mark and its use as
an encoding signature.
Control Codes
In addition to the special characters defined in the Unicode Standard for a number of pur-
poses, the standard incorporates the legacy control codes for compatibility with the ISO/
IEC 2022 framework, ASCII, and the various protocols that make use of control codes.
Rather than simply being defined as byte values, however, the legacy control codes are
assigned to Unicode code points: U+0000..U+001F, U+007F..U+009F. Those code points
for control codes must be represented consistently with the various Unicode encoding
forms when they are used with other Unicode characters. For more information on control
codes, see Section 15.1, Control Codes.
Supported Subsets
The Unicode Standard does not require that an application be capable of interpreting and
rendering all Unicode characters so as to be conformant. Many systems will have fonts only
for some scripts, but not for others; sorting and other text-processing rules may be imple-
mented only for a limited set of languages. As a result, an implementation is able to inter-
pret a subset of characters.
The Unicode Standard provides no formalized method for identifying an implemented
subset. Furthermore, such a subset is typically different for different aspects of an imple-
mentation. For example, an application may be able to read, write, and store any Unicode
character, and to sort one subset according to the rules of one or more languages (and the
rest arbitrarily), but have access only to fonts for a single script. The same implementation
may be able to render additional scripts as soon as additional fonts are installed in its envi-
ronment. Therefore, the subset of interpretable characters is typically not a static concept.
Conformance to the Unicode Standard implies that whenever text purports to be unmodi-
fied, uninterpreted code points must not be removed or altered. (See also Section 3.2, Con-
formance Requirements.)
There is a third kind of electronic document called a Unicode Standard Annex (UAX),
which is defined in Section 3.2, Conformance Requirements. Unicode Standard Annexes dif-
fer from UTRs and UTSs in that they form an integral part of the Unicode Standard. For a
summary overview of important Unicode Technical Standards, Unicode Technical Reports,
and Unicode Standard Annexes, see Appendix B, Abstracts of Unicode Technical Reports.