Ckumar

Unicode & W3C
Jataayu Software
C. Kumar
January 2007
Agenda
About Jataayu
Unicode & Encoding
W3C Specification for multi-lingual
authoring
Multilingual WEB Address
Indian WEB Sites an Overview
W3C Activity
About Jataayu
Jataayu formed with a clear focus of
delivering solutions for wireless data
services
Over 60% of the data traffic in Indian Mobile
Networks for WAP, Mobile WEB and MMS
handled by Jataayu Products
Mobile Device Solution Division focusing on
wireless data applications like WAP, MMS,
SyncML, IMPS, Email, Web Browsing,
Download
Active participants in OMA, W3C and MWI
Over 350 people strong with offices in UK,
Localization -
Internationalization
Localization (l10n)
Adaptation of the content to meet the
language, cultural and other requirements
of a specific target market
Internationalization (i18n)
Design & Development of the content that
enables easy localization for target
audiences that vary in culture, region or
language.
Mission of W3C i18n Activity is to ensure
the W3C’s formats and protocols are
Need for Unicode
Early character sets based on 7-bit,
gave 27 (ie. 128) possible characters
Adding the 8th bit gave a total of 256
possible characters. Still not enough
for all the European languages.
Code page mechanism helped a little
by changing the upper cells (0xA0 to
0xFF), but was very complex.
Addressing the needs of the other
languages requires thousands of
Unicode & Encoding
Unicode, universal character set

contains all the characters needed
for writing the majority of living
languages in use on computers.
Allows for simple display and storage
of multilingual content
An encoding refers to the way that
characters are mapped from the
character set to actual Unicode
value.
Unicode & Encoding
UTF-8 (Unicode Transformation

Format)
Variable length 8-bit character
encoding for Unicode
Able to represent any universal
character in the Unicode Standard
Uses one to four bytes to encode a
Unicode symbol
Only one byte is needed to encode
the US-ASCII characters
Unicode & Encoding
UTF-16 (16-bit Unicode Transformation
Format)
Variable length 16-bit character encoding
for Unicode
Uses two or four byte sequence to encode
a Unicode symbol
Two byte is required to encode the US-
ASCII character
UCS-2 (2-byte Universal Character Set)
Fixed length encoding that always
encodes characters into a single 16-bit
Unicode & Encoding
UCS-4 / UTF-32 (32-bit Unicode

Transformation Format)
Fixed length 32-bit character
encoding for Unicode
Every character it uses 4 bytes and it
is very space inefficient
Little used in practice with UTF-8 and
UTF-16 being the normal ways of
encoding Unicode Text
http://www.unicode.org/
Unicode & Encoding
Devanagari (0x0900 – 0x097F)
Bengali (0x0980 – 0x09FF)
Tamil (0x0B80 – 0x0BFF)
Kannada (0x0C80 – 0x0CFF)
Code
Point U+0041 U+05D0 U+597D U+233B4
E5 A5 F0 A3 8E
UTF-8 41 D7 90 BD B4
D8 4C DF
UTF-16 00
00 41
00 00 05
00 D0
00 05 59
00 7D
00 59 B4
00 02 33
UTF-32 41 D0 7D B4
Unicode & Encoding
Alternate way to represent the
character is by using escape value.
(א)
Not all documents have to be encoded
as Unicode
But documents can only contain
characters defined by Unicode
Standard
Any encoding can be used as long as it
is properly declared and it is the subset
Other Encoding
formats …
Shift_JIS (SJIS), character encoding
for the Japanese Language
Single byte character encoding for
the lower-ASCII characters (0x00 –
0x7F)
Double-byte character encoding for
the upper-ASCII bytes
GB2312, character encoding for
simplified Chinese characters
W3C Specification -
Encoding
W3C specification for multi-lingual authoring
Encoding of the document needs to be
mentioned, so that the application that consumes
can interpret it.
Meta Tag
<meta http-equiv=“Content-type”
content=“text/html;charset=UTF-8” />
XML
<?xml version=“1.0” encoding=“UTF-8”?>
Content-type header returned from the WEB
server should also contain the character
encoding of the document
W3C Specification -
Language
Author needs to specify the
language of the document (web
page content)
Browser can choose the appropriate
font selection using the Lang
attribute
Search Engine can group or filter
results based on the user’s linguistic
preferences (using meta)
Translation tools use to recognize the
W3C Specification -
Language
HTTP Content Language Header
Content-Language: hi
Language Attribute on html tag
<html lang=“hi”>
<html xml:lang=“hi”>
Content Language in meta tag
<meta http-equiv=“Content-Language”
content=“hi” />
Language attribute on embedded
content
What value to use for
lang?
IANA (Internet Assigned Numbers
Authority)
Provides a unique value for each
language
It is available in the Subtag value in
the new IANA Language
http://www.iana.org/assignments/language
Hindi – hi, Kannada – kn, Tamil – ta
Bi-directional text
Additional information is required

in addition to the language
attribute to provide support for
non-Latin scripts (like Arabic,
Hebrew, Urdu)
In HTML, dir attribute is used to
specify the direction of the text
The title says “<span dir=“rtl”> ‫ם ו א‬
‫ נ י ב ה ת ו ל י ע פ‬, W3C</span>” in
Hebrew.
Multilingual WEB
Address
A Web address is used to point a
resource on the WEB
Web address are typically expressed using
URIs (Uniform Resource Identifiers)
Restricts to a small number of characters
(upper & lower case letters of the English
alphabet, numerals and few symbols).
User’s expectations and use of the
Internet have changed this restrictions.
There is a growing need to use any
language characters in WEB Addresses.
Multilingual WEB
Address …
A Web address in your own
language and alphabet is easier to
create, memorize, interpret and
relate it. (Ex: http://खोज.com)
Punycode is a way of representing
Unicode code points using only
ASCII characters. (Ex:
http://xn--21bm4l.com)
Indian Content an
Overview
Most Indian Websites are not using
Unicode
Content are generated within the
ASCII range and provide the
proprietary fonts which maps the
ASCII character set to Indian
Languages.
Visually it will be fine, but no other
entities will be able to interpret it
For each site, the user may need to
download the proprietary fonts,
Indian Content an
Overview
Unicode & W3C
Importance
WEB is also moving
towards the mobile
W3C Mobile Web
Initiative (MWI)
defines the best
practices for Mobile
Browsing
Cannot install the
required font’s
during run-time as
used to do in
desktop
If Unicode character
Firefox
Firefox (http://www.getfirefox.com)
Provides extensive support for
Unicode and related fonts
Provides the Add-ons to type in
Indian Languages in web pages in
Linux. (Such tools are already
available for Windows XP Users
through the language packs)
https://addons.mozilla.org/firefox/5484/
author/
W3C i18n activity
Core Working group
Enable universal access to the World Wide
Web by providing adequate support to
other W3C Working Groups
GEO (Guidelines, Education &
Outreach)
Internationalization aspects of W3C
technology better understood and more
widely and consistently used
ITS (Internationalization Tag Set)
Develop a set of elements and attributes
Thanks
kumarc@jataayusoft.com

Ckumar

Uploaded by

Copyright:

Available Formats

Ckumar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ckumar

Uploaded by

Copyright:

Available Formats

Unicode & W3C

Unicode, universal character set

UTF-8 (Unicode Transformation

UCS-4 / UTF-32 (32-bit Unicode

Additional information is required

You might also like