Chapter 4A Data Encoding and XML
Chapter 4A Data Encoding and XML
1
BIT
• Binary digIT
• Possible values: 0 or 1
• The smallest data item in a computer
• All the impressive functions performed by
computers involve only the simplest
manipulation of 0s and 1s, e.g.
– Examining a bit’s value
– Setting a bit’s value
– Reversing a bit’s value
2
MAN VS MACHINE
• Tedious for humans to work with 0s and 1s
• More reasonable for people to work with
characters
– Decimal digits (0..9)
– Letters (A..Z, a..z)
– Special symbols ($, @, %, etc)
3
SOLUTION: DATA MAPPING
• Map characters to bits for
– Data representation
– Data exchange
• Data is
– Conceptually, a sequence of characters
– Physically, a sequence of bytes
• Data encoding
– The mapping from bytes to characters
4
CONSIDERATIONS
• The need to exchange data with other users
working in different environments
• Different languages use different characters
• To allow for consistent data transfer among
computer systems (e.g. using the ftp
command)
5
DATA ENCODING SCHEMES
• ASCII
• EBCDIC
• Unicode
• Extensible Markup Language (XML)
• JavaScript Object Notation (JSON)
6
UNICODE
• A computing industry standard for the consistent
encoding, representation and handling of text
• Unicode 13.0 has a total of 143,859 characters
• Aligned with ISO/IEC 10646:2020
• The Unicode standard defines UTF-8, UTF-16,
UTF-32 and other character encodings
7
UNICODE
• Intent: to create a single character set that included all
writing systems, i.e. a Universal Character Set (UCS)
• Before Unicode: each character maps to some bits
o A → 0100 0001
• In Unicode: a letter maps to a code point
o A → U+0041 (U+ means “Unicode”, numbers are hexadecimal)
o https://home.unicode.org/
• Different encodings exist
o Encodings address how the code points are stored in
memory or represented in an email address
8
ENCODING UNICODE
• UCS-2
o Uses 2 bytes to store the code point
o May be either big-endian or small-endian
• UTF-8
• UTF-16
9
EXAMPLE: Encoding “Hello”
H e l l o
ASCII 48 65 6C 6C 6F
& ANSI
Unicode U+0048 U+0065 U+006C U+006C U+006F
code
points
UCS-2 00 48 00 65 00 6C 00 6C 00 6F
Big-endian
UCS-2 48 00 65 00 6C 00 6C 00 6F 00
Little-
endian
10
UTF-8
• A system for storing strings of Unicode code
points in memory using 8-bit bytes
• Every code point from 0 – 127 is stored in a
single byte
– Looks exactly like ASCII
• Only code points 128 and above are stored
using 2, 3, … 6 bytes
11
CONSIDERATIONS
• Popular encodings for English text include
Windows-1252 and ISO-8859-1 (Latin-1)
• If there’s no equivalent for the Unicode code
point in the encoding you are using, the
character will be displayed as a question
mark ? or �
• Therefore, you must know what encoding is
used to represent a string in order to interpret
or display it correctly
12
PRESERVING ENCODING
INFORMATION
• In emails, specify in the header:
content_type:text/plain; charset=“UTF-8”
• In web pages, specify in the meta tag as the very first
thing in the <head> section:
<html>
<head>
<meta http-equiv=“content-type”
content=“text/html; charset=UTF-8”>
. . .
13
UTF-16
• A code unit consists of 2 bytes
• Code points below 65,536 are represented as a
single code unit
• Higher code points are represented as pairs of
code units
14
SUMMARY ON UNICODE
• The most common choices are UTF-8 and UTF-16
• In terms of file size, UTF-8 is more compact if the text
contains mainly ASCII characters
• UTF-16 is more efficient at representing higher code
points
• Ideally, all the different character encodings will be
replaced by Unicode
• HTML documents may be written in different
encodings
– This should be specified in the HTTP response header
15
XML
16
INTRODUCTION TO XML
• A markup language that is extensible
– Can be modified according to the needs of the data
being recorded
• Describes the structure and content of any
machine-readable information
• Device-independent and system-independent
• Used to create vocabularies of other markup
languages
17
Chapter4A\xmldemo\menuA.xml
XML EXAMPLE
18
19
XML SYNTAX RULES
1. Every XML element must have a closing tag
• However, a self-closing tag is permitted
2. XML tags are case sensitive
3. XML elements must be properly nested
• All elements can have child (sub)elements
4. Every XML document must have a root
element
5. XML elements can have attributes in name-
value pairs
• Each attribute value must be quoted
20
WELL-FORMED XML DOCUMENTS
• A well-formed XML document contains no
syntax errors and satisfies the general
specifications for XML code stated by W3C
• At minimum, an XML document must be well-
formed or it will not be readable by programs
that process XML code
21
Write an XML file to represent the
following data:
22
Write an XML file to represent
the following data:
Product Name Items Price
Purrfect Gift Pillow 36.00
Basket Blanket
Stationery Set Pen 20.50
Notebook
Ruler
23
XML USE
24
XML WITH SOFTWARE
APPLICATIONS & LANGUAGES
• Many software applications (e.g. Excel, Word)
and server languages (e.g. Java, PHP, .NET)
can read and create XML files
• Users can exchange data among applications
and enterprise systems using XML
• XML makes documents universally available
25
XML & DATABASES
26
XML vs DATABASES
Relational
XML
database
• based on 2D tables • based on
which has hierarchical trees
• no hierarchy in which
• no significant order • order is significant
• hierarchy & sequence
are used to represent
information
27
XML & WEB PAGES
• The structure of XML closely matches the
structure used to display information in HTML
• The data from relational databases must be
converted to appropriate XML hierarchies for
use in web pages
28
XML VOCABULARIES
XML Vocabulary Description
Chemical Markup Language Coding of molecular and
(CML) chemical information
Extensible Hypertext Markup HTML written as an XML
Language (XHTML) application
Mathematical Markup Presentation & evaluation of
Language (MathML) mathematical equations &
operations
Musical Markup Language Display & organization of
(MML) music notation & lyrics
Real Simple Syndication Distribution of news headlines
(RSS) and syndicated columns
29
XML DOCUMENT STRUCTURE
30
THE PROLOG INCLUDES
XML • Indicates that the document is written in
declaration the XML language
Document type
declaration • Optional
(DTD)
31
XML PROLOG EXAMPLE
32
XML DECLARATION
• <?xml ?>
– The first line in an XML document
– Indicates that the document is written in XML
• Provides information on how to interpret the
code
– version = “version number”
– encoding= “encoding type”
– standalone = “yes | no”
33
XML PARSERS
34
FUNCTIONS OF XML PARSERS
35
XML ELEMENTS
• An XML element includes everything from the
element’s start tag to the element’s end tag, e.g.
38
Chapter4A\xmldemo\menuB.xml
39
40
ATTRIBUTES VS ELEMENTS
41
XML TREE STRUCTURE
Root element
Element
42
XML TREE EXAMPLE
menu
item
Nasi Lemak
1001 10.90 2.60
Ayam
43
ENTITY REFERENCES
• Used to insert characters which either
– have a special meaning in XML or
– are not available on a standard keyboard
• Syntax:
&#nnn; <!-- numeric character reference -->
&entity; <!-- entity reference -->
44
CHARACTER & ENTITY REFERENCES
Symbol Character Reference Entity Reference
> > >
< < <
'  '
"  "
& & &
© ©
® ®
™ ™
° °
£ £
€ €
¥ ¥
45
Chapter4A\xmldemo\menuC.xml
46
47
EMPTY ELEMENTS
• An element with no content, e.g.:
<element></element> or <element />
• Similar to HTML’s empty elements, e.g. <br />
• Usage
– to mark certain sections of the document for programs
reading it
– to reference external documents containing non-
textual data (similar to the HTML <img /> tag)
• Empty elements can have attributes
48
XML’S TEXT CHARACTERS
• XML documents consist only of text
characters
• XML’s text characters are of 3 categories:
– Parsed character data (PCDATA)
– Character data (CDATA)
– White space
49
PCDATA
• Parsed character data
• Comprise the code in the XML document
• Includes characters found in
– The XML declaration
– The opening and closing tags of an element
– Empty element tags
– Character or entity references
– Comments
50
CDATA
• Pure data content, i.e. character data that will
not be processed as code in an XML document
• A CDATA section may be placed anywhere
within a document, cannot be nested within
other CDATA sections and cannot be empty
• To create a CDATA section:
<![CDATA [
character data
]]>
51
Chapter4A\xmldemo\menuD.xml
52
53
Chapter4A\xmldemo\weatherdata.xml
54
55
WHITE SPACE
56
PROCESSING INSTRUCTIONS
• A command that tells the XML parser how to
process the document
• General form:
<?target instruction ?>
– Where target identifies the program or object to
which the processing is directed and instruction is
information that the document passes to the parser for
processing
57
FORMATTING XML DATA WITH CSS
58
Chapter4A\xmldemo\menuE.xml
59
Chapter4A\xmldemo\menu.css
60
61
XML NAMESPACES
• Used to avoid element name conflicts
• When using prefixes in XML, a namespace for
the prefix must be defined by using an xmlns
attribute in the start tag of an element:
<element xmlns:prefix=“uri”>…</element>
• To declare a default namespace, omit the prefix
in the namespace declaration:
<element xmlns=“uri”>…</element>
– Any descendant element is considered part of this
namespace unless a different namespace is
declared within one of the child elements 62
63
URI
• A Uniform Resource Identifier (URI) is a
string of characters which identifies an Internet
resource
• The most common URI is the Uniform
Resource Locator (URL) which identifies an
Internet domain address
• Another URI is the Uniform Resource Name
(URN) but this URI is not so common
64
WELL FORMED XML DOCUMENTS
65
Recall: XML SYNTAX RULES
• XML documents must have a root element
• XML elements must have a closing tag
• XML tags are case sensitive
• XML elements must be properly nested
• XML attribute values must be quoted
66
VALID XML DOCUMENTS
• A valid XML document must be well formed
and in addition, must conform to a document
type definition or XML schema
• To validate an XML document, use either
▪ A Document Type Definition (DTD) or
▪ An XML Schema
67
DTDs & XML SCHEMAS
2 ways to specify rules for how data in a document
vocabulary should be structured
a. Document Type Definition (DTD)
▪ Defines the structure of the data and very broadly
the types of data allowable
b. XML Schema
▪ more precisely defines the structure of the data and
the specific data restrictions
68
WELL-FORMED & VALID XML
DOCUMENTS
• A well-formed XML document contains no
syntax errors and satisfies the general
specifications for XML code stated by W3C
o At minimum, an XML document must be well-
formed or it will not be readable by programs that
process XML code
• A well-formed XML document that satisfies the
rules of a DTD or XML schema is said to be a
valid document
69
Document Type
Definition (DTD)
70
INTRODUCTION TO DTD
• A DTD is a collection of rules that define the
content and structure of an XML document
• A DTD is used in conjunction with an XML
parser that supports data validation
71
DTD’s USE
• Ensure that all required elements are present in
the document
• Prevent undefined elements from being used in
the document
• Enforce a specific data structure on document
content
• Specify the use of element attributes and define
their permissible values
• Define default values for attributes
• Describe how parsers should access non-XML
or nontextual content
72
DOCTYPE
• A DTD is attached to an XML document using
a DOCTYPE, i.e. a document type declaration
• The DOCTYPE must be added to the document
prolog after the XML declaration and before
the document’s root element
• Each document can only have one DOCTYPE
• The purpose of the DOCTYPE is to either
– Specify the rules of the DTD or
– Provide information to the parser about where those
rules are located
73
DTD STRUCTURE
• DTDs can be placed either within an XML
document or in an external file
• A DOCTYPE can be either an
o internal subset, or
o external subset
74
DTD: EXTERNAL SUBSET
• In the XML document
❑Set the standalone attribute in the XML
declaration to “no” and
❑the DOCTYPE includes an external subset that
indicates the location of the file
• Locations are defined using either a system
identifier or a public identifier
75
SPECIFYING EXTERNAL DTD
LOCATION
• A system identifier specifies the location of the DTD
file:
Syntax: <!DOCTYPE root SYSTEM “URI”>
where root is the document’s root element,
uri is the URI of the external file
Example: <!DOCTYPE menu SYSTEM “rules.dtd”>
• For public identifier:
Syntax: <!DOCTYPE root PUBLIC “id” “URI”>
Example: <!DOCTYPE html PUBLIC
“-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-
strict.dtd”>
76
Chapter4A\dtddemo\menu.dtd
77
Chapter4A\dtddemo\menuF.dtd
78
ELEMENT DECLARATIONS IN DTDS
• One element declaration for each element type:
<!ELEMENT element_name content_specification>
where content_specification can be
❑(#PCDATA) parsed character data
❑(child) one child element
❑(c1,..,cn)a sequence of child elements c1..cn
❑(c1|..|cn) one of the elements c1..cn
For each component , possible counts can be specified:
❑c exactly one such element
❑c+ one or more
❑c* zero or more
❑c? zero or one 79
DTD EXAMPLE A bakeries object has
zero or more bakery
elements nested within
80
VARIATIONS IN ELEMENT
DECLARATIONS
• Arbitrary combinations:
<!ELEMENT f ((a|b)*, c+, (d|3))*>
• Elements with mixed content:
<!ELEMENT text (#PCDATA|index|cite))*>
• Elements with empty content:
<!ELEMENT image EMPTY>
• Elements with arbitrary content:
<!ELEMENT thesis ANY>
Note: The symbol | can connect alternative sequences of
tags
81
Write the DTD Element Declaration for the
following element:
• A name consists of an optional title (e.g., “Prof”),
a first name and a last name in that order, or it is
an IP address
<!ELEMENT name (title?, first, last)|ipaddr) >
82
ATTRIBUTES IN DTDs
• A bakery may have an attribute kind, a
character string describing the kind of bakery
(e.g. “Gourmet”, “Wedding”, “Commercial”)
<!ELEMENT bakery (name, cake+)>
<!ATTLIST bakery kind CDATA #IMPLIED>
83
EXAMPLE XML DOCUMENT
84
IDs and IDREFs
85
Chapter4A\dtddemo\members.dtd
86
Chapter4A\dtddemo\members.xml
87
Chapter4A\xmldemo\menuG.dtd
88
Chapter4A\xmldemo\menuG.xml
89
90
DTD SHORTCOMINGS
91
XML SCHEMAS
92
XML SCHEMAS
93
Chapter4A\xsddemo\menu.xsd
94
Chapter4A\xsddemo\menuH.xml
95
MAIN XML SCHEMAS CONSTRUCTS
• A simple type definition defines a family of
Unicode text strings
• A complex type definition defines a collection of
requirements for attributes, subelements and
character data
• An element declaration associates an element
name with either a simple type or a complex type
• An attribute declaration associates an attribute
name with a simple type
96
SIMPLE vs COMPLEX TYPES
• Simple type describes text without markup (in
character data and attribute values)
• Complex type describes text that may contain
markup (i.e. elements, attributes and character
data)
97
xs:element
• Used to provide the definition for an XML
element
• Has the attributes name and type where
❑name – the tag-name of the element being defined
❑type – the type of the element which may be
▪ an XML-schema type (e.g. xs:string) or
▪ a custom type defined in the document itself
98
xs:element EXAMPLE
99
COMPLEX TYPES
• xs:complexType is used to describe elements
that consist of subelements
❑Has attribute name that gives a name to the type
• Typical subelement is xs:sequence
• xs:sequence in turn
❑Has a sequence of xs:element subelements
❑Uses minOccurs and maxOccurs attributes to
indicate the number of occurrences of xs:element
▪ Note: the default for minOccurs and maxOccurs is 1
100
xs:attribute
• Used within a complex type to indicate
attributes of elements of that type
• Has the attributes name, type and use where
❑name and type - similar as for xs:element
❑use – either “required” or “optional”
101
xs:attribute EXAMPLE
102
RESTRICTED SIMPLE TYPES
• xs:simpleType can describe enumerations
and range-restricted base types
❑ Has attribute name
• xs:enumeration is a subelement
103
xs:restriction
• Attribute base specifies the simple type to be
restricted (e.g., xs:integer)
• To specify lower and upper bounds on a
numerical range, use the attributes
❑xs:minInclusive or xs:minExclusive
❑xs:maxInclusive or xs:maxExclusive
• xs:enumeration is a subelement with the
attribute value that allows enumerated types
104
EXAMPLE: RESTRICTION WITH
ENUMERATION
<xs:simpleType name=“licenceType”>
<xs:restriction base=“xs:string”>
<xs:enumeration value=“Learner” />
<xs:enumeration
value=“Probationary” />
<xs:enumeration value=“Competent” />
</xs:restriction>
</xs:simpleType>
105
EXAMPLE: RESTRICTION WITH RANGE
<xs:simpleType name=“fees”>
<xs:restriction
base=“xs:float”
minInclusive=“30.00”
maxExclusive=“151.00” />
</xs:restriction>
</xs:simpleType>
106