0% found this document useful (0 votes)
17 views

Chapter 4A Data Encoding and XML

Uploaded by

KUEK BOON KANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Chapter 4A Data Encoding and XML

Uploaded by

KUEK BOON KANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Chapter 4A

DATA ENCODING &


XML

1
BIT
• Binary digIT
• Possible values: 0 or 1
• The smallest data item in a computer
• All the impressive functions performed by
computers involve only the simplest
manipulation of 0s and 1s, e.g.
– Examining a bit’s value
– Setting a bit’s value
– Reversing a bit’s value

2
MAN VS MACHINE
• Tedious for humans to work with 0s and 1s
• More reasonable for people to work with
characters
– Decimal digits (0..9)
– Letters (A..Z, a..z)
– Special symbols ($, @, %, etc)

3
SOLUTION: DATA MAPPING
• Map characters to bits for
– Data representation
– Data exchange
• Data is
– Conceptually, a sequence of characters
– Physically, a sequence of bytes
• Data encoding
– The mapping from bytes to characters

4
CONSIDERATIONS
• The need to exchange data with other users
working in different environments
• Different languages use different characters
• To allow for consistent data transfer among
computer systems (e.g. using the ftp
command)

5
DATA ENCODING SCHEMES
• ASCII
• EBCDIC
• Unicode
• Extensible Markup Language (XML)
• JavaScript Object Notation (JSON)

6
UNICODE
• A computing industry standard for the consistent
encoding, representation and handling of text
• Unicode 13.0 has a total of 143,859 characters
• Aligned with ISO/IEC 10646:2020
• The Unicode standard defines UTF-8, UTF-16,
UTF-32 and other character encodings

7
UNICODE
• Intent: to create a single character set that included all
writing systems, i.e. a Universal Character Set (UCS)
• Before Unicode: each character maps to some bits
o A → 0100 0001
• In Unicode: a letter maps to a code point
o A → U+0041 (U+ means “Unicode”, numbers are hexadecimal)
o https://home.unicode.org/
• Different encodings exist
o Encodings address how the code points are stored in
memory or represented in an email address

8
ENCODING UNICODE

• UCS-2
o Uses 2 bytes to store the code point
o May be either big-endian or small-endian
• UTF-8
• UTF-16

9
EXAMPLE: Encoding “Hello”
H e l l o
ASCII 48 65 6C 6C 6F
& ANSI
Unicode U+0048 U+0065 U+006C U+006C U+006F
code
points
UCS-2 00 48 00 65 00 6C 00 6C 00 6F
Big-endian
UCS-2 48 00 65 00 6C 00 6C 00 6F 00
Little-
endian

10
UTF-8
• A system for storing strings of Unicode code
points in memory using 8-bit bytes
• Every code point from 0 – 127 is stored in a
single byte
– Looks exactly like ASCII
• Only code points 128 and above are stored
using 2, 3, … 6 bytes

11
CONSIDERATIONS
• Popular encodings for English text include
Windows-1252 and ISO-8859-1 (Latin-1)
• If there’s no equivalent for the Unicode code
point in the encoding you are using, the
character will be displayed as a question
mark ? or �
• Therefore, you must know what encoding is
used to represent a string in order to interpret
or display it correctly

12
PRESERVING ENCODING
INFORMATION
• In emails, specify in the header:
content_type:text/plain; charset=“UTF-8”
• In web pages, specify in the meta tag as the very first
thing in the <head> section:
<html>
<head>
<meta http-equiv=“content-type”
content=“text/html; charset=UTF-8”>
. . .

13
UTF-16
• A code unit consists of 2 bytes
• Code points below 65,536 are represented as a
single code unit
• Higher code points are represented as pairs of
code units

14
SUMMARY ON UNICODE
• The most common choices are UTF-8 and UTF-16
• In terms of file size, UTF-8 is more compact if the text
contains mainly ASCII characters
• UTF-16 is more efficient at representing higher code
points
• Ideally, all the different character encodings will be
replaced by Unicode
• HTML documents may be written in different
encodings
– This should be specified in the HTTP response header
15
XML

16
INTRODUCTION TO XML
• A markup language that is extensible
– Can be modified according to the needs of the data
being recorded
• Describes the structure and content of any
machine-readable information
• Device-independent and system-independent
• Used to create vocabularies of other markup
languages

17
Chapter4A\xmldemo\menuA.xml

XML EXAMPLE

18
19
XML SYNTAX RULES
1. Every XML element must have a closing tag
• However, a self-closing tag is permitted
2. XML tags are case sensitive
3. XML elements must be properly nested
• All elements can have child (sub)elements
4. Every XML document must have a root
element
5. XML elements can have attributes in name-
value pairs
• Each attribute value must be quoted
20
WELL-FORMED XML DOCUMENTS
• A well-formed XML document contains no
syntax errors and satisfies the general
specifications for XML code stated by W3C
• At minimum, an XML document must be well-
formed or it will not be readable by programs
that process XML code

21
Write an XML file to represent the
following data:

Product Name Manufacture Price


r
Purrfect Gift ABC Co 36.00
Basket
Stationery Set Write Well 20.50

22
Write an XML file to represent
the following data:
Product Name Items Price
Purrfect Gift Pillow 36.00
Basket Blanket
Stationery Set Pen 20.50
Notebook
Ruler

23
XML USE

• To structure, store and transport information


• A common tool for data transmission among
various applications
• Used across various industries
• Used in all major websites including major
web services

24
XML WITH SOFTWARE
APPLICATIONS & LANGUAGES
• Many software applications (e.g. Excel, Word)
and server languages (e.g. Java, PHP, .NET)
can read and create XML files
• Users can exchange data among applications
and enterprise systems using XML
• XML makes documents universally available

25
XML & DATABASES

• Databases store data and XML is widely used


for data interchange
• All major databases (e.g. MySQL, Oracle,
Access, etc) can read and create XML files
• The fact that XML is platform-independent
provides flexibility as technologies change

26
XML vs DATABASES

Relational
XML
database
• based on 2D tables • based on
which has hierarchical trees
• no hierarchy in which
• no significant order • order is significant
• hierarchy & sequence
are used to represent
information
27
XML & WEB PAGES
• The structure of XML closely matches the
structure used to display information in HTML
• The data from relational databases must be
converted to appropriate XML hierarchies for
use in web pages

28
XML VOCABULARIES
XML Vocabulary Description
Chemical Markup Language Coding of molecular and
(CML) chemical information
Extensible Hypertext Markup HTML written as an XML
Language (XHTML) application
Mathematical Markup Presentation & evaluation of
Language (MathML) mathematical equations &
operations
Musical Markup Language Display & organization of
(MML) music notation & lyrics
Real Simple Syndication Distribution of news headlines
(RSS) and syndicated columns

29
XML DOCUMENT STRUCTURE

An XML document consists of 3 parts


• The prolog
• The document body
• The epilog

30
THE PROLOG INCLUDES
XML • Indicates that the document is written in
declaration the XML language

Processing • Optional; Provide additional instructions to


be run by programs that read the XML
instructions document

Comment lines • Optional

Document type
declaration • Optional
(DTD)
31
XML PROLOG EXAMPLE

32
XML DECLARATION
• <?xml ?>
– The first line in an XML document
– Indicates that the document is written in XML
• Provides information on how to interpret the
code
– version = “version number”
– encoding= “encoding type”
– standalone = “yes | no”

33
XML PARSERS

• A program that reads and interprets an XML


document
• a.k.a. XML processor
• Current versions of all major web browsers
include an XML parser

34
FUNCTIONS OF XML PARSERS

1. Interpret a document’s code and verifies that


it satisfies all the XML specifications for
document structure and syntax
2. Interpret PCDATA in the document and
resolve any character or entity references
found within the document
3. Interpret processing instructions (if any) and
carry them out

35
XML ELEMENTS
• An XML element includes everything from the
element’s start tag to the element’s end tag, e.g.

• An element can contain:


– text
– attributes
– other elements (nested elements)
36
XML NAMING RULES
Element names
• Are case-sensitive
• Must start with a letter or underscore
• Cannot start with the letters xml, XML, Xml,
etc
• Can contain letters, digits, hyphens,
underscores, and periods
• Cannot contain spaces
37
XML ATTRIBUTES
• XML elements can have attributes just like
HTML
• Attribute values must always be quoted (either
single or double quotes)
• Attributes are used to contain metadata related
to a specific element
– E.g., can assign ID references to elements to
identify XML elements in the same manner as the
id attribute in HTML

38
Chapter4A\xmldemo\menuB.xml

39
40
ATTRIBUTES VS ELEMENTS

• Attributes cannot contain multiple values


whereas elements can
• Attributes cannot contain tree structures
whereas element can
• Attributes are not easily expandable for future
changes

41
XML TREE STRUCTURE

Root element

Element

Element Element Element Element

Text Text Text Text

42
XML TREE EXAMPLE

menu

item

name code price price


currency = RM currency =
USD

Nasi Lemak
1001 10.90 2.60
Ayam

43
ENTITY REFERENCES
• Used to insert characters which either
– have a special meaning in XML or
– are not available on a standard keyboard
• Syntax:
&#nnn; <!-- numeric character reference -->
&entity; <!-- entity reference -->

44
CHARACTER & ENTITY REFERENCES
Symbol Character Reference Entity Reference
> &#62; &gt;
< &#60; &lt;
' &#27; &apos;
" &#22; &quot;
& &#38; &amp;
© &#169;
® &#174;
™ &#153;
° &#176;
£ &#163;
€ &#8364;
¥ &#165;
45
Chapter4A\xmldemo\menuC.xml

46
47
EMPTY ELEMENTS
• An element with no content, e.g.:
<element></element> or <element />
• Similar to HTML’s empty elements, e.g. <br />
• Usage
– to mark certain sections of the document for programs
reading it
– to reference external documents containing non-
textual data (similar to the HTML <img /> tag)
• Empty elements can have attributes

48
XML’S TEXT CHARACTERS
• XML documents consist only of text
characters
• XML’s text characters are of 3 categories:
– Parsed character data (PCDATA)
– Character data (CDATA)
– White space

49
PCDATA
• Parsed character data
• Comprise the code in the XML document
• Includes characters found in
– The XML declaration
– The opening and closing tags of an element
– Empty element tags
– Character or entity references
– Comments

50
CDATA
• Pure data content, i.e. character data that will
not be processed as code in an XML document
• A CDATA section may be placed anywhere
within a document, cannot be nested within
other CDATA sections and cannot be empty
• To create a CDATA section:
<![CDATA [
character data
]]>

51
Chapter4A\xmldemo\menuD.xml

52
53
Chapter4A\xmldemo\weatherdata.xml

54
55
WHITE SPACE

• Includes all whitespace characters (e.g. space,


newline, tab)
• The whitespace characters in XML elements’
contents are preserved

56
PROCESSING INSTRUCTIONS
• A command that tells the XML parser how to
process the document
• General form:
<?target instruction ?>
– Where target identifies the program or object to
which the processing is directed and instruction is
information that the document passes to the parser for
processing

57
FORMATTING XML DATA WITH CSS

58
Chapter4A\xmldemo\menuE.xml

59
Chapter4A\xmldemo\menu.css

60
61
XML NAMESPACES
• Used to avoid element name conflicts
• When using prefixes in XML, a namespace for
the prefix must be defined by using an xmlns
attribute in the start tag of an element:
<element xmlns:prefix=“uri”>…</element>
• To declare a default namespace, omit the prefix
in the namespace declaration:
<element xmlns=“uri”>…</element>
– Any descendant element is considered part of this
namespace unless a different namespace is
declared within one of the child elements 62
63
URI
• A Uniform Resource Identifier (URI) is a
string of characters which identifies an Internet
resource
• The most common URI is the Uniform
Resource Locator (URL) which identifies an
Internet domain address
• Another URI is the Uniform Resource Name
(URN) but this URI is not so common

64
WELL FORMED XML DOCUMENTS

• An XML document with correct syntax is


called well formed
• There are various XML validators to syntax-
check your XML
• Some IDEs (e.g. NetBeans) have a built-in
XML validator

65
Recall: XML SYNTAX RULES
• XML documents must have a root element
• XML elements must have a closing tag
• XML tags are case sensitive
• XML elements must be properly nested
• XML attribute values must be quoted

66
VALID XML DOCUMENTS
• A valid XML document must be well formed
and in addition, must conform to a document
type definition or XML schema
• To validate an XML document, use either
▪ A Document Type Definition (DTD) or
▪ An XML Schema

67
DTDs & XML SCHEMAS
2 ways to specify rules for how data in a document
vocabulary should be structured
a. Document Type Definition (DTD)
▪ Defines the structure of the data and very broadly
the types of data allowable
b. XML Schema
▪ more precisely defines the structure of the data and
the specific data restrictions

68
WELL-FORMED & VALID XML
DOCUMENTS
• A well-formed XML document contains no
syntax errors and satisfies the general
specifications for XML code stated by W3C
o At minimum, an XML document must be well-
formed or it will not be readable by programs that
process XML code
• A well-formed XML document that satisfies the
rules of a DTD or XML schema is said to be a
valid document

69
Document Type
Definition (DTD)

70
INTRODUCTION TO DTD
• A DTD is a collection of rules that define the
content and structure of an XML document
• A DTD is used in conjunction with an XML
parser that supports data validation

71
DTD’s USE
• Ensure that all required elements are present in
the document
• Prevent undefined elements from being used in
the document
• Enforce a specific data structure on document
content
• Specify the use of element attributes and define
their permissible values
• Define default values for attributes
• Describe how parsers should access non-XML
or nontextual content
72
DOCTYPE
• A DTD is attached to an XML document using
a DOCTYPE, i.e. a document type declaration
• The DOCTYPE must be added to the document
prolog after the XML declaration and before
the document’s root element
• Each document can only have one DOCTYPE
• The purpose of the DOCTYPE is to either
– Specify the rules of the DTD or
– Provide information to the parser about where those
rules are located
73
DTD STRUCTURE
• DTDs can be placed either within an XML
document or in an external file
• A DOCTYPE can be either an
o internal subset, or
o external subset

74
DTD: EXTERNAL SUBSET
• In the XML document
❑Set the standalone attribute in the XML
declaration to “no” and
❑the DOCTYPE includes an external subset that
indicates the location of the file
• Locations are defined using either a system
identifier or a public identifier

75
SPECIFYING EXTERNAL DTD
LOCATION
• A system identifier specifies the location of the DTD
file:
Syntax: <!DOCTYPE root SYSTEM “URI”>
where root is the document’s root element,
uri is the URI of the external file
Example: <!DOCTYPE menu SYSTEM “rules.dtd”>
• For public identifier:
Syntax: <!DOCTYPE root PUBLIC “id” “URI”>
Example: <!DOCTYPE html PUBLIC
“-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-
strict.dtd”>
76
Chapter4A\dtddemo\menu.dtd

77
Chapter4A\dtddemo\menuF.dtd

78
ELEMENT DECLARATIONS IN DTDS
• One element declaration for each element type:
<!ELEMENT element_name content_specification>
where content_specification can be
❑(#PCDATA) parsed character data
❑(child) one child element
❑(c1,..,cn)a sequence of child elements c1..cn
❑(c1|..|cn) one of the elements c1..cn
For each component , possible counts can be specified:
❑c exactly one such element
❑c+ one or more
❑c* zero or more
❑c? zero or one 79
DTD EXAMPLE A bakeries object has
zero or more bakery
elements nested within

<!ELEMENT bakeries (bakery*)>


<!ELEMENT bakery (name, cake+)>
A bakery has
<!ELEMENT name (#PCDATA)> one name and
<!ELEMENT cake (name, price)> one or more
cake elements
<!ELEMENT price (#PCDATA)>
A cake has a
name and price name and a price
are text

80
VARIATIONS IN ELEMENT
DECLARATIONS
• Arbitrary combinations:
<!ELEMENT f ((a|b)*, c+, (d|3))*>
• Elements with mixed content:
<!ELEMENT text (#PCDATA|index|cite))*>
• Elements with empty content:
<!ELEMENT image EMPTY>
• Elements with arbitrary content:
<!ELEMENT thesis ANY>
Note: The symbol | can connect alternative sequences of
tags
81
Write the DTD Element Declaration for the
following element:
• A name consists of an optional title (e.g., “Prof”),
a first name and a last name in that order, or it is
an IP address
<!ELEMENT name (title?, first, last)|ipaddr) >

82
ATTRIBUTES IN DTDs
• A bakery may have an attribute kind, a
character string describing the kind of bakery
(e.g. “Gourmet”, “Wedding”, “Commercial”)
<!ELEMENT bakery (name, cake+)>
<!ATTLIST bakery kind CDATA #IMPLIED>

Character string Attribute is optional


type; no tags opposite: #REQUIRED

83
EXAMPLE XML DOCUMENT

84
IDs and IDREFs

• To allow an element to refer to another


element with an ID attribute, give the element
an attribute of type IDREF
• To allow an element to refer to any number of
other elements, give the element an attribute of
type IDREFS

85
Chapter4A\dtddemo\members.dtd

86
Chapter4A\dtddemo\members.xml

87
Chapter4A\xmldemo\menuG.dtd

88
Chapter4A\xmldemo\menuG.xml

89
90
DTD SHORTCOMINGS

• Lack of data types


• Does not support namespaces
• Uses a different syntax from XML
– Need to learn/know an additional syntax

91
XML SCHEMAS

92
XML SCHEMAS

• An XML-based alternative to DTD


• Describes the structure of an XML document
• There are 3 basic schema designs:
– Flat Catalog design (Salami Slice design)
– Russian Doll design
– Venetian Blind design

93
Chapter4A\xsddemo\menu.xsd

94
Chapter4A\xsddemo\menuH.xml

95
MAIN XML SCHEMAS CONSTRUCTS
• A simple type definition defines a family of
Unicode text strings
• A complex type definition defines a collection of
requirements for attributes, subelements and
character data
• An element declaration associates an element
name with either a simple type or a complex type
• An attribute declaration associates an attribute
name with a simple type

96
SIMPLE vs COMPLEX TYPES
• Simple type describes text without markup (in
character data and attribute values)
• Complex type describes text that may contain
markup (i.e. elements, attributes and character
data)

97
xs:element
• Used to provide the definition for an XML
element
• Has the attributes name and type where
❑name – the tag-name of the element being defined
❑type – the type of the element which may be
▪ an XML-schema type (e.g. xs:string) or
▪ a custom type defined in the document itself

98
xs:element EXAMPLE

99
COMPLEX TYPES
• xs:complexType is used to describe elements
that consist of subelements
❑Has attribute name that gives a name to the type
• Typical subelement is xs:sequence
• xs:sequence in turn
❑Has a sequence of xs:element subelements
❑Uses minOccurs and maxOccurs attributes to
indicate the number of occurrences of xs:element
▪ Note: the default for minOccurs and maxOccurs is 1

100
xs:attribute
• Used within a complex type to indicate
attributes of elements of that type
• Has the attributes name, type and use where
❑name and type - similar as for xs:element
❑use – either “required” or “optional”

101
xs:attribute EXAMPLE

102
RESTRICTED SIMPLE TYPES
• xs:simpleType can describe enumerations
and range-restricted base types
❑ Has attribute name
• xs:enumeration is a subelement

103
xs:restriction
• Attribute base specifies the simple type to be
restricted (e.g., xs:integer)
• To specify lower and upper bounds on a
numerical range, use the attributes
❑xs:minInclusive or xs:minExclusive
❑xs:maxInclusive or xs:maxExclusive
• xs:enumeration is a subelement with the
attribute value that allows enumerated types

104
EXAMPLE: RESTRICTION WITH
ENUMERATION
<xs:simpleType name=“licenceType”>
<xs:restriction base=“xs:string”>
<xs:enumeration value=“Learner” />
<xs:enumeration
value=“Probationary” />
<xs:enumeration value=“Competent” />
</xs:restriction>
</xs:simpleType>

105
EXAMPLE: RESTRICTION WITH RANGE
<xs:simpleType name=“fees”>
<xs:restriction
base=“xs:float”
minInclusive=“30.00”
maxExclusive=“151.00” />
</xs:restriction>
</xs:simpleType>

106

You might also like