Introduction:
algorithms (recapitulation),
bits, strings
Jesper Larsson
Course, teaching
• Me: Jesper Larsson, string and compression algorithms
person, teaching at MAU since 2014, research
background in string algorithms and data compression
• You?
• Languages? Spoken and for programming
• Lectures, assignments, computer time
• Not based on book, but exists in books
• Small and familiar course
Lectures (preliminary)
1. Intro, bits and strings
2. Bucket and radix sorting
3. Trie (digital tree)
4. Su x tree
5. Inverted le (search engine data structure) + regular expressions
6. Su x array
7. Su x data structure algorithms, supplemental
8. Information theory, codes, entropy coding
9. More on codes and their applications, Ziv-Lempel compression
10. The Burrows–Wheeler transform (BWT)
11. Substring search (KMP, BM, Karp–Rabin)
12. Catching up and summary
ffi
ffi
ffi
fi
I expect that you know
• Sorting and searching taught on basic algorithms
courses: binary search trees, hash tables, { selection,
insertion, quick-, merge-}-sort
• Programming
• Principles of algorithm analysis, O-notation etc.
Assignments (preliminary)
1. Word frequencies + compact le
2. Radix sorting (general interface)
3. Word frequencies 2 (trie) + search engine
4. Su x sorting
5. Entropy coding
6. BWT
ffi
fi
Today
• Brief recapitulation of algorithm time complexity
• We take a step back: digital representation of information
• Bitwise operations, numbers
(Not what this course is supposed to teach, but you are
going to need this in your programming assignments)
• If time: counting/bucket sort (“key-indexed counting”)
Algorithmic research
• Come up with algorithms for speci c problems
• Determine the “speed” of algorithms
• Find “faster” algorithms
• Prove that:
- A speci c algorithm does what it’s supposed to
- A speci c algorithm has a certain “speed”
- There can be no algorithm for the problem “faster”
than a certain “speed”
fi
fi
fi
how many times
do you have to
turn the crank?
Charles Babbage’s
analytical engine,
“programmed” by
Ada Lovelace
Time complexity of algorithm
• T(N) = a measure for the time it takes to run the program
on an input of size N
• Approximate with “at most proportional to”, O-notation
~, Θ, O
Ex 1. ⅙ N 3 + 20 N + 16 ~ ⅙N3 is Θ(N3)
Ex 2. ⅙ N 3 + 100 N 4/3 + 56 ~ ⅙N3 = proportional to N3
Ex 3. ⅙N3 - ½N 2 + ⅓ N ~ ⅙N3 = cubic
Most common measure
f(N) is O(g(N)) means: f(N) is at most proportional to g(N)
f(N) is Θ(g(N)) means: f(N) is precisely proportional to g(N)
g(N) is the “best” function such that f(N) is O(g(N))
Formally
f(N) is O(g(N)) means: ∃ constants N0 and c so that if N > N0, then |f(N)| < c·g(N)
f(N) is Ω(g(N)) means: ∃ constants N0 and c so that if N > N0, then |f(N)| > c·g(N)
f(N) is Θ(g(N)) means: f(N) is both O(g(N)) and Ω(g(N))
Time for sorting with
pairwise compares
log2
• Upper bound: ~N lg N compares (given by mergesort)
• Lower bound: ?
• Optimal algoritm: ?
Start with considering how many possible orderings there are
Decision tree to nd, using compare, which possible ordering of 3 elements is correct
a<b
tree heigh =
yes no number of
compares in the
b<c a<c
yes
no yes no
abc a<c bac b<c
yes no yes no
acb cab bca cba
one leaf per possible ordering: 3! = 6 leaves
fi
Decision tree for possible orderings of N values
• N values a1 to aN. Assume they are all di erent (a case we need to manage).
• N! = N · (N−1) · (N−2) · … · 3 · 2 · 1 di erent orderings
• Tree with compares (internal nodes), with orderings as leaves:
at least N! leaves no more than 2h leaves
• Worst case time: height h.
2 h ≥ N!
• Binary tree med h levels: at most 2h leaves
h ≥ lg (N!) ~ N lg N
Stirling's approximation
• Conclusion: Any algorithm must use lg (N!) ~ N lg N compares (worst case)
ff
ff
Stupid question?
• Well known: computers at machine level represent
everything using only 0 and 1
• So how can computers process and output graphics,
audio, or even text, which don’t look like 0s and 1s?
• What does it even mean “only 0 and 1”?
• (How to explain this to someone without comp sci
knowledge?)
Blondinrikard Fröberg, Listen closely https:// ic.kr/p/tRbAcU
στρατός (formerly known as Michelangelo_MI), At the end of the track https:// ic.kr/p/4wMSNh
fl
fl
Sound
• A sequence of amplitude values in binary
representatio
• Parameters: frequency, bits per sample…
n
Pictures
• Bit-mapped
PDF, Postscript, …
PNG, JPEG, GIF, …
ht
tp
:/
/b
lo
g.
xk
cd
.c
om
/2
01
0/
05
/0
3/
co
lo
r-
su
rv
ey
-r
es
ul
ts
/
Pictures
• Colors: RGB
• 3 byte numbers, 256×256×256
= 16777216 different colors
D. B. Gaston, Arabian Nights text cropped https:// ic.kr/p/5QvRXv
fl
7 bit ascii
International (e.g.
Scandinavian) characters
• Replace some glyphs [ ] \ { }
• Use 8 bits: Latin-1 (ISO 8859-1
• Replace for new chars (€ etc.): Latin-9
(ISO-8859-15
• Microsoft variant: Windows-125
• Unicode multibyte: UTF-8 de-facto standard?
)
UTF-8
No of bits we Bytes in
Byte 1 Byte 2 Byte 3 Byte 4
need UTF-8 value
7 1 0xxxxxxx
8–11 2 110xxxxx 10xxxxxx
12–16 3 1110xxxx 10xxxxxx 10xxxxxx
17–21 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Two approaches (which are
both kind of the same)
• Recursively split the universe in two parts. Denote the
parts 1 and 0. Binary tree with what we want to represent
in the leaves
• Assign numbers to everything we want to represent.
Encode in base 2 (binary numbers)
Game: 20 questions
• How many ”messages” are possible with 20 questions =
bits?
• What if some messages are (much) more likely than
others?
• (How many bits would we need to distinguish any
message ever written by a human?)
X has probability Claude Shannon
Optimal number of bits to represent X:
1
log2 bits
p
p
~ Unary number representation
Positional number system
ِ خ َوا ِرزْم َ َعبْ َداهلل ُم َح َّمد ِبن ُم
ْ وسى ا َ ْل
Abū ʿAbdallāh Muḥammad ibn Mūsā al-Khwārizmī
Indian positional system → “Arabic” numbers
782 = 2·100
+ 8·101
+ 7·102
2 7
1 3 6 8
4 5
0 9
Binary number
representation
decimal:
782 = 2·100
+ 8·101
+ 7·102
binary:
1100001110 = 0·20
+ 1·21
+ 1·22
+ 1·23
+ 0·24
+ 0·25
+ 0·26
+ 0·27
+ 1·28
+ 1·29
“8 Questions” for unsigned
numbers (octets or “bytes”)
00000000:
00000001:
00000010:
00000011:
00000100:
⋮
01111110: 12
01111111: 12
10000000: 12
10000001: 12
10000010: 13
⋮
11111110: 25
11111111: 255
4
Signed 8-bit numbers (octets
or “bytes”): two’s complement
00000000: 10000000: −12
00000001: 10000001: −12
00000010: 10000010: −12
00000011: ⋮
00000100: 11111110: −
⋮ 11111111: −1
01111110: 12 00000000:
01111111: 12 00000001:
10000000: −12 00000010:
10000001: −12 00000011:
10000010: −12 00000100:
⋮ ⋮
11111110: − 01111110: 12
11111111: −1 01111111: 127
4
Floating-point representation
0.250244140625:
Single precision, 32 bits: 1 sign, 8 exponent, 23 mantissa
0 01111101 00000000010000000000000
sign 0: positive
exponent: 01111101 = 125, subtract 127 (exponent bias): −2
mantissa: 1·20 (implicit rst 1 bit)
+ 0·2−1
+ 0·2−2
+ 0·2−3
⋮
+ 0·2−9
+ 1·2−10
+ 0·2−11
⋮
= 1.0009765625
+ 1.0009765625 · 2−2 = 0.250244140625
fi
Bitwise operations
• and & ∧ ·
• or | ∨ +
• xor ^ ⊻ ⊕
• not ~ ¬ ¯
• <<
• >>, >>>
a b a&b
0 0 0
0 1 0
1 0 0
1 1 1
a & b is true only if both a and b are true
a b a|b
0 0 0
0 1 1
1 0 1
1 1 1
a ∨ b is true if at least one of a and b is true
a b a^b
0 0 0
0 1 1
1 0 1
1 1 0
a ⊕ b is true when a and b are not equal
• Setting bit to 1: or 1 (or 0 means don't change)
• Setting bit to 0: and 0 (and 1 means don't change
• Flipping bit: xor 1 (xor 0 means don't change)
decimal: 9 + 5 = 14 decimal: 11 + 7 = 18
binary: 1001 + 101 = 1110 binary: 1011 + 111 = 10010
carry C
A
B
Each row is an operation withthree input bits and two output bits
outi = Ai ^ Bi ^ Ci
Ci+1 = Ai & Bi | Ci & (Ai ^ Bi)
Groups of bits
• Groups of 3: octal
• Groups of 4: hex[adecimal]
• groups of 8: byte “strings”
(One strange representation: IPv4 32-bit address: “192.168.10.199”)
Let’s calculate
• 2022 * 666
= 1346652
• 111|11100110 * 10|10011010
= 10100|10001100|01011100
2022 & 0x = 230
2022 >>> 8 = 7 7 230
666 & 0x = 154 2 154
666 >>> 8 = 2
ff
ff
154*230 & 0x = 92
154*230 >>> 8 = 138 7 230
138 2 154
92
ff
154*7 + 138 & 0x = 192
154*7 + 138 >>> 8 = 4 7 230
2 154
ff
4 192 92
2*230 & 0x = 204
2*230 >>> 8 = 1 7 230
1 2 154
4 192 92
204
ff
2*7 + 1 = 15
7 230
2 154
4 192 92
15 204
192 + 204 & 0x = 140
192 + 204 >>> 8 = 1 7 230
1+4+15 = 20 2 154
4 192 92
15 204
20 140 92
20·2562 + 140·2561 + 92·2560 = 1346652
= 666·2022
ff
Endianness
• Little-endian: byte 0 at address 0, byte 1 at address 1, …
Makes sense because it's analogous to bit 0 being the least signi cant bit
• Big-endian: byte 0 at the last address of the word,
most signi cant byte at address 0…
Makes sense because you can sort (unsigned) numbers as if they were strings
fi
fi
Counting multi-byte ints
A rst go at sorting in
less than O(N log N)
Next lecture!
fi