Adsii PDF
Adsii PDF
Adsii PDF
(WS 20/21)
Course Notes
Christian Schulz
Heidelberg University
source: xkcd
Preface 1
1 Algorithm Engineering 3
1.1 A Brief “History” of Algorithm Engineering . . . . . . . . . . . . 6
1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Algorithm Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
ii CONTENTS
7 Linear Programming 45
8 Parallel Algorithms 47
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Message-Passing Model/Paradigm . . . . . . . . . . . . . . . . . 47
8.3 Communication Cost in Parallel Machines . . . . . . . . . . . . . 49
8.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 51
8.4.1 Execution Time . . . . . . . . . . . . . . . . . . . . . . . 51
8.4.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.4.3 Superlinear speedup . . . . . . . . . . . . . . . . . . . . 52
8.4.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.4.5 Example: Adding n numbers using n processing elements 53
8.4.6 Example: Adding n numbers using p processing elements 54
8.5 Prefix Sums on a Hypercube Network . . . . . . . . . . . . . . . 55
8.6 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.6.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . 57
8.6.2 Parallelizing Quicksort . . . . . . . . . . . . . . . . . . . 58
8.6.3 Parallel Formulation for a Message-passing Platform . . . 59
8.7 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9 External Algorithms 63
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2 The external memory model . . . . . . . . . . . . . . . . . . . . 65
9.3 Stacks and Queues . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.4 Multiway Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . 67
9.5 Internal work of Multiway Mergesort . . . . . . . . . . . . . . . 69
9.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.7 External Priority Queues . . . . . . . . . . . . . . . . . . . . . . 73
9.8 Semiexternal Kruskal’s Algorithm . . . . . . . . . . . . . . . . . 78
10 Online Algorithms 81
10.1 Competitive Analysis . . . . . . . . . . . . . . . . . . . . . . . . 82
10.1.1 Example: Ski Rental . . . . . . . . . . . . . . . . . . . . 82
10.2 Paging – Deterministic Algorithms . . . . . . . . . . . . . . . . . 83
10.2.1 Longest Forward Distance . . . . . . . . . . . . . . . . . 86
10.2.2 Competitive Paging Algorithms . . . . . . . . . . . . . . 87
10.2.3 Conservative Paging Algorithms . . . . . . . . . . . . . . 89
10.3 Randomized Online Algorithms . . . . . . . . . . . . . . . . . . 91
10.3.1 Randomized Marking Algorithms . . . . . . . . . . . . . 92
10.3.2 Lower Bound for Randomized Paging Algorithms . . . . 94
CONTENTS iii
11 Geometry 99
11.1 Application Domains . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 Basic Elements and Definitions . . . . . . . . . . . . . . . . . . . 99
11.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.4 Line Segment Intersection - Case Study . . . . . . . . . . . . . . 101
11.4.1 Orthogonal Line Segment Intersection . . . . . . . . . . . 101
11.4.2 A more Complex Case . . . . . . . . . . . . . . . . . . . 102
11.4.3 The Arbitrary Case . . . . . . . . . . . . . . . . . . . . . 104
11.4.4 Further Improvements . . . . . . . . . . . . . . . . . . . 105
11.5 Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.5.1 Graham Scan . . . . . . . . . . . . . . . . . . . . . . . . 106
11.6 Smallest Enclosing Circle . . . . . . . . . . . . . . . . . . . . . . 107
11.6.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.7 Range Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.7.1 1D Range Search . . . . . . . . . . . . . . . . . . . . . . 109
11.7.2 2D Range Search . . . . . . . . . . . . . . . . . . . . . . 110
11.7.3 Bitvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 115
iv CONTENTS
Preface
The present course notes collect and elaborate on material for the course “Algo-
rithmen and Data Structures II” at Heidelberg University. Some parts of the script
were first published in winter term 2010/2011 for a course “Algorithmen II” at
KIT and have been expanded by additional material from an “Algorithms and Data
Structures II” course that I gave in Vienna. The initial version of the writeup is due
to Timo Bingmann, Johannes Fischer, Robert Geisberger, Moritz Kobitzsch, Vitaly
Osipov, Peter Sanders, Dennis Schieferdecker, Christian Schulz (me), Johannes
Singler. Nevertheless, the current state of the course notes is still preliminary –
in software jargon we would call it an α-version. Also, take note that the topics
presented in here sometimes go beyond what was presented during the lectures.
This aditional material is not required to successfully complete the course, it is
given to broaden your algorithmic horizon. Note that little here is original material
and kindly follow the citations to the original sources for more detailed information.
We will extend the chapters of the scriptum while the course is running. Also
NOTE that some chapters go far beyond the material that we covered in the lecture.
The chapters in these notes directly correspond to the chapters on the lecture
slides. For chapters 2–4 on Advanced Data Structures, Shortest Paths and Applica-
tions of DFS we refer to the textbook "Algorithms and Data Structures: The Basic
Toolbox" by Kurt Mehlhorn and Peter Sanders [38], that is also available online
and for free1 .
1
see https://people.mpi-inf.mpg.de/~mehlhorn/Toolbox.html
1
2 PREFACE
Chapter 1
Algorithm Engineering
realistic
algorithm models 3
engineering real
9 10
Inputs
applications
design
4
falsifiable
analysis 5 hypotheses 7 experiments
induction
deduction 6
implementation appl. engineering
perf.−
guarantees 8
algorithm−
libraries
3
4 CHAPTER 1. ALGORITHM ENGINEERING
and therefore has pivotal influence on the effective development of reliable and
resource-conserving technology. We only mention a few spectacular examples.
Fast search in the huge data space of the internet (e.g. using Google) has
changed the way we handle knowledge. This was made possible with full-text
search algorithms that are harvesting matching information out of petabytes of data
within fractions of a second and by ranking algorithms that process graphs with
billions of nodes in order to filter relevant information out of heaps of result data.
Less visible yet similarly important are algorithms for the efficient distribution
and caching of frequently accessed data under massive load fluctuations or even
distributed denial of service attacks.
One of the most far-reaching results of the last years was the ability to read the
human genome. Algorithmics was decisive for the early success of this project
[55]. Rather than just processing the data coming our of the lab, algorithmic con-
siderations shaped the implementation of the applied shotgun sequencing process.
The list of areas where sophisticated algorithms play a key role could be arbi-
trarily continued: computer graphics, image processing, geographic information
systems, cryptography, planning in production, logistics and transportation,. . .
How is algorithmic innovation transferred to applications? Traditionally, algo-
rithmics used the methodology of algorithm theory which stems from mathematics:
algorithms are designed using simple models of problem and machine. Main results
are provable performance guarantees for all possible inputs. This approach often
leads to elegant, timeless solutions that can be adapted to many applications. The
hard performance guarantees lead to reliably high efficiency even for types of inputs
that were unknown at implementation time. From the point of view of algorithm
theory, taking up and implementing an algorithmic idea is part of application devel-
opment. Unfortunately, it can be universally observed that this mode of transferring
results is a slow process. With growing requirements for innovative algorithms,
this causes growing gaps between theory and practice: Realistic hardware with its
parallelism, memory hierarchies etc. is diverging from traditional machine models.
Applications grow more and more complex. At the same time, algorithm theory
develops more and more elaborate algorithms that may contain important ideas but
are usually not directly implementable. Furthermore, real-world inputs are often
far away from the worst case scenarios of the theoretical analysis. In extreme cases,
promising algorithmic approaches are neglected because a mathematical analysis
would be difficult.
Since the early 1990s it therefore became more and more apparent that al-
gorithmics cannot restrict itself to theory. So, what else should algorithmicists
do? Experiments play a pivotal here. Algorithm engineering (AE) is therefore
sometimes equated with experimental algorithmics. However, in this paper we
argue that this view is too limited. First of all, to do experiments, you also have to
implement algorithms. This is often equally interesting and revealing as the experi-
5
ments themselves, needs its own set of techniques, and is an important interface to
software engineering. Furthermore, it makes little sense to view design and analysis
on the one hand and implementation and experimentation on the other hand as
separate activities. Rather, a feedback loop of design, analysis, implementation,
and experimentation that leads to new design ideas materializes as the central
process of algorithmics.
This cycle is quite similar to the cycle of theory building and experimental
validation in Popper’s scientific method [45]. We can learn several things from
this comparison. First, this cycle is driven by falsifiable hypotheses validated
by experiments – an experiment cannot prove a hypothesis but it can support it.
However, such support is only meaningful if there are conceivable outcomes of
experiments that prove the hypothesis wrong. Hypotheses can come from creative
ideas or result from inductive reasoning stemming from previous experiments.
Thus we see a fundamental difference to the deductive reasoning predominant in
algorithm theory. Experiments have to be reproducible, i.e., other researchers have
to be able to repeat an experiment to the extent that they draw the same conclusions
or uncover mistakes in the previous experimental setup.
There are further aspects of AE as a methodology for algorithmics, outside
the main cycle. Design, analysis and evaluation of algorithms are based on some
model of the problem and the underlying machine. Since gaps between theory
and practice often relate to these models, they are an important aspect of AE.
Since we aim at practicality, applications are an important aspect. However
we choose to view applications as being outside the methodology of AE since
it would otherwise become too open ended and because often one algorithm
can be used for quite diverse applications. Also, every new application will
have its own requirements and techniques some of which may be abstracted
away for algorithmic treatment. Still, in order to reduce gaps between theory
and practice, as many interactions as poissible between the application and the
activities of AE should be taken into account: Applications are the basis for realistic
models, they influence the kind of analysis we do, they put constraints on useful
implementations, and they supply realistic inputs and other design parameters for
experiments. On the other hand, the results of analysis and experiments influence
the way an algorithm is used (fast enough for real time or interactive use?,. . . )
and implementations may be the basis for software used in applications. Indeed,
we may view application engineering as a separate process living in both AE
and a concrete application domain where methods from both areas are used to
adapt an algorithm to a particular application. Applications engineering bridges
remaining unavoidable gaps between experimental implementations and production
quality code. Note that there are important differences between these two kinds of
code: fast development, efficiency, and instrumentation for experiments are very
important for AE, while thorough testing, maintainability, simplicity, and tuning for
6 CHAPTER 1. ALGORITHM ENGINEERING
particular classes of inputs are more important for the applications. Furthermore,
the algorithm engineers may not even know all the applications for which their
algorithms will be used. Hence, algorithm libraries of highly tested codes with
clear simple user interfaces are an important link between AE and applications.
Figure 1.1 summarizes the resulting schema for AE as a methodology for
algorithmics. The following sections will describe the activities in more detail. We
give examples of challenges and results that are a more or less random sample
biased to results we know well.
1.2 Models
A big difficulty for defining models for problems and machines is that (apparently)
only complex models are adequate images of reality whereas only simple models
lead to simple, widely usable, portable, and analyzable algorithms. Therefore, AE
must simultaneously and carefully abstract from application problems and refine
theoretical models.
A successful example for a machine model is the external memory model (or
I/O model) [1, 56, 39] which is a careful refinement of the von Neumann model
[42]. Instead of a uniform memory, there are two levels of memory. A fast memory
of limited size M and and a slow memory that is accessed in blocks of size B.
While only counting I/O steps in this model can become a highly theoretical game,
we get an abstraction useful for AE if we additionally take internal work into
account and if we are careful to use the right values for the parameters M and
B 2 . Algorithms good in the I/O model are often good in practice although the
model oversimplifies aspects like rotational delays, seek time, disk data density
depending on the track use, cache replacement strategies [35], flexible block sizes,
etc. Sometimes it would even be counterproductive to be too clever. For example,
a program carefully tuned to minimize rotational delays and seek time might
experience severe performance degradation as soon as another application accesses
the disk.
An example for application modelling is the simulation of traffic flows. While
microscopic simulations that take the actual behavior of every car into account
2
A common pitfall when applying the I/O model to disks is to look for a natural physical block
size. This can lead to values (e.g. the size of 512 byte for a decoding unit) that are four orders of
magnitude from the value that should be chosen – a value where data transfer takes about as long as
the average latency for a small block.
8 CHAPTER 1. ALGORITHM ENGINEERING
are currently limited to fairly small subnetworks, it may soon become possible to
simulate an entire country by only looking at the paths taken by each car.
1.3 Design
As in algorithm theory, we are interested in efficient algorithms. However, in AE,
it is equally important to look for simplicity, implementability, and possibilities for
code reuse. Furthermore, efficiency means not just asymptotic worst case efficiency,
but we also have to look at the constant factors involved and at the performance
for real-world inputs. In particular, some theoretically efficient algorithms have
similar best case and worse case behavior whereas the algorithms used in practice
perform much better on all but contrived examples. An interesting example are
maximum flow algorithms where the asymptotically best algorithm [21] is much
worse than theoretically inferior algorithms [13, 34].
1.4 Analysis
Even simple and proven practical algorithms are often difficult to analyze and
this is one of the main reasons for gaps between theory and practice. Thus, the
analysis of such algorithms is an important aspect of AE. For example, randomized
algorithms are often simpler and faster than their best deterministic competitors
but even simple randomized algorithms are often difficult to analyze.
Many complex optimization problems are attacked using meta heuristics like
(randomized) local search or evolutionary algorithms. Algorithms of this type
are simple and easily adaptable to the problem at hand. However, only very few
such algorithms have been successfully analyzed (e.g. [57]) although performance
guarantees would be of great theoretical and practical value.
An important open problem is partitioning of graphs into approximately equal
sized blocks such that few edges are cut. This problem has many applications,
e.g., in scientific computing. Currently available algorithms with performance
guarantees are too slow for practical use. Practical methods first contract the graph
while preserving its basic structure until only few nodes are left, compute an initial
solution on this coarse representation, and then improve by local search. These
algorithms, e.g., [20] are very fast and yield good solutions in many situations yet
no performance guarantees are known.
An even more famous example for local search is the simplex algorithm for
linear programming. Simple variants of the simplex algorithm need exponential
time for specially constructed inputs. However, in practice, a linear number
of iterations suffices. So far, only subexponential expected runtime bounds are
1.5. IMPLEMENTATION 9
known – for inpracticable variants. However, Spielmann and Teng were able to
show that even small random perturbations of the coefficients of a linear program
suffice to make the expected run time of the simplex algorithm polynomial [53].
This concept of smoothed analysis is a generalization of average case analysis
and an interesting tool of AE also outside the simplex algorithm. Beier and
Vöcking were able to show polynomial smoothed complexity for an important
family of NP-hard problems [7]. For example, this result explains why the knapsack
problem can be efficiently solved in practice and has also helped to improve the
best knapsack solvers. There are interesting interrelations between smoothed
complexity, approximation algorithms, and pseudopolynomial algorithms that is
also an interesting approach to practical solutions of NP-hard problems.
1.5 Implementation
Implementation only appears to be the most clearly prescribed and boring activity in
the cycle of AE. One reason is that there are huge semantic gaps between abstractly
formulated algorithms, imperative programming languages, and real hardware. A
typical example for this semantic gap is the implementation of an O(nm log n)
matching algorithm in [36]. Its abstract description requires a sophisticated data
structure whose efficient implementation only succeeded in [36].
An extreme example for the semantic gap are geometric algorithms which
are often designed assuming exact arithmetics with real numbers and without
considering degenerate cases. The robustness of geometric algorithms has therefore
become an important branch of AE [12, 8, 6].
Even the implementation of relatively simple basic algorithms can be challeng-
ing. You often have to compare several candidates based on small constant factors
in their execution time. Since even small implementation details can make a big
difference, the only reliable way is to highly tune all competitors and run them on
several architectures. It can even be advisable to compare the generated machine
code (e.g. [47, 51],[11]).
Often only implementations give convincing evidence of the correctness and
result quality of an algorithm. For example, an algorithm for planar embedding
[23] was the standard reference for 20 years although this paper only contains
a vague description how an algorithm for planarity testing can be generalized.
Several attempts at a more detailed description contained errors (e.g. [32]). This
was only noticed during the first correct implementation [33]. Similarly, for a long
time nobody suceeded in implementing the famous algorithm for computing three-
connected components from [24]. Only an implementation in 2000 [22] uncovered
and corrected an error. For the related problem of computing a maximal planar
subgraph there was a series of publications in prominent conferences uncovering
10 CHAPTER 1. ALGORITHM ENGINEERING
errors in the previous paper and introducing new ones – until it turned out that the
proposed underlying data structure is inadequate for the problem [26].
An important consequence for planning AE projects is that important imple-
mentations cannot usually be done as bachelor or master theses but require the
very best students or long term attendance by full time researchers or scientific
programmers.
1.6 Experiments
Meaningful experiments are the key to closing the cycle of the AE process. For
example, experiments on crossing minimization in [27] showed that previous
theoretical results were too optimistic so that new algorithms became interesting.
Experiments can also have a direct influence on the analysis. For example,
reconstructing a curve from a set of measured points is a fundamental variant of an
important family of image processing problems. In [4] an apparently expensive
method based on the travelling salesman problem is investigated. Experiments
indicated that “reasonable” inputs lead to easy instances of the travelling salesman
problem. This observation was later formalized and proven. A quite different
example of the same effect is the astonishing observation that arbitrary access
patterns to data blocks on disk arrays can be almost perfectly balanced when two
redundant copies of each block are placed on random disks [49].
Compared to the natural sciences, AE is in the privileged situation that it can
perform many experiments with relatively little effort. However, the other side of
the coin is highly nontrivial planning, evaluation, archiving, postprocessing, and
interpretation of results. The starting point should always be falsifiable hypotheses
on the behavior of the investigated algorithms which stem from the design, analysis,
implementation, or from previous experiments. The result is a confirmation,
falsification, or refinement of the hypothesis. The results complement the analytic
performance guarantees, lead to a better understanding of the algorithms, and
provide ideas for improved algorithms, more accurate analysis, or more efficient
implementation.
Successful experimentation involves a lot of software engineering. Modular
implementations allow flexible experiments. Clever use of tools simplifies the eval-
uation. Careful documentation and version management help with reproducibility
– a central requirement of scientific experiments, that is challenging due to the
frequent new versions of software and hardware.
1.7. ALGORITHM LIBRARIES 11
1.8 Instances
Collections of realistic problem instances for benchmarking have proven crucial
for improving algorithms. There are interesting collections for a number of NP-
hard problems like the travelling salesman problem 3 , the Steiner tree problem,
satisfiability, set covering, or graph partitioning. In particular for the first three
problems the benchmarks have helped enable astonishing breakthroughs. Using
deep mathematical insights into the structure of the problems one can now compute
optimal solutions even for large, realistic instances of the travelling salesman
problem [5] and of the Steiner tree problem [44]. It is a bit odd that similar
benchmarks for problems that are polynomially solvable are sometimes more
difficult to obtain. For route planning in road networks, realistic inputs have become
available in 2005 [50] enabling a revolution with speedups of up to six orders of
magnitude over Dijkstra’s algorithm and a perspective for many applications [14].
In string algorithms and data compression, real-world data is also no problem.
But for many typical graph problems like flows, random inputs are still common
3
http://www.iwr.uni-heidelberg.de/groups/comopt/software/
TSPLIB95/
1.9. APPLICATIONS 13
1.9 Applications
We could discuss many important applications where algorithms play a major role
and a lot of interesting work remains to be done. Since this would go beyond
the scope of this paper, we only want to mention a few: Bioinformatics (e.g.
sequencing, folding, docking, phylogenetic trees, DNA chip evaluations, reaction
networks); information retrieval (indexing, ranking); algorithmic game theory;
traffic information, simulation and planning for cars, busses, trains, and air traffic;
geographic information systems; communication networks; machine learning; real
time scheduling.
The effort for implementing algorithms for a particular application usually
lies somewhere between the effort for experimental evaluation and for algorithm
libraries depending on the context.
An important goal for AE should be to help shaping the applications (as in
the example for genome sequencing mentioned in the introduction) rather than
act as an ancillary science for other disciplines like physics, biology, mechanical
engineering,. . .
1.10 Conclusions
We hope to have demonstrated that AE is a “round” methodology for the devel-
opment of efficient algorithms which simplifies their practical use. We want to
stress, however, that it is not our intention to abolish algorithm theory. The saying
that there is nothing as practical as good theory remains true for algorithmics
because an algorithm with proven performance guarantees has a degree of gen-
erality, reliability, and predictability that cannot be obtained with any number of
experiments. However, this does not contradict the proposed methodology since
it views algorithm theory as a subset of AE, making it even more rich by asking
additional interesting kinds of questions (e.g. simplicity of algorithms, care for
14 CHAPTER 1. ALGORITHM ENGINEERING
The maximum flow problem is very easy to state: In a capacitated network, we wish
to send as much flow as possible between two special nodes, a source node s and a
sink node t, without exceeding the capacity of any edge. In this chapter we discuss
a number of algorithms for solving the maximum flow problem. These algorithms
are of two types:
1. Augmenting path algorithms which incrementally augment flow along
paths from the source node to the sink node.
2. Preflow-push algorithms that flood the network so that some nodes have
excesses (or buildup of flow). These algorithms incrementally relieve flow
from nodes with excesses by sending flow from the node forward toward the
sink node or backward toward the source node.
Definition 1
We define a network to be a directed weighted graph G = (V, E) with distinct
vertices s and t in G and the restriction that s has no incoming edges and t has
no outgoing edges. We call s a source node and t a sink node. For an edge e, we
call its weight ce the capacity of e (nonnegative!). An (s, t)-flow or simply flow is
a function f : E 7→ R≥0 (fe := f (e)) satisfying the capacity constraints and the
flow conservation constraints:
1. 0 ≤ fe ≤ ce ∀e ∈ E
P P
2. e:source(e)=v fe = e:target(e)=v fe ∀v ∈ V \{s, t}
The capacity constraints state that the flow across any edge is bounded by its
capacity, and the latter constraints state that for each node different from s and
t, the total incoming flow equals the total outgoing flow. We further define the
value of a flow val(f ) = total outgoing flow from s. Our goal is to find a flow with
maximum value. An example can be found in Figure 5.1.
15
16 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
10
8 12
10
10 4 12
s t
8 4 2 6
4
10 4 8
Figure 5.1: An example network in which capacities are marked black and the
current flow values on an edge are marked blue. The total flow value is 18 which is
maximal. We also see a cut in the network induced by the colors.
The basic idea behind augmenting path algorithms is to iteratively identify a path
from s to t such that each edge has some spare capacity. On this path, the edges
with the smallest spare capacity are saturated, i.e. the flow along this path is
incremented by this capacity. This way the flow is incrementally augmented along
paths from the source node to the sink node. To be more precise we need the
definition of a residual graph:
Definition 2
Given a network G = (V, E, c) and a flow f , we define the residual network to be
Gf = (V, Ef , cf ). The residual network Gf with respect to a flow f has the same
node set as G. Every edge of Gf is induced by an edge of G and has so-called
residual capacity. Let e = (u, v) be an arbitrary edge of G. If f (e) < ce then e
is also an edge of Gf . Its residual capacity is cfe = ce − f (e). If f (e) > 0 then
erev = (v, u) is an edge of Gf . Its residual capacity is cferev = f (e). Note that both
can be the case, i.e. e ∈ Ef and erev ∈ Ef . An example is given in Figure 5.2.
In other words, the residual capacity cfe=(u,v) of an edge e is the maximum additional
flow that can be sent from node u to node v using the edges (u, v) and (v, u).
Sending flow from u to v via the edge (v, u) means to decrease the flow that is on
this edge, i.e. decreasing the flow from v to u.
10 10 10
10 12
10 10 12 12
10 10 44
10 40 12 10 12
2
6 4 2 4 6 4 22 44
4 4
10 8 capacity 10 8
4 4
flow 6
4 residual capacity 4 4
Figure 5.2: On the left hand side we see the original graph with a flow f
shown. On the right hand side we see the corresponding residual network. In
this graph the highlighted path is an augmenting path with residual capacity
∆f = min{4, 2, 10, 4, 4} = 2.
18 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
We now describe one of the simples and most intuitive algorithms for solving
the maximum flow problem. This algorithm is known as the augmenting path
algorithm/Ford Fulkerson algorithm. We refer to a directed path from the source
to the sink in the residual network as an augmenting path. We define the residual
capacity ∆f of an augmenting path p as the minimum residual capacity of any
edge in the path. An example for an augmenting path can be found in Figure 5.2.
Observe that, by definition, the capacity of an augmenting path is always positive.
Consequently, whenever the network contains an augmenting path, we can send
additional flow from the source to the sink. The augmenting path algorithm is
essentially based on this simple observation. The algorithm proceeds by identifying
augmenting paths and augmenting flows on these path until the network contains
no such path. Pseudocode for this algorithm can be found in Figure 5.3 and 5.4.
For integer capacities, the algorithm runs in O(mval(f )) time. An example run of
the algorithm can be found in Figure 5.5.
10
10 12
4
s t
4
10 8
4
0 0
10 2 10 2
0 0
10 4 10 10 4 10
4 4 4 4
10 8 6 4 4
4 0
+10 +4
0 0
10 0 10 0
0 0
4 12 10 4 12
10
6 2 2 4 6 2 2 4
4 4 4 4 4 4
0 0
+2 +2
Definition 3
An s-t-cut is a partition of V into S and T = V \S with s ∈ S and t ∈ T . The
capacity of this cut is the sum of the weights of all edges starting in S and ending
in T :
P
c(S) := c(S, T = V \S) := u∈S,v∈T c(u,v)
The next lemma relates flows and cuts: the capacity of any (s, t)-cut is an upper
bound for the value of any (s, t)-flow. Conversely, the value of any (s, t)-flow is a
lower bound for the capacity of any (s, t)-cut.
20 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
Lemma 1
Let f be any (s, t)−flow and let S be any (s, t)-cut. Then
val(f ) ≤ c(S).
Proof.
Lemma 3
The algorithm of Ford and Fulkerson is correct.
Proof.
Clearly the algorithm computes a feasible flow f (invariant). At termination
of the algorithm, we have no augmenting path left in Gf . We further have
∀(u, v) ∈ E ∩ (S × T ) : cf(u,v) = 0 and hence f(v,u) = 0. That means that S
is saturated and hence S proofs the maximality of f .
Figure 5.6: By iteratively augmenting the flow along the two blue paths, we get
200 =val(f ) path augmentations.
a
4 r 4
b
s 4 1 4 t
c
4 1 4
d
√
Figure 5.7: Here r is 5−1 2
. Let p0 = hs, c, b, ti, p1 = hs, a, b, c, d, ti,
p2 = hs, c, b, a, ti, p3 = hs, d, c, b, ti. Then the sequence of augmenting paths
p0 (p1 , p2 , p1 , p3 )∗ is an infinite sequence of positive flow augmentations and the
flow value does not converge to the maximum value 9.
the layered network which consists of those edges (u, v) in Gf that satisfy the
condition d(u) = d(v) + 1. Observe that by definition every path from the source
to the sink in the layered network Lf is a shortest path in Gf . Observe further that
some edge in Lf might not be contained in any path from the source to the sink.
Dinic’s algorithm now computes a blocking flow fb in Lf .
Definition 5
A blocking flow fb in Lf is a flow such that for each path p = hs, ..., ti in Lf there
exists an saturated edge e ∈ p, i.e. fb (e) = cf (e).
When a blocking flow fb has been established in a network, Dinic’s algorithm adds
fb to f , recomputes the exact distance labels, forms a new layered network, and
repeats these computations. The algorithm terminates, when as it is forming the
new layered networks, it finds that the source is not connected to the sink. It is
possible to show that every time Dinic’s algorithm forms a new layered network,
the distance label of the source node strictly increases. Consequently, Dinic’s
algorithm forms at most n layered networks and runs in O(n2 m) time, since as
shown below a blocking flow can be computed in O(nm). Pseudocode of this
algorithm can be found in Figure 5.8.
We now explain how to compute blocking flows in the layered network Lf .
The main idea is to repeatedly use depth first searches to find augmenting paths in
the layered network. The augmentation of flow along an edge (u, v) reduces the
residual capacity of edge (u, v) and increases the residual capacity of the reversal
edge (v, u). However, each edge of the layered network is admissible, and therefore
Dinic’s algorithm does not add reversal edges to the layered network. Consequently,
the length of every augmenting path is d(s) and in an augmenting path for every
edge (u, v) we have d(u) = d(v) + 1, i.e. the level decreases with each edge. These
facts allow us to determine an augmenting path in the layered network, on average,
in O(n) time. Pseudocode for the algorithm can be found in Figure 5.9. The
24 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
Figure 5.9: Pseudocode for computing a blocking flow. The main idea is to
repeatedly use DF S to find augmenting paths.
Theorem 4
On√a unit network, i.e. unit edge capacities, Dinic’s algorithms makes at most
O( m) rounds.
Proof. √
Consider the layered network after m iterations. It has at
√ most m√edges. The
average number of edges between any two layers √ is m/ m = m. Hence,
there must be at least one layer i with at most m√edges. Now consider the
cut S = {v | d(v) < i}. It has capacity at most m. √ Since each blocking
flow augmentation increases flow by at least one, after m additional blocking
flow
√ augmentations,
√ a maximum flow has been found. Hence, no more than
2 m = O( m) blocking flow augmentations are needed.
5.2. AUGMENTING PATH ALGORITHMS 25
2 10 1
a c 2 1
extend a c extend
10 10 12 2
extend extend
3 10 4 10 0 3 4 extend 0
s t breakthrough s t breakthrough
4 4
extend
10 2 1 8 10 1 8
2
b d b d
4 4
2 1 2 1
a c a c
3 4 extend 0 3 4 0
s retreat t breakthrough s t
2
extend extend
8 2 1 8 extend retreat 4
2 1 4
b d
4 extend b d
1
1 1
s u 1
8
t
1 1
l edges
sent l units of flow from s to node u and then sent 1 unit of flow along l different
paths of length 2, we would have saved the repetitive computations in traversing
the common set of edges. This is the essential idea underlying the preflow-push
algorithms.
Definition 6
A preflow f is a flow where the flow conservation constraint is relaxed to
inflow outflow
z X}| { z X}| {
excess(v):= f(u,v) − f(v,w) ≥ 0 .
(u,v)∈E (v,w)∈E
This implies for a preflow that it is allowed that the total inflow exceeds the total
outflow. A node v is called active if and only if excess(v) > 0. As an exception,
5.3. PREFLOW-PUSH ALGORITHMS 27
s and t are never active. The residual network Gf with respect to a preflow f is
defined as in the case of a flow.
Procedure genericPreflowPush(G=(V,E), f)
forall e = (s, v) ∈ E do push(e, cap(e)) // saturate
d(s):= n
d(v):= 0 for all other nodes
while ∃v ∈ V \ {s, t} : excess(v) > 0 do // active node
if ∃e = (v, w) ∈ Ef : d(w) < d(v) then // eligible edge
choose some δ ≤ min excess(v), cfe
push(e, δ) // no new steep edges
else d(v)++ // relabel. No new steep edges
The obvious choice for δ is min excess(v), cfe . We call an edge e = (v, w) ∈
Gf steep if d(w) < d(v) − 1, i.e., if it reaches down by two or more levels.
It might be instructive to visualize the generic preflow-push algorithm in terms
of a physical network: edges represent flexible water pipes, nodes represent joints
and the distance function measures how far nodes are above the ground. In this
network we wish to send water from the source to the sink. In addition, we visualize
pushed flow as water flowing downhill. Initially, we move the source node upward,
and water flows to its neighbors. In general, water flows downhill towards the sink;
however, occasionally flow becomes trapped locally at a node that has no downhill
neighbors. At this point we move the node upward, and again water flows downhill
toward the sink. As we continue to move nodes upward, the remaining excess
eventually flows back toward the source. The algorithm terminates when all the
water flows either into the sink or flows back to the source.
In the following we show that an arbitrary preflow-push algorithm finds a
maximum flow in time O(n2 m). We divide this into two steps. The first step shows
that if the algorithm terminates then it computed a maximum flow. Then we show
the running time of the algorithm.
Lemma 5
The algorithm maintains a preflow and does not generate steep edges. The nodes s
and t stay on levels 0 and n, respectively.
Proof.
The algorithm clearly maintains a preflow. After the initialization, each edge in
Gf either connects two nodes on level zero or connects a node on level zero to
node s on level n. Thus, there are no steep edges (there are not even any eligible
edges). A relabeling of a node v does not create a steep edge since a node is
only relabeled if there are no eligible edges out of it. A push across an edge
5.3. PREFLOW-PUSH ALGORITHMS 29
e = (v, w) ∈ Gf may add the edge (w, v) ∈ Gf . However, this edge is not
even eligible. Only active nodes are relabeled and only nodes different from s
and t can be active. Thus s and t stay on layers n and 0, respectively. The
preceding lemma has an interesting interpretation. Since there are no steep edges,
any path from v to t must have length (=number of edges) at least d(v) and any
path from v to s must have length at least d(v) − n. Thus, d(v) is a lower bound
on the distance from v to t and d(v)−n is a lower bound on the distance from v to s.
The next lemma shows that active nodes can always reach s in the residual
network (since they must be able to send their excess back to s). It has the important
consequence that d-labels are bounded by 2n − 1.
Lemma 6
If v is active then there is path from v to s in Gf :
Proof.
The following is an equivalent invariant: There is an s–v path with nonzero flow to
any active node. Suppose now that push((u, w), δ) would destroy this invariant
for a path hs, . . . , w, u, . . . , vi. Then the path hs, . . . , u, . . . , vi reestablishes it.
s w u v
Lemma 7
No distance label ever reaches 2n (∀v ∈ V : d(v) < 2n).
Proof.
Assume that a node v is moved to level 2n = d(v). Since only active nodes are
relabeled, the lemma above implies the existence of a path p (and hence simple
path) in Gf from a node on level 2n to s (which is on level n). This path has at
most n − 1 nodes. We have d(s) = n and since the algorithm does not generate
steep edges d(v) is lower than 2n which is a contradiction.
Lemma 8
When genericPreflowPush terminates f is a maximum flow.
Proof.
When the algorithm terminates, there are no active nodes, i.e. ∀v ∈ V \ {s, t} :
excess(v) = 0, and hence the algorithm terminates with a flow. Call it f . In Gf
there can be no path from s to t since any such path must contain a steep edge
(since s is on level n, t is on level 0). Thus, f is a maximum flow (Max-Flow
Min-Cut Theorem).
30 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
Lemma 9
There are at most 2n2 relabels.
Proof.
No distance label ever exceeds 2n (d(v) ≤ 2n). That means v is relabeled at most
2n times. Hence, we have at most |V | · 2n = 2n2 relabel operations.
Lemma 10
There at most nm saturating pushes.
Proof.
A saturating push across an edge e = (v, w) ∈ Gf removes e from Gf . We claim
that v has to be relabeled at least twice before the next push across e and hence
there can be at most n saturating pushes across any edge. To see the claim, observe
that only a push across erev can again add e to Gf . Since pushes occur only across
eligible edges, w must be relabeled at least twice after the saturating push across e
and before the next push across erev . Similarly, it takes two relabels of v before e
becomes eligible again.
w
v lift
w
It is more difficult to bound the number of non-saturating pushes. It depends
heavily on which active node is selected for pushing and on which edge is selected
for pushing. Recall that a non-saturating push reduces excess(v) to zero, i.e.
deactivates a node.
Lemma 11
The number of non-saturating pushes is O(n2 m) if δ = min excess(v), cfe for
We will show:
5.3. PREFLOW-PUSH ALGORITHMS 31
1. φ ≥ 0, and φ = 0 initially.
Theorem 13
The arbitrary preflow-push algorithm finds a maximum flow in time O(n2 m).
Proof.
Lemma 8 gives us partial correctness. The time for initialization is bounded through
O(n + m). To maintain a set of active nodes we use for example a stack or a FIFO
and we use reverse edge pointers to implement push. Lemma 9 bounds the number
of relabel operations to 2n2 . The saturating pushes are less or equal than nm
(Lemma 10) and the non-saturating pushes are in O(n2 m) (Lemma 11). Lemma 12
states the overall search time for eligible edges is in O(nm). Summing up yields a
total time which is in O(n2 m). There are multiple ways to improve this running
time. The first one is the FIFO-rule: The active nodes are kept in a queue and the
first node in the queue is always selected. When a node is relabeled or activated
the node is added to the rear of the queue. The number of non-saturating pushes is
O(n3 ) when the FIFO-rule is uses. This bound is due to Goldberg.
Another variant follows the Highest Level-rule: An active node on the highest
level, i.e., with the maximal dist-value, is selected. Observe that when a maximal
level active node is relabeled, it will be the unique maximal active node after
the relabel. Thus, this rule guarantees that when a node is selected, pushes out
of the node will be performed√until the node becomes inactive. The number of
non-saturating pushes is O(n2 m) when the Highest Level-rule is uses. This rule
can be implemented using a bucket priority queue.
Lemma 14
When√the Highest Level-rule is used, the number of non-saturating pushes is
O(n2 m).
Proof. √
Again we use a potential function argument. Let K = m; this choice of K will
become clear at the end of the proof. For a node v, let
1. ≤ 4n2 phases and ≤ 4n2 K non-saturating pushes in all cheap phases to-
gether
Suppose that we haven shown (1) to (5). (3) and (4) imply that the total increase
of φ is at most (2n2 + mn)n/K and hence the total decrease can be at most this
number plus n2 /K by (2). The number of non-saturating pushes in expensive
phases is therefore bounded by (2n3 + n2 + mn2 )/K. Together with (1) we
conclude that the total number of non-saturating pushes is at most
What is the best case running time of the algorithm? The running time is Ω(n2 )
if Ω(n) nodes need to be lifted above level n. This is usually the case. The best
case behavior of the other parts of the algorithm is O(m) and hence the cost of
relabeling dominates the best case running time. In this section we will describe
several heuristics that frequently reduce the time spent in relabeling nodes and as a
side-effect reduce the time spent in all other operations. The heuristics will turn the
preflow-push algorithm into a highly effective algorithm for solving flow problems.
The Local Relabeling Heuristic: The local relabeling heuristic applies when-
ever a node is relabeled. It increases the dist-value of v to
Observe that v is active whenever it is relabeled and that an active node has at least
one outgoing edge in Gf . The expression above is therefore well defined. When v
is relabeled non of the outgoing edges is eligible and hence d(w) ≥ d(v) for all
(v, w) ∈ Gf . Thus the local relabeling heuristic increases d(v) by at least one. It
may increase it by more than one. The correctness of the heuristic follows from
the following alternative description: when a node is relabeled, continue to relabel
it until there is an eligible edge out of it.
The Global Relabeling Heuristic: The global relabeling heuristic updates the
distance values of all nodes. It sets d(v) := Gf .reverseBFS(t) for nodes that can
reached in Gf (nodes that aren’t reachable get another d-value which is not defined
here but in the LEDA book of Mehlhorn/Näher). A breadth-first search requires
time O(m). It should therefore not be applied to frequently. We will apply it
initially and every O(m) edge inspections. In this way Ω(m) time is spent between
applications of the global relabel heuristic and hence the worst case running time
is increased by at most a constant factor. The best case can improve significantly.
5.3. PREFLOW-PUSH ALGORITHMS 35
10 10 10 10 10
10 12 10 12
10 0 0 10 1 0
4 4
s t s t
6 4 0 6 4 0
d d
cap 10 0 0 8 cap 10 0 0 8
f 10 f 10
excess 10 4 excess 10 4
10 10 10 10
10 12 10 12
10 1 1 10 1 1
4 10 4 10
s t s t
6 4 0 6 4 0
d d
cap 10 0 0 8 cap 10 1 0 8
f 10 f 10
excess 10 4 excess 6 4 4 4
10 10 4 10 10 4
10 12 10 12
10 1 1 10 1 1
4 10 4 10
s t s t
4 0 4 0
6 4 6 4
d d
cap 10 2 0 8 cap 10 7 0 8
f 10 f 8
excess 2 4 4 4 excess 4 4 4
10 10 4 10 10 2
10 12 10 12
10 1 1 10 1 1
4 10 4 12
s t s t
4 0 4 0
6 4 6 4
d d
cap 10
8 7 1 48 cap 10
8 7 1 48
f f
excess 4 4 excess 4 4
2 10 8 10 8
10 10 12
10 1 2 12 10 2 2
4 12 4 12
2
s t s t
4 4 0
6 4 0 6 4
d d
cap 10
8 7 1 48 cap 10
8 7 1 48
f f
excess 4 4 excess 4 4 2
10 8
10 12
10 2 2
4 12
2
s t
4 0
6 4
d
cap 10
8 7 1 68
f
excess 4 4
Figure 5.14: An example run of the arbitrary preflow-push algorithm (12 pushes).
36 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
Timings: CG1
Timings: CG2
Timings: AMO
(
∀v ∈ V : excess(v) = −supply(v) // flow conservation constraints
subject to:
∀e ∈ E : f (e) ≤ c(e) // capacity constraints
Lemma 15 P
A feasible flow is optimal iff 6 ∃ cycle C ∈ Gf : costf (C) = e∈C costf (e) < 0
Proof.
not here (see [3], p. 307)
• Add a vertex s
• Add a vertex t
If f saturates the edges leaving s then f is feasible for G. Otherwise there cannot
be a feasible flow f 0 because f 0 could easily be converted into a flow in G∗ with
larger value.
Theorem 17
The min-cost flow problem can be solved in time O(mn log n + m2 log maxe∈E c(e)).
Proof.
For details take the courses in optimization or network flows.
Theorem 18
A maximum weighted matching can be found in time O(nm + n2 log n). [Gabow
1992]
Proof.
Not here.
Theorem 19
There is an O(m) time algorithm that finds a matching of weight at least maxmatching M w(M )/2.
The algorithm is a 1/2-approximation algorithm [18].
Proof.
The algorithm is as follows:
M 0 := ∅
invariant M 0 is a set of simple paths
while E 6= ∅ do // find heavy simple paths
select any v ∈ V with degree(v) > 0 // select a starting node
while degree(v) > 0 do // extend path greedily
(v, w):= heaviest edge leaving v // (*)
0 0
M := M ∪ {(v, w)}
remove v from the graph
v:= w
return any matching M ⊆ M 0 with w(M ) ≥ w(M 0 )/2
// one path at a time, e.g., look at the two ways to take every other edge.
5.5. MATCHING PROBLEMS 41
We now have to proof the approximation ratio: Let M ∗ denote a maximum weight
matching. It suffices to show that w(M 0 ) ≥ w(M ∗ ). Therefore we assign each
edge to that incident node that is deleted first. Now all e∗ in our maximum weight
matching M ∗ are assigned to different nodes (otherwise it wouldn’t be a matching).
Consider any edge e∗ ∈ M ∗ and assume it is assigned to node v. Since e∗ is
assigned to v, it was available in line (*), i.e. it could have been selected in this
line. Hence, there is an edge e ∈ M 0 assigned to v with w(e) ≥ w(e∗ ).
Lemma 20
There is an O m log 1 time algorithm that computes a 23 − -approximation of the
M := ∅
repeat Θ n log 1 times:
L R
s t
42 CHAPTER 5. MAXIMUM FLOWS AND MATCHINGS
Here EL→R are all edges E directed from L to R. The capacity of every edge
is set to one. First notice that if G has a matching of size k then G0 has a flow of
value k. To see this let M be a matching of size k. Now we construct a flow that
sends one unit along each edge in M . This flow has value k.
On the other hand if G0 has a flow of value k then G has a matching of size k.
To prove this consider a flow f of value k. First observe that f is an integral flow.
Thus each edge has flow 1 or 0. Consider the set M of edges from L to R that
have flow 1. M has k edges because the value of the flow is equal to the number of
non-zero flow edges crossing the cut (L ∪ {s}, R ∪ {t}). Furthermore, each vertex
has at most one edge in M incident upon it (since all edges have capacity 1).
Thus, to find a maximum cardinality matching in G, we construct G0 , find
the maximum flow in G0 and construct the matching M as above. To solve the
0
unit network√ max-flow problem G we can use Dinitz algorithm which yields a
O((n + m) n) algorithm for the maximum cardinality bipartite matching prob-
lem.
Please refer to Chapter 12 of the “Data Structures and Algorithms – The Basic
Toolbox” book as well as Chapter 15, 16 and 35 of “Introduction to Algorithms”
by Cormen, Leiserson, Rivest and Stein. These sections include dynamic program-
ming as well as greedy and branch-and-bound algorithms.
43
44 CHAPTER 6. GENERIC APPROACHES TO OPTIMIZATION
Chapter 7
Linear Programming
Please refer to Chapter 12 of the “Data Structures and Algorithms – The Basic
Toolbox” book as well as Chapter 29 of “Introduction to Algorithms” by Cormen,
Leiserson, Rivest and Stein.
45
46 CHAPTER 7. LINEAR PROGRAMMING
Chapter 8
Parallel Algorithms
8.1 Introduction
Numerous programming languages and libraries have been developed for explicit
parallel programming. These differ in their view of the address space that they
make available to the programmer, the degree of synchronization imposed on
concurrent activities, and the multiplicity of programs. The message-passing
programming paradigm is one of the oldest and most widely used approaches for
programming parallel computers. Its roots can be traced back in the early days of
parallel processing, and its wide-spread adoption can be attributed to the fact that
it imposes minimal requirements on the underlying hardware.
47
48 CHAPTER 8. PARALLEL ALGORITHMS
process). But except for this degenerate case, most processes execute the same
code. SPMD programs can be loosely synchronous or completely asynchronous.
• Startup time Tstart : The startup time is the time required to handle a message
at the sending and receiving nodes. This includes the time to prepare the
message (adding header, trailer, and error correction information), the time to
execute the routing algorithm, and the time to establish an interface between
the local node and the router. This delay is incurred only once for a single
message transfer.
• Per-hop time Thop : After a message leaves a node, it takes a finite amount
of time to reach the next node in its path. The time taken by the header of a
message to travel between two directly-connected nodes in the network is
called the per-hop time. It is also known as node latency. The per-hop time
is directly related to the latency within the routing switch for determining
which output buffer or channel the message should be forwarded to.
• Per-byte transfer time Tbyte : If the channel bandwidth is r bytes per second,
then each word takes time Tbyte = 1/r to traverse the link. This time is
called the per-byte transfer time. This time includes network as well as
buffering overheads.
This implies that in order to optimize the cost of message transfers, we would need
to:
• Communicate in bulk. That is, instead of sending small messages and paying
a startup costs for each one, we want to aggregate small messages into a
single large message and amortize the startup latency across a larger message.
This is because on typical platforms such as clusters and message-passing
machines, the value of Tstart is much larger than those of Thop or Tbyte .
While the first two objectives are relatively easy to achieve, the task of minimiz-
ing distance of communicating nodes is difficult, and in many cases an unnecessary
burden on the algorithm designer. This is a direct consequence of the following
characteristics of parallel platforms and paradigms:
• The per-hop time Thop is typically dominated either by the startup latency
Tstart for small messages or by per-byte component `Tbyte for large messages
`. Since the maximum number of hops in most networks is relatively small,
the per-hop time can be ignored with little loss in accuracy.
All of these point to a simpler cost model in which the cost of transferring a
message between two nodes on a network is given by:
However, it is important to note that our basic cost model is valid only for
uncongested networks. Architectures have varying thresholds for when they get
congested; i.e., a linear array has a much lower threshold for congestion than a
hypercube. Furthermore, different communication patterns congest a given network
to different extents. Consequently, our simplified cost model is valid only as long
as the underlying communication pattern does not congest the network.
8.4.2 Speedup
When evaluating a parallel system, we are often interested in knowing how much
performance gain is achieved by parallelizing a given application over a sequential
implementation. Speedup is a measure that captures the relative benefit of solving
a problem in parallel. It is defined as the ratio of the time taken to solve a problem
on a single processing element to the time required to solve the same problem on a
parallel computer with p identical processing elements. We denote speedup by the
symbol S.
For a given problem, more than one sequential algorithm may be available, but
all of these may not be equally suitable for parallelization. When a serial computer
is used, it is natural to use the sequential algorithm that solves the problem in the
least amount of time. Given a parallel algorithm, it is fair to judge its performance
with respect to the fastest sequential algorithm for solving the same problem on
a single processing element. Sometimes, the asymptotically fastest sequential
algorithm to solve a problem is not known, or its runtime has a large constant
that makes it impractical to implement. In such cases, we take the fastest known
algorithm that would be a practical choice for a serial computer to be the best
52 CHAPTER 8. PARALLEL ALGORITHMS
8.4.4 Efficiency
Only an ideal parallel system containing p processing elements can deliver a
speedup equal to p. In practice, ideal behavior is not achieved because while
executing a parallel algorithm, the processing elements cannot devote 100% of
their time to the computations of the algorithm. Efficiency is a measure of the
fraction of time for which a processing element is usefully employed; it is defined
as the ratio of speedup to the number of processing elements. In an ideal parallel
system, speedup is equal to p and efficiency is equal to one. In practice, speedup is
less than p and efficiency is between zero and one, depending on the effectiveness
8.4. PERFORMANCE METRICS 53
with which the processing elements are utilized. We denote efficiency by the
symbol E. Mathematically, it is given by
S
E=
p
Each step shown in Figure 8.1 consists of one addition and the communication
of a single word. The addition can be performed in some constant time, say Tc0 ,
and the communication of a single word can be performed in time Tstart + Tbyte .
Therefore, the addition and communication operations take a constant amount of
time. Thus,
T (p) = Θ(log n)
Since the problem can be solved in Θ(n) time on a single processing element, its
speedup is
n
S = Θ( )
log n
From this equation and the preceding definition, the efficiency of the algorithm for
adding n numbers on n processing elements is
Θ( logn n ) 1
E= = Θ( )
n log n
Figure 8.2: Computing the sum of 16 numbers using four processing elements.
the efficiency is
Tseq (n) 1 p log p
E= = =1−Θ
p(Tseq (n/p) + Θ(log p)) 1 + Θ(p log(p)) /n n
Figure 8.3: Computing the sum of 16 numbers using four processing elements.
56 CHAPTER 8. PARALLEL ALGORITHMS
8.6 Quicksort
This section examines the quicksort algorithm, which has an average complexity of
Θ(n log n). Quicksort is one of the most common sorting algorithms for sequential
computers because of its simplicity, low overhead, and optimal average complexity.
8.6. QUICKSORT 57
Algorithm 2 QUICKSORT(A, q, r)
if q < r then
x := A[q]
s := q
for i := q + 1 to r do
if A[i] ≤ x then
s := s + 1
swap(A[s], A[i])
end if
end for
swap(A[q], A[s])
QUICKSORT(A, q, s)
QUICKSORT(A, s + 1, r)
end if
Figure 8.4: Computing the sum of 16 numbers using four processing elements.
.
subsequences of and elements. In this case, the run time is given by the recurrence
relation T (n) = 2T (n/2) + Θ(n), whose solution is T (n) = Θ(n log n). The
second split yields an optimal algorithm. Although quicksort can have O(n2 ) worst-
case complexity, its average complexity is significantly better; the average number
of compare-exchange operations needed by quicksort for sorting a randomly-
ordered input sequence is 1.44n log n, which is asymptotically optimal. There
are several ways to select pivots. For example, the pivot can be the median of
a small number of elements of the sequence, or it can be an element selected at
random. Some pivot selection strategies have advantages over others for certain
input sequences.
process. Since one process must partition the original array A[1 . . . n], the run time
of this formulation is bounded below by Ω(n).
The main limitation of the previous parallel formulation is that it performs the
partitioning step serially. As we will see in subsequent formulations, performing
partitioning in parallel is essential in obtaining an efficient parallel quicksort. To
see why, consider the recurrence equation T (n) = 2T (n/2) + Θ(n), which gives
the complexity of quicksort for optimal pivot selection. The term Θ(n) is due to
the partitioning of the array. Compare this complexity with the overall complexity
of the algorithm, Θ(n log n). From these two complexities, we can think of the
quicksort algorithm as consisting of Θ(log n) steps, each requiring time Θ(n)
that of splitting the array. Therefore, if the partitioning step is performed in time
Θ(1), using Θ(n) processes, it is possible to obtain an overall parallel run time
of Θ(log n). Hence, parallelizing the partitioning step has the potential to yield a
significantly faster parallel formulation.
In the previous paragraph, we hinted that we could partition an array of size
n into two smaller arrays in time Θ(1) by using Θ(n) processes. However, this is
difficult for most parallel computing models. The only known algorithms are for
the abstract PRAM models. Because of communication overhead, the partitioning
step takes longer than Θ(1) on realistic shared-address-space and message-passing
parallel computers. In the following section we present a parallel formulation for a
message-passing platform.
Algorithm 3 theoQSort(d, i, p)
if p = 1 then
return
end if
r := random element from 0 . . . p − 1{same value in entire partition}
v := d@r{broadcast pivot}
f := Pd ≤ v{1 if d is on left side, 0 otherwise}
j := ik=0 f @k{prefix sum, count elements on left side}
p0 := j@(p − 1){broadcast, result is border index}
if f then
send d to PE j − 1
else
send d to PE p0 + i − j{i − j = ik=0 d@k > v}
P
end if
receive d
if i < p0 then
join left partition; qsort(d, i, p0 )
else
join right partition; qsort(d, i − p0 , p − p0 )
end if
T (p) = Θ(log2 p)
8.7 Mergesort
Given k sorted sequences S1 , . . . , Sk and a global rank m, we are asked to find
splitting positions i1 , . . . , ik such that i1 + . . . + ik = m and ∀j, j 0 ∈ 1 . . . k :
Sj [ij − 1] ≤ Sj 0 [ij 0 ]. This is similar to selection, although in this chapter, we are
not interested in the element of global rank m, but rather the splitting positions.
We call finding the splitting positions multiway partitioning. Finding the respective
element is trivial after multiway partitioning, just take the smallest element right
of the split. Merging two sorted sequences S1 and S2 is usually understood as
8.7. MERGESORT 61
efficiently coalescing them into one sorted sequence. Its generalization to multiple
sequences S1 , . . . , Sk is called multiway merging.
It is often used instead of repeated binary merging, since it is more I/O- and
cache-efficient. The high “branching factor“ allows to perform a considerable
amount of work per item. This compensates for the costs of accessing the data on
the (shared or external) memory.
k
{ ······
PE 3
PE 2
PE 1
PE 0
Figure 8.5: Schema of parallel multiway merging, in this case merging a specified
number of elements only.
PE 0 PE 1 PE 2 PE 3
virtually distribute
multiway merge
temporary
memory
copy back
Figure 8.6: Schema of parallel multiway mergesort. For shared memory, the
distribution depicted in the middle row takes place only virtually, i. e. the PEs read
the data from where it is already, there is no copying. The final copying step is
optional, only necessary if the result is expected in the same place as the input.
Chapter 9
External Algorithms
9.1 Introduction
Massive data sets arise naturally in many domains. Spatial data bases of geographic
information systems like GoogleEarth and NASA’s World Wind store terabytes of
geographically-referenced information that includes the whole Earth. In computer
graphics one has to visualize huge scenes using only a conventional workstation
with limited memory. Billing systems of telecommunication companies evaluate
terabytes of phone call log files. One is interested in analyzing huge network
instances like a web graph or a phone call graph. Search engines like Google and
Yahoo provide fast text search in their data bases indexing billions of web pages.
A precise simulation of the Earth’s climate needs to manipulate with petabytes of
data. These examples are only a sample of numerous applications which have to
process huge amount of data.
For economical reasons, it is not feasible to build all of the computer’s memory
of the fastest type or to extend the fast memory to dimensions that could hold all
relevant data. Instead, modern computer architectures contain a memory hierarchy
of increasing size, decreasing speed and costs from top to bottom: On top, we
have the registers integrated in the CPU, a number of caches, main memory and
finally disks, which are often referenced as external memory as opposed to internal
memory.
The internal memories of computers can keep only a small fraction of the large
data sets. During processing applications need to access the external memory (e. g.
hard disks) very frequently. One such access can be about 106 times slower than a
main memory access. Therefore, disk accesses (I/Os) become the main bottleneck.
63
64 CHAPTER 9. EXTERNAL ALGORITHMS
The reason for this high latency is the mechanical nature of the disk access.
Figure 9.1 shows the schematic construction of a hard disk. The time needed
for finding the data position on the disk is called seek time or (seek) latency and
averages to about 3-10 ms for modern disks. The seek time depends on the surface
data density and the rotational speed and can hardly be reduced because of the
mechanical nature of hard disk technology, which still remains the best way to
store massive amounts of data. Note that after finding the required position on the
surface, the data can be transferred at a higher speed which is limited only by the
surface data density and the bandwidth of the interface connecting CPU and hard
disk. This speed is called sustained throughput and achieves up to 80 MByte/s
nowadays. In order to amortize the high seek latency one reads or writes the data
in chunks (blocks). The block size is balanced when the seek latency is a fraction
of the sustained transfer time for the block. Good results show blocks containing a
full track. For older low density disks of the early 90’s the track capacities were
about 16-64 KB. Nowadays, disk tracks have a capacity of several megabytes.
Operating systems implement the so called virtual memory mechanism that
provides an additional working space for an application, mapping an external
memory file (page file) to virtual main memory addresses. This idea supports the
Random Access Machine model in which a program has an infinitely large main
memory with uniform random access cost. Since the memory view is unified in
operating systems supporting virtual memory, the application does not know where
its working space and program code are located: in the main memory or (partially)
swapped out to the page file. For many applications and algorithms with non-linear
access pattern, these remedies are not useful and even counterproductive: the swap
file is accessed very frequently; the data code can be swapped out in favor of data
9.2. THE EXTERNAL MEMORY MODEL 65
blocks; the swap file is highly fragmented and thus many random input/output
operations (I/Os) are needed even for scanning.
• I/O complexity: the number of I/O steps should be minimized (the main
metric),
The PDM model has become the standard theoretical model for designing and
analyzing I/O-efficient algorithms.
CPU
Memory M
B B
B
There are some “golden rules” that can guide the process of designing I/O
efficient algorithms: Unstructured memory access is often very expensive as
it comes with 1 I/O per operation whereas we want 1/B I/Os for an efficient
algorithm. Instead, we want to scan the external memory, always loading the
next due block of size B in one step and processing it internally. An optimal
N
scan will only cost scan(N ) := Θ( D·B ) I/Os. If the data is not stored in a
way that allows linear scanning, we can often use sorting to reorder and than
scan it. As we will see in section 9.4, external sorting can be implemented with
N
sort(N ) := Θ( D·B · logM/B N
B
) I/Os.
A simple example of this technique is the following task: We want to reorder
the elements in an array A into an array B using a given “rank” stored in array C.
This should be done in an I/O efficient way.
int[1..N] A,B,C;
for i=1 to N do A[i]:=B[C[i]];
The literal implementation would have worst case costs of Ω(N ) I/Os. For
N = 106 , this would take ≈ T = 10000 seconds ≈ 3 hours. Using the sort-and-
scan technique, we can lower this to sort(N ) and the algorithm would finish in
less than a second:
An External Stack
In the case of a stack, the buffer is just an internal memory array of 2B elements
that at any time contains the k set elements most recently inserted, where k ≤ 2B.
Remove operations can now be implemented using no I/Os, except for the case
where the buffer has run empty. In this case a single I/O is used to retrieve the
block of B elements most recently written to external memory.
One way of looking at this is that external memory is used to implement a stack
with blocks as data elements. In other words: The “macroscopic view” in external
memory is the same as the “microscopic view” in internal memory.
Returning to external stacks, the above means that at least B remove operations
are made for each I/O reading a block. Insertions use no I/Os except when the
buffer runs full. In this case a single I/O is used to write the B least recent elements
to a block in external memory. Summing up, both insertions and deletions are done
in 1/B I/O, in the amortized sense. This is the best performance we could hope
for when storing or retrieving a sequence of data items much larger than internal
memory, since no more that B items can be read or written in one I/O. A desired
goal in many external memory data structures is that when reporting a sequence of
elements, only O(1/B) I/O is used per element.
An External Queue
To implement an efficient queue we use two buffers of size B, a read buffer and
a write buffer. Remove operations work on the read buffer until it is empty, in
which case the least recently written external memory block is read into the buffer.
(If there are no elements in external memory, the contents of the write buffer is
transfered to the read buffer.) Insertions are done to the write buffer which when
full is written to external memory. Similar to before, we get at most 1/B I/O per
operation.
single pass resulting in a smaller number of merge phases. We only have to keep
one block (containing the currently smallest elements) per run in main memory.
We maintain a priority queue containing the smallest elements of each run in
the current merging step to efficiently keep track of the overall smallest element.
Figure 9.4 exemplifies a 4-way merging.
M
f i=0 i=M i=2M
run sort internal
f
make things as simple as possible but no simpler
f __aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst
runBuffer st ss ps st
B next internal
M k out ss
t ________aaabbeeeeghiiiiklllmmmnnooppprss
Every element is read/written twice for forming the runs (in blocks of size B)
and twice for every merging phase. Access granularity is blocks. This leads to the
following (asymptotically optimal) total number of I/Os:
2n 2n l n m
(1 + dlogk #runse) = 1 + logM/B := sort(n) (9.1)
B B M
During initialization, runs 1 and 2 are read, i is set to 1. Thread A sorts runs in
memory and writes them to disk. Thread B will wait until run i is finished (and
thread A works on i + 1) and reads the next run i + 2 into the freed space. The
thread doing the more intense work will never wait for the other one.
A similar result can be achieved during the merging step: We have an I/O-
thread (more, if D > 1) and a merging-thread. Merging is I/O bound if and only if
y > DB/2, where y is the number of elements merged during one I/O step1 and
DB/2 is the number of elements that can be moved in and out the merger during
one I/O step.
Consider the I/O bound case. We can show that the I/O thread never blocks:
Consider figure 9.5. The region in which the I/O thread would block is colored
dark grey. Here, there are too few elements in the output buffers (writing them to
disk would be inefficient) and the prefetch and merge buffers are still full. The
blocking region can only be reached from the light grey areas: Merge y elements
into the output buffers and either refill overlap buffers with DB elements2 elements
(left arrow) or write DB items from the output buffers to disk (right arrow). But
these two areas themselves cannot be reached at all by the two possible operations;
therefore, the I/O thread never blocks.
Analogously, the merging thread does not block in the compute bound case.
Here we have two zones of possible blocking: More than 2DB − y elements in
the output buffers or less than kB + y elements in the merge and prefetch buffers 3 .
Again, the constraints on possible transitions in this diagram lead to a single region
from where the blocking areas can be reached which is itself unreachable.
1
Note that y ≤ DB in all interesting cases as otherwise we have an I/O-bound algorithm but
merging is so fast that there is always enough room to fetch D blocks.
2
At the same time y are removed by merging resulting in a total increase of DB − y.
3
y elements are not sufficient in general because these may not be the smallest; but if there are
more than kB elements in merge and prefetch buffers, at least one element can be written to the
output buffer.
70 CHAPTER 9. EXTERNAL ALGORITHMS
r
kB+3DB
kB+2DB
r output
kB+3DB
fetch
kB+2DB+y blocking
kB+DB
kB+2DB output kB+2y
kB+DB+y fetch kB+y
kB block merging
kB+DB
w w
DB−y DB 2DB−y 2DB DB 2DB−y2DB
#elements in output buffers #elements in output buffers
I/O bound case compute bound case
Overlapping is not always possible. Consider the k runs in figure 9.6. While
merging them, there is no work to do for the prefetching thread. Then, the current
blocks of each run will out of elements simultaneously and merging can only
continue after both buffers are refilled.
As internal work influences running time, we need a fast solution for the most
compute intense step during merging: A Tournament Tree (or Loser Tree) is a
specialized data structure for finding the smallest element of all runs, see Figure 9.7.
For k = 2K , it is a complete binary tree with K levels, where each leaf contains
the currently smallest element of one run. Each internal node contains the ’loser’
(i.e., the greater) of the ’competition’ between its two child nodes. Above the root
node, we store the global winner along with a pointer to the corresponding run.
After writing this element to the output buffer, we simply have to move the next
element of its run up until there is a new global winner. Compared to general
purpose data structures like binary heaps, we get exactly log k comparisons (no
hidden constant factors). Similar to the implicit search trees we used for Sample
Sort, Tournament Trees can be implemented as arrays where finding the parent
9.6. EXPERIMENTS 71
node simply maps to an index shift to the right. The inner loop for moving from
leaf to root can be unrolled and contains predictable load instructions and index
computations allowing exploitation of instruction parallelism, see Algorithm 9.8.
1 deleteMin+ 2
insertNext
2 3
4 4 4 4
6 7 9 7 6 7 9 7
4 6 2 7 9 1 4 7 4 6 2 7 9 3 4 7
3
9.6 Experiments
Experiments on Multiway Merge Sort were performed in 2001 on a 2 × 2GHz
Xeon × 2 threads machine (Intel IV with Netburst) with several 66 MHz PCI-
buses, 4 fast IDE controllers (Promise Ultra100 TX2) and 8 fast IDE disks (IBM
IC35L080AVVA07). This inexpensive (mid 2002) setup gave a high I/O-bandwidth
of 360 MB/s. The keys consisted of 16 GByte random 32 bit integers, run size was
256 MByte, block size B was 2MB (if not otherwise mentioned).
Figure 9.9 shows the running time for different element sizes (for a constant
total data volume of 16 GByte). The smaller the elements, the costlier becomes
internal work, especially during run formation (there are more elements to sort).
With a high I/O throughput and intelligent prefetching algorithms, I/O wait time
72 CHAPTER 9. EXTERNAL ALGORITHMS
400
run formation
350 merging
I/O wait in merge phase
I/O wait in run formation phase
300
200
150
100
50
0
16 32 64 128 256 512 1024
element size [byte]
never makes up for more than half of the total running time. This proves the point
that overlapping and tuning of internal work are important.
26
128 GBytes 1x merge
128 GBytes 2x merge
24 16 GBytes
22
sort time [ns/byte]
20
18
16
14
12
128 256 512 1024 2048 4096 8192
block size [KByte]
What is a good block size B? An intuitive approach would link B to the size
of a physical disk block. However, figure 9.10 shows that B is no technology
constant but a tuning parameter: A larger B is better (as it reduces the amortized
costs of O(1/B) I/Os per element), as long as the resulting smaller k still allows
for a single merge phase (see curve for 128GB).
9.7. EXTERNAL PRIORITY QUEUES 73
sorted
1 2 ... k m sequences
B
k−merge insertion
m buffer PQ
PQ
2
1 2 ... k m 1 2 ... k mk 1 2 ... k mk
R-merge
Figure 9.12: Overview of the complete data structure for R = 3 merge groups
DeleteMin: The smallest elements of the deletion buffer and insertion buffer are
compared, and the smaller one is deleted and returned. If this empties the deletion
buffer, it is refilled from the group buffers using an R-way merge. Before the
refill, group buffers with less than m0 elements are refilled from the sequences in
their group (if the group is nonempty). Figure 9.13 gives an example of a deletion
operation.
DeleteMin works correctly provided the data structure fulfills the heap property,
i.e., elements in the group buffers are not smaller than elements in the deletion
buffer, and in turn, elements in a sorted sequence are not smaller than the elements
in the respective group buffer. Maintaining this invariant is the main difficulty for
implementing insertion.
Insert: New elements are inserted into the insert heap. When its size reaches
m, its elements are sorted (e.g. using merge sort or heap sort). The result is then
merged with the concatenation of the deletion buffer and the group buffer 1. The
smallest resulting elements replace the deletion buffer and group buffer 1. The
9.7. EXTERNAL PRIORITY QUEUES 75
remaining elements form a new sequence of length at most m. The new sequence
is finally inserted into a free slot of group G1 . If there is no free slot initially, G1 is
emptied by merging all its sequences into a single sequence of size at most km,
which is then put into G2 . The same strategy is used recursively to free higher level
groups when necessary. When group GR overflows, R is incremented and a new
group is created. When a sequence is moved from one group to the other, the heap
property may be violated. Therefore, when G1 through Gi have been emptied, the
group buffers 1 through i + 1 are merged, and put into G1 . An example insertion
operation is shown in Figures 9.14 and 9.15.
n o p n o p
m q m q
k l r k l r
d j i s w d j i s w
x b g h t u v x b g h t u v
e b f c e b f c
a 3
(a) Deletion of two elements empties insert heap and deletion buffer
n o p n o p
m q m q
k l r k l r
d j i s w d j i s w
x b g h t u v x t u v
e b f c e
b g h
f c
(b) Every Group fills its buffer via k-way-merging, the deletion buffer is filled
from group buffers via M-way-merging
For cached memory, where the speed of internal computation matters, it is also
crucial how to implement the operation of k-way merging. How this can be done
in an efficient way is described in the chapter about Sorting (9.5).
76 CHAPTER 9. EXTERNAL ALGORITHMS
insert(3) v u w v u w
s n t s n t
r m p r m p
q g o q c o
k j f l c h x k j f l h
i e g i
e b d b
d b
b a x a 3
(a) Inserting element 3 leads to overflow of insert heap: it
is merged with deletion buffer and group buffer 1 and then
inserted into group 1
v u w v u w
s n t s n t
r m p k r m p
q c o j q c o
x k j f l h x g l h
e g i e f i
d b d b
b b
a 3 a 3
(b) Overflow in group 1: all old elements are merged and
inserted in next group
v u w n o p
s n t m q
k r m p k l r
j q c o j i s w
x g l h x g h t u v
e f i e f c
d b d b
b b
a 3 a 3
(c) Overflow in group 2: all old elements are merged and inserted
in next group
n o p n o p
m q m q
k l r k l r
j i s w d j i s w
x g h t u v x b g h t u v
e f c e b f c
d b
b
a 3 a 3
(d) Group buffers are invalid now: merge and inserted them to group 1
Analysis
We will now give a sketch for the I/O analysis of our priority queues. Let i denote
the number of insertions and an upper bound to the number of deleteMin operations.
First note that Group Gi can overflow at most every m(k i − 1) insertions:
The only complication is the slot in group G1 used for invalid group buffers.
Nevertheless, when groups G1 through Gi contain k sequences each, this can only
happen if
XR
m(k − 1)k j−1 = m(k i − 1)
j=1
Now consider the I/Os performed for an element moving on the following
canonical data path: It is first inserted into the insert buffer and then written to
a sequence in group G1 in a batched manner, i.e., 1/B I/Os are charged to the
insertion of this element. Then it is involved in emptying groups until it arrives in
group GR . For each emptying operation, the element is involved into one batched
read and one batched write, i.e., it is charged with 2(R − 1)/B I/Os for tree
emptying operations. Eventually, the element is read into group buffer R yielding a
charge of 1/B I/Os for. All in all, we get a charge of 2R/B I/Os for each insertion.
What remains to be shown (and is ommited here) is that the remaining I/Os
only contribute lower order terms or replace I/Os done on the canonical path. For
example, we save I/Os when an element is extracted before it reaches the last group.
We use the costs charged for this to pay for swapping the group buffers in and out.
Eventually, we have O(sort(I)) I/Os.
In a similar fashion, we can show that I operations inflict I log I key compar-
isons on average. As for sorting, this is a good measure for the internal work,
since in efficient implementations of priority queues for the comparison model, this
78 CHAPTER 9. EXTERNAL ALGORITHMS
Experiments
We now present the results of some experiments conducted to compare our sequence
heap with other priority queue implementations. Random 32 bit integers were used
as keys for another 32 bits of associated information. The operation sequence used
was (Insert − deleteMin − Insert)N (deleteMin − Insert − deleteMin)N .
The choice of this sequence is nontrivial as it can have measurable influence (factor
two and more) on the performance. Figure 9.17 show this: Here we have the se-
quence (Insert (deleteMin Insert)s )N (deleteMin (Insert deleteMin)s )N
for several values of s. For larger s, the performance gets better when N is large
enough. This can be explained with a “locality effect“: New elements tend to be
smaller than most old elements (the smallest of the old elements have long been
removed before). Therefore, many elements never make it into group G1 let alone
the groups for larger sequences. Since most work is performed while emptying
groups, this work is saved. So that these instances should come close to the worst
case. To make clear that sequence heaps are nevertheless still much better than
binary or 4-ary heaps, Figure 9.17 additionally contains their timing for s = 0.
The parameters chosen for the experiments where m0 = 32, m = 256 and
k = 128 on all machines tried. While there were better settings for individual
machines, these global values gave near optimal performance in all cases.
sequence heap
150
100
50
0
1024 4096 16384 65536 218 220 222 223
N
s=0
s=1
128 s=4
s=16
64
32
256 1024 4096 16384 65536 218 220 222 223
N
Online Algorithms
Online algorithms are a class of algorithms which process their input in a serial
fashion, where future information is only revealed in parts. The algorithms must
fully process each part and produce a partial output before receiving further input.
No information is known about the further input sequence. Therefore, online
algorithms are restricted to making decisions based solely on past events. The
challenge is to design algorithms that perform well under this lack of knowledge
about the future.
In contrast, many algorithms previously developed in these course notes require
the whole input to be available at the beginning and do extensive computation on
the entirety of the problem data. These are called offline algorithms.
However, many practical problems have an intrinsic online nature, for example:
81
82 CHAPTER 10. ONLINE ALGORITHMS
Definition 7
Given an optimization problem with input I, for which an offline algorithm OPT
can compute an optimal solution using a cost measure OPT(I), then an online
algorithm ALG is called c-competitive if there is a constant α such that for all
finite input sequences I
ALG(I) ≤ c · OPT(I) + α
ALG(I)
c = sup .
I OPT(I)
As a first simple online algorithm we suggest buying skis on the sixth trip.
Using competitive analysis we can calculate the competitive ratio for this algorithm:
10.2. PAGING – DETERMINISTIC ALGORITHMS 83
• If you make five or fewer skiing trips, you will pay the optimal cost of 50
euro per trip and thus the ratio to an optimal offline algorithm is 5·50
5·50
= 1 for
these inputs.
• If you make six or more skiing trips, you pay 250 + 300 = 550 euros in total.
However, an optimal offline algorithm can take advantage of its information
about the future and pays only 300 euro. For these inputs the ratio is 550
300
= 11
6
,
11
thus our simple algorithm is 6 -competitive.
We now show that no better online algorithm can be devised for this problem.
• Suppose we buy skis earlier, say on the trip x < 6. For fewer than x trips,
the ratio stays 1, but for exactly x trips an optimal algorithm pays 50 x euros
while the online algorithm pays 50(x − 1) + 300. Thus it has a competitive
ratio of
50x + 250 5
=1+ ≥2
50x x
• Suppose we buy skis later, say on the trip x > 6. For exactly x trips an
optimal offline algorithm pays only 300 euros while the online algorithm
pays again 50(x − 1) + 300 euro. The competitive ratio in this case is
50x + 250 5+x
= ≥2
300 6
So both buying times x < 6 and x > 6 yield an algorithm with a competitive ratio
greater than 2. All other algorithms, like renting skis while already possessing a
pair or buying two pairs, are obviously inferior and thus buying after 6 trips is the
optimal algorithm in absence of complete knowledge.
The ski rental example is prototypical of a class of problems in which you
must decide whether to pay smaller repeating costs or a large one-time cost, which
reduces or eliminates the repeating costs. Other examples are whether to buy
a BahnCard or monthly tickets for a public transport system. The idea behind
optimal online algorithms for these problems is to not pay the large one-time cost
until you would have paid the same amount in smaller repeating costs.
large memory
Paging is a natural online problem as future memory access patterns are gen-
erally not known in advance. Table 10.1 contains the most basic and well-known
deterministic paging algorithms.
10.2. PAGING – DETERMINISTIC ALGORITHMS 85
All algorithms given above except LFD are online algorithms. LFD is an
offline algorithm because it requires knowledge of the future page access order.
Furthermore, all these algorithms except FWF are called demand paging: they
never evict a page unless there is a page fault. One can show that any page
replacement algorithm can be modified to be demand paging without adding a cost
penalty. For FWF this leads to an algorithm which evicts an arbitrary page when
the cache is full.
Figure 10.2 shows a snapshot of LRU and LFD working on an example page
request sequence. The current cache, with k = 3 in both examples, is depicted
in the second row. Conceptually, LRU keeps a backwards reference to the last
time each cached page was used. In the current state shown, the red page will be
evicted to make room for the requested blue page, as the red page has been longest
in cache, relative to the most recent access. LFD, on the over hand, is an offline
algorithm and can therefore make use of the future request sequence. Each page in
the cache references its next access position. When a page must be evicted, LFD
can pick the page which has its next request farthest in the future. In the example,
access to the green page causes a page to be evicted and because the yellow one
will be requested latest of all in the cache, LFD will remove the yellow one.
A further interesting topic is how efficiently the replacement strategies can
be implemented. Simplicity is a key factor for paging algorithms because long
calculations would impact the fast memory’s access performance on cache misses.
However, for competitive analysis of the algorithms with page faults as cost metric,
the implementation complexity is immaterial.
σ σ
LRU LFD
Theorem 21
LFD is an optimal offline algorithm for paging.
Proof.
We prove this theorem by showing that any optimal offline algorithm OPT can be
modified to act like LFD on any given page request sequence σ without increasing
the number of page faults.
This is shown via inductive application of the following claim: given any
paging algorithm ALG and a page request sequence σ, we can construct an offline
algorithm ALG0i for any i = 1, . . . , |σ| so that following three properties are
satisfied:
(i) ALG0i processes the first i − 1 paging requests exactly as ALG does.
(ii) If the i-th request causes a page fault, then ALG0i evicts the page with
longest forward distance from cache memory. So for the i-th request the new
algorithm behaves the same as LFD.
(iii) The cost in page faults of ALG0i does not increase: ALG0i (σ) ≤ ALG(σ).
If this claim holds, the theorem can be proven by applying it n = |σ| times to
any optimal offline algorithm OPT. Start by applying the claim to OPT with
i = 1 to obtain OPT01 , which acts like LFD on the first request. Then apply the
claim to OPT01 with i = 2 to obtain OPT002 and so on. After n steps the algorithm
OPT(n)n acts identically to LFD on the request sequence σ. The number of page
fault of OPT(n)n has not increased due to (iii), nor can it have decreased as OPT
was already optimal.
To finish the proof, it remains to show the claim. Given a paging algorithm
ALG, we first construct ALG0i . Figure 10.3 shows an illustration of this construc-
tion.
For the first i − 1 requests, both algorithms must behave identically. Therefore,
after the (i − 1)-th request the cached page sets are identical. Only after handling
the i-th request for page x these sets might differ, as ALG0i will pick the page a
with longest forward distance to evict. We denote the cached page set of ALG after
i requests by X ∪ {a} and of ALG0i by X ∪ {b}, where X is the common page
set of size k − 1. If a = b, then there was no page fault or ALG already behaves
like ALG0i for the i-th page request and all further page requests can be handled
10.2. PAGING – DETERMINISTIC ALGORITHMS 87
1st i-th
LFD
x= a= b= r=
Figure 10.3: Example showing cached page sets of ALG and ALG0i
identically. We therefore assume a 6= b. For the requests following the i-th, ALG0i
shall mimic ALG, except when either page a or b is requested. This strategy is
possible, because both algorithms’ page sets share k − 1 pages. If at some point
ALG evicts a, then ALG0i can evict b instead, and from there on both are identical.
Otherwise b will be requested before a, as a has longest forward distance. When b
is requested, ALG must evict a page r to make room for b, while ALG0i still has
b in cache. If r = a, then both page sets are equal from this point on. Otherwise
ALG now has X̃ ∪ {a} cached while ALG0i has X̃ ∪ {r} with a common page set
X̃. Again ALG0i can continue like ALG, until eventually a is requested. At this
time ALG0i can evict r to make room for a, after which both page sets are identical.
The constructed algorithm ALG0i satisfies conditions (i) and (ii). The difficulty
lies in showing (iii): ALG0i may not suffer more page faults than ALG. An
additional page fault can only occur before both page sets are identical again. Both
algorithms behave the same except for requests to pages a and b, so both incur the
same page faults for access to other pages. If a 6= b, then ALG suffers a page fault
when b is accessed before a. ALG then evicts r, while ALG0i still holds r. The
page r then takes the role of b. This can happen multiple times further increasing
ALG’s cost. Eventually a is requested, for which ALG0i suffers a page fault and
can then remove r. This page fault is covered by the previous one when r was
evicted by ALG. Thus ALG0i never suffers more page faults than ALG.
Let k be the cache size and P = {p1 , . . . , pk , pk+1 } the pages in the slow main
memory. When processing the request sequence
σ = p1 , p2 , . . . , pk , (pk+1 , pk )m
for some integer m, LIFO will first cache pages p1 , . . . , pk . For the (k + 1)-st
request LIFO will evict pk to make room for pk+1 , which it will remove again
on the (k + 2)-nd request and so on. Thus LIFO will incur page faults on every
request after the k-th one, in total 2m, whereas LFD only suffers k + 1 page faults
in total. We can therefore provide for every constant c a request sequence of length
|σ| > k + c(k + 1), which violates definition 7.
Again let k be the cache size and P = {p1 , . . . , pk , pk+1 } the pages in the slow
main memory. For every integer m we can consider the request sequence
LFU will load the pages p1 , . . . , pk−1 for the first k − 1 page faults. After the
first (k − 1)m requests however, LFU will suffer a page fault for each of the 2m
following requests. For the first page fault pk , the cache still has one empty slot, but
following this each request forces LFU to evict a page: pk when pk+1 is requested
and vice versa. Since m can be made arbitrarily large and LFD requires only k + 1
page faults, the competitive ratio of LFU is unbounded.
Following this sobering result, we show another theorem, which establishes a
lower bound on the competitive ratio for any online paging algorithm.
Theorem 22
No deterministic online paging algorithm can achieve a better competitive ratio
than k, where k is the cache size.
Proof.
Let ALG be an online paging algorithm working on a cache size of k. We show
there is an arbitrarily long request sequence σ for which
Assume there are k + 1 pages in slow memory and that the cache already holds k
pages. Thus there is always a page not in the cache. Since ALG is deterministic,
a malicious page request sequence can be constructed for which each request
results in a page fault: in each step a page not in cache is requested. For such a
sequence ALG(σ) = |σ|, whereas LFD incurs at most one page fault on every k
requests: i.e. LFD(σ) ≤ |σ|k
. This is due to LFD being able to have the next k page
requests in cache for each position in the sequence. Thus ALG(σ) ≥ k · LFD(σ)
or ALG(σ)
LFD(σ)
≥ k.
10.2. PAGING – DETERMINISTIC ALGORITHMS 89
Definition 8
A paging algorithm is called conservative if it incurs at most k page faults on any
request sequence containing at most k distinct pages.
LRU, FIFO and even FWF are conservative. LFU and LIFO are not con-
servative, they are not even competitive. The page requests sequences in the
proof of non-competitiveness are good starting points for the reader to show this
observation.
We now show that all conservative paging algorithms have at least the following
competitive ratio.
Theorem 23
Any conservative online algorithm ALG, which works on a cache of size k, is
k
k−h+1
-competitive to any optimal offline algorithm OPT with h ≤ k pages of
cache.
Proof.
Let σ be a fixed page request sequence. We first partition this sequence into phases:
phase 0 is the empty sequence and for i > 0 let phase i be the longest sequence of
page requests following phase i − 1 that contains at most k distinct pages. Such a
σ
1 2 3 4 5 6
phase i
f g
k distinct pages
OPT has h pages in cache including f
Definition 9
A randomized online algorithm ALG is called c-competitive against an oblivious
adversary if there is a constant α such that for all finite input sequences σ
E(ALG(σ)) ≤ c · OPT(σ) + α
where E(ALG(σ)) is the expected value of the random variable ALG(σ) over all
possible random choices in ALG.
Definition 10
We call a paging algorithm a marking algorithm, if its operations can formally
be described using a one bit mark for each page of the slow memory. Initially, with
an empty cache, all pages are unmarked. On request of a page, this page is marked
and loaded into cache memory, if it is not already there. Only unmarked pages may
be evicted if the algorithm needs to removed a page from cache. If all pages are
marked and there is a page fault, then all marked pages are unmarked and one of
them is evicted.
The marking/unmarking phases of this definition are similar but not identical
to the k-phase partition used in the proof of theorem 23. Marking algorithms
can only differ in the decision, which of the unmarked pages is evicted. FWF
is maybe the most primitive marking algorithm, as it just evicts all unmarked
pages on a page fault requiring page eviction. More elaborately, LRU is also a
marking algorithm and it picks the unmarked page least recently used. FIFO is
not a marking algorithm.
For competitive analysis we target RMARK, which is the marking algorithm
that evicts an unmarked page chosen uniformly at random when required.
10.3. RANDOMIZED ONLINE ALGORITHMS 93
Figure 10.6: Worst-case for RMARK: new and old pages in a k-phase partition
Theorem 24
The randomized paging algorithm RMARK is (2Hk )-competitive against an obliv-
ious adversary.
Proof.
We again look at the k-phase partition of a fixed request sequence σ and analyze
the expected number of page faults in phase i for RMARK. At the beginning of
phase i there are k pages in cache as requested in phase i − 1. We call these pages
old, whereas pages that are requested in phase i, but not in phase i − 1, are labeled
as new. Every page in phase i is either old or new, let mi be the number of new
pages. Access to each of the mi new pages must result in a page fault, whereas
old pages may still be in the cache from the previous phase. Consecutive accesses
to the same old or new pages do not incur any further page faults and we can
therefore concentrate on distinct page requests. The worst possible order of page
requests for the online algorithm is first all new pages followed by the old pages
(see figure 10.6), because each new page request evicts an old page which might
be used later. The online algorithm, however, cannot take into account which of
the old pages will be requested.
So there are mi unavoidable page faults for the new pages, which are marked
and therefore cannot be evicted in this phase. We need to determine the expected
value of page faults for the k − mi old pages. For this consider the probability that
the j-th old page is still in the cache when it is first requested: since it is unmarked
it must be one of the k − mi − (j − 1) unmarked old pages in the cache (there
are only k − mi slots for old pages and j − 1 are already taken). Dividing this
by the total number of unmarked old pages k − (j − 1) in this request sequence
(including those not requested) yields the probability of a cache hit for the j-th old
94 CHAPTER 10. ONLINE ALGORITHMS
page. So, the probability for the j-th old page to cause a page fault is
k − mi − (j − 1) mi
1− =
k − (j − 1) k − (j − 1)
To get the expected number of faults in phase i, we sum over all old pages:
k−m
Xi
mi 1 1 1
mi + = mi + mi + + ··· +
j=1
k − j + 1 k k − 1 mi + 1
= mi + mi (Hk − Hmi ) = mi (Hk − Hmi + 1) ≤ mi Hk
If we say the request sequence σ has N phases, then we get
N
X
E(RMARK(σ)) ≤ Hk mi
i=1
We now have an upper bound for the cost of RMARK, to complete the competitive
analysis we require a lower bound on OPT(σ). Here we need to look at two
adjacent phases i − 1 and i. Since there are mi new pages in phase i, there must be
at least k + mi distinct pages in the request sequence consisting of phases i − 1 and
i. Thus in phase i − 1 and i even OPT must incur at least mi page faults. Hence,
we have the following lower bound.
N
1X
OPT(σ) ≥ mi
2 i=1
So with RMARK we have a simple randomized algorithm with a competitive
ratio of 2Hk . Note that Hk is usually a lot smaller than k (for example H8 ≈ 2.72),
therefore even this crude randomized algorithm can be better in practice than LRU
or FIFO, and probably is even easier to implement.
σ U M U M U M M U M U M M M U U M M U
M Marked page
1
U Unmarked page E(ALG) ≥ k−j+1
Theorem 25
No randomized online paging algorithm working with cache size k and slow
memory size N ≥ k + 1 can have a better competitive ratio than Hk against an
oblivious adversary.
Proof.
Let ALG be a randomized online algorithm with k cache pages and N = k + 1
pages in slow memory. Since an oblivious adversary must construct the request
sequence in advance, we must devise a method which yields a sequence that
results in a competitive ratio of Hk for any ALG, then the competitive ratio, as
a worst-case measure over all sequences, is greater for all possible algorithms.
The adversary knows about the algorithm’s randomization mechanism and can
therefore maintain a probability vector (p1 , . . . , pN ), where pi is the probability for
page i of the slow memory to not be in the current cache set. This vector can be
calculated by the adversary offline using the probabilityPdistributions in ALG for
each position in the request sequence iteratively, while N i=1 pi = 1 holds in each
step. Thus the expected cost for a request to page i is pi .
We continue to construct a malicious sequence based on the k-phase partition
schema used in the proof to theorem 23. In each phase ALG will suffer at least
Hk page faults, while OPT can have k − 1 pages in cache and incurs only 1 page
fault. This then results in a competitive ratio of at least H1k . To construct the phases,
the adversary uses a “reverse” marking algorithm and thus must store the set of
currently marked pages.
Each phase starts with all pages unmarked. During the phase all k pages will
be accessed at least once and thereby marked. The trick is to use the probability
vector to increase the expected number of page faults between two accesses to
unmarked pages. This is possible because there are k + 1 pages in total.
Each k-phase is composed of k sub-phases, which in turn are composed of zero
or more requests to marked pages already in cache and terminate with one request
to an unmarked page, which may or may not be in cache (see figure 10.7). At the
start of the j-th sub-phase there are exactly k − (j − 1) unmarked pages. As each
sub-phase increases the number of marked pages by one, thus at the end of the
96 CHAPTER 10. ONLINE ALGORITHMS
k-th sub-phase all will become unmarked again and a new (super-)phase begins
with one marked and k unmarked pages (N = k + 1). If we can ensure that the
1
expected cost of ALG for sub-phase j is k−j+1 , then the total expected cost for
phase i would be
k
X 1 1 1 1
= + + · · · + = Hk (10.1)
j=1
k−j+1 k k−1 1
At the start of the j-th sub-phase the set of marked pages M has cardinalityP
j, while
the set of unmarked pages U contains (k + 1) − j items. Consider γ := i∈M pi ,
which is the total probability of any marked page not being in the cache. If γ = 0,
1
there is an unmarked page a ∈ U with pa ≥ k+1−j as |U | = k+1−j. By requesting
this page the expected cost can be met and the sub-phase ended. Otherwise with
γ > 0, there exists a marked page m with pm > 0. Let ε := pm and request this
page as the first in the sub-phase. Then generate subsequent requests to marked
pages using following loop:
1
While the expected cost for ALG is less than k+1−j and γ > ε, request
the marked page x with maximal probability px .
The expected cost for ALG increases with each iteration of the loop as px > 0,
so the loop must terminate. (Actually, each additional request contributes at least
γ
|M |
> |Mε | > 0 cost, because γ > ε.) If the loop terminates with expected cost
1
≥ k+1−j , then the adversary can pick any unmarked page to end the sub-phase. If,
1
however, the loop terminates with expected cost < k+1−j , then γ ≤ ε must have
been the reason. In this case the adversary must request the unmarked page b with
maximal pb , for which pb ≥ 1−γ|U |
1−γ
= k+1−j . The total expected cost for ALG is in
this case ε + pb , which is
1−γ 1−ε 1
ε + pb ≥ ε + ≥ε+ ≥
k+1−j k+1−j k+1−j
1
Thus the total expected cost for any sub-phase is ≥ k+1−j and therefore for the
whole phase at least Hk (due to eq. 10.1). The variable ε is used as an arbitrary
bound to the number of marked pages requested.
In each phase the expected cost for ALG is at least Hk , while the cost for OPT
is at most 1. Thus the competitive ratio of ALG is at least Hk . Since we did not
make any assumption about how ALG works on the page request sequence, this
lower bound holds for all randomized online paging algorithms.
This closes these chapters on competitive analysis of online algorithms. We
have shown many analytic results that tend to be very pessimistic about competitive
10.3. RANDOMIZED ONLINE ALGORITHMS 97
performance. But keep in mind that this performance is always against a much
more powerful, optimal offline algorithm. The results of these optimal algorithms
are often also not easy to come by. In practice the best paging algorithms are LRU
and RMARK (see figure 8.1 in [10]).
In spite of these disadvantages, competitive analysis is a solid foundation to
give guarantees on the worst-case performance of online algorithms.
98 CHAPTER 10. ONLINE ALGORITHMS
Chapter 11
Geometry
Definition 11
A point p is defined as a vector ∈ Rd . We use p.i to refer to the i-th element of the
99
100 CHAPTER 11. GEOMETRY
vector. In two- as well as three-dimensional space, we also use p.x, p.y and p.z to
refer to the elements coordinates.
Definition 12
Given two points a and b, we define a line segment ab as {α · a + (1 − α) · b|α ∈ [0, 1]}.
Definition 13
A polygon in an d-dimensional space is a set of line segments given as a list of
points P = p1 , . . . , pn with pn = p1 and pi , pi+1 for i ∈ [1, n − 1] defining the
outline of the polygon.
We call a polygon convex if for any two points a and b in P the line segment
ab is contained in P .
foreach {s, t} ⊆ S do
if s ∩ t 6= ∅ then
output {s, t}
The algorithm obviously performs in Θ(n2 ) which is far to slow for large data
sets. Naturally the questions arises: what is the best we can do?
Since we have to look at each edge
at least once and have to output
every intersection, we get a mini-
...
mal complexity of Ω(n + k) with
k denoting the number of existing
...
Keeping a sorted range T , the at most n insert and remove operations can be
performed within O(log(n)) each. Finding the boundaries of the interval containing
the intersections can be done in O(log(n)), the output of the intersections in O(ks )
with ks denoting the number of intersections Pon the horizontal segment s. Overall
this yields a running time of O(n · log(n) + s ks ) = O(n · log(n) + k).
Lemma 26
Let si and sj be two non horizontal segments whose interiors intersect in a singe
point p, and assume there is no third segment passing through p. Then there is an
event point above p where si and sj become adjacent and are tested for intersection.
Proof.
Choose ` in such a way that there is no event located on ` and no event between `
and the horizontal line through p. Then si and sj are adjacent in T for this position
of `. Since T = hi at the beginning of the algorithm, there exists an event where si
and sj become adjacent in T and are tested for intersection.
This observations yields an improved version of our algorithm. We define two
functions to aid our algorithm description. The first is called findNewEvent and
calculates whether there is an intersection event between two given line segments.
Due to the insert into Q, findNewEvent runs in O(1 + log(n)). The second
function performs the event processing step. We distinguish between three version
of handleEvent for the three different types of events.
Due to the n line segments, we have to handle n start and n finish events. The
number of intersection events is linear in the number of intersections as can be
proven by an interpretation as a planar graph and by using eulers formula. For a
detailed analysis we refer to [? ].
The initialization of our algorithm performs in O(n · log(n)), the event loop in
O((n + k) · log(n)). Thus the overall running time is in O((n + k) · log(n)).
This can be interpreted as a sweep-line with an infinitely small slope. With this we
can once more adapt our algorithm:
To finally also allow overlapping segments, we can compute the line g(s) for
any segment s ∈ S that contains s. Sorting S along g(s) enables us to solve a
1-dimensional overlapping problem for each segment.
11.5. CONVEX HULL 105
Function findNewEvent(s, t, p)
if s and t intersect at a point p0 p then
if p0 6∈ Q then Q.insert(p0 )
pi .x = pj .x and pi .y < pj .y. If the points are not given in this order, we can sort
them in O(sort(n)) additional time.
For this algorithm, we first split the set of points along the line segment p1 pn .
Now the algorithm can be implemented as two convex hull calculations that process
the lower-hull, which represents the hull of points below p1 pn , and the upper-hull
separately (illustrated in Figure 11.4). In the following we will focus on the
upper-hull. The algorithm for the lower-hull can be formulated equivalently.
Function upperHull(p1 , . . . , pn )
L=hpn , p1 , p2 i : Stack of Point
invariant L is the upper hull of hpn , p1 , . . . , pi i
for i := 3 to n do
while ¬rightTurn(L.secondButlast, L.last, pi ) do
L.pop
L:= L ◦ hpi i
return L
The running time of the algorithm upper-hull itself lies in O(n). This is, as the
for loop obviously runs in O(n) time. The additional work done by the while loop
only adds an additional O(n) in total, as each point can only be deleted once from
L. Thus the overall running time is dominated by the sorting and the algorithm
performs in O(n · log(n)).
In three-dimensional space, the number of vertices does not directly imply a linear
complexity of the convex hull. Still one can show that the number of edges and the
number of faces are both in O(n). We refer to [? ] for a proof. For even higher
dimensions, the output complexity rises to O nbd/2c .
Function smallestEnclosingBallWithPoints(P, Q)
if |P | = 1 ∨ |Q| = d + 1 then return ball(Q)
pick random x ∈ P
B:= smallestEnclosingBallWithPoints(P \ {x} , Q)
if x ∈ B then return B
return smallestEnclosingBallWithPoints(P \ {x} , Q ∪ {x})
108 CHAPTER 11. GEOMETRY
Note the base of our recursion: a sphere for only one point can be given as
the point itself without any radius. Also a set of d + 1 vertices uniquely defines a
sphere in d-dimensional space. In Q we store a set of border nodes that form the
current sEB.
To prove our algorithm correct we have to prove two claims.
The first states:
Lemma 27
Smallest Enclosing Ball With Points (sEB) are non-ambiguous.
Proof.
Assume: ∃ sEB’s B1 6= B2 . Since B1 and B2 are both sEB’s P ⊆ B1 ∧ P ⊆ B2
holds. This directly induces P ⊆ B1 ∩ B2 ⊆ sEB( B1 ∩ B2 ) = B and radius(B)
< radius(B1 ). This cannot hold, since B1 is a sEB.
Lemma 28
If a point x is not in the sEB of P \ {x} , Q, then x is on the border of sEB(P ).
Proof.
We show the contraposition: Assume x is not on the border of sEB(P ). Then
sEB(P ) = sEB(P \ {x}) = B, due to the non-ambiguous sEBs.
11.6.1 Analysis
We have given a probabilistic algorithm for the smallest enclosing ball problem
and proven it correct. In this section we want to analyse the expected running time
for the algorithm.
We start by defining the base of our recurrence. As already mentioned above,
our recursion T (p, q) ends for T (p, d + 1) = T (1, p) = 0.
For T (p, q) following equation holds:
T (p, q) ≤ 1 + T (p − 1, q) + P[x ∈ / B] · T (p, q + 1) as we only have to execute the
second recursive call, whenever x ∈ / B. Now let us inspect P[x ∈ / B] for a fixed
subset of points p1 , . . . , pi and its sEB Q. If we remove a point pj , when does Q
change? This happens only if we remove one of the d+1 points that uniquely define
Q. If there are more than d + 1 points on the boundary of Q, the smallest enclosing
ball cannot change after the removal of one point. Thus, since we already know q
points to be on the boundary of Q, the number of points that could change the sEB
is given by d + 1 − q. Therefore T (p, q) ≤ 1 + T (p − 1, q) + d+1−q p
· T (p, q + 1)
holds.
Using this equation, we can calculate T (p, d + 1) for different values of d. For
example d = 1 would yield an expected running time of 3 · n whereas d = 2 would
result in a running time of 10 · n. In general T (p, 0) ≤ d! · n holds.
11.7. RANGE SEARCH 109
3 13
2 5 11 19
2 3 5 7 11 13 17 19 00
Definition 15
For a given range of elements X = hx1 , . . . , xn i, a Wavelet Tree is recursively
defined as follows. The tree keeps a bitvector b with b[i] = 1 iff xi > bn/2c.
Furthermore, it contains two children, each recursivly defined as the Wavelet Trees
of hxi : xi ≤ bn/2ci as the left child and hxi − bn/2c : xi > bn/2ci as the right
child respectively. This recursion is stopped as soon as n ≤ n0 , where n0 is a
tuning parameter. In this case, the sequence is stored directly. A sample wavelet
tree is shown in Figure 11.7.
0
1
1
1
0
0
1
0
1 1
1 0
0 0
0 1
Definition 16
The Dominance Counting Problem is defined as follows: Given a sequence of
elements X = hx1 , . . . , xn i as well as an pair (x, y),
find |{xi : xi ∈ X ∧ xi ≥ x ∧ i ≥ y}|. (dominance criterion)
11.7. RANGE SEARCH 111
Using Wavelet Trees, we can solve this problem with Algorithm 11.8. In every
recursion step, the number of elements yr less or equal to y is calculated using the
rank function of the bitvector. How this can be accomplished will be discussed later.
If the desired value of x is found in the left child, we know of at least dn/2e − yr
elements to fulfill the dominance criterion. The rest can be calculated recursively
with (x, y − yr ) as input for the left child, as all (xi , yi ) with yi < y ∧ xi > x are
not considered in the left child. The recursion for x > bn/2c can be explained in a
similar matter. For an illustration, the reader can turn to Figure 11.9.
If we implement the rank to work in constant time (see below), the algorithm
runs in O(log n) as we only follow one recursive path and have a recursion depth
of O(log n).
0
1
1
1
0 4−1=3
3 0
1 1
0
2
1 2−0=2 1
1 0
2 0 0
00 1
2
21,12 21,12 22,11 22,11
Figure 11.9: Example query for dominance counting using a Wavelet Tree
Using Algorithm 11.10, we can also perform range counting queries on integer
values. The algorithm is illustrated in Figure 11.11.
112 CHAPTER 11. GEOMETRY
Definition 17
The One Sided Reporting problem is given as: Given a set P = {(x1 , y1 ), . . . (xn , yn )}
and a value y, find P ∩ [1, n] × [y, n].
Having solved the One Sided Reporting problem, we are ready to report all
elements dominating an (x, y) pair with Algorithm 11.13:
Definition 18
The 2D-Range Search problem is defined as follows.
Given a set of data P = {p1 , . . . , pn } ⊆ R2 , and an axis parallel rectangle
Q = [x, x0 ] × [y, y 0 ]
We can reduce the general problem onto the integer problem on [1, n] × [1, n].
This can be achieved by replacing each pair (x, y) ∈ P with (rank(x, xi ), rank(y, yi )),
where rank(x, xi ) denotes the rank of x in a sorted range of the {xi |(xi , y) ∈
P for y ∈ R}. This replacement can be achieved by sorting the xi and yi respec-
tivly im combination with a binary search for every xi , yi in the resulting ranges.
Thus we can achieve a general range query in O(n log n), using the integer range
counting methods discussed above.
114 CHAPTER 11. GEOMETRY
...
...
11.7.3 Bitvectors
Bitvectors, as used in the range query algorithms discussed above, store a set of
bits and offer a set of distinct operations. For our use case, the interesting operation
is the rank(j) operation.
To support the rank operation in constant time, we precompute the ranks of
every iB-th element, were B can be seen as a tuning parameter. Than rank(j)
can be computed as the sum of the precomputed value j mod B and rank(v[j
mod B], j]). If we chose B = Θ(log n), we use O(n) time and O(n) bits. This
process is illustrated in Figure 11.15.
The constant lookup can be either realized using a lookup table or by the
instruction population count which on some systems (e. g. SSE 4.2) counts the
number of set bits in a machine word.
In a similar manner, we can calculate the position of the i-th set bit b.select(i)
in constant time. Furthermore, it is possible to reach the information theoretical
optimal space of n + o(n) bits.
0 4 8 11
10010011 10001110 00001110 10100010
rank=8+2=10
[6] H. Bast, S. Funke, P. Sanders, and D. Schultes. Fast routing in road networks
with transit nodes. Science, 316(5824):566, 2007.
[7] R. Beier and B. Vöcking. Typical properties of winners and losers in discrete
optimization. In 36th ACM Symposium on the Theory of Computing, pages
343–352, 2004.
[9] T. Beth and M. Clausen, editors. Applied Algebra, Algebraic Algorithms and
Error-Correcting Codes, 4th International Conference, AAECC-4, Karlsruhe,
FRG, September 23-26, 1986, Proceedings, volume 307 of Lecture Notes in
Computer Science. Springer, 1988.
115
116 BIBLIOGRAPHY
[18] D.E. Drake and S. Hougardy. A simple approximation algorithm for the
weighted matching problem. Information Processing Letters, 85(4):211–213,
2003.
[21] A. V. Goldberg and S. Rao. Beyond the flow decomposition barrier. Journal
of the ACM, 45(5):1–15, September 1998.
BIBLIOGRAPHY 117
[30] C. C. McGeoch, D. Precup, and P. R. Cohen. How to find big-oh in your data
set (and how not to). In Advances in Intelligent Data Analysis, number 1280
in LNCS, pages 41–52, 1997.
[32] K. Mehlhorn. Data Structures and Algorithms, Vol. I — Sorting and Searching.
EATCS Monographs on Theoretical CS. Springer-Verlag, Berlin/Heidelberg,
Germany, 1984.
[33] K. Mehlhorn and P. Mutzel. On the embedding phase of the Hopcroft and
Tarjan planarity testing algorithm. Algorithmica, 16(2):233–242, 1996.
[35] K. Mehlhorn and P. Sanders. Scanning multiple sequences via cache memory.
Algorithmica, 35(1):75–93, 2003.
[36] K. Mehlhorn and G. Schäfer. Implementation of weighted matchings in
general graphs: The power of data structure. ACM Journal of Experimental
Algorithmics, 7, 2002.
[37] Kurt Mehlhorn and Stefan Näher. LEDA, A platform for Combinatorial and
Geometric Computing. Cambridge University Press, 1999.
[38] Kurt Mehlhorn and Peter Sanders. Algorithms and Data Structures: The
Basic Toolbox. Springer, 2008.
[39] U. Meyer, P. Sanders, and J. Sibeyn, editors. Algorithms for Memory Hierar-
chies, volume 2625 of LNCS Tutorial. Springer, 2003.
[40] U. Meyer, P. Sanders, and J. Sibeyn, editors. Algorithms for Memory Hierar-
chies, volume 2625 of LNCS Tutorial. Springer, 2003.
[41] B. M. E. Moret. Towards a discipline of experimental algorithmics. In 5th
DIMACS Challenge, DIMACS Monograph Series, 2000.
[42] J. von Neumann. First draft of a report on the EDVAC. Technical report,
University of Pennsylvania, 1945.
[43] S. Pettie and P. Sanders. A simpler linear time 2/3-approximation for max-
imum weight matching. Information Processing Letters, 91(6):271–276,
2004.
[44] T. Polzin and S. V. Daneshmand. Extending reduction techniques for the
Steiner tree problem. In 10th Eur. Symp. on Algorithms, volume 2461 of
LNCS, pages 795–807. Springer, 2002.
[45] K. R. Popper. Logik der Forschung. Springer, 1934. English Translation: The
Logic of Scientific Discovery, Hutchinson, 1959.
[46] P. Sanders. Fast priority queues for cached memory. In ALENEX ’99, Work-
shop on Algorithm Engineering and Experimentation, number 1619 in LNCS,
pages 312–327. Springer, 1999.
[47] P. Sanders. Fast priority queues for cached memory. ACM Journal of Experi-
mental Algorithmics, 5, 2000.
[48] P. Sanders. Algorithm engineering – an attempt at a definition. In Efficient
Algorithms, volume 5760 of Lecture Notes in Computer Science, pages 321–
340. Springer, 2009.
BIBLIOGRAPHY 119
[49] P. Sanders, S. Egner, and J. Korst. Fast concurrent access to parallel disks. In
11th ACM-SIAM Symposium on Discrete Algorithms, pages 849–858, 2000.
[50] P. Sanders and D. Schultes. Highway hierarchies hasten exact shortest path
queries. In 13th European Symposium on Algorithms (ESA), volume 3669 of
LNCS, pages 568–597. Springer, 2005.
[51] P. Sanders and S. Winkel. Super scalar sample sort. In 12th European
Symposium on Algorithms, volume 3221 of LNCS, pages 784–796. Springer,
2004.
[52] J. Singler, P. Sanders, and F. Putze. MCSTL: The multi-core standard template
library. In 13th International Euro-Par Conference, volume 4641 of LNCS,
pages 682–694. Springer, 2007.
[53] D. Spielman and S.-H. Teng. Smoothed analysis of algorithms: why the
simplex algorithm usually takes polynomial time. In 33rd ACM Symposium
on Theory of Computing, pages 296–305, 2001.
[54] D. Gollman T. Beth. Algorithm engineering for public key algorithms. IEEE
Journal on Selected Areas in Communications, 7(4):458–466, 1989.
[55] J. C. Venter, M. D. Adams, and E. W. Myers et al. The sequence of the human
genome. Science, 291(5507):1304–1351, 2001.