Principles of Distributed Database System 2020
Principles of Distributed Database System 2020
Cơ sở dữ liệu phân tán (Học viện Công nghệ Bưu chính Viễn thông)
Principles of Distributed
Database Systems
Fourth Edition
123
Downloaded by D20CQCN02-N TRAN VIET ANH (n20dccn087@student.ptithcm.edu.vn)
lOMoARcPSD|43974171
The first two editions of this book were published by: Pearson Education, Inc.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To our families
and our parents
M.T.Ö. and P.V.
Preface
The first edition of this book appeared in 1991 when the technology was new and
there were not too many products. In the Preface to the first edition, we had quoted
Michael Stonebraker who claimed in 1988 that in the following 10 years, centralized
DBMSs would be an “antique curiosity” and most organizations would move
towards distributed DBMSs. That prediction has certainly proved to be correct, and
a large proportion of the systems in use today are either distributed or parallel—
commonly referred to as scale-out systems. When we were putting together the
first edition, undergraduate and graduate database courses were not as prevalent
as they are now; so the initial version of the book contained lengthy discussions
of centralized solutions before introducing their distributed/parallel counterparts.
Times have certainly changed on that front as well, and now, it is hard to find
a graduate student who does not have at least some rudimentary knowledge of
database technology. Therefore, a graduate-level textbook on distributed/parallel
database technology needs to be positioned differently today. That was our objective
in this edition while maintaining the many new topics we introduced in the third
edition. The main revisions introduced in this fourth edition are the following:
1. Over the years, the motivations and the environment for this technology have
somewhat shifted (Web, cloud, etc.). In light of this, the introductory chapter
needed a serious refresh. We revised the introduction with the aim of a more
contemporary look at the technology.
2. We have added a new chapter on big data processing to cover distributed storage
systems, data stream processing, MapReduce and Spark platforms, graph
analytics, and data lakes. With the proliferation of these systems, systematic
treatment of these topics is essential.
3. Similarly, we addressed the growing influence of NoSQL systems by devoting
a new chapter to it. This chapter covers the four types of NoSQL (key-value
stores, document stores, wide column systems, and graph DBMSs), as well as
NewSQL systems and polystores.
4. We have combined the database integration and multidatabase query processing
chapters from the third edition into a uniform chapter on database integration.
vii
viii Preface
Preface ix
polystores, and parallel DBMS chapters and provided extremely detailed comments
that improved those chapters considerably. Tamer’s students Anil Pacaci, Khaled
Ammar and postdoc Xiaofei Zhang provided extensive reviews of the big data
chapter, and texts from their publications are included in this chapter. The NoSQL,
NewSQL, and polystores chapter includes text from publications of Boyan Kolev
and Patrick’s student Carlyna Bondiombouy. Jim Webber reviewed the section
on Neo4j in that chapter. The characterization of graph analytics systems in that
chapter is partially based on Minyang Han’s master’s thesis where he also proposes
GiraphUC approach that is discussed in that chapter. Semih Salihoglu and Lukasz
Golab also reviewed and provided very helpful comments on parts of this chapter.
Alon Halevy provided comments on the WebTables discussion in Chap. 12. The data
quality discussion in web data integration is contributed by Ihab Ilyas and Xu Chu.
Stratos Idreos was very helpful in clarifying how database cracking can be used as
a partitioning approach and provided text that is included in Chap. 2. Renan Souza
and Fabian Stöter reviewed the entire book.
The third edition of the book introduced a number of new topics that carried over
to this edition, and a number of colleagues were very influential in writing those
chapters. We would like to, once again, acknowledge their assistance since their
impact is reflected in the current edition as well. Renée Miller, Erhard Rahm, and
Alon Halevy were critical in putting together the discussion on database integration,
which was reviewed thoroughly by Avigdor Gal. Matthias Jarke, Xiang Li, Gottfried
Vossen, Erhard Rahm, and Andreas Thor contributed exercises to this chapter.
Hubert Naacke contributed to the section on heterogeneous cost modeling and Fabio
Porto to the section on adaptive query processing. Data replication (Chap. 6) could
not have been written without the assistance of Gustavo Alonso and Bettina Kemme.
Esther Pacitti also contributed to the data replication chapter, both by reviewing
it and by providing background material; she also contributed to the section on
replication in database clusters in the parallel DBMS chapter. Peer-to-peer data
management owes a lot to the discussions with Beng Chin Ooi. The section of this
chapter on query processing in P2P systems uses material from the PhD work of
Reza Akbarinia and Wenceslao Palma, while the section on replication uses material
from the PhD work of Vidal Martins.
We thank our editor at Springer Susan Lagerstrom-Fife for pushing this project
within Springer and also pushing us to finish it in a timely manner. We missed almost
all of her deadlines, but we hope the end result is satisfactory.
Finally, we would be very interested to hear your comments and suggestions
regarding the material. We welcome any feedback, but we would particularly like to
receive feedback on the following aspects:
1. Any errors that may have remained despite our best efforts (although we hope
there are not many);
x Preface
2. Any topics that should no longer be included and any topics that should be added
or expanded;
3. Any exercises that you may have designed that you would like to be included in
the book.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Is a Distributed Database System? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 History of Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Delivery Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Promises of Distributed DBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Transparent Management of Distributed and
Replicated Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Reliability Through Distributed Transactions. . . . . . . . . . . . 10
1.4.3 Improved Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Distributed Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Distributed Data Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.3 Distributed Query Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Distributed Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.5 Reliability of Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.7 Parallel DBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.8 Database Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.9 Alternative Distribution Approaches . . . . . . . . . . . . . . . . . . . . . 16
1.5.10 Big Data Processing and NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Distributed DBMS Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Architectural Models for Distributed DBMSs . . . . . . . . . . . 17
1.6.2 Client/Server Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.3 Peer-to-Peer Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.4 Multidatabase Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6.5 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
xii Contents
Contents xiii
xiv Contents
Contents xv
xvi Contents
Contents xvii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
The original version of this book was revised. The correction to this book is available at https://
doi.org/10.1007/978-3-030-26253-2_13