Distributed Query Processing
Group Members
M. Arslan BCS07183033
Adnan Ali BCS07183037
Amina Sajid BCS07183040
Motivation
• Cost and scalability: network of off-shelf
machines
• Integration of different software vendors
(with own DBMS)
• Integration of legacy systems
• Applications inherently distributed, such as
workflow or collaborative-design
• State-of-the-art distributed information
technologies (e-businesses)
Distribute Query Processing
• Query Processing Basics
– centralized query processing
– distributed query processing
Problem Statement
• Input: Query such as „Biological objects in
study A referenced in a literature in journal Y“.
• Output: Answer
• Objectives:
– response time, throughput, first answers, little IO, ...
• Centralized vs. Distributed Query Processing
– same basic problem
– but, more and different parameters, such(data sites
or available machine power) and objectives
Steps in Query Processing
• Input: Declarative Query
– SQL, XQuery, ...
• Step 1: Translate Query into Algebra
– Tree of operators (query plan generation)
• Step 2: Optimize Query
– Tree of operators (logical) - also select partitions of table
– Tree of operators (physical) – also site annotations
– (Compilation)
• Step 3: Execution
– Interpretation; Query result generation
Algebra
A.d
A.a =
SELECT A.d
B.b,
FROM A, B
A.c =
WHERE A.a = B.b
35
AND A.c = 35 X
A B
– relational algebra for SQL very well understood
– algebra for XQuery mostly understood
Query Optimization
A.d A.d
A.a =
B.b,
hashjoin
A.c =
35
X B.b
index
A B B
A.c
– logical, e.g., push down cheap predicates
– enumerate alternative plans, apply cost model
– use search heuristics to find cheapest plan
Basic Query Optimization
• Classical Dynamic Programming algorithm
– Performs join order optimization
– Input : Join query on n relations
– Output : Best join order
Query Execution
John
A.d
(John, 35, CS)
hashjoin (CS)
(AS)
(John, 35, CS) B.b
(Mary, 35, EE) (Edinburgh, CS,5.0)
(Edinburgh, AS, 6.0)
index
B
A.c
– library of operators (hash join, merge join, ...)
– exploit indexes and clustering in database
– pipelining (iterator model)
Summary : Centralized Queries
• Basic SQL (SPJG, nesting) well understood
• Very good extensibility
– spatial joins, time series, UDF, xquery, etc.
• Current problems
– Better statistics : cost model for optimization
– Physical database design expensive & complex
• Some Trends
– interactiveness during execution
– approximate answers, top-k
– self-tuning capabilities (adaptive; robust; etc.)
Distributed Query Processing: Basics
• Idea:
Extension of centralized query processing. (System
R* et al. in 80s)
• What is different?
– extend physical algebra: send&receive operators
– other metrics : optimize for response time
– resource vectors, network interconnect matrix
– caching and replication
– less predictability in cost model (adaptive algos)
– heterogeneity in data formats and data models
Issues in Distributed Databases
• Plan enumeration
– The time and space complexity of traditional dynamic
programming algorithm is very large
– Iterative Dynamic Programming (heuristic for large
queries)
• Cost Models
– Classic Cost Model
– Response Time Model
– Economic Models
Query Execution Techniques for
Distributed Databases
• Row Blocking
• Multi-cast optimization
• Multi-threaded execution
• Joins with horizontal partitioning
• Semi joins
• Top n queries
Query Execution Techniques for DD
• Row Blocking –
– SEND and RECEIVE operators in query plan
to model communication
– Implemented by TCP/IP, UDP, etc.
– Ship tuples in block-wise fashion (batch);
smooth burstiness
Query Execution Techniques for DD
• Multi-cast Optimization
– Location of sending/receiving may affect
communication costs; forwarding versus multi-casting
• Multi-threaded execution
– Several threads for operators at the same site (intra-
query parallelism)
– May be useful to enable concurrent reads for diverse
machines (while continuing query processing)
– Must consider if resources warrant concurrent operator
execution (say two sorts each needing all memory)
Query Execution Techniques for DD
• Joins with Data (horizontal) partitioning:
– Hash-based partitioning to conduct joins on independent partitions
• Semi Joins :
– Reduce communication costs; Send only “join keys” instead of
complete tuples to the site to extract relevant join partners
• Double-pipelined hash joins :
– Non-blocking join operators to deliver first results quickly; fully
exploit pipelined parallelism, and reduce overall response time
• Top n queries :
– Isloate top n tuples quickly and only perform other expensive
operations (like sort, join, etc) on those few (use “stop” operators)