0% found this document useful (0 votes)
26 views

Lecture5 -Query_Processing 1

Query Processing

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture5 -Query_Processing 1

Query Processing

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
◼ Introduction
◼ Distributed and parallel database design
◼ Distributed data control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multidatabase Systems
◼ Parallel Database Systems
◼ Peer-to-Peer Data Management
◼ Big Data Processing
◼ NoSQL, NewSQL and Polystores
◼ Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
◼ Distributed Query Processing
❑ Query Decomposition and Localization
❑ Join Ordering
❑ Distributed Query Optimization
❑ Adaptive Query Processing

© 2020, M.T. Özsu & P. Valduriez 3


Query Processing in a DDBMS

◼ Generally, a query in distributed DBMS require data from


multiple sites, and this is called transmission of data that
causes communication costs.

◼ Query processing in DBMS is different from centralized


DBMS due to the communication cost of data transfer
over the network.

◼ The transmission cost is low when the sites are


connected through high-speed network and is quite
significant in another network.

© 2020, M.T. Özsu & P. Valduriez 4


Query Processing in a DDBMS

◼ In distributed query processing, the data transfer


cost of distributed query processing means:

❑ Cost of transferring intermediate files to other sites for


processing and
❑ Cost of transferring the ultimate result file to the
location where the results required

© 2020, M.T. Özsu & P. Valduriez 5


Distributed DBMS Environment

• If s1 request a query and


needs data from s2 and
s3.
• It is decided to execute
the query at s3.
• 1st communication cost is
transferring the data from
s2 to s3 → then s3 will
execute the query and
get the result.
• 2nd communication cost is
transferring the result
from s3 to s1

© 2020, M.T. Özsu & P. Valduriez 6


Query Processing in a DDBMS

High level user query

Query
Processor

Low-level data manipulation


commands for D-DBMS

© 2020, M.T. Özsu & P. Valduriez 7


Query Processing Components

◼ Query language
❑ SQL: “intergalactic dataspeak”

◼ Query execution
❑ The steps that one goes through in executing high-level
(declarative) user queries.

◼ Query optimization
❑ How do we determine the “best” execution plan?

◼ We assume a homogeneous D-DBMS

© 2020, M.T. Özsu & P. Valduriez 8


Selecting Alternatives
Find the names of employee who are
managing a project

Strategy 1

SELECT ENAME
FROM EMP , ASG
WHERE emp.no=asg.no and RESP = "Manager"

Strategy 1
 ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))

© 2020, M.T. Özsu & P. Valduriez 9


Selecting Alternatives
Strategy 2

SELECT ENAME
FROM EMP NATURAL JOIN ASG
WHERE RESP = "Manager"

Strategy 2
 ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, and consumes less
computing resources, so may be “better”

© 2020, M.T. Özsu & P. Valduriez 10


Selecting Alternatives
In a distributed system,

◼ Relational algebra is not enough to express execution


strategies. It must be supplemented with operators for
exchanging data between sites.

◼ The distributed query processor must also select the


best sites to process data , and possibly the way data
should be transformed.

© 2020, M.T. Özsu & P. Valduriez 11


What is the Problem?

Site 1 Site 2 Site 3 Site 4 Site 5


ASG1=ENO≤“E3”(ASG) ASG2= ENO>“E3”(ASG) EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP) Result

Strategy A Strategy B
© 2020, M.T. Özsu & P. Valduriez 12
Cost of Alternatives

Assume
◼ size(EMP) = 400 row,
size(ASG) = 1000
◼ tuple access cost = 1
unit (1 operation or 1s);
◼ tuple transfer cost = 10
units
◼ There are 20 managers
in relation ASG
◼ Assume that the data is
uniformly distributed
among sites

© 2020, M.T. Özsu & P. Valduriez 13


Cost of Alternatives

◼ Strategy A
❑ produce ASG': (10+10) tuple access cost 20
❑ transfer ASG' to the sites of EMP: (10+10)
tuple transfer cost 200
❑ produce EMP': (10+10) tuple access cost
2 40
❑ transfer EMP' to result site: (10+10) tuple
transfer cost 200
Total Cost 460

◼ Strategy B
❑ transfer EMP to site 5: 400 tuple transfer
cost 4,000
❑ transfer ASG to site 5: 1000 tuple transfer
cost 10,000
❑ produce ASG': 1000 tuple access (apply
condition) 1,000
❑ join EMP and ASG': 400 20(manager) tuple
access cost 8,000
Total Cost 23,000

© 2020, M.T. Özsu & P. Valduriez 14


Query Optimization Objectives
◼ Minimize a cost function
❑ I/O cost + CPU cost + communication cost
❑ These might have different weights in different distributed
environments
◼ Wide area networks
❑ Communication cost may dominate or vary much
◼ Bandwidth
◼ Speed
◼ Protocol overhead
◼ Local area networks
❑ Communication cost not that dominant, so total cost function
should be considered
◼ Can also maximize throughput
© 2020, M.T. Özsu & P. Valduriez 15
Complexity of Relational Operations

Operation Complexity

Select
Project O(n)
◼ Assume (without duplicate elimination)

❑ Relations of cardinality n Project


(with duplicate elimination) O(n  log n)
❑ Sequential scan
Group

Join
Semi-join O(n  log n)
Division
Set Operators

Cartesian Product O(n2)

© 2020, M.T. Özsu & P. Valduriez 16


Types Of Optimizers

◼ Exhaustive search
❑ Cost-based
❑ Optimal
❑ Combinatorial complexity in the number of relations
◼ Heuristics
❑ Not optimal
❑ Regroup common sub-expressions
❑ Perform selection, projection first
❑ Replace a join by a series of semijoins
❑ Reorder operations to reduce intermediate relation size
❑ Optimize individual operations

© 2020, M.T. Özsu & P. Valduriez 17


Optimization Granularity

◼ Single query at a time


❑ Cannot use common intermediate results

◼ Multiple queries at a time


❑ Efficient if many similar queries

❑ Decision space is much larger

© 2020, M.T. Özsu & P. Valduriez 18


Optimization Timing

◼ Static : optimization is done at query compilation time


❑ Compilation ➔ optimize prior to the execution
❑ Difficult to estimate the size of the intermediate resultserror
propagation
❑ Can amortize over many executions
◼ Dynamic: proceeds at query execution time
❑ Run time optimization
❑ Exact information on the intermediate relation sizes
❑ Have to re-optimize for multiple executions
◼ Hybrid: tradeoff between both
❑ Compile using a static algorithm
❑ If the error in estimate sizes > threshold, re-optimize at run time

© 2020, M.T. Özsu & P. Valduriez 19


Statistics

◼ Relation
❑ Cardinality
❑ Size of a tuple
❑ Fraction of tuples participating in a join with another relation
◼ Attribute
❑ Cardinality of domain
❑ Actual number of distinct values
◼ Simplifying assumptions
❑ Independence between different attribute values
❑ Uniform distribution of attribute values within their domain

© 2020, M.T. Özsu & P. Valduriez 20


Optimization Decision Sites

◼ Centralized
❑ Single site determines the “best” schedule
❑ Simple
❑ Need knowledge about the entire distributed database
◼ Distributed
❑ Cooperation among sites to determine the schedule
❑ Need only local information
❑ Cost of cooperation
◼ Hybrid
❑ One site determines the global schedule
❑ Each site optimizes the local subqueries

© 2020, M.T. Özsu & P. Valduriez 21


Network Topology

◼ Wide area networks (WAN) – point-to-point


❑ Characteristics
◼ Relatively low bandwidth (compared to local CPU/IO)
◼ High protocol overhead
❑ Communication cost may dominate; ignore all other cost factors
❑ Global schedule to minimize communication cost
❑ Local schedules according to centralized query optimization
◼ Local area networks (LAN)
❑ Communication cost not that dominant
❑ Total cost function should be considered
❑ Broadcasting can be exploited (joins)
❑ Special algorithms exist for star networks

© 2020, M.T. Özsu & P. Valduriez 22


Questions?

© 2020, M.T. Özsu & P. Valduriez 23

You might also like