CH 1 Query Processing

Chapter one
Query processing and Optimization
1
Query Processing and Optimization
Query Processing: refers to the range of
activities involved in extracting data from a database.
Query optimization: is the process of
choosing a suitable execution strategy for processing
a query.
+ Before optimizing the query it is represented in an
internal or intermediate form using two data
structures:-
Query tree: A tree data structure that corresponds to a relational
algebra expression. It represents the input relations of the query as leaf nodes of
the tree, and represents the relational algebra operations as internal nodes.
Query graph: A graph data structure that corresponds to a
relational calculus expression. It does not indicate an order on which operations
to perform first. There is only a single graph corresponding to each query.
2
Query Processing (cont…)
3
Cont…
Scanner: The scanner identifies the language
tokens such as SQL Keywords, attribute
names, and relation names in the text of the
query.
Parser: The parser checks the query syntax to
determine whether it is formulated according
to the syntax rules of the query language.
Validation: The query must be validated by
checking that all attributes and relation names
are valid and semantically meaningful names
in the schema of the particular database being
queried.
4
Cont…
Query Optimization: The process of choosing
a suitable execution strategy for processing a
query.
◦ This module has the task of producing an execution
plan.
Query Code Generator: It generates the code
to execute the plan.
Runtime Database Processor: It has the task of
running the query code whether in compiled or
interpreted mode.
◦ If a runtime error results an error message is
generated by the runtime database processor
5
Cont…
Query block:
◦ The basic unit that can be translated into the
algebraic operators
A query block contains a single
SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING
clause if these are part of the block.
Nested queries within a query are
identified as separate query blocks.
6
Cont…
SELECT LNAME, FNAME

FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT LNAME, FNAME SELECT MAX (SALARY)

FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5
πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))
7
Transformation Rules for Relational Algebra
Operations:
1. Cascade of : A conjunctive selection condition can be

broken up into a cascade (sequence) of individual 
operations:
◦  c1 AND c2 AND ... AND cn(R) = c1 (c2 (...(cn(R))...) )
2. Commutatively of :
◦ The  operation is commutative:
◦ c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of  operations,
all but the last one can be ignored:
◦ List1 (List2 (...(Listn(R))...) ) = List1(R)
8
Cont…
4. Commuting  with :
◦ If the selection condition c involves only the
attributes A1, ..., An in the projection list, the
two operations can be commuted:
 A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
5. Commuting of and X: both are

commutative
R cS = S c R, RXS=SXR
9
Cont…
6. Commuting  with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:
◦ L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))
7. Converting a (, x) sequence into : If the condition c of a

 that follows a x Corresponds to a join condition, convert
the (, x) sequence into a as follows:
(C (R x S)) = (R C S)
10
Basic algorithms for executing query
operations
 External sorting:
◦ Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files
 External sorting uses Sort-Merge strategy:
◦ Starts by sorting small subfiles (runs) of the main file
and then merges the sorted runs, creating larger sorted
subfiles that are merged in turn
◦ Sorting phase:
 Number subfiles (runs) nR = (b/nB)
◦ Merging phase:
 Degree of mergin(dM) = Min (nB-1, nR); nP = (logdM(nR))
11
Cont…
◦ nR: number of initial runs;
◦ b: number of file blocks;
◦ nB: available buffer space;
◦ dM: degree of merging;
◦ P: number of passes
 The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
 Example: if nB=5 blocks and size of the file=1024 blocks,
◦ nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
◦ Hence, after the sort phase, 205 sorted runs are stored as
temporary subfiles on disk
12
Cont…
 In the merging phase, the sorted runs are merged during one or
more passes.
 The degree of merging (dM) is the number of runs that can be
merged in each pass.
◦ dM=min((nB-1) and nR))
 The number of passes=[(logdM (nR))]
 In each pass, one buffer block is needed to hold one block from each
of the runs being merged and one block is needed for containing one
block of the merge result
 In the above example, dM=4(four way merging)
 Hence, the 205 initial sorted runs would be merged into:
◦ 52 at the end of the first pass
◦ 13 at the end of the second pass
◦ 4 at the end of the third pass
◦ 1 at the end of the fourth pass
13
Cont…
Exercise
 A file of 4096 blocks is to be sorted with an available
buffer space of 64 blocks. How many passes will be
needed in the merge phase of the external sort-merge
algorithm?
14
Implementing the SELECT Operation
There are many options for executing a SELECT
operation
Some options depend on the file having specific
access paths and may apply only to certain types of
selection conditions
Examples:
◦ (OP1):  SSN='123456789' (EMPLOYEE)
◦ (OP2):  DNUMBER>5(DEPARTMENT)
◦ (OP3):  DNO=5(EMPLOYEE)
◦ (OP4):  DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◦ (OP5):  ESSN=123456789 AND PNO=10(WORKS_ON)
15
Cont…
 Search Methods for implementing Simple Selection:
◦ S1 Linear search (brute force):
 Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
◦ S2 Binary search:
 If the selection condition involves an equality comparison
on a key attribute on which the file is ordered, binary
search (which is more efficient than linear search) can be
used.
 An example is OP1 if SSN is the ordering attribute for
EMPLOYEE file
◦ S3 Using a primary index or hash key to retrieve a single
record:
 If the selection condition involves an equality comparison
on a key attribute with a primary index (or a hash key), use
the primary index (or the hash key) to retrieve the record.
 For Example, OP1 use primary index to retrieve the
record
16
Cont…
Search Methods for implementing Simple Selection
◦ S4 Using a primary index to retrieve multiple
records:
 If the comparison condition is >, ≥, <, or ≤ on a
key field with a primary index, use the index to
find the record satisfying the corresponding
equality condition, then retrieve all subsequent
records in the (ordered) file. (see OP2)
◦ S5 Using a clustering index to retrieve multiple
records:
 If the selection condition involves an equality
comparison on a non-key attribute with a
clustering index, use the clustering index to
retrieve all the records satisfying the selection
condition. (See OP3)
17
cont.…
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record
if the indexing field is a key (has unique values) or to
retrieve multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection
• S7: Conjunctive selection using an individual index :
 If an attribute involved in any single simple condition
in the conjunctive condition has an access path that
permits the use of one of the methods S2 to S5, use
that condition to retrieve the records and then check
whether each retrieved record satisfies the remaining
simple conditions in the conjunctive condition
18
Cont…
 Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
 If an access path exists, the method corresponding to
that access path is used;
 Otherwise, the “brute force” linear search approach of
method S1 is used
◦ For conjunctive selection conditions, whenever more
than one of the attributes involved in the conditions have
an access path, query optimization should be done to
choose the access path that retrieves the fewest records
in the most efficient way
19
Cont…
 Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND
◦ Compared to conjunctive selection, It is much harder to
process and optimize
Example: ’(EMPLOYEE)
DNO=5 OR SALARY>3000 OR SEX=‘F
◦ Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions
◦ Hence, if any of the individual conditions does not have
an access path, we are compelled to use the brute force
approach
20
Implementing the JOIN Operation:
◦ The join operation is one of the most time

consuming operation in query processing
◦ Join
 two–way join: a join on two files
 e.g. R A=B S
 multi-way joins: joins involving more than two files
 e.g. R A=B S C=D T
Examples
◦ (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
◦ (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE
21
Implementing the JOIN Operation(cont…)
Methods for implementing joins:
◦ J1 Nested-loop join (brute force):
 For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
◦ J2 Single-loop join (Using an access structure
to retrieve the matching records):
 If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].
22
Implementing the JOIN Operation (cont…)
Methods for implementing joins:
◦ J3 Sort-merge join:
 If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
 Both files are scanned in order of the join attributes,
matching the records that have the same values for A
and B.
 In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes
23
Cont…
Implementing the JOIN Operation

(cont...):
Factors affecting JOIN performance
◦ Available buffer space
◦ Join selection factor
◦ Choice of inner VS outer relation
24
Cont…
 Implementing the JOIN Operation (contd.):
 Other types of JOIN algorithms
 Partition hash join
◦ Partitioning phase:
 Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes:
 R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm
 Minimum number of in-memory buffers needed for the
partitioning phase: M+1.
 A disk sub-file is created per partition to store the tuples
for that partition.
◦ Joining or probing phase:
 Involves M iterations, one per partitioned file.
 Iteration i involves joining partitions Ri and Si.
25
Algorithms for PROJECT and SET Operations
 Algorithm for PROJECT operations ()

 <attribute list>(R)
1. If <attribute list> has a key of relation R, extract all
tuples from R with only the values for the attributes in
<attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
 Methods to remove duplicate tuples

1. Sorting
2. Hashing
26
Using Heuristics in Query Optimization
 Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the
files involved in the query
 The main heuristic is to apply first the operations that
reduce the size of intermediate results
◦ E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
27
Cont…
 Heuristic Optimization of Query Trees:
◦ The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
◦ The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
 Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON,
PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND
ESSN=SSN AND BDATE >
‘1957-12-31’;
28
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q on slide 28
29
Moving the select operation down the tree
30
Applying the more restrictive select operation first
31
Replacing Cartesian product and select with join
32
Moving project operations down the query tree
33
Using Heuristic in query optimization(cont…)
The query graph representation for queries
given in slide 28.
 [lName]
P Pnumber=pno W ESSn=SSn E
Pname=‘Aquaris’ BDate>1957-12-31
Aquaris
1957-12-31
34
Using Heuristics in Query Optimization
 Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and project operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
35
Using Selectivity and Cost Estimates in Query Optimization
 Cost-based query optimization:
◦ Estimate and compare the costs of executing a query using different
execution strategies and choose the strategy with the lowest cost
estimate
 Cost Components for Query Execution
1. Access cost to secondary storage: Cost of transferring data
blocks between secondary storage and main memory buffers.
2. Computation cost: Cost of performing in-memory operations on

the records within the data buffers during query execution.
3. Memory usage cost: Cost pertaining to the number of main
memory buffers needed during query execution.
4. Communication cost: cost of shipping the query and its results
from the database site to the site or terminal
Note: Different database systems may focus on
different cost components.
36
Using Selectivity and Cost Estimates in Query Optimization
 Catalog Information Used in Cost Functions

◦ Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr)
b=r/bfr
◦ Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute
37
Semantic Query Optimization
 Semantic Query Optimization:
◦ Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
 Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHEREE.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
 Explanation:
◦ Suppose that we had a constraint on the database schema that stated
that no employee can earn more than his or her direct supervisor. If
the semantic query optimizer checks for the existence of this
constraint, it need not execute the query at all because it knows that
the result of the query will be empty.
38

CH 1 Query Processing

Uploaded by

Copyright:

Available Formats

CH 1 Query Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 1 Query Processing

Uploaded by

Copyright:

Available Formats

Chapter one

Query processing and Optimization

SELECT LNAME, FNAME

SELECT LNAME, FNAME SELECT MAX (SALARY)

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

1. Cascade of : A conjunctive selection condition can be

5. Commuting of and X: both are

7. Converting a (, x) sequence into : If the condition c of a

◦ The join operation is one of the most time

Implementing the JOIN Operation

 Algorithm for PROJECT operations ()

 Methods to remove duplicate tuples

2. Computation cost: Cost of performing in-memory operations on

 Catalog Information Used in Cost Functions

You might also like