CH 1 Query Processing
CH 1 Query Processing
CH 1 Query Processing
1
Query Processing and Optimization
Query Processing: refers to the range of
activities involved in extracting data from a database.
Query optimization: is the process of
choosing a suitable execution strategy for processing
a query.
+ Before optimizing the query it is represented in an
internal or intermediate form using two data
structures:-
Query tree: A tree data structure that corresponds to a relational
algebra expression. It represents the input relations of the query as leaf nodes of
the tree, and represents the relational algebra operations as internal nodes.
Query graph: A graph data structure that corresponds to a
relational calculus expression. It does not indicate an order on which operations
to perform first. There is only a single graph corresponding to each query.
2
Query Processing (cont…)
3
Cont…
Scanner: The scanner identifies the language
tokens such as SQL Keywords, attribute
names, and relation names in the text of the
query.
Parser: The parser checks the query syntax to
determine whether it is formulated according
to the syntax rules of the query language.
Validation: The query must be validated by
checking that all attributes and relation names
are valid and semantically meaningful names
in the schema of the particular database being
queried.
4
Cont…
Query Optimization: The process of choosing
a suitable execution strategy for processing a
query.
◦ This module has the task of producing an execution
plan.
Query Code Generator: It generates the code
to execute the plan.
Runtime Database Processor: It has the task of
running the query code whether in compiled or
interpreted mode.
◦ If a runtime error results an error message is
generated by the runtime database processor
5
Cont…
Query block:
◦ The basic unit that can be translated into the
algebraic operators
A query block contains a single
SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING
clause if these are part of the block.
Nested queries within a query are
identified as separate query blocks.
6
Cont…
7
Transformation Rules for Relational Algebra
Operations:
8
Cont…
4. Commuting with :
◦ If the selection condition c involves only the
attributes A1, ..., An in the projection list, the
two operations can be commuted:
A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
(C (R x S)) = (R C S)
10
Basic algorithms for executing query
operations
External sorting:
◦ Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files
External sorting uses Sort-Merge strategy:
◦ Starts by sorting small subfiles (runs) of the main file
and then merges the sorted runs, creating larger sorted
subfiles that are merged in turn
◦ Sorting phase:
Number subfiles (runs) nR = (b/nB)
◦ Merging phase:
Degree of mergin(dM) = Min (nB-1, nR); nP = (logdM(nR))
11
Cont…
◦ nR: number of initial runs;
◦ b: number of file blocks;
◦ nB: available buffer space;
◦ dM: degree of merging;
◦ P: number of passes
The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
Example: if nB=5 blocks and size of the file=1024 blocks,
◦ nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
◦ Hence, after the sort phase, 205 sorted runs are stored as
temporary subfiles on disk
12
Cont…
In the merging phase, the sorted runs are merged during one or
more passes.
The degree of merging (dM) is the number of runs that can be
merged in each pass.
◦ dM=min((nB-1) and nR))
The number of passes=[(logdM (nR))]
In each pass, one buffer block is needed to hold one block from each
of the runs being merged and one block is needed for containing one
block of the merge result
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged into:
◦ 52 at the end of the first pass
◦ 13 at the end of the second pass
◦ 4 at the end of the third pass
◦ 1 at the end of the fourth pass
13
Cont…
Exercise
A file of 4096 blocks is to be sorted with an available
buffer space of 64 blocks. How many passes will be
needed in the merge phase of the external sort-merge
algorithm?
14
Implementing the SELECT Operation
There are many options for executing a SELECT
operation
Some options depend on the file having specific
access paths and may apply only to certain types of
selection conditions
Examples:
◦ (OP1): SSN='123456789' (EMPLOYEE)
◦ (OP2): DNUMBER>5(DEPARTMENT)
◦ (OP3): DNO=5(EMPLOYEE)
◦ (OP4): DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◦ (OP5): ESSN=123456789 AND PNO=10(WORKS_ON)
15
Cont…
Search Methods for implementing Simple Selection:
◦ S1 Linear search (brute force):
Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
◦ S2 Binary search:
If the selection condition involves an equality comparison
on a key attribute on which the file is ordered, binary
search (which is more efficient than linear search) can be
used.
An example is OP1 if SSN is the ordering attribute for
EMPLOYEE file
◦ S3 Using a primary index or hash key to retrieve a single
record:
If the selection condition involves an equality comparison
on a key attribute with a primary index (or a hash key), use
the primary index (or the hash key) to retrieve the record.
For Example, OP1 use primary index to retrieve the
record
16
Cont…
Search Methods for implementing Simple Selection
◦ S4 Using a primary index to retrieve multiple
records:
If the comparison condition is >, ≥, <, or ≤ on a
key field with a primary index, use the index to
find the record satisfying the corresponding
equality condition, then retrieve all subsequent
records in the (ordered) file. (see OP2)
◦ S5 Using a clustering index to retrieve multiple
records:
If the selection condition involves an equality
comparison on a non-key attribute with a
clustering index, use the clustering index to
retrieve all the records satisfying the selection
condition. (See OP3)
17
cont.…
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record
if the indexing field is a key (has unique values) or to
retrieve multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection
• S7: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition
in the conjunctive condition has an access path that
permits the use of one of the methods S2 to S5, use
that condition to retrieve the records and then check
whether each retrieved record satisfies the remaining
simple conditions in the conjunctive condition
18
Cont…
Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
If an access path exists, the method corresponding to
that access path is used;
Otherwise, the “brute force” linear search approach of
method S1 is used
◦ For conjunctive selection conditions, whenever more
than one of the attributes involved in the conditions have
an access path, query optimization should be done to
choose the access path that retrieves the fewest records
in the most efficient way
19
Cont…
Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND
◦ Compared to conjunctive selection, It is much harder to
process and optimize
Example: ’(EMPLOYEE)
DNO=5 OR SALARY>3000 OR SEX=‘F
◦ Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions
◦ Hence, if any of the individual conditions does not have
an access path, we are compelled to use the brute force
approach
20
Implementing the JOIN Operation:
21
Implementing the JOIN Operation(cont…)
Methods for implementing joins:
◦ J1 Nested-loop join (brute force):
For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
◦ J2 Single-loop join (Using an access structure
to retrieve the matching records):
If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].
22
Implementing the JOIN Operation (cont…)
Methods for implementing joins:
◦ J3 Sort-merge join:
If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
Both files are scanned in order of the join attributes,
matching the records that have the same values for A
and B.
In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes
23
Cont…
24
Cont…
Implementing the JOIN Operation (contd.):
Other types of JOIN algorithms
Partition hash join
◦ Partitioning phase:
Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes:
R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm
Minimum number of in-memory buffers needed for the
partitioning phase: M+1.
A disk sub-file is created per partition to store the tuples
for that partition.
◦ Joining or probing phase:
Involves M iterations, one per partitioned file.
Iteration i involves joining partitions Ri and Si.
25
Algorithms for PROJECT and SET Operations
26
Using Heuristics in Query Optimization
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the
files involved in the query
The main heuristic is to apply first the operations that
reduce the size of intermediate results
◦ E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
27
Cont…
Heuristic Optimization of Query Trees:
◦ The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
◦ The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON,
PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND
ESSN=SSN AND BDATE >
‘1957-12-31’;
28
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q on slide 28
29
Steps in converting a query tree during heuristic optimization:
Moving the select operation down the tree
30
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first
31
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join
32
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree
33
Using Heuristic in query optimization(cont…)
The query graph representation for queries
given in slide 28.
[lName]
P Pnumber=pno W ESSn=SSn E
Pname=‘Aquaris’ BDate>1957-12-31
Aquaris
1957-12-31
34
Using Heuristics in Query Optimization
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and project operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
35
Using Selectivity and Cost Estimates in Query Optimization
Cost-based query optimization:
◦ Estimate and compare the costs of executing a query using different
execution strategies and choose the strategy with the lowest cost
estimate
Cost Components for Query Execution
1. Access cost to secondary storage: Cost of transferring data
blocks between secondary storage and main memory buffers.
36
Using Selectivity and Cost Estimates in Query Optimization
37
Semantic Query Optimization
Semantic Query Optimization:
◦ Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHEREE.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
Explanation:
◦ Suppose that we had a constraint on the database schema that stated
that no employee can earn more than his or her direct supervisor. If
the semantic query optimizer checks for the existence of this
constraint, it need not execute the query at all because it knows that
the result of the query will be empty.
38