Unit-5 Query Processing and Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Unit-5

Query Processing
and
Optimization
Steps in Query Processing
Parser checks the syntax of Translator translates the
query and verifies attribute query into its internal
name and relation name form (relational algebra)

Parser and Relational algebra


Query
translator expression

Choose best execution plan


Optimizer
Execute the query-evaluation
plan and returns output
Evaluation
Query output Execution plan
engine

Database Catalog
Data Statistics about Data
Steps in Query Processing
Steps in Query Processing
Steps of query processing are:
1. Query decomposition.
2. Query optimization.
3. Code generation
4. Runtime query execution.
Steps in Query Processing
1.Query Decomposition
Its include scanning, parsing and validating process.
Scanning: The scanner finds the language tokens like SQL keywords,
attribute names, and relation names in the text of the query.
Parsing: Parser checks the syntax of query to find out whether it is
written according to the syntax rules of the query language.
Validating:
 Query validation is done by checking that all attribute and relation
names are valid.
 An internal representation of the query is then shown as a tree data
structure called a query tree.
 An internal representation of the query is then shown as a graph
data structure called a query graph.
Steps in Query Processing
2. Query optimization
 Query optimization is one of the most important tasks of a relational
DBMS.

 It generates alternative plans and chooses the best plan with the
least estimated cost.

 This module is used to find out an execution plan which is the


execution strategy to retrieve the result of the query from the
database files.

 A query can have many possible execution strategies each with


different performances the process of selecting a reasonably efficient
execution plan known as query optimization.
Steps in Query Processing
3. Code generation

 It generates the code to execute the execution plan selected by


query optimization block.

4. Runtime query execution.

 The above generated code will be accepted by runtime query


processor which executes the generated code and produces the
query result.

 This is desired query output that user wants.


Measures of Query Cost
 Cost is generally measured as the total time required to
execute a statement/query.
 Factors contribute to time cost
1. Disk accesses (time to process a data request and retrieve the
required data from the storage device)
2. CPU time to execute a query (Computation Cost)
3. Network communication cost
4. Memory usage cost
5. Number of different resources used like printer disk access etc.
6. Data transfer rate
 Disk access is the predominant (major) cost, since disk
access is slow as compared to in-memory operation.
 Cost to write a block is greater than cost to read a block
because data is read back after being written to ensure
that the write was successful.
Selection operation
 Symbol:  (Sigma)
 Notation:  condition(Relation)
 Operation: Selects tuples from a relation that satisfy a given
condition.
RollNo Name Branch SPI
101 Raj CE 8
102 Meet ME 9
103 Harsh EE 8
104 Punit CE 9

 Branch=‘CE’ (Student)
RollNo Name Branch SPI
101 Raj CE 8
104 Punit CE 9
Search algorithm for selection operation
1. Linear search (A1)
2. Binary search (A2)
Linear search (A1)
 It scans each blocks and tests all records to see whether they
satisfy the selection condition.
• Cost of linear search (worst case) = br
br denotes number of blocks containing records from relation r
 If the selection condition is there on a (primary) key
attribute, then system can stop searching if the required
record is found.
• cost of linear search (best case) = (br /2)
 Linear search can be applied regardless of
• selection condition or
• ordering of records in the file (relation)
 This algorithm is slower than binary search algorithm.
Binary search (A2)
 Generally, this algorithm is used if selection is an equality
comparison on the (primary) key attribute and file
(relation) is ordered (sorted) on (primary) key attribute.
 cost of binary search = [log2(br)]
• br denotes number of blocks containing records from relation r
 If the selection is on non (primary) key attribute then
multiple block may contains required records, then the cost
of scanning such blocks need to be added to the cost
estimate.
 This algorithm is faster than linear search algorithm.
Evaluation of expressions
 Expression may contain more than one operations, solving
expression will be difficult if it contains more than one
expression.
 Cust_Name ( Balance<2500 (account) customer )
 To evaluate such expression we need to evaluate each
operation one by one in appropriate order.
3  Cust_Name

2
Bottom to top
Execution

1  Balance<2500 customer

account
Evaluation of expressions
 Methods for evaluating an entire expression tree are:
1. Materialization
2. Pipelining
Materialization
 Materialization evaluates the expression tree of the
relational algebra operation from the bottom and
performs the innermost or leaf-level operations first.
 The intermediate result of each operation is materialized
(store in temporary relation) and becomes input for
subsequent (next) operations.
 The cost of materialization is the sum of the individual
operations plus the cost of writing the intermediate results
to disk.
 The problem with materialization is that
• it creates lots of temporary relations
• it performs lots of I/O operations
Pipelining
 In pipelining, operations form a queue, and results are
passed from one operation to another as they are
calculated.
 To reduce number of intermediate temporary relations, we
pass results of one operation to the next operation in the
pipelines.
 Combining operations into a pipeline eliminates the cost of
reading and writing temporary relations.
 Pipelines can be executed in two ways:
1. Demand driven (System makes repeated requests for tuples from the
operation at the top of pipeline)
2. Producer driven (Operations do not wait for request to produce
tuples, but generate the tuples eagerly.)
Query optimization
 It is a process of selecting the most efficient query
evaluation plan from the available possible plans.
 Cust_Name ( Balance<2500 (Account) Customer )

Efficient plan 2 records 4 records

 Cust_Name ( Balance<2500 (Account Customer ))

Customer 4 records Account 4 records

Cid Ano Cust_name Ano Balance


C01 A01 Raj A01 3000
C02 A02 Meet A02 1000
C03 A03 Harsh A03 2000
C04 A04 Punit A04 4000
Approaches to Query Optimization
1. Exhaustive Search Optimization
• Generates all possible query plans and then the best plan is
selected.
• It provides best solution.
2. Heuristic Based Optimization
 Heuristic based optimization uses rule-based optimization
approaches for query optimization.
• Performs select and project operations before join operations. This
is done by moving the select and project operations down the query
tree. This reduces the number of tuples available for join.
• Avoid cross-product operation because they result in very large-sized
intermediate tables.
• This algorithms do not necessarily produce the best query plan.
Transformation of relational expressions
 Two relational algebra expressions are said to be equivalent if
the two expressions generate the same set of tuples.
 Example:
Customer Account
Cid Ano Cust_name Ano Balance Cust_name
C01 A01 Raj A01 3000 Meet
C02 A02 Meet A02 1000 Harsh
C03 A03 Harsh A03 2000
C04 A04 Punit A04 4000

 Cust_Name ( Balance<2500 (Account Customer ))

 Cust_Name ( Balance<2500 (Account) Customer )


Transformation of relational expressions
1. Combined selection operation can be divided into sequence of
individual selections. This transformation is called cascade of σ.
Customer Output
Cid Ano Cust_name Balance Cid Ano Cust_name Balance
C01 1 Raj 3000 C02 2 Meet 1000
C02 2 Meet 1000
C03 3 Harsh 2000
C04 4 Punit 4000

σAno<3 Λ Balance<2000 (Customer) = σAno<3 (σBalance<2000 (Customer))

σθ1Λθ2 (E) = σθ1(σθ2 (E))


Transformation of relational expressions
2. Selection operations are commutative.

Customer Output
Cid Ano Cust_name Balance Cid Ano Cust_name Balance
C01 1 Raj 3000 C02 2 Meet 1000
C02 2 Meet 1000
C03 3 Harsh 2000
C04 4 Punit 4000

σAno<3 (σBalance<2000 (Customer)) = σBalance<2000 (σAno<3 (Customer))

σθ1(σθ2 (E) = σθ2(σθ1 (E))


Transformation of relational expressions
3. If more than one projection operation is used in expression
then only the outer projection operation is required. So
skip all the other inner projection operation.
Customer Output
Cid Ano Cust_name Balance Cust_name
C01 1 Raj 3000 Raj
C02 2 Meet 1000 Meet
C03 3 Harsh 2000 Harsh
C04 4 Punit 4000 Punit

∏Cust_name (∏Ano, Cust_name (Customer)) = ∏Cust_name (Customer)

∏L1 (∏L2 (… (∏Ln (E))…)) = ∏L1 (E)


Transformation of relational expressions
4. Selection operation can be joined with cartesian product
and theta join.
Customer Balance Output
Cid Ano Cust_name Ano Balance Cid Ano Cust_name Balance
C01 1 Raj 1 3000 C01 1 Raj 3000
C02 2 Meet 2 1000 C02 2 Meet 1000
C03 3 Harsh 3 2000
C04 4 Punit 4 4000

σAno<3 (Customer Balance) = (Customer) σAno<3 (Balance)


σθ (E1 E2) = E1 θ E2

σθ1 (E1 θ2 E2) = E1 θ1Λθ2 E2


Transformation of relational expressions
5. Theta operations are commutative.

Customer Balance Output


Cid Ano Cust_name Ano Balance Cid Ano Cust_name Balance
C01 1 Raj 1 3000 C01 1 Raj 3000
C02 2 Meet 2 1000 C02 2 Meet 1000
C03 3 Harsh 3 2000
C04 4 Punit 4 4000

(Balance) σAno<3 (Customer) = (Customer) σAno<3 (Balance)

E1 θ E2 = E2 θ E1
Transformation of relational expressions
6. Natural join operations are associative.
(E1 E2) E3 = E1 (E2 E3)
7. Selection operation distribute over theta join operation
under the following condition
• When all the attributes in the selection condition θ0 involves only the
attributes of the one of the expression (says E1) being joined.
σθ0 (E1 θ E2) = (σθ0 (E1)) θ E2
• When the selection condition θ1 involves only the attributes of E1 and
θ2 involves only the attributes of E2.
σθ1Λθ2 (E1 θ E2) = (σθ1(E1) θ (σθ2 (E2)))
Transformation of relational expressions
8. Set operations union and intersection are commutative.
set difference is not commutative Union Intersect
Customer Employee Output Output
Cust_name Emp_name Name Name
Raj Meet Raj Meet
Meet Suresh Meet
Suresh

Customer U Employee = Employee U Customer

Customer ∩ Employee = Employee ∩ Customer

E1 U E2 = E2 U E1
E1 ∩ E2 = E2 ∩ E1
Transformation of relational expressions
9. Set operations union and intersection are associative.
Union Intersect
Customer Employee Student Output Output
Cust_name Emp_name Emp_name Name Name
Raj Meet Raj Raj Meet
Meet Suresh Meet Meet
Suresh

(Customer U Employee) U Student = Customer U (Employee U Student)

(Customer ∩ Employee) ∩ Student = Customer ∩ (Employee ∩ Student)

(E1 U E2) U E3 = E1 U (E2 U E3)


(E1 ∩ E2) ∩ E2 = E1 ∩ (E2 ∩ E3)
Transformation of relational expressions
10.Selection operation distributes over U, ∩ and –.
σθ(E1 – E2) = σθ(E1) – σθ(E2)
similarly selection operation is distributed for U and ∩ also.
Sorting
 Several of the relational operations, such as joins, can be
implemented efficiently if the input relations are first
sorted.
 We can sort a relation by building an index on the relation
and then using that index to read the relation in sorted order.
 Such a process orders the relation only logically rather than
physically.
 Hence reading of tuples in the sorted order may lead to disk
access for each record, which can be very expensive.
 So it is desirable to order the records physically.
Sorting
 Sorting of relation that fit into main memory, standard
sorting techniques such as quick-sort can be used.
 Sorting of relations that do not fit in main memory is called
external sorting.
 Most commonly used algorithm for this type of sorting is
external sort merge algorithm.
External Sort-Merge (Example)
 Blocks=3 24 19 14
24 2
19 24 16
19 3
31 31 19
31 7
33 14 24
33 14
14 16 31
14 14
16 33 33
16 16
16 16 3 16
2
21 21 16 19
3
3 3 21 21
7
2 24
2 2 14
7 create merge merge 31
7 7 16
14 runs pass-1 pass-2 33
14 14 21
initial relation runs runs sorted output
External Sort-Merge Algorithm
 Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:
1) Read M blocks of relation into memory
2) Sort the in-memory blocks
3) Write sorted data to run Ri; then increment i.
Let the final value of i be N

2. Merge the runs (next slide)…..


External Sort-Merge
2. Merge the runs (N-way merge). We assume (for now) that N <
M.
1) Use N blocks of memory to buffer input runs, and 1 block to buffer
output. Read the first block of each run into its buffer page
2) repeat
• Select the first record (in sort order) among all buffer pages
• Write the record to the output buffer. If the output buffer is full write it to
disk.
• Delete the record from its input buffer page.
If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.
3) until all input buffer pages are empty:
External Sort-Merge
 If N  M, several merge passes are required.
• In each pass, contiguous groups of M - 1 runs are merged.
• A pass reduces the number of runs by a factor of M -1, and creates runs
longer by the same factor.
 E.g. If M=11, and there are 90 runs, one pass reduces the number of
runs to 9, each 10 times the size of the initial runs
• Repeated passes are performed till all runs have been merged into one.
Sum (Nested loop join)
 Assuming worst case memory availability and the following
given statistics for the relations customer and depositor
• Number of records of customer: 10,000 (ncustomer)
• Number of records of depositor: 5,000 (ndepositor)
• Number of blocks of customer: 400 (bcustomer)
• Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with depositor as outer relation
2. with customer as outer relation
Sum (Nested loop join)
 Assuming worst case memory availability and the following
given statistics for the relations customer and depositor
• Number of records of customer: 10,000 (ncustomer)
• Number of records of depositor: 5,000 (ndepositor)
• Number of blocks of customer: 400 (bcustomer)
• Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with depositor as outer relation
No. of blocks access = ndepositor * bcustomer + bdepositor
= 5000 *400 + 100
= 2000100
Sum (Nested loop join)
 Assuming worst case memory availability and the following
given statistics for the relations customer and depositor
• Number of records of customer: 10,000 (ncustomer)
• Number of records of depositor: 5,000 (ndepositor)
• Number of blocks of customer: 400 (bcustomer)
• Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with customer as outer relation
No. of blocks access = ncustomer * bdepositor + bcustomer
= 10000 *100 + 400
= 1000400
Sum (Nested loop join)
 Assuming best case memory availability and the following
given statistics for the relations customer and depositor
• Number of records of customer: 10,000 (ncustomer)
• Number of records of depositor: 5,000 (ndepositor)
• Number of blocks of customer: 400 (bcustomer)
• Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
No. of blocks access = bdepositor + bcustomer
= 100 + 400
= 500
Join Operations
 There are several different algorithms that can be used to
implement joins
1. Nested-Loop Join
2. Block Nested-Loop Join
3. Index Nested-Loop Join
4. Sort-Merge Join
5. Hash-Join
Cost of computing for all joins
 R is called the outer and S the inner relation of the join.
• Number of records of R: (NR)
• Number of records of S: (NS)
• Number of blocks of R: (BR)
• Number of blocks of S: (BS)
Join Worst Case Best Case
Nested-Loop Join B R + NR ∗ B S BR + B S
Block Nested-Loop Join BR + B R ∗ BS BR + B S
Index Nested-Loop Join B R + NR ∗ c
Merge Join BR + B S
Hash-Join 3 ∗ (BR + BS)

• c is the cost of a single selection on S using the join condition.

You might also like