Query Processing, Optimization &
Indexing
Lecture Agenda
⚫ Query Processing in DBMS
⚫ Query Parsing & Optimization
⚫ Execution Plans & Join Algorithms
⚫ Indexing & Index Structures
⚫ B-Trees, Hash Indexes
⚫ Unique, Composite, and Covering Indexes
What is Query Processing?
⚫ Query processing is the series of steps a database
management system (DBMS) follows to interpret and
execute a user's request (query) to retrieve or
manipulate data stored in a database.
⚫ Translation of SQL queries into low-level instructions
Query Processing Phases
Query Processing Involves multiple stages: Parsing,
Optimization, Execution
⚫ Parsing: Analyzing SQL syntax and semantics
⚫ Optimization: Finding the most efficient execution
plan
⚫ Execution: Running the query using the selected plan
Query Parsing
⚫ Query parsing is the process of analyzing a query
written in SQL (or another query language) to
understand its syntax and structure.
⚫ The parser checks whether the query follows the
correct grammar of the query language. If the syntax is
correct, it translates the query into a parse tree or
abstract syntax tree (AST), which is a tree
representation of the query's logical structure.
Steps in Query Parsing
⚫ Lexical Analysis: The query string is divided into
tokens (e.g., keywords, operators, identifiers).
⚫ Syntax Analysis: The tokens are checked against the
grammar of the query language to ensure correct
structure.
⚫ Semantic Analysis: The parser checks for semantic
errors (e.g., referencing non-existent tables or
columns).
⚫ Query Rewrite: Some queries are rewritten for
optimization purposes (e.g., transforming a subquery
into a join).
Query Optimization
⚫ Query Optimization is the process where the database
system evaluates multiple ways to execute a query and
chooses the most efficient one, usually based on cost
estimates (e.g., time, memory, or I/O operations).
⚫ It happens after parsing and translating the query into a
logical plan, but before actual execution.
⚫ Why Optimize? Reduce I/O, CPU, memory use
Types of Query Optimization
⚫ Query optimization can be categorized in several ways
based on how and when the optimization occurs
⚫ Key Types are
1. Heuristic query optimization
2. Cost-based Query optimization
Heuristic Query Optimization
⚫ Uses a set of rules of thumb (heuristics) to transform
and simplify the query into a more efficient form
without considering cost estimates.
Key Techniques:
⚫ Apply selection operations as early as possible (push
down selections).
⚫ Combine selections and projections to reduce the
number of columns/rows.
⚫ Use smaller tables first in joins.
⚫ Reorder joins and other operations based on known
patterns.
Heuristic optimization
Advantages:
⚫ Fast and simple
⚫ Good for basic improvements
Disadvantage:
Doesn’t always find the most efficient plan
Cost-Based Query Optimization
⚫ Evaluates multiple possible query execution plans using
statistics (e.g., table size, indexes, row selectivity) and chooses
the one with the lowest estimated cost.
Key Steps:
⚫ Generate All Possible Execution Plans: The optimizer
generates all possible ways to execute the query using
different combinations of joins, scans, and sorting.
⚫ Estimate the Cost of Each Plan: Each plan’s cost is
estimated based on factors such as:
⚫ Number of I/O operations: Reading and writing data.
⚫ CPU time: Time spent processing data.
⚫ Memory usage: Storage used during query processing.
⚫ Select the Best Plan: The plan with the lowest cost is
selected as the optimal execution plan.
Cost Based optimization
Factors Affecting Query Cost:
⚫ Table Size: Larger tables require more I/O
operations to read.
⚫ Index Availability: If indexes are available, they
can speed up data retrieval.
⚫ Join Methods: Different join algorithms (e.g.,
nested loop, merge join, hash join) have different
costs.
⚫ Sort Operations: Sorting data can be expensive if
it requires additional memory or disk I/O.
Cost Based optimization
Advantage:
⚫ More accurate and efficient plans
⚫ Adapts to real data distribution
Disadvantage:
⚫ More time-consuming than heuristic optimization
⚫ Depends heavily on up-to-date statistics
Query Execution
⚫ A Query Execution Plan (QEP) is a step-by-step
strategy chosen by the database management system
(DBMS) to execute a SQL query efficiently. After
optimization, the DBMS selects the best plan and uses
it to retrieve or modify the data.
⚫ Generated by the optimizer
⚫ Describes how tables are accessed and joined
Execution Plan
⚫ Access Paths
How data will be accessed: full table scan, index scan, etc.
⚫ Join Methods
How tables will be joined: nested loop, hash join, merge
join.
⚫ Join Order
In what sequence tables will be joined.
⚫ Selection/Projection
When and how WHERE and SELECT clauses are applied.
⚫ Intermediate Steps
Temporary tables, sorting, filtering, grouping.
⚫ Estimated Costs
Estimated time, CPU usage, I/O operations, number of
rows processed at each step.
Example: Optimized Query Execution
Plan
⚫ Student(student_id, name, age, course_id)
⚫ Course(course_id, course_name)
SELECT s.name, c.course_name
FROM Student s
JOIN Course c ON s.course_id = c.course_id
WHERE s.age > 20;
Execution Plan
Step Operation Table Method Notes
Use index on
1 Index Scan Student B-Tree Index age to filter
early
2 Filter Student - s.age > 20
Student + Join on
3 Hash Join Hash Table
Course course_id
Load all
Table Access Full Table
4 Course course
(Full) Scan
records
Output
s.name,
5 Projection Result -
c.course_na
me
How to View Execution Plans
⚫ MYSQL EXPLAIN Select
Join Algorithms Overview
The type of join used in the execution plan of a query
depends on various factors, such as:
⚫ The size of the tables involved
⚫ Availability of indexes
⚫ Join condition (e.g., equality vs inequality)
⚫ The database engine's optimizer
Selection depends on: Input size, Sort order, Join
condition
Types of Join Algorithms
Types of Join Algorithms:
⚫ Nested Loop Join,
⚫ Hash Join,
⚫ Merge Join
Nested Loop Join
⚫ Brute-force method: Compare every pair
⚫ Time Complexity: O(n × m)
⚫ Best For: Small datasets or indexed lookups
How it works: For each row in the outer table, the DBMS
searches for matching rows in the inner table.
Best when: One table is small, or there's an index on the join
column of the inner table.
⚫ Example in plan:
Nested Loop
-> Index Scan on Student
-> Index Lookup on Course
Hash Join
⚫ Build Phase: Hash smaller table
⚫ Probe Phase: Scan larger table and match using hash
⚫ Efficient for equi-joins
How it works:
⚫ Build a hash table on the smaller table using the join key.
⚫ Scan the larger table and use the hash table to find matches.
Best when: No indexes, large tables, and equality joins.
Example in plan:
Hash Join
-> Seq Scan on Student
-> Hash on Course
Merge Join
⚫ Works on sorted inputs
⚫ Time Complexity: O(n + m)
⚫ Efficient for large, pre-sorted data
How it works: Both tables are sorted on the join key,
and then merged together like in merge sort.
Best when: Tables are already sorted or sorting is
efficient.
Example in plan:
Merge Join
-> Sort on Student.course_id
-> Sort on Course.course_id
Example
Consider Schema
Student(student_id, name, age, course_id)
Course(course_id, course_name)
Query:
Select the names of students enrolled in course of
database
Query optimization & Joins Example
⚫ Plan 1:
Select s.name from student
Where course_id
IN (select course_id from course where course_name =
Database)
⚫ Plan 2:
Select s.name from student s JOIN course c
ON s.course_id = c.course_id
Where c.course_name = ‘Database’;
Query optimization & Joins Example
⚫ Plan 3:
WITH registration AS(select course_id from course
where course_name = ‘Database’)
Select s.name from student s JOIN registration r ON
s.course_id = r.course_id;
Join Algorithm Comparison
⚫ Nested Loop: Small tables, O(n × m)
⚫ Hash Join: Medium/Large, EQ joins, O(n + m)
⚫ Merge Join: Sorted data, O(n + m)
Join Type Best For Common in execution
plan when
Nested Loop Small + Large Index exists on inner
Join (with index) table
Hash Join Large + Large No index, but equality
(equality only) join
Merge Join Sorted data Data already sorted or
sorted easily
Class Assignment
⚫ Optimize this query:
⚫ SELECT E.name, D.name FROM Employees E JOIN
Departments D ON E.dept_id = D.id WHERE E.salary
> 70000;
⚫ Suggest 2 execution plans, Recommend suitable
indexes