0% found this document useful (0 votes)
8 views

Normalization

Uploaded by

wisjwid13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Normalization

Uploaded by

wisjwid13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Nazarbayev University

SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Relational Database Design
Functional Dependency (FD)
Relational Database Design

Very poor design


Relational Database Design

Single Relation:
1. Attribute semantics (logical grouping to form base relation(s))

Relationship types
2. Data redundancy (redundant values, columns, and relations)

Only Foreign Keys
3. Null values (unknown, inapplicable, no information)

Multiple Relations:
1. Spurious tuples
Example Database
 Large unnecessary data retrieval
 Inefficient data management - anomaly

Update, Delete, Insert commands

Consequences:

data inconsistency

data loss

poor DML and DQL performance

insert rejection or null values
 Inefficient data access

 Dependency and data loss


 Spurious data generation anomaly
Database Normalization

Solutions:
1. Data Decomposition (lossless joins)

Non-additive property

Losslessness property

Functional dependencies preservation
2. Functional Dependency
3. Multivalued Dependency
Database Normalization
Normalization – using keys and FDs, normalize a relation into normal form by decomposing the relation.

Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)

UNF → 1NF → 2NF → 3NF → BCNF → 4NF → 5NF → …


Functional Dependency (FD)
Used to identify inter-attribute (or domain) relationships
 Single or multiple attributes relationships
 Used in normal forms
 A type of constraint
 Must be true for any instance r of relation R

* foreign Key – used to connect tables


FD in Product
 model → {type, maker}
 type x {model, maker}
 maker x {model, type}

 {model, type} → maker


 {model, maker} → type
 {maker, type} x model

Example of: maker x {model, type}


A ? {pc, laptop}
B ? {pc, laptop}
C ok pc
D ? {pc, printer}
E ? {pc, laptop, printer}
F ok laptop
G ok laptop
H ok printer
FD in Product
 model → {type, maker}
 type x {model, maker}
 maker x {model, type}

 {model, type} → maker


 {model, maker} → type
 {maker, type} x model

Candidate keys: model


Super keys: {model, type}, {model, maker}
Foreign keys: none

* Since there is only one candidate key,


 primary key of table Product is model.
 no alternative keys.
* Any key (except FKs) in R functionally determines any other attribute in R.
Key Types
Key
 for any tuples t1 and t2 in R, t1[X] != t2[X]
 one tuple only

Composite key – two or more attributes

Candidate key – minimal determinant of all attributes in a relation


Primary key – a candidate key selected from all candidate keys
Alternative key – complement of primary key
Super key – not minimal determinant of all attributes in a relation

Prime attribute(s) – if it is a member of any of the candidate keys of relation R


Non-prime attribute(s) – complement of prime attributes

* Foreign key – used in inter-table relationships


Functional Dependency (FD)
X – a single or set of attributes of relation R
Y – a single or set of attributes of relation R

X → Y means X uniquely identifies Y (a single tuple).


In other words, Y is functionally dependent on X.

Iff for each x ∈ R.X there is only one y ∈ R.Y, then X → Y.


If X → Y, then for any tuples t1 and t2 in R, if t1[X] = t2[X], then t1[Y] = t2[Y] must hold.

* X → Y is different than Y → X

* X and Y are not necessarily keys


* If X is a key of R, X → any Y in R;
FD in Product
R - relation Product
{A1, A2, A3} - columns of Product, respectively

 model → {type, maker} A2 → {A3, A1}


 type x {model, maker} A3 x {A2, A1}
 maker x {model, type} A1 x {A2, A3}

 {model, type} → maker


{A2, A3} → A1
 {model, maker} → type {A2, A1} → A3
 {maker, type} x model {A1, A3} x A2
Inference Rules of FD
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):


A1 → {A3, A4}
A2 → A4
{A1, A2} → A5

Splitting Rule:
A1 → {A3, A4}
is
A1 → A3
A1 → A4

Combining Rule:
A1 → A3
A1 → A4
is
A1 → {A3, A4}
Inference Rules of FD
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):


A1 → {A3, A4}
A2 → A4
{A1, A2} → A5

Closure Rule:
{A1}+ → {A1, A3, A4}
{A2}+ → {A2, A4}
{A1, A2} → {A1, A2, A3, A4, A5}
+
Inference Rules of FD (exercise)
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):


A1 → A3
A2 → {A4, A5}
{A1, A2} → A6

Closure Rule:
{A1}+ → {??}
{A2}+ → {??}
{A1, A2} → {??}
+
Finding a Key for a Relation
R - relation Product
{A1, A2, A3} - columns of Product, respectively

A2 → {A3, A1}
A3 x {A2, A1}
A1 x {A2, A3}

{A2, A3} → A1
{A2, A1} → A3
{A1, A3} x A2

{A2}+ → {A1, A2, A3}


Thus, {A2}+ is a key

Find all possible candidate keys:


 Is X+ a candidate key?
 Is X+ a super key?
Finding a Key for a Relation (exercise)
R - relation Product
{A1, A2, A3} - columns of Product, respectively

A1 → {A2, A3}
{A3, A4} → {A5,A6}
A2 → A6

{A1,A4}+ → {??}

Find all possible candidate keys:


 Is X+ a candidate key?
 Is X+ a super key?
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation
If Y ⊆ X, then X → Y holds.

Transitivity
Suppose Y ⊆ X, then for any tuples t1 and t2 in R,
if t1[X] = t2[X], then t1[Y] = t2[Y] must hold.
Derived Rules:

Decomposition (split, projection)

Union (combine)
OR

Pseudo-transitivity
if Y ⊆ XY, then XY → YZ is a trivial functional dependency.

Since Y ⊆ XY, removing the common fields Y on the right side,


the FD becomes as XY → Z (nontrivial functional dependency).
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity If X → Y, then WX → WY holds.
Suppose X → Y, then for any tuples t1 and t2 in R,
Derived Rules: 1) t1[X] = t2[X]

Decomposition (split, projection) 2) t1[Y] = t2[Y]

Union (combine) 3) t1[WX] = t2[WX]

Pseudo-transitivity 4) t1[WY] = t2[WY], thus
5) t1[W] = t2[W]

Thus, if WX → WY, we can simplify the FD to X → Y.


Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity

Derived Rules: If X → Y and Y → Z, then X → Z holds.



Decomposition (split, projection) Suppose X → Y and Y → Z, then for any tuples t 1 and t2 in R,

Union (combine) such that t1[X] = t2[X], we must have t1[Y] = t2[Y] and t1[Z] = t2[Z].

Pseudo-transitivity

If X ⊇ Y or X ⊇ Z, the reflexivity rule is applied.


Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity
If X → YZ, then X → Y and X → Z hold.
Derived Rules:

Decomposition (split, projection) Suppose X → YZ,

Union (combine) then YZ → Y because of Y ⊆ YZ, thus X → Y by transitivity.

Pseudo-transitivity similarly, YZ → Z because of Z ⊆ YZ, thus X → Z by transitivity.
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation If X → Y and X → Z, then X → YZ holds.

Transitivity Suppose X → Y and X → Z,
then because of X → XY by augmentation (XX → XY) and
Derived Rules: XY → ZY by augmentation,

Decomposition (split, projection) thus X → YZ by transitivity.

Union (combine)

Pseudo-transitivity
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation If X → Y and WY → Z, then WX → Z holds.

Transitivity Suppose X → Y and WY → Z,
then because of WX → WY by augmentation,
Derived Rules: thus WX → Z by transitivity.

Decomposition (split, projection)

Union (combine)

Pseudo-transitivity
Types of FD

 Trivial FD – if Y ⊆ XY, then XY → Y is a trivial functional dependency.


 Nontrivial FD:

Full FD – if X → Y and, for every attribute A ∈ X, X-A → Y is not true, then X → Y is a full functional dependency.
i.e if removal of any attribute in X violates X → Y, then X → Y is a full functional dependency.

Partial FD – if ZX → Y and X → Y, then X → Y is a partial functional dependency.

Transitive FD – if X → Y and Y → Z, then X → Z is a transitive functional dependency.
Nazarbayev University
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Normalization and Normal Forms
Multivalued Dependency (MVD)
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10} * Makers have only a single office in one city.

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}

Prime attributes: maker, model

Candidate keys: {maker, model} is composite key


Primary key: {maker, model}
Alternative keys: none
1 Normal Form
st

A relation is in first normal form (1NF) if:


1) Atomic values only in attributes

No composite attributes

No multivalued attributes

No nested relations
2) Primary Key is identified
Anomalies
1. Data redundancy
2. Data inconsistency
3. Spurious data generation

Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)

‘Product’ record = 96 bytes (integer x4, double x2, string x4)


‘Product’ relation = 960 bytes (records x10)
2 Normal Form
nd
A relation R in second normal form (2NF) if:
 R is in 1NF
 every non-prime attribute in R is fully (not partially) functionally dependent on all attributes of the primary key of R.

i.e. no partial functional dependencies ( if ZX → Y and X → Y in FDs, then decompose X → Y into a new table)

(Recursively) eliminate partial functional dependencies by decomposing R.

R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
2 Normal Form
nd
2 Normal Form
nd
3 Normal Form
rd

A relation R in third normal form (3NF) if:



R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

i.e. no transitive functional dependencies (if X → Y and Y → Z in FDs, then decompose Y → Z into a new table)

(Recursively) eliminate transitive functional dependencies by decomposing R.


 In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
3 Normal Form
rd
3 Normal Form
rd

Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)

‘Product’ record = 96 bytes (integer x4, double x2, string x4)


‘Product’ relation = 960 bytes (records x10)

‘DETAILS’ record = 44 bytes


‘DETAILS’ relation = 396 bytes
‘PRODUCT’ record = 32 bytes
‘PRODUCT’ relation = 96 bytes
‘PRODUCT_DETAILS’ record = 24 bytes
‘PRODUCT_DETAILS’ relation = 240 bytes
‘REGION’ record = 32 bytes
‘REGION’ relation = 64 bytes
TOTAL: ~796 bytes (without indexes, views, etc.)

Difference 164 bytes

1. Anomaly elimination
2. Lossless join recovery
3. FD preservation
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):



theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):



theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):



theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):



theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):



theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Database Normalization
Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)

UNF → 1NF → 2NF → 3NF → BCNF → 4NF → 5NF → …


Multivalued Dependency (MVD)
Independence of attribute(s).
Occur in many-to-many relationships.

Suppose X ⊆ R, Y ⊆ R, and Z = R - X – Y

Suppose tuples t1, t2, t3, t4 in R, then X →→ Y is a multivalued dependency if


 t1[X] = t2[X] = t3[X] = t4[X]
 t1[Y] = t3[Y] and t2[Z] = t3[Z]
 t2[Y] = t4[Y] and t1[Z] = t4[Z]

R = {title, theater, city}

Candidate Keys:
{theater, city, title}

Multivalued Dependencies (given):


city →→ title
city →→ theater
4 Normal Form
th

A relation R in 4NF if:


 R is in 2NF
 every X →→ Y is a nontrivial MVD and X is a super key.

R = {title, theater, city}

Multivalued Dependencies (given):


city →→ title
city →→ theater

Candidate Keys:
{theater, city, title}

Decompose:
R1 {city, title} and R2 {city, theater}

* both relations are in 4NF and in BCNF since a FD is also an MVD.

* if R is in 4NF, then it is in BCNF. However, there can be R in BCNF but not in 4NF.
Nazarbayev University
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Introduction to Query Optimization
Relational Database System Architecture Overview
Query Performance

SQL → RA, etc.: 0.534 ms


Total:
16736.860 ms or 16.74 sec

0.179 ms

16736.147 ms (or 16.74 sec)


TPCH Database
Selectivity (percentage %)
High selectivity – fewer rows to scan and filter.
TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation 2.Part: 2000
WHERE n_name = 'JAPAN'; 3.Supplier: 100
4.Nation: 25
Result |T’| = 1 tuple (4% of data, high selectivity) 5.Region: 5
|T`| = |T| x 0.04 = 1
Selectivity (multiple selection predicates)
High selectivity – fewer rows to scan and filter.

TPCH dataset
SELECT COUNT(*)
1.Partsupp: 8000
FROM nation

WHERE n_name = 'JAPAN' AND n_comment LIKE '%haggle%';


2.Part: 2000
3.Supplier: 100
Result |T’| = 0 tuple (0% of data, high selectivity) 4.Nation: 25
|T`| = |T| x 0.0 = 0
5.Region: 5

Result-1 |T1’| = 1 tuple (4% of data on n_name, high selectivity)


|T1`| = |T| x 0.04 = 1
Result-2 |T2’| = 4 tuples (16% of data on n_comment, ~high selectivity)
|T2`| = |T| x 0.16 = 4
Selectivity ↔ Cardinality
High selectivity – fewer rows to scan and filter.
TPCH dataset

SELECT COUNT(*) 1.Partsupp: 8000

FROM nation 2.Part: 2000

WHERE n_name = 'JAPAN'; 3.Supplier: 100


4.Nation: 25

Result |T’| = 1 tuple (4% of data, high selectivity) 5.Region: 5

|T`| = |T| x 0.04 = 1 (cardinality)


Cardinality: Single Table

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation 2.Part: 2000
WHERE n_name = 'JAPAN'; 3.Supplier: 100
4.Nation: 25
Result |T’| = 1 tuple 5.Region: 5
Cardinality: Two Tables

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, region 2.Part: 2000
WHERE n_regionkey = r_regionkey 3.Supplier: 100
AND r_name = 'EUROPE'; 4.Nation: 25
5.Region: 5
Result |T’| = 5 tuples
Cardinality: Multiple Tables

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, supplier, region 2.Part: 2000
WHERE n_nationkey = s_nationkey 3.Supplier: 100
AND n_regionkey = r_regionkey; 4.Nation: 25
5.Region: 5
Result |T’| = 100 tuples
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey;

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows (only intermediate tables)
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey;

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 100 + 100 + 8,000 + 8,000 = 16,200 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Predicates
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Push-Down
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows


Cardinality in Join Ordering with Projection Push-Down
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows (matters in bytes)


Comparing Query Plans

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680 rows


Cost Function

n – number of tables

Primitive cost function that sums the cardinalities of base and intermediate tables.

Why this function may be insufficient?


Cost Function

|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows |T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows

n – number of tables

Primitive cost function that sums the cardinalities of base and intermediate tables.

Why this function may be insufficient?


Join Ordering in Query Plan

Bushy tree
Left-deep tree Right-deep tree Zig-zag tree
Query Plan Search Space

3 tables ≤ 3! (three factorial) = 6 plans


4 tables ≤ 4! = 24 plans
5 tables ≤ 5! = 120 plans

10 tables ≤ 10! = 3,628,800 plans

We cannot first execute all query plans, then say which query plan is the fastest to process the query.
Query Plan Enumeration Approaches

3 tables ≤ 3! (three factorial) = 6 plans


4 tables ≤ 4! = 24 plans
5 tables ≤ 5! = 120 plans

10 tables ≤ 10! = 3,628,800 plans

Approach #1: Permutation-based representation Approach #2: Graph problem

includes cross-joins in the search space cross-joins can be avoided


Query Plan Enumeration Algorithms

1. Dynamic Programming algorithm


a) Top-down version
b) Bottom-up version

2. Greedy approach
Query Plan Selection
?
?
?
?
? ?

? ?
n – number of tables
? ?
? ?

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680

How does the optimizer knows the cardinalities in advance?


True Cardinality vs Estimated Cardinality
?
?
?
?
? ?

? ?
n – number of tables
? ?
? ?

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680

Metadata, Statistics used for estimation (stored in Database Catalog)


Referential Constraints used for deriving upper-bounds
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM nation WHERE n_name = ‘JAPAN’;
Computer dataset:
|T’| ≈ 25 / 25 = 1 tuple 
nation: 25
True result: 1 tuple, D(n_name) = 25

SELECT COUNT(*) FROM nation WHERE n_regionkey = 3;

|T’| ≈ 25 / 5 = 5 tuples
T’ – output table
True result: 5 tuples, D(n_regionkey) = 5 |T’| – size of result table
T – input table
* regionkey 3 is “EUROPE” |T| – size of input table
D(T.a) – number of distinct values
on column a in table T

D(T.a) – needs to be stored or estimated

Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM nation WHERE n_regionkey = 3;

|T’| ≈ 25 / 5 = 5 tuples

True result: 5 tuples, D(n_regionkey) = 5

Assumption: uniform distribution

In real world data, this assumption is often violated, thus there are errors in the estimations.
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM supplier WHERE s_nationkey = 1;

|T’| ≈ 500 / 25 = 20 tuples

True result: 14 tuples, D(s_nationkey) = 25

* nationkey 1 is “ARGENTINA”

Assumption: uniform distribution

In real world data, this assumption is often violated thus there are errors in the estimations.
Join Operator Estimation: Two-way join
SELECT COUNT(*) FROM nation, supplier
WHERE n_nationkey = s_naitonkey; Computer dataset:

nation: 25

supplier: 500
True result: 500 tuples
D(n_nationkey) = 25, D(s_nationkey) = 25

T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T

D(T.a) – needs to be stored or estimated

Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Join Operator Estimators
Two-way join estimator: T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T

tuples

Assumptions: uniformity, inclusion


and independence (in multi-way joins)
Join Algorithms
Join Algorithms
Algorithms Category #1:
 Sorting-based methods
 Hash-based-methods
 Index-based methods

Algorithms Category #2:


 One-pass algorithms
 Two-pass algorithms
 Many-pass algorithms
(Example) Nested-loop Join algorithm

Runtime Complexity:

Memory Complexity:
(Example) Nested-loop Join algorithm
page #1
scan

page #2 Product.Pages = {1, 2, 3, 4, 5, 6}

In the hard drive


scan
FOR each page p in Pages DO:
FOR each tuple t1 in p:
page #3 FOR each tuple t2 in PC DO:
scan IF t1 t2 THEN:
output t1 t2
page #4
scan

page #5 In main memory


scan
Example:

page #6 If page ≈ 512 MB and each page can contain 5


scan tuples, then 6 pages x 512 MB ≈ 3 GB memory
Decisions on Join Operator Algorithms

Algorithms:
 Nested-loop
 Merge-join Building Table
Probing Table
10 GB data 2GB data
 Hash-join

Which table is good to keep in the RAM?

Which join operator to choose?


Cardinality in Join Ordering with Join Operators
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows


Decisions on Scan Operator Algorithms

Algorithms:
 Sequential Scan
 Index Scan Building Table
Probing Table
10 GB data 2GB data
 Parallel Scan

SELECT COUNT(*) FROM partsupp, supplier


WHERE ps_suppkey = s_suppkey AND ps_supplycost = 771;

Which scan operator to choose?


Cardinality in Join Ordering with Scan Operators
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows


Complexity Estimation of Relational Algebra Operators
?

?
?
?

? ?
? ?

? ? ?

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows


Categorizing Relational Algebra Operators
Tuple-at-a-time (single input table):
 Selection operator
 Projection operator

Full-relation (single input table):


 GroupBy and OrderBy operator
 Distinct operator
 Aggregation operator
 Scan operator

Full-relation (multiple input tables):


 Set operator
 Join operator
Decisions by Query Optimizer
1. Cost Function - accuracy 1. Join Order selection
2. Search Space 2. Query Plan Enumeration algorithm selection
3. Query Plan enumeration algorithms 3. Relation Algebra Operator selection
4. Operator Fusion, Memory Allocation, etc.

1. Tree shapes (binary; left, right, bushy, zig-zag)


2. Ordering of join and other operators
3. New Relational Algebra Operators
4. Different Algorithms
5. Different Index-based Algorithms
6. Different Storage-based Algorithms (I/O)
7. Different Processor-based Algorithms
8. Parallel and Distributed Algorithms
Databases and Information Retrieval
Big Data
 Hardware changes and their limits Other Optimizations:
 Data transfer

Correct and Efficient query writing

Indexes

Materialized Views
Requirements

Normalization
 Accuracy
 Speed

Workloads
 Transactional
 Analytical
 Machine and Deep Learning
CSCI 341
Database Systems

Transactions
Transaction Processing Concepts
Department of Computer Science,
School of Engineering and Digital Sciences
Nazarbayev University
(Fall 2024)
What is a transaction?
A Transaction: A transaction is a sequence of one or more
(SQL) operations (read - retrieval, write - insert or update or
delete) treated as a unit.
§ Transactions appear to run in isolation
§ If the system fails, each transaction’s changes are
reflected either entirely or not at all
• A transaction (set of operations) may be stand-alone specified
in a high-level language like SQL submitted interactively or
may be embedded within a program.
• Transaction boundaries: Begin and End transaction.
• An application program may contain several transactions
separated by the Begin and End transaction boundaries.
Transaction Processing
Motivated by two independent requirements

§Concurrent database access for efficient


performance
§Resilience to system failures, being able to
guarantee the consistency and recover the
database
Transactions
Concurrent Database Access

Even more software


Select…
Update…
Create Table…
Drop Index…
Help…
More software Delete…

DBMS

Data
Transactions
Concurrency Goal

Execute sequence of SQL statements so that they


appear to be running in isolation

🞽 Simple solution: execute them in isolation

But we want to enable concurrency whenever it


is safe to do so

Why?
Motivation for Concurrent Executions
• Multiple transactions are allowed to run concurrently
in the system.
• Ensuring transaction isolation while permitting
concurrent execution is difficult but necessary for
performance reasons.
• increased processor and disk utilization, leading
to better transaction throughput (the ave. no. of
transactions completed in a given time): one
transaction can be using the CPU while another is
reading from or writing to the disk.
• reduced average response time for transactions
(average time taken to complete a transaction):
short transactions need not wait behind long ones.
How can things go wrong?
• Suppose we have a table of bank accounts which
contains the balance of the account. A deposit of
$50 to a particular account # 1234 would be written
as:
update Accounts
set balance = balance + $50
where account#= ‘1234’;

• We will see that this entails a read and a write of


the account’s balance.
• What happens if two people deposit money onto the
same account at the same time?
Concurrent deposits

• This SQL update code is represented as a sequence


of read and write operations on “data items”:

Deposit 1 (T1) Deposit 2 (T2)


read(X.bal) read(X.bal)
X.bal := X.bal + $50 X.bal:= X.bal + $100
write(X.bal) write(X.bal)

• Here, X is the data item representing the account


with account# 1234.
A “bad” concurrent execution
• But only one “action” (e.g. a read or a write) can
happen at a time, and there are a variety of ways
in which the two deposits could be simultaneously
executed:
BAD!

Deposit 1 (T1) Deposit 2 (T2)


read(X.bal)
time read(X.bal)
X.bal := X.bal + $50
X.bal:= X.bal + $100
write(X.bal)
write(X.bal)
A “good” execution
• The previous execution would have been fine if
the two accounts were different (i.e. one were
X and one were Y).
• The following execution is a serial execution,
and executes one transaction after the other:

GOOD!
Deposit 1 Deposit 2
read(X.bal)
time X.bal := X.bal + $50
write(X.bal)
read(X.bal)
X.bal:= X.bal + $100
write(X.bal)
Good executions
• An execution is “good” if is it serial (i.e. the
transactions are executed one after the other) or
serializable (i.e., equivalent to some serial execution)
Deposit 1 Deposit 3
read(X.bal)
read(Y.bal)
X.bal := X.bal + $50
Y.bal:= Y.bal + $10
write(X.bal)
write(Y.bal)

• This execution is equivalent to executing Deposit 1


then Deposit 3, or vice versa.
Schedule 1
• Let T1 transfer $50 from A to B, and T2 transfer 10% of the
balance from A to B.
• A serial schedule in which T1 is followed by T2:

14
Schedule 2
• A serial schedule where T2 is followed by T1

15
Schedule 3
• Let T1 and T2 be the transactions defined previously.
The following schedule is not a serial schedule, but it is
equivalent to Schedule 1.

In Schedules 1, 2 and 3, the sum A + B is preserved.


16
Schedule 4
• The following concurrent schedule does not
preserve the value of (A + B).

17
Desirable Properties of Transactions

ACID Properties:
Atomicity
Consistency
Isolation
Durability

ACID properties should be enforced by the concurrency control


and recovery methods of the DBMS.
SIMPLE MODEL OF A DATABASE

• A database - collection of named data items


• Granularity of data locking–
a field,
a record,
a whole disk block, or
a table

• Transaction Concept is independent of granularity


• Basic operations are read and write.
SIMPLE MODEL OF A DATABASE
(for purposes of discussing transactions)
read_item(X) command includes the following steps:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
3. Copy item X from the buffer to the program variable named X.

write_item(X) command includes the following steps:


1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
3. Copy item X from the buffer to the program variable named X.
4. Modify the variable named X
5. Copy item X from the program variable named X into its correct location in
the buffer.
6. Store the updated block from the buffer back to disk (either immediately or
at some later point in time).
Why Concurrency Control is needed?

1. The temporary update (or dirty read) problem


2. Unrepeatable read problem
3. Lost update problem
4. The incorrect summary problem
Transactions
1. The temporary update (or dirty read) problem (WR conflicts)

T1: R(A), W(A), R(B), W(B), Abort


T2: R(A), W(A), C

One source of anomalies is that T2 could read a DB object A that has been
modified by another transaction T1, which has not yet committed.
The value of object A read by T2 is called dirty read, because it has been
created by a transaction T1 that has not committed yet; hence, this
problem is also known as the dirty read problem.
Transactions

2. Unrepeatable read problem (RW Conflicts)


T1: R(A), R(A), W(A), C
T2: R(A), W(A), C

This problem may occur, when a transaction T1 reads


an item twice and the item is changed by another
transaction T2 between the two reads.

Hence, T1 receives different values for its two reads of


the same item.
For example:
Unrepeatable read problem may occur during an airline
reservation transaction:
A customer is inquiring about seat availability on several
flights. When the customer decides on a particular flight,
the transaction then reads the number of seats on that
flight a second time before completing the reservation
and T reads a different number of seats available.
3. A Lost Update.

Deposit 1 (T1) Deposit 2 (T2)


read(X.bal)
read(X.bal)
time X.bal := X.bal + $50
X.bal:= X.bal + $10
write(X.bal)
write(X.bal)

The update that adds $50 is lost!


4. The Incorrect Summary Problem.

May occur when one transaction is calculating an


aggregate summary function on a number of database
items while other transactions are updating some of
these items.

Average salary of all the employees (while somebody


updates the entries of some of the employees) then the
number of employees with more than the average.
What causes a Transaction to fail?
(why Transaction Recovery may be necessary?)

1. A computer failure (system crash): A hardware or software error


occurs in the computer system during transaction execution.
2. A transaction or system error: Some operation in the transaction may
cause it to fail, such as integer overflow or division by zero. Transaction
failure may also occur because of erroneous parameter values or
because of a logical programming error. In addition, the user may
interrupt the transaction during its execution.
3. Local errors or exception conditions detected by the transaction:
-certain conditions necessitate cancellation of the transaction. For
example, data for the transaction may not be found. A condition, such
as insufficient account balance in a banking database, may cause a
transaction, such as a fund withdrawal from that account, to be
canceled.
-a programmed abort in the transaction causes it to fail.
What causes a Transaction to fail
(why Transaction/Data Recovery may be necessary)

4. Concurrency control enforcement: The concurrency control


method may decide to abort the transaction, to be
restarted later, because it violates serializability or because
several transactions are in a state of deadlock.
5. Disk failure: Some disk blocks may lose their data because
of a read or write malfunction or because of a disk
read/write head crash. This may happen during a read or a
write operation of the transaction.
6. Physical problems and catastrophes: This refers to an
endless list of problems that includes power or air-
conditioning failure, fire, theft, sabotage, overwriting disks
or tapes by mistake, and mounting of a wrong tape by the
operator.
Additional Operations In Transaction Management
For recovery purposes, the recovery manager keeps track of the following operations:
start_transaction: marks the beginning of transaction execution.
read or write: specify read or write operations on the database items executed as
a part of a transaction.
end_transaction: specifies that read and write transaction operations have ended
and marks the end limit of transaction execution.
commit_transaction: signals a successful end of the transaction so that any
changes (updates) executed by the transaction can be safely committed to the
database and will not be undone.
rollback (or abort): signals that the transaction has ended unsuccessfully, so that
any changes or effects that the transaction may have applied to the database must
be undone.
Recovery Techniques Use The Following Operations:
undo: Similar to rollback except that it applies to a single operation rather than to
a whole transaction.
redo: specifies that certain transaction operations must be redone to ensure that
all the operations of a committed transaction have been applied successfully to
the database.

You might also like