0% found this document useful (0 votes)

8 views

Normalization

Uploaded by

wisjwid13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Normalization

Uploaded by

wisjwid13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 113

Nazarbayev University

SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Relational Database Design
Functional Dependency (FD)
Relational Database Design

Very poor design

Relational Database Design

Single Relation:
1. Attribute semantics (logical grouping to form base relation(s))

Relationship types
2. Data redundancy (redundant values, columns, and relations)

Only Foreign Keys
3. Null values (unknown, inapplicable, no information)

Multiple Relations:
1. Spurious tuples
Example Database
 Large unnecessary data retrieval
 Inefficient data management - anomaly

Update, Delete, Insert commands

Consequences:

data inconsistency

data loss

poor DML and DQL performance

insert rejection or null values
 Inefficient data access

 Dependency and data loss

 Spurious data generation anomaly
Database Normalization

Solutions:
1. Data Decomposition (lossless joins)

Non-additive property

Losslessness property

Functional dependencies preservation
2. Functional Dependency
3. Multivalued Dependency
Database Normalization
Normalization – using keys and FDs, normalize a relation into normal form by decomposing the relation.

Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)

UNF → 1NF → 2NF → 3NF → BCNF → 4NF → 5NF → …

Functional Dependency (FD)
Used to identify inter-attribute (or domain) relationships
 Single or multiple attributes relationships
 Used in normal forms
 A type of constraint
 Must be true for any instance r of relation R

* foreign Key – used to connect tables

FD in Product
 model → {type, maker}
 type x {model, maker}
 maker x {model, type}

 {model, type} → maker

 {model, maker} → type
 {maker, type} x model

Example of: maker x {model, type}

A ? {pc, laptop}
B ? {pc, laptop}
C ok pc
D ? {pc, printer}
E ? {pc, laptop, printer}
F ok laptop
G ok laptop
H ok printer
FD in Product
 model → {type, maker}
 type x {model, maker}
 maker x {model, type}

 {model, type} → maker

 {model, maker} → type
 {maker, type} x model

Candidate keys: model

Super keys: {model, type}, {model, maker}
Foreign keys: none

* Since there is only one candidate key,

 primary key of table Product is model.
 no alternative keys.
* Any key (except FKs) in R functionally determines any other attribute in R.
Key Types
Key
 for any tuples t1 and t2 in R, t1[X] != t2[X]
 one tuple only

Composite key – two or more attributes

Candidate key – minimal determinant of all attributes in a relation

Primary key – a candidate key selected from all candidate keys
Alternative key – complement of primary key
Super key – not minimal determinant of all attributes in a relation

Prime attribute(s) – if it is a member of any of the candidate keys of relation R

Non-prime attribute(s) – complement of prime attributes

* Foreign key – used in inter-table relationships

Functional Dependency (FD)
X – a single or set of attributes of relation R
Y – a single or set of attributes of relation R

X → Y means X uniquely identifies Y (a single tuple).

In other words, Y is functionally dependent on X.

Iff for each x ∈ R.X there is only one y ∈ R.Y, then X → Y.

If X → Y, then for any tuples t1 and t2 in R, if t1[X] = t2[X], then t1[Y] = t2[Y] must hold.

* X → Y is different than Y → X

* X and Y are not necessarily keys

* If X is a key of R, X → any Y in R;
FD in Product
R - relation Product
{A1, A2, A3} - columns of Product, respectively

 model → {type, maker} A2 → {A3, A1}

 type x {model, maker} A3 x {A2, A1}
 maker x {model, type} A1 x {A2, A3}

 {model, type} → maker

{A2, A3} → A1
 {model, maker} → type {A2, A1} → A3
 {maker, type} x model {A1, A3} x A2
Inference Rules of FD
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):

A1 → {A3, A4}
A2 → A4
{A1, A2} → A5

Splitting Rule:
A1 → {A3, A4}
is
A1 → A3
A1 → A4

Combining Rule:
A1 → A3
A1 → A4
is
A1 → {A3, A4}
Inference Rules of FD
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):

A1 → {A3, A4}
A2 → A4
{A1, A2} → A5

Closure Rule:
{A1}+ → {A1, A3, A4}
{A2}+ → {A2, A4}
{A1, A2} → {A1, A2, A3, A4, A5}
+
Inference Rules of FD (exercise)
R - relation Product
{A1, A2, A3} - columns of Product, respectively

Functional Dependencies (given):

A1 → A3
A2 → {A4, A5}
{A1, A2} → A6

Closure Rule:
{A1}+ → {??}
{A2}+ → {??}
{A1, A2} → {??}
+
Finding a Key for a Relation
R - relation Product
{A1, A2, A3} - columns of Product, respectively

A2 → {A3, A1}
A3 x {A2, A1}
A1 x {A2, A3}

{A2, A3} → A1
{A2, A1} → A3
{A1, A3} x A2

{A2}+ → {A1, A2, A3}

Thus, {A2}+ is a key

Find all possible candidate keys:

 Is X+ a candidate key?
 Is X+ a super key?
Finding a Key for a Relation (exercise)
R - relation Product
{A1, A2, A3} - columns of Product, respectively

A1 → {A2, A3}
{A3, A4} → {A5,A6}
A2 → A6

{A1,A4}+ → {??}

Find all possible candidate keys:

 Is X+ a candidate key?
 Is X+ a super key?
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation
If Y ⊆ X, then X → Y holds.

Transitivity
Suppose Y ⊆ X, then for any tuples t1 and t2 in R,
if t1[X] = t2[X], then t1[Y] = t2[Y] must hold.
Derived Rules:

Decomposition (split, projection)

Union (combine)
OR

Pseudo-transitivity
if Y ⊆ XY, then XY → YZ is a trivial functional dependency.

Since Y ⊆ XY, removing the common fields Y on the right side,

the FD becomes as XY → Z (nontrivial functional dependency).
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity If X → Y, then WX → WY holds.
Suppose X → Y, then for any tuples t1 and t2 in R,
Derived Rules: 1) t1[X] = t2[X]

Decomposition (split, projection) 2) t1[Y] = t2[Y]

Union (combine) 3) t1[WX] = t2[WX]

Pseudo-transitivity 4) t1[WY] = t2[WY], thus
5) t1[W] = t2[W]

Thus, if WX → WY, we can simplify the FD to X → Y.

Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity

Derived Rules: If X → Y and Y → Z, then X → Z holds.


Decomposition (split, projection) Suppose X → Y and Y → Z, then for any tuples t 1 and t2 in R,

Union (combine) such that t1[X] = t2[X], we must have t1[Y] = t2[Y] and t1[Z] = t2[Z].

Pseudo-transitivity

If X ⊇ Y or X ⊇ Z, the reflexivity rule is applied.

Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation

Transitivity
If X → YZ, then X → Y and X → Z hold.
Derived Rules:

Decomposition (split, projection) Suppose X → YZ,

Union (combine) then YZ → Y because of Y ⊆ YZ, thus X → Y by transitivity.

Pseudo-transitivity similarly, YZ → Z because of Z ⊆ YZ, thus X → Z by transitivity.
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation If X → Y and X → Z, then X → YZ holds.

Transitivity Suppose X → Y and X → Z,
then because of X → XY by augmentation (XX → XY) and
Derived Rules: XY → ZY by augmentation,

Decomposition (split, projection) thus X → YZ by transitivity.

Union (combine)

Pseudo-transitivity
Inference Rules of FD
Computation Rules (Armstrong’s axioms):

Reflexivity (trivial)

Augmentation If X → Y and WY → Z, then WX → Z holds.

Transitivity Suppose X → Y and WY → Z,
then because of WX → WY by augmentation,
Derived Rules: thus WX → Z by transitivity.

Decomposition (split, projection)

Union (combine)

Pseudo-transitivity
Types of FD

 Trivial FD – if Y ⊆ XY, then XY → Y is a trivial functional dependency.

 Nontrivial FD:

Full FD – if X → Y and, for every attribute A ∈ X, X-A → Y is not true, then X → Y is a full functional dependency.
i.e if removal of any attribute in X violates X → Y, then X → Y is a full functional dependency.

Partial FD – if ZX → Y and X → Y, then X → Y is a partial functional dependency.

Transitive FD – if X → Y and Y → Z, then X → Z is a transitive functional dependency.
Nazarbayev University
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Normalization and Normal Forms
Multivalued Dependency (MVD)
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10} * Makers have only a single office in one city.

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}

Prime attributes: maker, model

Candidate keys: {maker, model} is composite key

Primary key: {maker, model}
Alternative keys: none
1 Normal Form
st

A relation is in first normal form (1NF) if:

1) Atomic values only in attributes

No composite attributes

No multivalued attributes

No nested relations
2) Primary Key is identified
Anomalies
1. Data redundancy
2. Data inconsistency
3. Spurious data generation

Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)

‘Product’ record = 96 bytes (integer x4, double x2, string x4)

‘Product’ relation = 960 bytes (records x10)
2 Normal Form
nd
A relation R in second normal form (2NF) if:
 R is in 1NF
 every non-prime attribute in R is fully (not partially) functionally dependent on all attributes of the primary key of R.

i.e. no partial functional dependencies ( if ZX → Y and X → Y in FDs, then decompose X → Y into a new table)

(Recursively) eliminate partial functional dependencies by decomposing R.

R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
2 Normal Form
nd
2 Normal Form
nd
3 Normal Form
rd

A relation R in third normal form (3NF) if:


R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

i.e. no transitive functional dependencies (if X → Y and Y → Z in FDs, then decompose Y → Z into a new table)

(Recursively) eliminate transitive functional dependencies by decomposing R.

 In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively

A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}

{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
3 Normal Form
rd
3 Normal Form
rd

Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)

‘Product’ record = 96 bytes (integer x4, double x2, string x4)

‘Product’ relation = 960 bytes (records x10)

‘DETAILS’ record = 44 bytes

‘DETAILS’ relation = 396 bytes
‘PRODUCT’ record = 32 bytes
‘PRODUCT’ relation = 96 bytes
‘PRODUCT_DETAILS’ record = 24 bytes
‘PRODUCT_DETAILS’ relation = 240 bytes
‘REGION’ record = 32 bytes
‘REGION’ relation = 64 bytes
TOTAL: ~796 bytes (without indexes, views, etc.)

Difference 164 bytes

1. Anomaly elimination
2. Lossless join recovery
3. FD preservation
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):


theater → city ← FD is in 3NF and not in BCNF (theater is not a super key but city is a prime attribute)

{city, title} → theater ← FD is in 3NF and not in BCNF (theater is a prime attribute)

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:

R is in 2NF

every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.

In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)

R = {title, theater, city}

Functional Dependencies (given):

Candidate Keys:

{theater, title}

{city, title}

R = {title, theater, city}

Functional Dependencies (given):

Candidate Keys:

{theater, title}

{city, title}

R = {title, theater, city}

Functional Dependencies (given):

Candidate Keys:

{theater, title}

{city, title}

R = {title, theater, city}

Functional Dependencies (given):

Candidate Keys:

{theater, title}

{city, title}

Decompose:

R1 {theater, city} and R2 {city, title}

R1 {theater, title} and R2 {city, title}

R1 {theater, city} and R2 {theater, title}
Database Normalization
Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)

UNF → 1NF → 2NF → 3NF → BCNF → 4NF → 5NF → …

Multivalued Dependency (MVD)
Independence of attribute(s).
Occur in many-to-many relationships.

Suppose X ⊆ R, Y ⊆ R, and Z = R - X – Y

Suppose tuples t1, t2, t3, t4 in R, then X →→ Y is a multivalued dependency if

 t1[X] = t2[X] = t3[X] = t4[X]
 t1[Y] = t3[Y] and t2[Z] = t3[Z]
 t2[Y] = t4[Y] and t1[Z] = t4[Z]

R = {title, theater, city}

Candidate Keys:
{theater, city, title}

Multivalued Dependencies (given):

city →→ title
city →→ theater
4 Normal Form
th

A relation R in 4NF if:

 R is in 2NF
 every X →→ Y is a nontrivial MVD and X is a super key.

R = {title, theater, city}

Multivalued Dependencies (given):

city →→ title
city →→ theater

Candidate Keys:
{theater, city, title}

Decompose:
R1 {city, title} and R2 {city, theater}

* both relations are in 4NF and in BCNF since a FD is also an MVD.

* if R is in 4NF, then it is in BCNF. However, there can be R in BCNF but not in 4NF.
Nazarbayev University
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov

Lecture
Introduction to Query Optimization
Relational Database System Architecture Overview
Query Performance

SQL → RA, etc.: 0.534 ms

Total:
16736.860 ms or 16.74 sec

0.179 ms

16736.147 ms (or 16.74 sec)

TPCH Database
Selectivity (percentage %)
High selectivity – fewer rows to scan and filter.
TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation 2.Part: 2000
WHERE n_name = 'JAPAN'; 3.Supplier: 100
4.Nation: 25
Result |T’| = 1 tuple (4% of data, high selectivity) 5.Region: 5
|T`| = |T| x 0.04 = 1
Selectivity (multiple selection predicates)
High selectivity – fewer rows to scan and filter.

TPCH dataset
SELECT COUNT(*)
1.Partsupp: 8000
FROM nation

WHERE n_name = 'JAPAN' AND n_comment LIKE '%haggle%';

2.Part: 2000
3.Supplier: 100
Result |T’| = 0 tuple (0% of data, high selectivity) 4.Nation: 25
|T`| = |T| x 0.0 = 0
5.Region: 5

Result-1 |T1’| = 1 tuple (4% of data on n_name, high selectivity)

|T1`| = |T| x 0.04 = 1
Result-2 |T2’| = 4 tuples (16% of data on n_comment, ~high selectivity)
|T2`| = |T| x 0.16 = 4
Selectivity ↔ Cardinality
High selectivity – fewer rows to scan and filter.
TPCH dataset

SELECT COUNT(*) 1.Partsupp: 8000

FROM nation 2.Part: 2000

WHERE n_name = 'JAPAN'; 3.Supplier: 100

4.Nation: 25

Result |T’| = 1 tuple (4% of data, high selectivity) 5.Region: 5

|T`| = |T| x 0.04 = 1 (cardinality)

Cardinality: Single Table

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation 2.Part: 2000
WHERE n_name = 'JAPAN'; 3.Supplier: 100
4.Nation: 25
Result |T’| = 1 tuple 5.Region: 5
Cardinality: Two Tables

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, region 2.Part: 2000
WHERE n_regionkey = r_regionkey 3.Supplier: 100
AND r_name = 'EUROPE'; 4.Nation: 25
5.Region: 5
Result |T’| = 5 tuples
Cardinality: Multiple Tables

TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, supplier, region 2.Part: 2000
WHERE n_nationkey = s_nationkey 3.Supplier: 100
AND n_regionkey = r_regionkey; 4.Nation: 25
5.Region: 5
Result |T’| = 100 tuples
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey;

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows (only intermediate tables)
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey;

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 100 + 100 + 8,000 + 8,000 = 16,200 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Predicates
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Push-Down
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

Cardinality in Join Ordering with Projection Push-Down
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

Note: the order doesn’t matter in SQL unless the order is forced to the DB server.

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows (matters in bytes)

Comparing Query Plans

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680 rows

Cost Function

n – number of tables

Primitive cost function that sums the cardinalities of base and intermediate tables.

Why this function may be insufficient?

Cost Function

|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows |T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows

n – number of tables

Primitive cost function that sums the cardinalities of base and intermediate tables.

Why this function may be insufficient?

Join Ordering in Query Plan

Bushy tree
Left-deep tree Right-deep tree Zig-zag tree
Query Plan Search Space

3 tables ≤ 3! (three factorial) = 6 plans

4 tables ≤ 4! = 24 plans
5 tables ≤ 5! = 120 plans
…
10 tables ≤ 10! = 3,628,800 plans

We cannot first execute all query plans, then say which query plan is the fastest to process the query.
Query Plan Enumeration Approaches

3 tables ≤ 3! (three factorial) = 6 plans

4 tables ≤ 4! = 24 plans
5 tables ≤ 5! = 120 plans
…
10 tables ≤ 10! = 3,628,800 plans

Approach #1: Permutation-based representation Approach #2: Graph problem

includes cross-joins in the search space cross-joins can be avoided

Query Plan Enumeration Algorithms

1. Dynamic Programming algorithm

a) Top-down version
b) Bottom-up version

2. Greedy approach
Query Plan Selection
?
?
?
?
? ?

? ?
n – number of tables
? ?
? ?

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680

How does the optimizer knows the cardinalities in advance?

True Cardinality vs Estimated Cardinality
?
?
?
?
? ?

? ?
n – number of tables
? ?
? ?

|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680

Metadata, Statistics used for estimation (stored in Database Catalog)

Referential Constraints used for deriving upper-bounds
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM nation WHERE n_name = ‘JAPAN’;
Computer dataset:
|T’| ≈ 25 / 25 = 1 tuple 
nation: 25
True result: 1 tuple, D(n_name) = 25

SELECT COUNT(*) FROM nation WHERE n_regionkey = 3;

|T’| ≈ 25 / 5 = 5 tuples
T’ – output table
True result: 5 tuples, D(n_regionkey) = 5 |T’| – size of result table
T – input table
* regionkey 3 is “EUROPE” |T| – size of input table
D(T.a) – number of distinct values
on column a in table T

D(T.a) – needs to be stored or estimated

Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM nation WHERE n_regionkey = 3;

|T’| ≈ 25 / 5 = 5 tuples

True result: 5 tuples, D(n_regionkey) = 5

Assumption: uniform distribution

In real world data, this assumption is often violated, thus there are errors in the estimations.
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM supplier WHERE s_nationkey = 1;

|T’| ≈ 500 / 25 = 20 tuples

True result: 14 tuples, D(s_nationkey) = 25

* nationkey 1 is “ARGENTINA”

Assumption: uniform distribution

In real world data, this assumption is often violated thus there are errors in the estimations.
Join Operator Estimation: Two-way join
SELECT COUNT(*) FROM nation, supplier
WHERE n_nationkey = s_naitonkey; Computer dataset:

nation: 25

supplier: 500
True result: 500 tuples
D(n_nationkey) = 25, D(s_nationkey) = 25

T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T

D(T.a) – needs to be stored or estimated

Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Join Operator Estimators
Two-way join estimator: T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T

tuples

Assumptions: uniformity, inclusion

and independence (in multi-way joins)
Join Algorithms
Join Algorithms
Algorithms Category #1:
 Sorting-based methods
 Hash-based-methods
 Index-based methods

Algorithms Category #2:

 One-pass algorithms
 Two-pass algorithms
 Many-pass algorithms
(Example) Nested-loop Join algorithm

Runtime Complexity:

Memory Complexity:
(Example) Nested-loop Join algorithm
page #1
scan

page #2 Product.Pages = {1, 2, 3, 4, 5, 6}

In the hard drive

scan
FOR each page p in Pages DO:
FOR each tuple t1 in p:
page #3 FOR each tuple t2 in PC DO:
scan IF t1 t2 THEN:
output t1 t2
page #4
scan

page #5 In main memory

scan
Example:

page #6 If page ≈ 512 MB and each page can contain 5

scan tuples, then 6 pages x 512 MB ≈ 3 GB memory
Decisions on Join Operator Algorithms

Algorithms:
 Nested-loop
 Merge-join Building Table
Probing Table
10 GB data 2GB data
 Hash-join

Which table is good to keep in the RAM?

Which join operator to choose?

Cardinality in Join Ordering with Join Operators
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

Decisions on Scan Operator Algorithms

Algorithms:
 Sequential Scan
 Index Scan Building Table
Probing Table
10 GB data 2GB data
 Parallel Scan

SELECT COUNT(*) FROM partsupp, supplier

WHERE ps_suppkey = s_suppkey AND ps_supplycost = 771;

Which scan operator to choose?

Cardinality in Join Ordering with Scan Operators
SELECT COUNT(*)

FROM partsupp, supplier, nation, part, region

WHERE p_partkey = ps_partkey

AND ps_suppkey = s_suppkey

AND s_nationkey = n_nationkey

AND n_regionkey = r_regionkey

AND r_name = 'ASIA' AND n_name = 'JAPAN';

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

Complexity Estimation of Relational Algebra Operators
?

?
?
?

? ?
? ?

? ? ?

|T’| = 1 + 1 + 4 + 4 + 320 + 320 = 652 rows

Categorizing Relational Algebra Operators
Tuple-at-a-time (single input table):
 Selection operator
 Projection operator

Full-relation (single input table):

 GroupBy and OrderBy operator
 Distinct operator
 Aggregation operator
 Scan operator

Full-relation (multiple input tables):

 Set operator
 Join operator
Decisions by Query Optimizer
1. Cost Function - accuracy 1. Join Order selection
2. Search Space 2. Query Plan Enumeration algorithm selection
3. Query Plan enumeration algorithms 3. Relation Algebra Operator selection
4. Operator Fusion, Memory Allocation, etc.

1. Tree shapes (binary; left, right, bushy, zig-zag)

2. Ordering of join and other operators
3. New Relational Algebra Operators
4. Different Algorithms
5. Different Index-based Algorithms
6. Different Storage-based Algorithms (I/O)
7. Different Processor-based Algorithms
8. Parallel and Distributed Algorithms
Databases and Information Retrieval
Big Data
 Hardware changes and their limits Other Optimizations:
 Data transfer

Correct and Efficient query writing

Indexes

Materialized Views
Requirements

Normalization
 Accuracy
 Speed

Workloads
 Transactional
 Analytical
 Machine and Deep Learning
CSCI 341
Database Systems

Transactions
Transaction Processing Concepts
Department of Computer Science,
School of Engineering and Digital Sciences
Nazarbayev University
(Fall 2024)
What is a transaction?
A Transaction: A transaction is a sequence of one or more
(SQL) operations (read - retrieval, write - insert or update or
delete) treated as a unit.
§ Transactions appear to run in isolation
§ If the system fails, each transaction’s changes are
reflected either entirely or not at all
• A transaction (set of operations) may be stand-alone specified
in a high-level language like SQL submitted interactively or
may be embedded within a program.
• Transaction boundaries: Begin and End transaction.
• An application program may contain several transactions
separated by the Begin and End transaction boundaries.
Transaction Processing
Motivated by two independent requirements

§Concurrent database access for efficient

performance
§Resilience to system failures, being able to
guarantee the consistency and recover the
database
Transactions
Concurrent Database Access

Even more software

Select…
Update…
Create Table…
Drop Index…
Help…
More software Delete…

DBMS

Data
Transactions
Concurrency Goal

Execute sequence of SQL statements so that they

appear to be running in isolation

🞽 Simple solution: execute them in isolation

But we want to enable concurrency whenever it

is safe to do so

Why?
Motivation for Concurrent Executions
• Multiple transactions are allowed to run concurrently
in the system.
• Ensuring transaction isolation while permitting
concurrent execution is difficult but necessary for
performance reasons.
• increased processor and disk utilization, leading
to better transaction throughput (the ave. no. of
transactions completed in a given time): one
transaction can be using the CPU while another is
reading from or writing to the disk.
• reduced average response time for transactions
(average time taken to complete a transaction):
short transactions need not wait behind long ones.
How can things go wrong?
• Suppose we have a table of bank accounts which
contains the balance of the account. A deposit of
$50 to a particular account # 1234 would be written
as:
update Accounts
set balance = balance + $50
where account#= ‘1234’;

• We will see that this entails a read and a write of

the account’s balance.
• What happens if two people deposit money onto the
same account at the same time?
Concurrent deposits

• This SQL update code is represented as a sequence

of read and write operations on “data items”:

Deposit 1 (T1) Deposit 2 (T2)

read(X.bal) read(X.bal)
X.bal := X.bal + $50 X.bal:= X.bal + $100
write(X.bal) write(X.bal)

• Here, X is the data item representing the account

with account# 1234.
A “bad” concurrent execution
• But only one “action” (e.g. a read or a write) can
happen at a time, and there are a variety of ways
in which the two deposits could be simultaneously
executed:
BAD!

Deposit 1 (T1) Deposit 2 (T2)

read(X.bal)
time read(X.bal)
X.bal := X.bal + $50
X.bal:= X.bal + $100
write(X.bal)
write(X.bal)
A “good” execution
• The previous execution would have been fine if
the two accounts were different (i.e. one were
X and one were Y).
• The following execution is a serial execution,
and executes one transaction after the other:

GOOD!
Deposit 1 Deposit 2
read(X.bal)
time X.bal := X.bal + $50
write(X.bal)
read(X.bal)
X.bal:= X.bal + $100
write(X.bal)
Good executions
• An execution is “good” if is it serial (i.e. the
transactions are executed one after the other) or
serializable (i.e., equivalent to some serial execution)
Deposit 1 Deposit 3
read(X.bal)
read(Y.bal)
X.bal := X.bal + $50
Y.bal:= Y.bal + $10
write(X.bal)
write(Y.bal)

• This execution is equivalent to executing Deposit 1

then Deposit 3, or vice versa.
Schedule 1
• Let T1 transfer $50 from A to B, and T2 transfer 10% of the
balance from A to B.
• A serial schedule in which T1 is followed by T2:

14
Schedule 2
• A serial schedule where T2 is followed by T1

15
Schedule 3
• Let T1 and T2 be the transactions defined previously.
The following schedule is not a serial schedule, but it is
equivalent to Schedule 1.

In Schedules 1, 2 and 3, the sum A + B is preserved.

16
Schedule 4
• The following concurrent schedule does not
preserve the value of (A + B).

17
Desirable Properties of Transactions

ACID Properties:
Atomicity
Consistency
Isolation
Durability

ACID properties should be enforced by the concurrency control

and recovery methods of the DBMS.
SIMPLE MODEL OF A DATABASE

• A database - collection of named data items

• Granularity of data locking–
a field,
a record,
a whole disk block, or
a table

• Transaction Concept is independent of granularity

• Basic operations are read and write.
SIMPLE MODEL OF A DATABASE
(for purposes of discussing transactions)
read_item(X) command includes the following steps:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
3. Copy item X from the buffer to the program variable named X.

write_item(X) command includes the following steps:

1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
3. Copy item X from the buffer to the program variable named X.
4. Modify the variable named X
5. Copy item X from the program variable named X into its correct location in
the buffer.
6. Store the updated block from the buffer back to disk (either immediately or
at some later point in time).
Why Concurrency Control is needed?

1. The temporary update (or dirty read) problem

2. Unrepeatable read problem
3. Lost update problem
4. The incorrect summary problem
Transactions
1. The temporary update (or dirty read) problem (WR conflicts)

T1: R(A), W(A), R(B), W(B), Abort

T2: R(A), W(A), C

One source of anomalies is that T2 could read a DB object A that has been
modified by another transaction T1, which has not yet committed.
The value of object A read by T2 is called dirty read, because it has been
created by a transaction T1 that has not committed yet; hence, this
problem is also known as the dirty read problem.
Transactions

2. Unrepeatable read problem (RW Conflicts)

T1: R(A), R(A), W(A), C
T2: R(A), W(A), C

This problem may occur, when a transaction T1 reads

an item twice and the item is changed by another
transaction T2 between the two reads.

Hence, T1 receives different values for its two reads of

the same item.
For example:
Unrepeatable read problem may occur during an airline
reservation transaction:
A customer is inquiring about seat availability on several
flights. When the customer decides on a particular flight,
the transaction then reads the number of seats on that
flight a second time before completing the reservation
and T reads a different number of seats available.
3. A Lost Update.

Deposit 1 (T1) Deposit 2 (T2)

read(X.bal)
read(X.bal)
time X.bal := X.bal + $50
X.bal:= X.bal + $10
write(X.bal)
write(X.bal)

The update that adds $50 is lost!

4. The Incorrect Summary Problem.

May occur when one transaction is calculating an

aggregate summary function on a number of database
items while other transactions are updating some of
these items.

Average salary of all the employees (while somebody

updates the entries of some of the employees) then the
number of employees with more than the average.
What causes a Transaction to fail?
(why Transaction Recovery may be necessary?)

1. A computer failure (system crash): A hardware or software error

occurs in the computer system during transaction execution.
2. A transaction or system error: Some operation in the transaction may
cause it to fail, such as integer overflow or division by zero. Transaction
failure may also occur because of erroneous parameter values or
because of a logical programming error. In addition, the user may
interrupt the transaction during its execution.
3. Local errors or exception conditions detected by the transaction:
-certain conditions necessitate cancellation of the transaction. For
example, data for the transaction may not be found. A condition, such
as insufficient account balance in a banking database, may cause a
transaction, such as a fund withdrawal from that account, to be
canceled.
-a programmed abort in the transaction causes it to fail.
What causes a Transaction to fail
(why Transaction/Data Recovery may be necessary)

4. Concurrency control enforcement: The concurrency control

method may decide to abort the transaction, to be
restarted later, because it violates serializability or because
several transactions are in a state of deadlock.
5. Disk failure: Some disk blocks may lose their data because
of a read or write malfunction or because of a disk
read/write head crash. This may happen during a read or a
write operation of the transaction.
6. Physical problems and catastrophes: This refers to an
endless list of problems that includes power or air-
conditioning failure, fire, theft, sabotage, overwriting disks
or tapes by mistake, and mounting of a wrong tape by the
operator.
Additional Operations In Transaction Management
For recovery purposes, the recovery manager keeps track of the following operations:
start_transaction: marks the beginning of transaction execution.
read or write: specify read or write operations on the database items executed as
a part of a transaction.
end_transaction: specifies that read and write transaction operations have ended
and marks the end limit of transaction execution.
commit_transaction: signals a successful end of the transaction so that any
changes (updates) executed by the transaction can be safely committed to the
database and will not be undone.
rollback (or abort): signals that the transaction has ended unsuccessfully, so that
any changes or effects that the transaction may have applied to the database must
be undone.
Recovery Techniques Use The Following Operations:
undo: Similar to rollback except that it applies to a single operation rather than to
a whole transaction.
redo: specifies that certain transaction operations must be redone to ensure that
all the operations of a committed transaction have been applied successfully to
the database.

Aria2 - The Ultra Fast Download Utility
No ratings yet
Aria2 - The Ultra Fast Download Utility
9 pages
Functional Dependencies and Normalization
No ratings yet
Functional Dependencies and Normalization
7 pages
The Journey of Wipro Began in 1945
No ratings yet
The Journey of Wipro Began in 1945
1 page
Automating Essbase Data Load Through ODI - Part I
100% (3)
Automating Essbase Data Load Through ODI - Part I
29 pages
DB Normalization Part1
No ratings yet
DB Normalization Part1
71 pages
9 Design Theory
No ratings yet
9 Design Theory
47 pages
Functional Dependencies: R&G Chapter 19
No ratings yet
Functional Dependencies: R&G Chapter 19
16 pages
Chapter 3
No ratings yet
Chapter 3
91 pages
DB04-FDs-Rules
No ratings yet
DB04-FDs-Rules
31 pages
6 - Chapter 3 - Functional Dependencies
No ratings yet
6 - Chapter 3 - Functional Dependencies
29 pages
Functional Dependencies: CS 186, Spring 2006, Lecture 21 R&G Chapter 19
No ratings yet
Functional Dependencies: CS 186, Spring 2006, Lecture 21 R&G Chapter 19
17 pages
5 - Chapter 3 - Functional Dependencies
No ratings yet
5 - Chapter 3 - Functional Dependencies
26 pages
Functional Dependencies and Normalization4
No ratings yet
Functional Dependencies and Normalization4
86 pages
Dependencies in DBMS
No ratings yet
Dependencies in DBMS
30 pages
Functional Dependency
No ratings yet
Functional Dependency
95 pages
Lecture7-Keys_and_FD
No ratings yet
Lecture7-Keys_and_FD
60 pages
Relational Database Design Functional Dependencies
No ratings yet
Relational Database Design Functional Dependencies
27 pages
Chapter 3 - Functional Dependencies
No ratings yet
Chapter 3 - Functional Dependencies
122 pages
Winter Semester 2023-24_CSE2007_ETH_AP2023246001166_2024-02-29_Reference-Material-I
No ratings yet
Winter Semester 2023-24_CSE2007_ETH_AP2023246001166_2024-02-29_Reference-Material-I
32 pages
Dbms Chapter 4
No ratings yet
Dbms Chapter 4
76 pages
DB Design
No ratings yet
DB Design
21 pages
Functional Dependency
No ratings yet
Functional Dependency
35 pages
5.1 - Chapter 3 - Functional Dependencies
No ratings yet
5.1 - Chapter 3 - Functional Dependencies
34 pages
IT 220 Unit 4 Relational-Database-Design
No ratings yet
IT 220 Unit 4 Relational-Database-Design
56 pages
Chapter 2 RM and normalization V2
No ratings yet
Chapter 2 RM and normalization V2
71 pages
PDD 4
No ratings yet
PDD 4
20 pages
Functional Dependency
No ratings yet
Functional Dependency
17 pages
Rules of Functional Dependencies
No ratings yet
Rules of Functional Dependencies
39 pages
FD
No ratings yet
FD
3 pages
Functional Dependency & Normalization: BY: Richa Jain
No ratings yet
Functional Dependency & Normalization: BY: Richa Jain
83 pages
Functional Dependencies
No ratings yet
Functional Dependencies
9 pages
Functional Dependency & Normalization
No ratings yet
Functional Dependency & Normalization
42 pages
Module-3 DBMS CS208 Notes
No ratings yet
Module-3 DBMS CS208 Notes
4 pages
Module 3 Part 1 (1)
No ratings yet
Module 3 Part 1 (1)
14 pages
Normalization
No ratings yet
Normalization
177 pages
Relational Database Design Functional Dependencies
No ratings yet
Relational Database Design Functional Dependencies
21 pages
MODULE-3 DBMS CS208 NOTES (Ktuassist - In)
No ratings yet
MODULE-3 DBMS CS208 NOTES (Ktuassist - In)
4 pages
Chapter 14
No ratings yet
Chapter 14
54 pages
Functional Dependecy
No ratings yet
Functional Dependecy
47 pages
3 NF
No ratings yet
3 NF
93 pages
ch4dbms FDand Nor
No ratings yet
ch4dbms FDand Nor
73 pages
Normalization
No ratings yet
Normalization
51 pages
Database Management Systems: Dont Do Research On Any Topic
No ratings yet
Database Management Systems: Dont Do Research On Any Topic
38 pages
Chapter 3 Update
No ratings yet
Chapter 3 Update
76 pages
ch7 MKN FD Part 1
No ratings yet
ch7 MKN FD Part 1
63 pages
FD
No ratings yet
FD
134 pages
Database I: Methodology Normalization
No ratings yet
Database I: Methodology Normalization
37 pages
Adbm Unit I
No ratings yet
Adbm Unit I
43 pages
Unit 4 Dbms
No ratings yet
Unit 4 Dbms
22 pages
Functional Dependency (FD)
No ratings yet
Functional Dependency (FD)
14 pages
Database Design Theory: Introduction To Databases CSCC43 Winter 2011 Ryan Johnson
No ratings yet
Database Design Theory: Introduction To Databases CSCC43 Winter 2011 Ryan Johnson
10 pages
Normalization
No ratings yet
Normalization
107 pages
Normalization PPT
No ratings yet
Normalization PPT
161 pages
Introduction To Database Systems: Functional Dependencies
No ratings yet
Introduction To Database Systems: Functional Dependencies
19 pages
Chapter3 Session1
No ratings yet
Chapter3 Session1
41 pages
Functional Dependency_attribute Closure and Armstrong’s Axioms
No ratings yet
Functional Dependency_attribute Closure and Armstrong’s Axioms
15 pages
Functional Dependencies
No ratings yet
Functional Dependencies
61 pages
550_lecture13
No ratings yet
550_lecture13
24 pages
Functional Dependency
No ratings yet
Functional Dependency
12 pages
Functional Dependencies & Normalization For Relational Dbs
No ratings yet
Functional Dependencies & Normalization For Relational Dbs
76 pages
MODULE 4 - Normalization - 1
No ratings yet
MODULE 4 - Normalization - 1
107 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Unit1 CGR
No ratings yet
Unit1 CGR
81 pages
Question
No ratings yet
Question
4 pages
Predictive Analytics with KNIME: Analytics for Citizen Data Scientists 1st Edition Acito download
No ratings yet
Predictive Analytics with KNIME: Analytics for Citizen Data Scientists 1st Edition Acito download
44 pages
Detailed Z80 Instruction Set
No ratings yet
Detailed Z80 Instruction Set
11 pages
QP BB CS Xii
No ratings yet
QP BB CS Xii
4 pages
Lecture 16
No ratings yet
Lecture 16
92 pages
HIPEPlusbrochure US
No ratings yet
HIPEPlusbrochure US
12 pages
4958 Spy Watch Manual
No ratings yet
4958 Spy Watch Manual
2 pages
X470 Manual
No ratings yet
X470 Manual
110 pages
File Manage and Control Chip CH376
No ratings yet
File Manage and Control Chip CH376
31 pages
Log Cat 1705948800311
No ratings yet
Log Cat 1705948800311
11 pages
Developments On Application System & Integration System
No ratings yet
Developments On Application System & Integration System
23 pages
Bachelor of Science in Mechanical Engineering Stanford University
No ratings yet
Bachelor of Science in Mechanical Engineering Stanford University
39 pages
Unit I Introduction To Cellular Mobile System 1.1.1. Conventional Mobile Telephone Systems
No ratings yet
Unit I Introduction To Cellular Mobile System 1.1.1. Conventional Mobile Telephone Systems
14 pages
Quick Spec
No ratings yet
Quick Spec
63 pages
Tableau Software Project Solution
No ratings yet
Tableau Software Project Solution
8 pages
Sample Super Store - Assignment
No ratings yet
Sample Super Store - Assignment
3 pages
530CID02 DS en PDF
No ratings yet
530CID02 DS en PDF
8 pages
C6748 LC Dev Kit Ver A7e
No ratings yet
C6748 LC Dev Kit Ver A7e
12 pages
Lab Manual For Computer Organization and Assembly Language: Stack
No ratings yet
Lab Manual For Computer Organization and Assembly Language: Stack
8 pages
Fizika 6, Luarasi PDF
No ratings yet
Fizika 6, Luarasi PDF
10 pages
Model Answer Format MIC Class Test 2
No ratings yet
Model Answer Format MIC Class Test 2
7 pages
A-Introduction To ETL and DataStage
No ratings yet
A-Introduction To ETL and DataStage
48 pages
Berlin Day Open Cloud Ware
No ratings yet
Berlin Day Open Cloud Ware
20 pages
Bluenrg-1, Bluenrg-2 Ble Stack V2.X Programming Guidelines
No ratings yet
Bluenrg-1, Bluenrg-2 Ble Stack V2.X Programming Guidelines
77 pages
question_7
No ratings yet
question_7
4 pages
Primer Cybersecurity Election
No ratings yet
Primer Cybersecurity Election
33 pages