Normalization
Normalization
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov
Lecture
Relational Database Design
Functional Dependency (FD)
Relational Database Design
Single Relation:
1. Attribute semantics (logical grouping to form base relation(s))
Relationship types
2. Data redundancy (redundant values, columns, and relations)
Only Foreign Keys
3. Null values (unknown, inapplicable, no information)
Multiple Relations:
1. Spurious tuples
Example Database
Large unnecessary data retrieval
Inefficient data management - anomaly
Update, Delete, Insert commands
Consequences:
data inconsistency
data loss
poor DML and DQL performance
insert rejection or null values
Inefficient data access
Solutions:
1. Data Decomposition (lossless joins)
Non-additive property
Losslessness property
Functional dependencies preservation
2. Functional Dependency
3. Multivalued Dependency
Database Normalization
Normalization – using keys and FDs, normalize a relation into normal form by decomposing the relation.
Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)
* X → Y is different than Y → X
Splitting Rule:
A1 → {A3, A4}
is
A1 → A3
A1 → A4
Combining Rule:
A1 → A3
A1 → A4
is
A1 → {A3, A4}
Inference Rules of FD
R - relation Product
{A1, A2, A3} - columns of Product, respectively
Closure Rule:
{A1}+ → {A1, A3, A4}
{A2}+ → {A2, A4}
{A1, A2} → {A1, A2, A3, A4, A5}
+
Inference Rules of FD (exercise)
R - relation Product
{A1, A2, A3} - columns of Product, respectively
Closure Rule:
{A1}+ → {??}
{A2}+ → {??}
{A1, A2} → {??}
+
Finding a Key for a Relation
R - relation Product
{A1, A2, A3} - columns of Product, respectively
A2 → {A3, A1}
A3 x {A2, A1}
A1 x {A2, A3}
{A2, A3} → A1
{A2, A1} → A3
{A1, A3} x A2
A1 → {A2, A3}
{A3, A4} → {A5,A6}
A2 → A6
{A1,A4}+ → {??}
Lecture
Normalization and Normal Forms
Multivalued Dependency (MVD)
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
1 Normal Form
st
R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively
A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10} * Makers have only a single office in one city.
{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)
R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively
A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}
{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
2 Normal Form
nd
2 Normal Form
nd
3 Normal Form
rd
R – relation Product
{A1,A2,A3,A4,A5,A6,A7,A8,A9,A10} - columns of Product, respectively
A1 → {A3,A4}
A2 → {}
A3 → {A4}
A4 → {}
A5 → {A2,A6,A7,A8,A9,A10}
A6 → {}
A7 → {}
A8 → {}
A9 → {}
A10 → {}
{A1,A5} → {A10}
{A1,A5}+ → {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10}
3 Normal Form
rd
3 Normal Form
rd
Integer (4 bytes)
Double (8 bytes)
String (~16 bytes)
1. Anomaly elimination
2. Lossless join recovery
3. FD preservation
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:
R is in 2NF
every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.
In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)
Candidate Keys:
{theater, title}
{city, title}
Decompose:
R1 {theater, city} and R2 {city, title}
R1 {theater, title} and R2 {city, title}
R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:
R is in 2NF
every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.
In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)
Candidate Keys:
{theater, title}
{city, title}
Decompose:
R1 {theater, city} and R2 {city, title}
R1 {theater, title} and R2 {city, title}
R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:
R is in 2NF
every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.
In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)
Candidate Keys:
{theater, title}
{city, title}
Decompose:
R1 {theater, city} and R2 {city, title}
R1 {theater, title} and R2 {city, title}
R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:
R is in 2NF
every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.
In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)
Candidate Keys:
{theater, title}
{city, title}
Decompose:
R1 {theater, city} and R2 {city, title}
R1 {theater, title} and R2 {city, title}
R1 {theater, city} and R2 {theater, title}
Boyce-Codd Normal Form (BCNF)
A relation R in BCNF if:
R is in 2NF
every non-prime attribute in R is only fully functionally dependent on all attributes of the primary key of R.
In all X → Y in FDs, X must be a super key OR Y is prime attribute(s)
Candidate Keys:
{theater, title}
{city, title}
Decompose:
R1 {theater, city} and R2 {city, title}
R1 {theater, title} and R2 {city, title}
R1 {theater, city} and R2 {theater, title}
Database Normalization
Normal Forms:
1st Normal Form (1NF)
2nd Normal Form (2NF)
3rd Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
4th Normal Form (4NF)
5th Normal Form (5NF)
Suppose X ⊆ R, Y ⊆ R, and Z = R - X – Y
Candidate Keys:
{theater, city, title}
Candidate Keys:
{theater, city, title}
Decompose:
R1 {city, title} and R2 {city, theater}
* if R is in 4NF, then it is in BCNF. However, there can be R in BCNF but not in 4NF.
Nazarbayev University
SEDS CS
CSCI 341, Fall 2024
Instructor: Yesdaulet Izenov
Lecture
Introduction to Query Optimization
Relational Database System Architecture Overview
Query Performance
0.179 ms
TPCH dataset
SELECT COUNT(*)
1.Partsupp: 8000
FROM nation
TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation 2.Part: 2000
WHERE n_name = 'JAPAN'; 3.Supplier: 100
4.Nation: 25
Result |T’| = 1 tuple 5.Region: 5
Cardinality: Two Tables
TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, region 2.Part: 2000
WHERE n_regionkey = r_regionkey 3.Supplier: 100
AND r_name = 'EUROPE'; 4.Nation: 25
5.Region: 5
Result |T’| = 5 tuples
Cardinality: Multiple Tables
TPCH dataset
SELECT COUNT(*) 1.Partsupp: 8000
FROM nation, supplier, region 2.Part: 2000
WHERE n_nationkey = s_nationkey 3.Supplier: 100
AND n_regionkey = r_regionkey; 4.Nation: 25
5.Region: 5
Result |T’| = 100 tuples
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)
Note: the order doesn’t matter in SQL unless the order is forced to the DB server.
|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows (only intermediate tables)
Cardinality: Join Ordering in Query Plan
SELECT COUNT(*)
Note: the order doesn’t matter in SQL unless the order is forced to the DB server.
|T’| = 100 + 100 + 8,000 + 8,000 = 16,200 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Predicates
SELECT COUNT(*)
Note: the order doesn’t matter in SQL unless the order is forced to the DB server.
|T’| = 100 + 100 + 8,000 + 8,000 + 2,160 + 320 = 18,680 rows (only intermediate tables)
Cardinality in Join Ordering with Selection Push-Down
SELECT COUNT(*)
Note: the order doesn’t matter in SQL unless the order is forced to the DB server.
Note: the order doesn’t matter in SQL unless the order is forced to the DB server.
n – number of tables
Primitive cost function that sums the cardinalities of base and intermediate tables.
|T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows |T’| = 8,000 + 8,000 + 8,000 + 8,000 = 32,000 rows
n – number of tables
Primitive cost function that sums the cardinalities of base and intermediate tables.
Bushy tree
Left-deep tree Right-deep tree Zig-zag tree
Query Plan Search Space
We cannot first execute all query plans, then say which query plan is the fastest to process the query.
Query Plan Enumeration Approaches
2. Greedy approach
Query Plan Selection
?
?
?
?
? ?
? ?
n – number of tables
? ?
? ?
? ?
n – number of tables
? ?
? ?
|T’| ≈ 25 / 5 = 5 tuples
T’ – output table
True result: 5 tuples, D(n_regionkey) = 5 |T’| – size of result table
T – input table
* regionkey 3 is “EUROPE” |T| – size of input table
D(T.a) – number of distinct values
on column a in table T
Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM nation WHERE n_regionkey = 3;
|T’| ≈ 25 / 5 = 5 tuples
In real world data, this assumption is often violated, thus there are errors in the estimations.
Selection Operator Estimation: Point condition (= sign)
SELECT COUNT(*) FROM supplier WHERE s_nationkey = 1;
* nationkey 1 is “ARGENTINA”
In real world data, this assumption is often violated thus there are errors in the estimations.
Join Operator Estimation: Two-way join
SELECT COUNT(*) FROM nation, supplier
WHERE n_nationkey = s_naitonkey; Computer dataset:
nation: 25
supplier: 500
True result: 500 tuples
D(n_nationkey) = 25, D(s_nationkey) = 25
T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T
Some databases periodically estimate D(T.a) based on randomly chosen sample data (e.g. 1%)
Join Operator Estimators
Two-way join estimator: T’ – output table
|T’| – size of result table
T – input table
|T| – size of input table
D(T.a) – number of distinct values
on column a in table T
tuples
Runtime Complexity:
Memory Complexity:
(Example) Nested-loop Join algorithm
page #1
scan
Algorithms:
Nested-loop
Merge-join Building Table
Probing Table
10 GB data 2GB data
Hash-join
Algorithms:
Sequential Scan
Index Scan Building Table
Probing Table
10 GB data 2GB data
Parallel Scan
?
?
?
? ?
? ?
? ? ?
Workloads
Transactional
Analytical
Machine and Deep Learning
CSCI 341
Database Systems
Transactions
Transaction Processing Concepts
Department of Computer Science,
School of Engineering and Digital Sciences
Nazarbayev University
(Fall 2024)
What is a transaction?
A Transaction: A transaction is a sequence of one or more
(SQL) operations (read - retrieval, write - insert or update or
delete) treated as a unit.
§ Transactions appear to run in isolation
§ If the system fails, each transaction’s changes are
reflected either entirely or not at all
• A transaction (set of operations) may be stand-alone specified
in a high-level language like SQL submitted interactively or
may be embedded within a program.
• Transaction boundaries: Begin and End transaction.
• An application program may contain several transactions
separated by the Begin and End transaction boundaries.
Transaction Processing
Motivated by two independent requirements
DBMS
Data
Transactions
Concurrency Goal
Why?
Motivation for Concurrent Executions
• Multiple transactions are allowed to run concurrently
in the system.
• Ensuring transaction isolation while permitting
concurrent execution is difficult but necessary for
performance reasons.
• increased processor and disk utilization, leading
to better transaction throughput (the ave. no. of
transactions completed in a given time): one
transaction can be using the CPU while another is
reading from or writing to the disk.
• reduced average response time for transactions
(average time taken to complete a transaction):
short transactions need not wait behind long ones.
How can things go wrong?
• Suppose we have a table of bank accounts which
contains the balance of the account. A deposit of
$50 to a particular account # 1234 would be written
as:
update Accounts
set balance = balance + $50
where account#= ‘1234’;
GOOD!
Deposit 1 Deposit 2
read(X.bal)
time X.bal := X.bal + $50
write(X.bal)
read(X.bal)
X.bal:= X.bal + $100
write(X.bal)
Good executions
• An execution is “good” if is it serial (i.e. the
transactions are executed one after the other) or
serializable (i.e., equivalent to some serial execution)
Deposit 1 Deposit 3
read(X.bal)
read(Y.bal)
X.bal := X.bal + $50
Y.bal:= Y.bal + $10
write(X.bal)
write(Y.bal)
14
Schedule 2
• A serial schedule where T2 is followed by T1
15
Schedule 3
• Let T1 and T2 be the transactions defined previously.
The following schedule is not a serial schedule, but it is
equivalent to Schedule 1.
17
Desirable Properties of Transactions
ACID Properties:
Atomicity
Consistency
Isolation
Durability
One source of anomalies is that T2 could read a DB object A that has been
modified by another transaction T1, which has not yet committed.
The value of object A read by T2 is called dirty read, because it has been
created by a transaction T1 that has not committed yet; hence, this
problem is also known as the dirty read problem.
Transactions