1904001-DBMS Notes 5 Units
1904001-DBMS Notes 5 Units
1904001-DBMS Notes 5 Units
Person Relation
PID Industry
101 Infosys
102 TCS
Employee Relation
Advantages:
Relational model allows the designer to simply focus on logical design and not on
physical design. Hence the relational models are conceptually simple to
understand
Using simple query language can retrieve the required information easily
They are easy to maintain and use
Disadvantages
It requires powerful hardware and large data storage devices
May lead to slower processing time
Poorly designed systems lead to poor implementation of database systems
Database System Architecture
Data base system is partitioned into modules that deal with each of the responsibilities
of the overall system. Data base system architecture is shown in figure
Data base system architecture is broadly classified into two modules
1.Storage Manager
2.Query Processor
Storage Manager
Storage manager is very important because data base typically requires a large amount
of storage space.
Storage manager is a program module that provides the interface between the low
level data stored in the data base and the application programs and queries submitted
to the system.
Storage manager is responsible for interaction with the file manager. Storage manager
translates the various DML (Data Manipulation Language) statements into low level
file system commands. Thus storage manager is responsible for storing,retrieving and
updating data in the database.
The storage manager components include-
1. Authorization and integrity manager: Whichtests for the satisfaction of
integrity constraints and checks the authority of users to access the data
2. Transaction Manager: Ensures that the database system remains in a
consistent system despite of system failures and concurrent transaction
execution proceeds without conflicting.
3. File Manager: File manager manages the allocation of space on disk storage
and data structures used to represent information stored on the disk.
4. Buffer Manager: Buffer manager responsible for fetching data from the data
from disk storage into main memory and deciding what data to cache in main
memory.Buffer manager is a critical part of the system, since it enables the
data base to handle data sizes that are much larger than the size of main
memory
Storage Manager implements several data structures such as
Data files: responsible for storing database itself
Data Dictionary: which stores the meta data about the structure of the
database
Indices: Used to provide fast access to data items
Introduction to relational database -Relational model and keys
RELATIONAL MODEL
Relational database consists of collection of table, each of which is assigned a unique
name. Table is an entity set, and a row is an entity.
For each attribute there is a set of permitted values are called domain of the attribute.
For example, account table shown in figure, domain of the attribute bname is the set
of all branch name.
D1 denotes set of all account number. D2 denote set of all branch name.D3 set of all
balances. Then the relation account will be the subset of D1*D2*D3.
Account
Accno bname Bal
A101 Tambaram 10000
D102 Tnager 50000
E101 MM Nager 10000
A108 Velachery 20000
A120 Anna Nager 40000
Relational Algebra
Relational Algebra is a procedural language in which user instructs the system to
perform a sequence of operations on the database to compute the desired result. It
consists of a set of operations that take one or two relations as input and produce a
new relation as their result.
Select, Project and rename operations are called unary operation, because they operate
on only one reation.Union,Setdifferene and Cartesian Product operations operate on
pairs of relations are called binary operations.
Select
Select operation selects tuple that satisfy a given predicate
loan
Lno bname Amount
102 Tambaram 12000
120 Tambaram 14000
100 MM Nager 60000
400 Velachery 20000
500 Anna Nager 40000
Example 1: “To select those tuples of loan relation where branch is Tambarm”
is used denote selection. Predicate appears as a subscript to𝜎
𝝈𝒃𝒏𝒂𝒎𝒆 = Tambaram(𝒍𝒐𝒂𝒏)
Output:
Example 2: “To select those tuples of loan relation where loan amount is more than
12000 is made by the Tambaram branch”
Project
Project operation is a unary operation that returns its argument as relation with certain
attributes left out.
Example1: “To list all lno’s and amount of the loan relation” is
𝜋𝑙𝑛𝑜,𝑎𝑚𝑜𝑢𝑛𝑡 (𝑙𝑜𝑎𝑛)
Entity-Relationship model and ER Diagrams
The entity relationship model conceptual view of a database. It represents the overall
structure of a database. The ER model is very useful in mapping the meanings real
world entities on to a conceptual schema.
ER Diagrams
Note: - Entity is shown by rectangular box and attributes are shown in oval.
The primary keys are underlined.
The entity set is a collection of related entities all of the same type. For example,
an entity set student contains student entities. Each entity in the entity set that
share the same set of attributes used to distinguish them from other entity set. As
shown in figure student entity set has attributes such as SRegno, Sname, address.
Where SRegno is primary key used to identify the entities in the student entity
set.
Types of Attributes
Simple and Composite Arrtibutes: Attributes that have not been divided in to sub
parts are called simple attribute. For example, SRegno in the student entity set.
Attribute that have been divided in to subparts are called composite attribute.
Single valued and multivalued Attributes: Attributes having a single value for a
particular entity is called single valued attribute. For example, Sregno in the
student entity set. Attributes having set of values for particular entity called
multivalued attributes. For example, student can have more than one email
address.
Derived Attribute: The value of this type of attribute can be derived from the
values of other related attributes. For example, employee experience can be
calculated from the date of joining to the particular company. Let us consider
another example student age can be derived from the data of birth.
Enhanced ER Model
Extended ER features
Specialization
The process of designating subgrouping with in an entity set is called
specialization. Specialization is represented by a triangle. It is a top down approach.
The label stands for “IS A”. For example, shown in figure Person entity set is an
employee or customer. The ISA relationship is also referred as superclass-subclass
relationship.
The entity set employee would have all the attributes of Person and an additional
attribute Salary. Similarly, an entity set customer would have credit-rating as
additional attribute.
Generalization
It is a simple inversion of specialization. There are some similarities between
customer entity set and employee entity set, they have several attributes that are same
across two entity set. These commonalities synthesized in to a single high level entity
is known as generalization. It is a bottom up approach.
Aggregation
A feature of the entity relationship model that allows a relationship set participate in
another relationship set. It is an abstraction through which relationships are treated as
higher level entities.
For example –We treat the relationship set writes and the entity sets Author and
Book as a higher level entity set called writes by a dashed box around aggregation
shown in figure.
ER-to-Relational Mapping
Consider the relation schema given in Figure. Design and draw an ER diagram that
capture the information of this schema.
Employee(empno,name,office,age)
Books(isbn,title,authors,publisher)
Loan(empno,isbn,date)
Construct an ER diagram for a car insurance company whose customers own one or
more cars each. Each car has associated with it zero to any number of recorded
accidents.due date and date insurance policy covers one or more cars and has one
or more premium payments associated with it. Each payment is for particular period
of time and has an associated due date and date when the payment was received.
UNIT 2 SQL FUNDAMENTALS
Relational Algebra
Relational algebra is a procedural query language, which takes instances of relations
as input and yields instances of relations as output. It uses operators to perform
queries. An operator can be either unary or binary. They accept relations as their
input and yield relations as their output. Relational algebra is performed recursively
on a relation and intermediate results are also considered relations.
The fundamental operations of relational algebra are as follows −
Select
Project
Union
Set different
Cartesian product
Rename
We will discuss all these operations in the following sections.
Select Operation (σ)
It selects tuples that satisfy the given predicate from a relation.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional
logic formula which may use connectors like and, or, and not. These terms may use
relational operators like − =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or
those books published after 2010.
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
Notation − ∏A1, A2, An (r)
Where A1, A2 , An are attribute names of relation r.
Duplicate rows are automatically eliminated, as relation is a set.
For example −
Selects and projects columns named as subject and author from the relation Books.
Union Operation (∪)
It performs binary union between two given relations and is defined as −
r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold −
r, and s must have the same number of attributes.
Attribute domains must be compatible.
Duplicate tuples are automatically eliminated.
Output − Projects the names of the authors who have either written a book or an
article or both.
Set Difference (−)
The result of set difference operation is tuples, which are present in one relation but
are not in the second relation.
Notation − r − s
Finds all the tuples that are present in r but not in s.
Output − Provides the name of authors who have written books but not articles.
SQL fundamentals
SQL is a standard query language for requesting information from relational
databases. The original version called SEQUEL (Structured English Query Language)
was designed by an IBM in 1975 and the year 1979 oracle corporation introduced
SQL as a commercial database system.
SQL language has several components
Data Definition Language(DDL): The DDL commands allow us to define the
relational schemas and delete or modify the relational schema.
Data Manipulation Language(DML): provides construct for querying
information from the relational database, used to insert, delete and modify
tuples in the database
Data Control Language(DCL): Controls the user access to the database objects
Transaction Control Language(TCL): Provides commands for specifying
beginning and ending of transaction
Data Types
The type of information to be stored in a relation is defined by the datatypes of the
attributes during the table creation itself.
SQL supports following data types:
char(n): n specifies the size of variable. If size 10 then we must have 10
characters exactly. Maximum size is 255, default size is 1
varchar(n): all column values must be less than or equal to the size of the variable
n
number(n): n specifies the size of number variable
number(n,d): number consists of n digits, and d of the n digits are to the right of
the decimal point. For qexample, number (5,2) then it is used to store 300.50
exactly
data: used to store fixed length data values. The data function SYSDATE returns
the current date and time
SQL COMMANDS
SQL commands are used to perform CRUD operation that means used to create, read,
update and delete operation with the values stored on the relational database.
DDL COMMANDS
DDL command used to create overall schema of the table. Also used to add new
column to the data base and modify the existing column.
Command Description
CREATE Used to create table,view,indexes,
synonyms and other object in the database
ALTER Used to modify an existing database
object
DROP Used to delete table
RENAME Used to rename an existing table
TRUNCATE Used to delete the contents of existing
table
DML COMMANDS
DML COMMANDS used to query and manipulate the existing database objects
Command Description
SELECT Used to retrieve information from one or
more table based on the given condition
INSERT Used to insert values on the table
UPDATE Used to modify the record
DELETE Used to delete the records
Data Control Language(DCL)
Command Description
GRANT Used to give privileges to user for
access/modification
REVOKE Used to withdraw granted privileges from
the user
Transaction Control Language(TCL)
Command Description
COMMIT Used to save changes made permanently
to the database
ROLLBACK Used to undo the last operation
SAVEPOINT Used to temporarily save a transaction so
that we can roll back to that point
whenever required
Advanced SQL features-Triggers
Triggers are the SQL codes that are automatically executed in response to certain
events on a particular table. These are used to maintain the integrity of the data. A
trigger in SQL works similar to a real-world trigger. For example, when the gun
trigger is pulled a bullet is fired. We all know this, but how this is related to Triggers
in SQL? To understand this let’s consider a hypothetical situation.
John is the marketing officer in a company. When a new customer data is entered into
the company’s database he has to send the welcome message to each new customer. If
it is one or two customers John can do it manually, but what if the count is more than
a thousand? Well in such scenario triggers come in handy.
Thus, now John can easily create a trigger which will automatically send a welcome
email to the new customers once their data is entered into the database. So I hope you
are clear with the introduction of Triggers in SQL.
Always remember that there cannot be two triggers with similar action time and event
for one table. For example, we cannot have two BEFORE UPDATE triggers for a
table. But we can have a BEFORE UPDATE and a BEFORE INSERT trigger, or
a BEFORE UPDATE and an AFTER UPDATE trigger.
Before we dive further into the fundamentals of triggers I would suggest you to
understand the concepts of SQL Basics and Normalization so that you get a better
grip on Triggers in SQL.
Syntax and Example
Lets now look at the syntax of a trigger.
1Create Trigger Trigger_Name
2(Before | After) [ Insert | Update | Delete]
3on [Table_Name]
4[ for each row | for each column ]
5[ trigger_body ]
Now let me break down this syntax and explain each and every part in detail.
Create Trigger
These two keywords are used to specify that a trigger block is going to be
declared.
Trigger_Name
It specifies the name of the trigger. Trigger name has to be unique and shouldn’t
repeat.
( Before | After )
This specifies when the trigger will be executed. It tells us the time at which the
trigger is initiated, i.e, either before the ongoing event or after.
Before Triggers are used to update or validate record values before they’re saved
to the database.
After Triggers are used to access field values that are set by the system and to
effect changes in other records. The records that activate the after trigger are
read-only. We cannot use After trigger if we want to update a record because it
will lead to read-only error.
[ Insert | Update | Delete ]
These are the DML operations and we can use either of them in a given trigger.
on [ Table_Name ]
We need to mention the table name on which the trigger is being applied. Don’t
forget to use on keyword and also make sure the selected table is present in the
database.
1. Row-level trigger gets executed before or after any column value of a
row changes
2. Column Level Trigger gets executed before or after the specified
column changes
Nested Queries
In nested queries, a query is written inside a query. The result of inner query is used
in execution of outer query. We will use STUDENT, COURSE,
STUDENT_COURSE tables for understanding nested queries.
STUDENT
COURSE
C_ID C_NAME
C1 DSA
C2 Programming
C3 DBMS
STUDENT_COURSE
S_ID C_ID
S1 C1
S1 C3
S2 C1
S3 C2
S4 C2
S4 C3
10308 2 1996-09-18
10309 37 1996-09-19
10310 77 1996-09-20
Notice that the "CustomerID" column in the "Orders" table refers to the
"CustomerID" in the "Customers" table. The relationship between the two tables
above is the "CustomerID" column.
Then, we can create the following SQL statement (that contains an INNER JOIN),
that selects records that have matching values in both tables:
Example
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;
and it will produce something like this:
left_join()
return all rows from x, and all columns from x and y. Rows in x with no match
in y will have NA values in the new columns. If there are multiple matches
between x and y, all combinations of the matches are returned.
right_join()
return all rows from y, and all columns from x and y. Rows in y with no match
in x will have NA values in the new columns. If there are multiple matches
between x and y, all combinations of the matches are returned.
full_join()
return all rows and all columns from both x and y. Where there are not matching
values, returns NA for the one missing.
Filtering joins keep cases from the left-hand data.frame:
semi_join()
return all rows from x where there are matching values in y, keeping just columns
from x.
A semi join differs from an inner join because an inner join will return one row
of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join()
return all rows from x where there are not matching values in y, keeping just columns
from x.
Grouping
Groups are ignored for the purpose of joining, but the result preserves the grouping
of x.
Examples
# NOT RUN {# "Mutating" joins combine variables from the LHS and RHS
band_members %>% inner_join(band_instruments)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
# "Filtering" joins keep cases from the LHS
band_members %>% semi_join(band_instruments)
band_members %>% anti_join(band_instruments)
# To suppress the message, supply by
band_members %>% inner_join(band_instruments, by = "name")# This is good
practice in production code
# Use a named `by` if the join variables have different names
band_members %>% full_join(band_instruments2, by = c("name" = "artist"))# Note
that only the key from the LHS is kept# }
Join-Functions
Join Two Tbls Together
These are generic functions that dispatch to individual tbl methods - see the method
documentation for details of individual data sources. x and y should usually be from
the same data source, but if copy is TRUE, y will automatically be copied to the same
source as x.
Usage
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...)
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...)
full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
semi_join(x, y, by = NULL, copy = FALSE, ...)
anti_join(x, y, by = NULL, copy = FALSE, ...)
Arguments
x, y
tbls to join
by
a character vector of variables to join by. If NULL, the default, *_join() will do a
natural join, using all variables with common names across the two tables. A message
lists the variables so that you can check they're right (to suppress the message, simply
explicitly list the variables that you want to join).
To join by different variables on x and y use a named vector. For example, by = c("a"
= "b") will match x.a to y.b.
copy
If x and y are not from the same data source, and copy is TRUE, then y will be copied
into the same src as x. This allows you to join tables across srcs, but it is a potentially
expensive operation so you must opt into it.
suffix
If there are non-joined duplicate variables in x and y, these suffixes will be added to
the output to disambiguate them. Should be a character vector of length 2.
...
other parameters passed onto methods, for instance, na_matches to control
how NA values are matched. See join.tbl_df for more.
CARTESIAN JOIN: The CARTESIAN JOIN is also known as CROSS JOIN. In a
CARTESIAN JOIN there is a join for each row of one table to every row of another
table. This usually happens when the matching column or WHERE condition is not
specified.
In the absence of a WHERE condition the CARTESIAN JOIN will behave like
a CARTESIAN PRODUCT . i.e., the number of rows in the result-set is the
product of the number of rows of the two tables.
In the presence of WHERE condition this JOIN will function like a INNER
JOIN.
Generally speaking, Cross join is similar to an inner join where the
join-condition will always evaluate to True
Syntax:
SELECT table1.column1 , table1.column2, table2.column1...
FROM table1
CROSS JOIN table2;
UNIT 3 NORMALIZATION
Functional Dependencies
A functional dependency is a constraint between two sets of attributes in a relation.
That means, Functional describes the relationship between attributes in a relation.
For example, If A and B are attributes of relation R and B is functionally dependent
on A, if each value of A is associated with exactly one value of B. Where A and B
may each consist of one or more attributes.
A B
Two employees (John & Lessa) are having two mobile numbers so the
company stored them in the same field.
To make the table complies with 1NF we should have the data like this:
Empid Empname Empaddress Empmobile
101 Henna Delhi 8912312390
102 John Mumbai 8812121212
102 John Mumbai 9900012222
103 Ruth Chennai 7778881212
104 Lessa Bangalore 9990000123
104 Lessa Bangalore 8123450987
emp_dept table:
Empid Empdept
1001 Production and planning
1001 Stores
Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}
This is now in BCNF as in both the functional dependencies left side part is a
key.
Multi-valued Dependencies and Fourth Normal Form
Fourth Normal Form (4NF)
• Fourth normal form eliminates independent many-to-one relationships
between columns.
• To be in Fourth Normal Form,
– a relation must first be in Boyce-Codd Normal Form.
– a given relation may not contain more than one multi-valued
attribute.
Example 1 (Not in 4NF)
Scheme {Manager, Child, Employee}
1. Primary Key {Manager, Child, Employee}
2. Each manager can have more than one child
3. Each manager can supervise more than one employee
4. 4NF Violated
Manager Child Employee
Godson David Jessy
John Beulah Mary
John NULL Ganesh
Manager Child
Godson David
John Beulah
Manager Employee
Godson Jessy
John Mary
John Ganesh
Example 2:
Fifth Normal Form (5NF)
Fifth normal form is satisfied when all tables are broken into as many tables as
possible in order to avoid redundancy. Once it is in fifth normal form it cannot be
broken into smaller relations without changing the facts or the meaning.
BCNF : A relational schema R is considered to be in Boyce–Codd normal
form (BCNF) if and if the relation is in 3NF and for every one of its dependencies X
→ Y, one of the following conditions holds true
X → Y is a trivial functional dependency (i.e., Y is a subset of
X)
X is a superkey for schema That means if Y is a prime attribute
then X cannot be the non-prime attribute.
Example 1:
Let us consider the following student enrolment table for the subject
SRegisterNo Subject Name Instructor
101 Cloud Computing Anita
101 Python Rajesh
102 HTML Anita
103 PHP David
104 Python Joel
Where one student can register multiple subjects. For example, student with register
number 101 can enroll for cloud computing a well as Python. Similarly, multiple
instructor can teach the same subject. For example, subject name Python can be
taught by both the instructors namely Rajesh and Joel.
The above table holds the following dependencies
(SRegisterNo, Subject) --->Instructor
Instructor Subject
Where the Instructor is not super key. Moreover, non-prime attribute instructor
derives the prime attribute subject. It should not be occurred in the case of
BCNF.Hence the above table not in BCNF.
Join Dependencies and Fifth Normal Form
A relation is in 5NF if it is in 4NF and not contains any join dependency and
joining should be lossless.
5NF is satisfied when all the tables are broken into as many tables as possible in
order to avoid redundancy.
5NF is also known as Project-join normal form (PJ/NF).
Example
In the above table, John takes both Computer and Math class for Semester 1 but he
doesn't take Math class for Semester 2. In this case, combination of all these fields
required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and
who will be taking that subject so we leave Lecturer and Subject as NULL. But all
three columns together acts as a primary key, so we can't leave other two columns
blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
UNIT 4 TRANSCATION PROCESSING AND CONCURRENCY
CONTROL
Transaction Concepts
Transaction is a logical unit of work that represents real-world events of any
organisation or an enterprise whereas concurrency control is the management of
concurrent transaction execution. Transaction processing systems execute database
transactions with large databases and hundreds of concurrent users, for example,
railway and air reservations systems, banking system, credit card processing, stock
market monitoring, super market inventory and checkouts and so on.
TRANSACTION :
A transaction is a logical unit of work of database processing that includes one or
more database access operations.
A transaction can be defined as an action or series of actions that is carried out by a
single user or application program to perform operations for accessing the contents of
the database. The operations can include retrieval, (Read), insertion (Write), deletion
and modification. A transaction must be either completed or aborted.
It can either be embedded within an application program or can be specified
interactively via a high-level query language such as SQL. Its execution preserves
the consistency of the database.
Each transaction should access shared data without interfering with the other
transactions and whenever a transaction successfully completes its execution; its
effect should be permanent.
This basic abstraction frees the database application programmer from the following
concerns :
Inconsistencies caused by conflicting updates from concurrent users.
Partially completed transactions in the event of systems failure.
User-directed undoing of transactions.
A transaction is a sequence of READ and WRITE actions that are grouped together to
from a database access. A transaction may consist of a simple SELECT operation to
generate a list of table contents, or it may consist of a series of related UPDATE
command sequences.
A transaction can include the following basic database access operations:
Operations Descriptions
Transaction that changes the contents of the database must alter the database from
one consistent state to another. A consistent database state is one in which all data
integrity constraints are satisfied.
To ensure database consistency, every transaction must begin with the database in a
known consistent state.
ACID properties
PROPORTIES
Atomicity-either all operations of the transaction are properly executed in the
database or none are executed.
Consistency-Consistency means that the sum of A and B be unchanged by the
execution of transaction, without consistency money could be created or destroyed by
the transaction.
Execution of a transaction in isolation with no other transaction executing
concurrently preserves the consistency of the database.
Isolation-Even though multiple transaction executed concurrently the system
guarantee that for every pair of transaction Ti,Tj.
*either Tj started execution after Ti finished
(or)
* Tj finished execution before Ti started.
Durability-After transaction completes successfully the changes it has made to the
database persist, even if there are system failure.
Requirements of ACID Property:
Let us consider transaction Ti A(Ajay) want to send ₹1000 to his friend B(Shiva)
Ti: read(A)
A:=A-1000
Write(A)
Read(B)
B:=B+1000
Write(B)
Atomicity:
Suppose system failure happened after write(A) but before write(B) then the values of
A and B are A= ₹4000 and B= ₹1000 The system destroyed 1000 as a results of this
failures
If atomicity is present, all actions of transaction are reflected in the database or none
are
Basic idea behind ensuring atomicity is that the system keeps track of old values of
any data on which transaction performs write
If the transaction does not complete its execution the database system restores the old
values to make it appear as transaction never executed.
Ensuring the atomicity is the responsibility of the system itself it is handled by
transaction management component.
Consistency:
Consistency requirement here is that the sum of A & B be unchanged by the
execution of transaction
Without the consistency requirement money could be created or destroyed by the
transaction.
Programmer who codes the transaction is responsible for maintenance of that
recovery.
Isolation:
If several transactions are executed concurrently, their operations may interleave in
some undesirable way resulting in an inconsistent state.
To avoid the problem of concurrency transaction is to execute serially in any order
that is one after other.
Durability:
Durability guarantees that once a transaction completes successfully, all the update
that it carried out on the database persist, even if there is a system failure after the
transaction completes execution.
Failure of system may result loss of data in main memory but data written in disk are
never lost. By writing the updates carried out by transaction is in to disk is sufficient
to enable the database to reconstruct the updates when the database system is restarted
after failure
Transaction state:
A transaction must be in one of the following states;
Active:
The initial state, the transaction stays in this state while it is executing.
Partially committed:
After the final statement has been executed.
Failed:
After the transaction has been rolled back and the database has been restored to its
state prior to the start of the transaction.
Concurrency Control and Need for Concurrency
What is Concurrency Control?
Concurrency Control in Database Management System is a procedure of managing
simultaneous operations without conflicting with each other. It ensures that Database
transactions are performed concurrently and accurately to produce correct results
without violating data integrity of the respective Database.
Concurrent access is quite easy if all users are just reading data. There is no way they
can interfere with one another. Though for any practical Database, it would have a
mix of READ and WRITE operations and hence the concurrency is a challenge.
DBMS Concurrency Control is used to address such conflicts, which mostly occur
with a multi-user system. Therefore, Concurrency Control is the most important
element for proper functioning of a Database Management System where two or more
database transactions are executed simultaneously, which require access to the same
data.
In this tutorial, you will learn
What is Concurrency Control?
Potential problems of Concurrency
Why use Concurrency method?
Concurrency Control Protocols
Lock-based Protocols
Two Phase Locking (2PL) Protocol
Timestamp-based Protocols
Validation Based Protocol
Characteristics of Good Concurrency Protocol
Potential problems of Concurrency
Here, are some issues which you will likely to face while using the DBMS
Concurrency Control method:
Lost Updates occur when multiple transactions select the same row and update
the row based on the value selected
Uncommitted dependency issues occur when the second transaction selects a row
which is updated by another transaction (dirty read)
Non-Repeatable Read occurs when a second transaction is trying to access the
same row several times and reads different data each time.
Incorrect Summary issue occurs when one transaction takes summary over the
value of all the instances of a repeated data-item, and second transaction update
few instances of that specific data-item. In that situation, the resulting summary
does not reflect a correct result.
Why use Concurrency method?
Reasons for using Concurrency control method is DBMS:
To apply Isolation through mutual exclusion between conflicting transactions
To resolve read-write and write-write conflict issues
To preserve database consistency through constantly preserving execution
obstructions
The system needs to control the interaction among the concurrent transactions.
This control is achieved using concurrent-control schemes.
Concurrency control helps to ensure serializability
Example
Assume that two people who go to electronic kiosks at the same time to buy a movie
ticket for the same movie and the same show time.
However, there is only one seat left in for the movie show in that particular theatre.
Without concurrency control in DBMS, it is possible that both moviegoers will end up
purchasing a ticket. However, concurrency control method does not allow this to
happen. Both moviegoers can still access information written in the movie seating
database. But concurrency control only provides a ticket to the buyer who has
completed the transaction process first.
Locking protocols and Two phase locking
One way to ensure serializbility is to ensure that data items be accessed in a mutually
exclusive manner that means while one transaction is accessing a data item, no other
transaction can modify that data item.
The most common method to implement this requirement is to allow a transaction to
access a data item only if it is currently holding a lock on that item. Transaction
acquires a lock on the database before it starts and releases the lock after it has
committed. While a transaction holds a lock, no other transaction is allowed to
acquire the lock and all therefore must wait for the lock to be released.
There are two methods in which data item may be locked.
Shared mode lock(S)-It can read data item but can’t perform write
operation
Exclusive mode lock(X)-It can perform both read and write operation
on data item.
The concurrency control manager is responsible for releasing lock to the transaction.
Concurrency control manager will check the compatibility matrix shown in fig , if
the lock requests are compatible with each other then the manager will grant the lock.
S X
S T F
X F F
Figure: Lock Compatibility Matrix
Shared mode is compatible with shared mode but not with exclusive mode. At any
time several shared mode lock can be held on particular data item. Whereas exclusive
mode lock request has to wait until the currently held shared mode locks are released.
Transaction requesting shared mode lock by executing Lock-S(Q)
Transaction requesting exclusive mode lock by executing Lock-X(Q)
To illustrate lock based protocol let us consider our example Transaction T1 Ajay(A)
want to send 1000 to his friend Shiva(B). Then the corresponding lock based protocol
would be
Note: Initial value of Ajay and Shiva account balance will be A=₹5000 and
B=₹1000
T1: Lock-X(A)
Read(A)
A:=A-1000;
Write(A)
Unlock(A)
Lock-X(B)
B:=B+1000;
Write(B);
Unlock(B)
T1 T2
Lock-X(A)
Read(A)
A:=A-1000;
Write(A)
Unlock(A)
Lock-S(A)
Read(A)
Unlock(A)
Lock-S(B)
Read(B)
Unlock(B);
Display(A+B)
Lock-X(B)
B:=B+1000;
Write(B);
Unlock(B)
Anomaly Serializable is not the same as Serializable. That is, it is necessary, but not
sufficient that a Serializable schedule should be free of all three phenomena types.
SQL Facilities for Concurrency and Recovery
Two Phase Locking Protocol
Protocol that ensures serializability is that two phase locking protocol. This protocol
request that each transaction issue lock and unlock request in two phases
1. Growing Phase- During this phase transaction may obtain locks but may not
release the lock
2. Shrinking Phase-A transaction may release locks but may not obtain any new
locks.
Transaction is initially in the growing phase and it acquires lock as needed. Once
the transaction releases lock it enters the shrinking phase and it cannot issue
further lock request.
Limitation of Two Phase Locking Protocol
Dead lock-Dead lock cannot be solved by two phase locking protocol.
Cascading Roll back- Cascading Roll back may occur under two phase
locking protocol.
Let us consider the following schedule
T1 T2 T3
Lock-X(A)
read(A)
Lock-X(B)
read(B)
write(A)
Unlock(A)
Lock-X(A)
read(A)
write(A)
Unlock(A)
Lock-X(A)
read(A)
The data item A value written by T1 will read by T2.Tha value written by T2 will
read by T3.Failure of transaction T1 will result failure of T2 and T3.Hence we need
to perform cascading roll back of T2 and T3.
Types of Two Phase Locking
Strict Two Phase locking protocol-This protocol require that not only the
lock be two phase, but also that all exclusive mode locks taken by a
transaction be held until that transaction commits
Rigorous Two Phase locking protocol-This is more stricter than strict two
phase locking protocol. Here not only exclusive mode lock all the locks be
held until the transaction commits.
Lock Conversion
There are two modes of conversion
Upgrade Conversion-Conversion from shared to exclusive mode
is known as upgrade conversion. It can take place only in the
growing phase
Downgrade Conversion- Conversion from exclusive mode to
shared mode is known as downgrade conversion. It can take
place only in the shrinking phase
Time Stamp Based Protocol
This protocol ensures serializability.It selects an ordering among transactions in
advance using time stamps. For each and every transaction Timestamp is given.
Ti---> TS(Ti)
Time stamp is assigned by the database system before the transaction Ti starts its
execution.If Ti has been assigned timestamp and a new transaction Tj enters the
system such that
TS(Ti)< TS(Tj)
Then Ti appears before Tj.
W-timestamp(Q)-Denotes the largest timestamp of any instruction that executed
write(Q) successfully.
R-timestamp(Q)- Denotes the largest timestamp of any instruction that executed
read(Q) successfully
UNIT 5 IMPLEMENTATION TECHNIQUES
RAID
RAID (redundant array of independent disks) is a way of storing the same data in
different places on multiple hard disks or solid-state drives to protect data in the case
of a drive failure. There are different RAID levels, however, and not all have the goal
of providing redundancy.
How RAID works
RAID works by placing data on multiple disks and allowing input/output (I/O)
operations to overlap in a balanced way, improving performance. Because the use of
multiple disks increases the mean time between failures (MTBF), storing data
redundantly also increases fault tolerance.
RAID arrays appear to the operating system (OS) as a single logical drive. RAID
employs the techniques of disk mirroring or disk striping. Mirroring will copy
identical data onto more than one drive. Striping partitions helps spread data over
multiple disk drives. Each drive's storage space is divided into units ranging from
a sector (512 bytes) up to several megabytes. The stripes of all the disks are
interleaved and addressed in order.
Clustered index is an index whose search key also defines the sequential order of the
file.
There are two types of ordered indices:
Dense Index:
In a dense Index, an index entry appears for every search key values. The index
record contains search key value and a pointer to the actual value.
Sparse Index:
In a sparse Index, an index entry appears for only some of the search key values.
To locate a record, we find the index entry with the largest search key value that is
less than or equal to the search key value for which we are looking.
Search key values in the leaf node are kept in sorted order.
Algorithm for Insertion Operation
Step1: Find correct Leaf Node
Step 2: Enter the data onto that Leaf Node L
If L is half enough then done
Else must split L(into L and new node L1)
Allocate new node
Redistribute entries evenly
Copy up the middle key
Insert index entry pointing L1 into parent of L
Step 3: To split index node, redistribute entries evenly and push up the middle key
Do step3 recursively
Step4: Tree can grow due to node splitting. Root split increases height
Let us consider B+ tree to insert the following key elements (Order of the Tree 3)
5,3,4,9,7,15,14,21,22,23
B tree Index files
B-Tree is a self-balancing search tree. In most of the other self-balancing search
trees (like AVL and Red-Black Trees), it is assumed that everything is in main
memory. To understand the use of B-Trees, we must think of the huge amount of
data that cannot fit in main memory. When the number of keys is high, the data is
read from disk in the form of blocks. Disk access time is very high compared to the
main memory access time. The main idea of using B-Trees is to reduce the number
of disk accesses. Most of the tree operations (search, insert, delete, max, min, ..etc )
require O(h) disk accesses where h is the height of the tree. B-tree is a fat tree. The
height of B-Trees is kept low by putting maximum possible keys in a B-Tree node.
Generally, the B-Tree node size is kept equal to the disk block size. Since the height
of the B-tree is low so total disk accesses for most of the operations are reduced
significantly compared to balanced Binary Search Trees like AVL Tree, Red-Black
Tree, ..etc.
Time Complexity of B-Tree:
1. Search O(log n)
2. Insert O(log n)
3. Delete O(log n)
We can see in the above diagram that all the leaf nodes are at the same level and all
non-leaf have no empty sub-tree and have keys one less than the number of their
children.
Static Hashing and Dynamic Hashing
As the file size grows in size performance will degrade in the sequential file
organization mechanism. Moreover, it is impossible to search all the index values
through all its level and then reach the destination to retrieve the desired data.
Hashing is an effective technique to calculate the direct location of a data record on the
disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a
data record.
There are two main types of hashing techniques
1. Static Hashing
2. Dynamic Hashing
Static Hashing
In static hashing, when a search-key value is provided, the hash function always
computes the same address. For example, if mod-10 hash function is used, then it shall
generate only 10values. That means bucket size is fixed it is 10. The output address
shall always be same for that function.
Let us consider we want to create the table to show the list of available database
management tools.
With the use of hash function h(k)= k mod 10 the tools are stored in the bucket as
follows:
Where k is search key values here serial number and 10 is the bucket size
Tools table would be
Hash function is responsible for mapping the elements (Key Value) =H (10) =10 Mod
10=0, Hence the oracle stored in the bucket 0. Similarly,H (19) =19Mod 10=9 Hence
the value couchDB stored in bucket 9.
Optimization
In this step query evaluation plan is prepared from all the relational algebraic
expressions
The query cost for all the evaluation plans is calculated. It is evaluated using
information from the database such as the number of tuples in each relation, size of
tuples etc
Evaluation
Query evaluation engine takes a query evaluation plan, executes that plan and
returns the answers to the query.
For example, Select salary from instructor where salary<75000;
The above query can be translated into either one of the following expression
𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000 (𝜋𝑠𝑎𝑙𝑎𝑟𝑦 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟))
𝜋𝑠𝑎𝑙𝑎𝑟𝑦 (𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟))
To evaluate a query, in addition to relational algebra expression we need to give
instructions specifying how to evaluate each operation.
Evaluation Primitive
Relational algebra operation annotated with instructions on how to evaluate it is
called an evaluation primitive
A sequence of primitive operations that can be used to evaluate a query is a query
evaluation plan
𝜋𝑠𝑎𝑙𝑎𝑟𝑦
𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000
Instructor
The query evaluation engine takes the query evaluation plan and executes the plan
and returns the answers to the query.