1904001-DBMS Notes 5 Units

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

UNIT 1 INTRODUCTION TO DATABASE

Purpose of Database System and Views of data


Definition: A database management system is a collection of interrelated data and a
set of programs that allow users to access and modify those data.
The primary goal of DBMS is to provide a way to store and retrieve the required
information from the database in convenient and efficient manner.
Views of data
Major purpose of database system is to provide users with an abstract view of
data. That is the system hides certain details of how the data are stored and
maintained.
The view of the system helps the user to retrieve the data more efficiently
Data Abstraction
Data abstraction means retrieving only the required data and hiding the background
details.
For simplifying user interaction with the system there are three levels of abstraction.
 Physical Level
 Logical Level
 View Level
Physical Level
The lowest level of abstraction describes how the data are actually stored
This level describes complex low level data structure
Logical Level
The next higher level of abstraction describes what data are stored in the data
base and what relationships exists among those data
View Level
The highest level of abstraction describes only part of the entire database. This
level helps the user in simplifying the interaction with the system. It can provide
multiple views of the same system.
For Example: Let us consider the following relation
Doctor (Doc_id, DName, Qualification, Salary) in physical level is represented as
follows
Doc_id integer 2bytes
DName Character 4bytes
Qualification Character 6bytes
Salary Character 4bytes
Many database systems hide these lowest level storage details and provide abstract
view of data to the user.
The type definition of Doctor relation is decided at the logical level.
At the view level, specific view of the record is allowed to access by the user. For
instance patient can view the name of the doctor, or qualification of the doctor but
cannot access the doctor’s salary
Instances and Schemas
Schema
Overall design of the database is called database schema.That means we need to
specify the attributes you needed and the data type of that attribute.For example The
scehema for the Airport record can be

Code City State

The collection of information stored in the database at a particular moment is


called instances. For example, following is the instance of airport relation
Code City State
200 Chennai Tamil Nadu
300 Kochi Kerala

Data base has several schema based on the levels of abstraction


 Physical schema: Physical schema describes database design at the physical level
 Logical Schema: The logical schema describes database design at logical level
 Subschema: A database may have several schemas at the view level which are
called subschemas. They used to describe different views of the database.
Data Models
Data Model gives us an idea that how the final system will look like after its
complete implementation. It defines the data elements and the relationships between
the data elements. Data Models are used to show how data is stored, connected,
accessed and updated in the database management system. Here, we use a set of
symbols and text to represent the information so that members of the organization
can communicate and understand it. Though there are many data models being used
nowadays but the Relational model is the most widely used model. Apart from the
Relational model, there are many other types of data models about which we will
study in details in this blog. Some of the Data Models in DBMS are:
 Hierarchical Model
 Network Model
 Entity-Relationship Model
 Relational Model
 Object-Oriented Data Model
 Object-Relational Data Model
 Flat Data Model
 Semi-Structured Data Model
 Associative Data Model
 Context Data Model
Hierarchical Model
Hierarchical Model was the first DBMS model. This model organises the data in the
hierarchical tree structure. The hierarchy starts from the root which has root data and
then it expands in the form of a tree adding child node to the parent node. This
model easily represents some of the real-world relationships like food recipes,
sitemap of a website etc.
Relational Model
Relational model uses collection of tables to represent both data and relationship
among those data.
Table is known as relation. The relation can have one are more columns and each
column has unique name.
For example, following figure shows Relational Model by showing the relationship
between Person and Employee database.
Person Rakesh living in Mumbai and he is working in the industry Infosys. Thus the
relationship between these two relation is maintained by the Pid.Column
PID Name City
101 Rakesh Mumbai
102 Sham Delhi

Person Relation
PID Industry
101 Infosys
102 TCS
Employee Relation
Advantages:
 Relational model allows the designer to simply focus on logical design and not on
physical design. Hence the relational models are conceptually simple to
understand
 Using simple query language can retrieve the required information easily
 They are easy to maintain and use
Disadvantages
 It requires powerful hardware and large data storage devices
 May lead to slower processing time
 Poorly designed systems lead to poor implementation of database systems
Database System Architecture
Data base system is partitioned into modules that deal with each of the responsibilities
of the overall system. Data base system architecture is shown in figure
Data base system architecture is broadly classified into two modules
1.Storage Manager
2.Query Processor
Storage Manager
Storage manager is very important because data base typically requires a large amount
of storage space.
Storage manager is a program module that provides the interface between the low
level data stored in the data base and the application programs and queries submitted
to the system.
Storage manager is responsible for interaction with the file manager. Storage manager
translates the various DML (Data Manipulation Language) statements into low level
file system commands. Thus storage manager is responsible for storing,retrieving and
updating data in the database.
The storage manager components include-
1. Authorization and integrity manager: Whichtests for the satisfaction of
integrity constraints and checks the authority of users to access the data
2. Transaction Manager: Ensures that the database system remains in a
consistent system despite of system failures and concurrent transaction
execution proceeds without conflicting.
3. File Manager: File manager manages the allocation of space on disk storage
and data structures used to represent information stored on the disk.
4. Buffer Manager: Buffer manager responsible for fetching data from the data
from disk storage into main memory and deciding what data to cache in main
memory.Buffer manager is a critical part of the system, since it enables the
data base to handle data sizes that are much larger than the size of main
memory
Storage Manager implements several data structures such as
 Data files: responsible for storing database itself
 Data Dictionary: which stores the meta data about the structure of the
database
 Indices: Used to provide fast access to data items
Introduction to relational database -Relational model and keys
RELATIONAL MODEL
Relational database consists of collection of table, each of which is assigned a unique
name. Table is an entity set, and a row is an entity.
For each attribute there is a set of permitted values are called domain of the attribute.
For example, account table shown in figure, domain of the attribute bname is the set
of all branch name.
D1 denotes set of all account number. D2 denote set of all branch name.D3 set of all
balances. Then the relation account will be the subset of D1*D2*D3.
Account
Accno bname Bal
A101 Tambaram 10000
D102 Tnager 50000
E101 MM Nager 10000
A108 Velachery 20000
A120 Anna Nager 40000

Relational Algebra
Relational Algebra is a procedural language in which user instructs the system to
perform a sequence of operations on the database to compute the desired result. It
consists of a set of operations that take one or two relations as input and produce a
new relation as their result.
Select, Project and rename operations are called unary operation, because they operate
on only one reation.Union,Setdifferene and Cartesian Product operations operate on
pairs of relations are called binary operations.
Select
Select operation selects tuple that satisfy a given predicate
loan
Lno bname Amount
102 Tambaram 12000
120 Tambaram 14000
100 MM Nager 60000
400 Velachery 20000
500 Anna Nager 40000

Example 1: “To select those tuples of loan relation where branch is Tambarm”
is used denote selection. Predicate appears as a subscript to𝜎
𝝈𝒃𝒏𝒂𝒎𝒆 = Tambaram(𝒍𝒐𝒂𝒏)
Output:

Lno bname Amount


102 Tambaram 12000
120 Tambaram 14000

Example 2: “To select those tuples of loan relation where loan amount is more than
12000 is made by the Tambaram branch”

𝝈𝒃𝒏𝒂𝒎𝒆 = Tambaram ∧ amount > 12000(loan)


Output:

Lno bname Amount


120 Tambaram 14000

Project
Project operation is a unary operation that returns its argument as relation with certain
attributes left out.
Example1: “To list all lno’s and amount of the loan relation” is
𝜋𝑙𝑛𝑜,𝑎𝑚𝑜𝑢𝑛𝑡 (𝑙𝑜𝑎𝑛)
Entity-Relationship model and ER Diagrams
The entity relationship model conceptual view of a database. It represents the overall
structure of a database. The ER model is very useful in mapping the meanings real
world entities on to a conceptual schema.
ER Diagrams

Entity set Attribute Relationship set


The ER model consists of two basic concepts-
1. Entity Set
Entity: An entity is an object in the real world that is distinguishable from all
other objects. An entity set has set of attributes or properties that used to
identify them.

Note: - Entity is shown by rectangular box and attributes are shown in oval.
The primary keys are underlined.

 The entity set is a collection of related entities all of the same type. For example,
an entity set student contains student entities. Each entity in the entity set that
share the same set of attributes used to distinguish them from other entity set. As
shown in figure student entity set has attributes such as SRegno, Sname, address.
Where SRegno is primary key used to identify the entities in the student entity
set.

 Relationship set Relationship is an association among several entities.


Relationship set is a set of relationship of the same type. Let us consider two
relationship set student and company, we can define relationship set Placed to
denote the students placed in the particular company.

Types of Attributes
 Simple and Composite Arrtibutes: Attributes that have not been divided in to sub
parts are called simple attribute. For example, SRegno in the student entity set.
Attribute that have been divided in to subparts are called composite attribute.
 Single valued and multivalued Attributes: Attributes having a single value for a
particular entity is called single valued attribute. For example, Sregno in the
student entity set. Attributes having set of values for particular entity called
multivalued attributes. For example, student can have more than one email
address.
 Derived Attribute: The value of this type of attribute can be derived from the
values of other related attributes. For example, employee experience can be
calculated from the date of joining to the particular company. Let us consider
another example student age can be derived from the data of birth.
Enhanced ER Model
Extended ER features
Specialization
The process of designating subgrouping with in an entity set is called
specialization. Specialization is represented by a triangle. It is a top down approach.
The label stands for “IS A”. For example, shown in figure Person entity set is an
employee or customer. The ISA relationship is also referred as superclass-subclass
relationship.

The entity set employee would have all the attributes of Person and an additional
attribute Salary. Similarly, an entity set customer would have credit-rating as
additional attribute.
Generalization
It is a simple inversion of specialization. There are some similarities between
customer entity set and employee entity set, they have several attributes that are same
across two entity set. These commonalities synthesized in to a single high level entity
is known as generalization. It is a bottom up approach.
Aggregation
A feature of the entity relationship model that allows a relationship set participate in
another relationship set. It is an abstraction through which relationships are treated as
higher level entities.
For example –We treat the relationship set writes and the entity sets Author and
Book as a higher level entity set called writes by a dashed box around aggregation
shown in figure.
ER-to-Relational Mapping
Consider the relation schema given in Figure. Design and draw an ER diagram that
capture the information of this schema.
Employee(empno,name,office,age)
Books(isbn,title,authors,publisher)
Loan(empno,isbn,date)

Construct an ER diagram for a car insurance company whose customers own one or
more cars each. Each car has associated with it zero to any number of recorded
accidents.due date and date insurance policy covers one or more cars and has one
or more premium payments associated with it. Each payment is for particular period
of time and has an associated due date and date when the payment was received.
UNIT 2 SQL FUNDAMENTALS
Relational Algebra
Relational algebra is a procedural query language, which takes instances of relations
as input and yields instances of relations as output. It uses operators to perform
queries. An operator can be either unary or binary. They accept relations as their
input and yield relations as their output. Relational algebra is performed recursively
on a relation and intermediate results are also considered relations.
The fundamental operations of relational algebra are as follows −
 Select
 Project
 Union
 Set different
 Cartesian product
 Rename
We will discuss all these operations in the following sections.
Select Operation (σ)
It selects tuples that satisfy the given predicate from a relation.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional
logic formula which may use connectors like and, or, and not. These terms may use
relational operators like − =, ≠, ≥, < , >, ≤.
For example −

σsubject = "database"(Books)

Output − Selects tuples from books where subject is 'database'.

σsubject = "database" and price = "450"(Books)

Output − Selects tuples from books where subject is 'database' and 'price' is 450.

σsubject = "database" and price = "450" or year > "2010"(Books)

Output − Selects tuples from books where subject is 'database' and 'price' is 450 or
those books published after 2010.
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
Notation − ∏A1, A2, An (r)
Where A1, A2 , An are attribute names of relation r.
Duplicate rows are automatically eliminated, as relation is a set.
For example −

∏subject, author (Books)

Selects and projects columns named as subject and author from the relation Books.
Union Operation (∪)
It performs binary union between two given relations and is defined as −

r ∪ s = { t | t ∈ r or t ∈ s}

Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold −
 r, and s must have the same number of attributes.
 Attribute domains must be compatible.
 Duplicate tuples are automatically eliminated.

∏ author (Books) ∪ ∏ author (Articles)

Output − Projects the names of the authors who have either written a book or an
article or both.
Set Difference (−)
The result of set difference operation is tuples, which are present in one relation but
are not in the second relation.
Notation − r − s
Finds all the tuples that are present in r but not in s.

∏ author (Books) − ∏ author (Articles)

Output − Provides the name of authors who have written books but not articles.
SQL fundamentals
SQL is a standard query language for requesting information from relational
databases. The original version called SEQUEL (Structured English Query Language)
was designed by an IBM in 1975 and the year 1979 oracle corporation introduced
SQL as a commercial database system.
SQL language has several components
 Data Definition Language(DDL): The DDL commands allow us to define the
relational schemas and delete or modify the relational schema.
 Data Manipulation Language(DML): provides construct for querying
information from the relational database, used to insert, delete and modify
tuples in the database
 Data Control Language(DCL): Controls the user access to the database objects
 Transaction Control Language(TCL): Provides commands for specifying
beginning and ending of transaction
Data Types
The type of information to be stored in a relation is defined by the datatypes of the
attributes during the table creation itself.
SQL supports following data types:
 char(n): n specifies the size of variable. If size 10 then we must have 10
characters exactly. Maximum size is 255, default size is 1
 varchar(n): all column values must be less than or equal to the size of the variable
n
 number(n): n specifies the size of number variable
 number(n,d): number consists of n digits, and d of the n digits are to the right of
the decimal point. For qexample, number (5,2) then it is used to store 300.50
exactly
 data: used to store fixed length data values. The data function SYSDATE returns
the current date and time
SQL COMMANDS
SQL commands are used to perform CRUD operation that means used to create, read,
update and delete operation with the values stored on the relational database.
DDL COMMANDS
DDL command used to create overall schema of the table. Also used to add new
column to the data base and modify the existing column.
Command Description
CREATE Used to create table,view,indexes,
synonyms and other object in the database
ALTER Used to modify an existing database
object
DROP Used to delete table
RENAME Used to rename an existing table
TRUNCATE Used to delete the contents of existing
table
DML COMMANDS
DML COMMANDS used to query and manipulate the existing database objects
Command Description
SELECT Used to retrieve information from one or
more table based on the given condition
INSERT Used to insert values on the table
UPDATE Used to modify the record
DELETE Used to delete the records
Data Control Language(DCL)
Command Description
GRANT Used to give privileges to user for
access/modification
REVOKE Used to withdraw granted privileges from
the user
Transaction Control Language(TCL)
Command Description
COMMIT Used to save changes made permanently
to the database
ROLLBACK Used to undo the last operation
SAVEPOINT Used to temporarily save a transaction so
that we can roll back to that point
whenever required
Advanced SQL features-Triggers
Triggers are the SQL codes that are automatically executed in response to certain
events on a particular table. These are used to maintain the integrity of the data. A
trigger in SQL works similar to a real-world trigger. For example, when the gun
trigger is pulled a bullet is fired. We all know this, but how this is related to Triggers
in SQL? To understand this let’s consider a hypothetical situation.

John is the marketing officer in a company. When a new customer data is entered into
the company’s database he has to send the welcome message to each new customer. If
it is one or two customers John can do it manually, but what if the count is more than
a thousand? Well in such scenario triggers come in handy.
Thus, now John can easily create a trigger which will automatically send a welcome
email to the new customers once their data is entered into the database. So I hope you
are clear with the introduction of Triggers in SQL.
Always remember that there cannot be two triggers with similar action time and event
for one table. For example, we cannot have two BEFORE UPDATE triggers for a
table. But we can have a BEFORE UPDATE and a BEFORE INSERT trigger, or
a BEFORE UPDATE and an AFTER UPDATE trigger.
Before we dive further into the fundamentals of triggers I would suggest you to
understand the concepts of SQL Basics and Normalization so that you get a better
grip on Triggers in SQL.
Syntax and Example
Lets now look at the syntax of a trigger.
1Create Trigger Trigger_Name
2(Before | After) [ Insert | Update | Delete]
3on [Table_Name]
4[ for each row | for each column ]
5[ trigger_body ]
Now let me break down this syntax and explain each and every part in detail.
 Create Trigger
These two keywords are used to specify that a trigger block is going to be
declared.
 Trigger_Name
It specifies the name of the trigger. Trigger name has to be unique and shouldn’t
repeat.
 ( Before | After )
This specifies when the trigger will be executed. It tells us the time at which the
trigger is initiated, i.e, either before the ongoing event or after.
 Before Triggers are used to update or validate record values before they’re saved
to the database.
 After Triggers are used to access field values that are set by the system and to
effect changes in other records. The records that activate the after trigger are
read-only. We cannot use After trigger if we want to update a record because it
will lead to read-only error.
 [ Insert | Update | Delete ]
These are the DML operations and we can use either of them in a given trigger.
 on [ Table_Name ]
We need to mention the table name on which the trigger is being applied. Don’t
forget to use on keyword and also make sure the selected table is present in the
database.
1. Row-level trigger gets executed before or after any column value of a
row changes
2. Column Level Trigger gets executed before or after the specified
column changes
Nested Queries
In nested queries, a query is written inside a query. The result of inner query is used
in execution of outer query. We will use STUDENT, COURSE,
STUDENT_COURSE tables for understanding nested queries.
STUDENT

S_ID S_NAME S_ADDRESS S_PHONE S_AGE

S1 RAM DELHI 9455123451 18

S2 RAMESH GURGAON 9652431543 18

S3 SUJIT ROHTAK 9156253131 20

S4 SURESH DELHI 9156768971 18

COURSE

C_ID C_NAME

C1 DSA

C2 Programming

C3 DBMS

STUDENT_COURSE

S_ID C_ID

S1 C1
S1 C3

S2 C1

S3 C2

S4 C2

S4 C3

There are mainly two types of nested queries:


 Independent Nested Queries: In independent nested queries, query execution
starts from innermost query to outermost queries. The execution of inner query
is independent of outer query, but the result of inner query is used in execution
of outer query. Various operators like IN, NOT IN, ANY, ALL etc are used in
writing independent nested queries
 Co-related Nested Queries: In co-related nested queries, the output of inner
query depends on the row which is being currently executed in outer query. e.g.;
If we want to find out S_NAME of STUDENTs who are enrolled in C_ID ‘C1’,
it can be done with the help of co-related nested query as:
Joins-Inner and Join Outer
SQL JOIN
A JOIN clause is used to combine rows from two or more tables, based on a related
column between them.
Let's look at a selection from the "Orders" table:

OrderID CustomerID OrderDate

10308 2 1996-09-18

10309 37 1996-09-19

10310 77 1996-09-20

Then, look at a selection from the "Customers" table:

CustomerID CustomerName ContactName Country

1 Alfreds Futterkiste Maria Anders Germany

2 Ana Trujillo Emparedados y Ana Trujillo Mexico


helados

3 Antonio Moreno Taquería Antonio Mexico


Moreno

Notice that the "CustomerID" column in the "Orders" table refers to the
"CustomerID" in the "Customers" table. The relationship between the two tables
above is the "CustomerID" column.
Then, we can create the following SQL statement (that contains an INNER JOIN),
that selects records that have matching values in both tables:
Example
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;
and it will produce something like this:

OrderID CustomerName OrderDate

10308 Ana Trujillo Emparedados y helados 9/18/1996

10365 Antonio Moreno Taquería 11/27/1996

10383 Around the Horn 12/16/1996

10355 Around the Horn 11/15/1996

10278 Berglunds snabbköp 8/12/1996

Different Types of SQL JOINs


Here are the different types of the JOINs in SQL:
 (INNER) JOIN: Returns records that have matching values in both tables
 LEFT (OUTER) JOIN: Returns all records from the left table, and the
matched records from the right table
 RIGHT (OUTER) JOIN: Returns all records from the right table, and the
matched records from the left table
 FULL (OUTER) JOIN: Returns all records when there is a match in either
left or right table
Join Types
Currently dplyr supports four types of mutating joins and two types of filtering joins.
Mutating joins combine variables from the two data.frames:
inner_join()
return all rows from x where there are matching values in y, and all columns
from x and y. If there are multiple matches between x and y, all combination of the
matches are returned.

left_join()
return all rows from x, and all columns from x and y. Rows in x with no match
in y will have NA values in the new columns. If there are multiple matches
between x and y, all combinations of the matches are returned.
right_join()
return all rows from y, and all columns from x and y. Rows in y with no match
in x will have NA values in the new columns. If there are multiple matches
between x and y, all combinations of the matches are returned.
full_join()
return all rows and all columns from both x and y. Where there are not matching
values, returns NA for the one missing.
Filtering joins keep cases from the left-hand data.frame:
semi_join()
return all rows from x where there are matching values in y, keeping just columns
from x.
A semi join differs from an inner join because an inner join will return one row
of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join()
return all rows from x where there are not matching values in y, keeping just columns
from x.
Grouping
Groups are ignored for the purpose of joining, but the result preserves the grouping
of x.
Examples
# NOT RUN {# "Mutating" joins combine variables from the LHS and RHS
band_members %>% inner_join(band_instruments)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
# "Filtering" joins keep cases from the LHS
band_members %>% semi_join(band_instruments)
band_members %>% anti_join(band_instruments)
# To suppress the message, supply by
band_members %>% inner_join(band_instruments, by = "name")# This is good
practice in production code
# Use a named `by` if the join variables have different names
band_members %>% full_join(band_instruments2, by = c("name" = "artist"))# Note
that only the key from the LHS is kept# }
Join-Functions
Join Two Tbls Together
These are generic functions that dispatch to individual tbl methods - see the method
documentation for details of individual data sources. x and y should usually be from
the same data source, but if copy is TRUE, y will automatically be copied to the same
source as x.
Usage
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...)
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...)
full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
semi_join(x, y, by = NULL, copy = FALSE, ...)
anti_join(x, y, by = NULL, copy = FALSE, ...)
Arguments
x, y
tbls to join
by
a character vector of variables to join by. If NULL, the default, *_join() will do a
natural join, using all variables with common names across the two tables. A message
lists the variables so that you can check they're right (to suppress the message, simply
explicitly list the variables that you want to join).
To join by different variables on x and y use a named vector. For example, by = c("a"
= "b") will match x.a to y.b.
copy
If x and y are not from the same data source, and copy is TRUE, then y will be copied
into the same src as x. This allows you to join tables across srcs, but it is a potentially
expensive operation so you must opt into it.
suffix
If there are non-joined duplicate variables in x and y, these suffixes will be added to
the output to disambiguate them. Should be a character vector of length 2.
...
other parameters passed onto methods, for instance, na_matches to control
how NA values are matched. See join.tbl_df for more.
CARTESIAN JOIN: The CARTESIAN JOIN is also known as CROSS JOIN. In a
CARTESIAN JOIN there is a join for each row of one table to every row of another
table. This usually happens when the matching column or WHERE condition is not
specified.
 In the absence of a WHERE condition the CARTESIAN JOIN will behave like
a CARTESIAN PRODUCT . i.e., the number of rows in the result-set is the
product of the number of rows of the two tables.
 In the presence of WHERE condition this JOIN will function like a INNER
JOIN.
 Generally speaking, Cross join is similar to an inner join where the
join-condition will always evaluate to True
Syntax:
SELECT table1.column1 , table1.column2, table2.column1...
FROM table1
CROSS JOIN table2;
UNIT 3 NORMALIZATION
Functional Dependencies
A functional dependency is a constraint between two sets of attributes in a relation.
That means, Functional describes the relationship between attributes in a relation.
For example, If A and B are attributes of relation R and B is functionally dependent
on A, if each value of A is associated with exactly one value of B. Where A and B
may each consist of one or more attributes.

A B

Values of the A component of a tuple uniquely determines the value of B component.


When a functional dependency is present the dependency is specified as a constraint
between the attributes.
Consider the following relational schema EMP-PROJ
SSN Pnumber Duration Ename Pname Plocation
1234 105 32 Months Smith,Kumar SmartCity Trichy
1587 205 56Months Sheela,Ravi IOT Chennai
1698 567 48 Months Ramesh,Sugand Waste Madurai
Management

Following functional dependency is present


1. SSN------------> Ename
2. PNumber…………….>{Pname,Plocation}
3. {SSN,Pnumber}……………..>Duration
Functional dependency (1) specify that the value of an employee’s social security
number(SSN) determines the employee name (Ename)
Functional dependency (2) specify that the value of a Projects number(Pnumber)
uniquely determines the project name and project location
Functional dependency (3), a combination of SSN and Pnumber values uniquely
determines the duration that the employee works together to complete the project
Inference Rule for FD
In real time it is impossible to specify all functional dependencies for a given
situation.
For example, if each department has one manager, so that dept-no uniquely
determines Mgr-Ssn(dept-no Mgr-Ssn) and a manager has a unique phone
number(Mgr-SsnMgr-Phone)
Then these two dependency together imply that dept-noMgr-Phone. This is an
inferred FD and need not be stated explicitly in addition to the two given Functional
Dependencies.
It is useful to define a concept called closure that includes all possible dependencies
that can be inferred from the given set of functional dependency.
Closure Set of Functional Dependency
Definition: A set of all functional dependencies that are implied by a given set of
functional dependencies X is called Closure of X, written by X+.
Set of inference rule is needed to compute X+ from X.
 Reflexivity: If A≥B, B is a subset of A, then AB .Functional dependency is
said to be trival if A≥B it holds else it is nontrival.
 Augmentation Rule: Augmentation rule says that adding the same set of attributes
to both left hand and right hand sides of a dependency results in another valid
dependency
 If AB is present in a relation then A,CB will present
 Transitive Rule: If AB and BC then AC
 Decomposition: If ABC then AB and AC ,rule says that we can remove
attributes form the right hand side of a dependency, applying this rule repeatedly
can decompose the Functional Dependency ABC in to set of dependencies
AB and AC
 Union Rule: If AB and AC holds then AB union rule is just the
opposite of decomposition rule. We can combine set of dependencies (AB,
AC) in to single functional dependency
 Pseudotransitiverule: If AB and CD then A,CB. Although AB and
CD we can’t form ACBD but it will be A,CB like that only
Non-loss Decomposition
Decomposition used to remove redundancy, anomalies and inconsistency from a
database by dividing the table in to multiple table.
There are two types of decomposition
1. Non loss or loss less decomposition
2. Lossy decomposition
Loss less decomposition
Decomposition is said to be loss less if it is feasible to reconstruct relation R from
decomposed tables using joins.
Let us consider an example
<EmployeeInfo>
EMPID ENAME EMPAGE COMPANY DEPTID DEPTNAME
NAME
E001 JACOB 28 ACCENTURE DEPT0021 TESTING
E002 HENRY 32 ACCENTURE DEPT0025 DEVELOPMENT
E003 TOM 33 INFOSYS DEPT0030 SUPPORT

Decompose the above table in to two tables <EMPDETAILS> and


<DEPTDETAILS> as follows
<EMPDETAILS>
EMPID ENAME EMPAGE COMPANY
NAME
E001 JACOB 28 ACCENTURE
E002 HENRY 32 ACCENTURE
E003 TOM 33 INFOSYS
<DEPTDETAILS>
EMPID DEPTID DEPTNAME
E001 DEPT0021 TESTING
E002 DEPT0025 DEVELOPMENT
E003 DEPT0030 SUPPORT
Then perform natural join on the above two relations it is possible for us to get the
original relation back. Such decomposition is said to be loss less decomposition.
Lossy Decomposition
As the name suggests, when a relation is decomposed into two or more relation
schemas, the
loss of information is unavoidable when the original relation is retrieved.
Let us consider an example
<EmployeeInfo>
EMPID ENAME EMPAGE COMPANY DEPTID DEPTNAME
NAME
E001 JACOB 28 ACCENTURE DEPT0021 TESTING
E002 HENRY 32 ACCENTURE DEPT0025 DEVELOPMENT
E003 TOM 33 INFOSYS DEPT0030 SUPPORT

Decompose the above table in to two tables <EMPDETAILS> and


<DEPTDETAILS> as follows
<EMPDETAILS>
EMPID ENAME EMPAGE COMPANY
NAME
E001 JACOB 28 ACCENTURE
E002 HENRY 32 ACCENTURE
E003 TOM 33 INFOSYS
<DEPTDETAILS>
DEPTID DEPTNAME
DEPT0021 TESTING
DEPT0025 DEVELOPMENT
DEPT0030 SUPPORT
It is not possible to join the above tables, since EMPID is not part of the
<DEPTDETAILS>.Therefore the above relation has lossy decomposition.
First ,Second ,Third Normal Form
Normalization
 Normalization is a process of organizing the data in database to avoid data
redundancy.
 B’coz redundancy causes several anomalies, such as Insertion anamoly,Deletion
anamoly,update anomaly.
Example: Suppose a manufacturing company stores the employee details in a table
named employee that has four attributes: emp_id for storing employee’s id,
emp_name for storing employee’s name, emp_address for storing employee’s address
and emp_dept for storing the department details in which the employee works.
Empid Empname Empaddress Empdept
120 Reshma Mumbai D001
120 Reshma Mumbai D002
124 Swetha Delhi D880
160 Ruth Chennai D990
160 Ruth Chennai D005
Update anomaly: In the above table we have two rows for employee Reshma as she
belongs to two departments of the company. If we want to update the address of
Reshma then we have to update the same in two rows or the data will become
inconsistent
Insert anomaly: Suppose a new employee joins the company, who is under training
and currently not assigned to any department then we would not be able to insert the
data into the table if empdept field doesn’t allow nulls.
Delete anomaly: Suppose, if at a point of time the company closes the department
D880 then deleting the rows that are having empdept as D880 would also delete the
information of employee Swetha since she is assigned only to this department.
First normal form (1NF)
 1NF is a relation in which intersection of each row and column contains one and
only one value. That means attribute values permitted by 1NF are single atomic
values.
 Example: Suppose a company wants to store the names and contact details of its
employees. It creates a table that looks like this:
Empid Empname Empaddress Empmobile
101 Henna Delhi 8912312390
102 John Mumbai 8812121212
9900012222
103 Ruth Chennai 7778881212
104 Lessa Bangalore 9990000123
8123450987

 Two employees (John & Lessa) are having two mobile numbers so the
company stored them in the same field.
To make the table complies with 1NF we should have the data like this:
Empid Empname Empaddress Empmobile
101 Henna Delhi 8912312390
102 John Mumbai 8812121212
102 John Mumbai 9900012222
103 Ruth Chennai 7778881212
104 Lessa Bangalore 9990000123
104 Lessa Bangalore 8123450987

Second normal form (2NF)


 A table is said to be in 2NF if both the following conditions hold:
 Table is in 1NF (First normal form)
 No non-prime attribute is dependent on the proper subset of any
candidate key of table.
 An attribute that is not part of any candidate key is known as non-prime
attribute.
 Example: Suppose a college wants to store the data of professors and the
subjects they teach. They create a table that looks like this: Since a professor
can teach more than one subjects, the table can have multiple rows for a same
teacher.

Instructor Id Subject InstructorAge


111 Datasturctures 38
111 DatabaseManagement 38
Systems
222 Data Analytics 35
333 Cloud Computing 40
333 Datamining 40

 The table is in 1 NF because each attribute has atomic values. However, it is


not in 2NF because non prime attribute InstructorAge is dependent on
Instructor Id alone which is a proper subset of candidate key. This violates
the rule for 2NF as the rule says “no non-prime attribute is dependent on the
proper subset of any candidate key of the table”.
 To make the table complies with 2NF we can break it in two tables like this:
Instructor details table:
Instructor Id InstructorAge
111 38
111 38
222 35
333 40
333 40

Instructor subject table:


Instructor Id Subject
111 Datasturctures
111 DatabaseManagement
Systems
222 Data Analytics
333 Cloud Computing
333 Datamining

Third Normal form (3NF)


 A table design is said to be in 3NF if both the following conditions hold:
 Table must be in 2NF
 Transitive functional dependency of non-prime attribute on any
super key should be removed.
 An attribute that is not part of any candidate key is known as non-prime
attribute.
Example: Suppose a company wants to store the complete address of each employee,
they create a table named employee_details that looks like this:

 Super keys: {emp_id}, {emp_id, emp_name}, {emp_id, emp_name,


emp_zip}…so on

Empid Empname Empzip Empstate Empcity Empdistrict


1001 John 282005 UP Agra Dayal Bagh
1002 Ajeet 222008 TN Chennai M-City
1006 Lora 282007 TN Chennai Urrapakkam
1101 Lilly 292008 UK Pauri Bhagwan
1201 Steve 222999 MP Gwalior Ratan
Candidate Keys: {emp_id}
Non-prime attributes: all attributes except emp_id are non-prime as they are not part
of any candidate keys.

 Here, emp_state, emp_city & emp_district dependent on emp_zip. And,


emp_zip is dependent on emp_id that makes non-prime attributes (emp_state,
emp_city & emp_district) transitively dependent on super key (emp_id). This
violates the rule of 3NF.
 To make this table complies with 3NF we have to break the table into two
tables to remove the transitive dependency:
employee table

Empid Empname Empzip


1001 John 282005
1002 Ajeet 222008
1006 Lora 282007
1101 Lilly 292008
1201 Steve 222999
Boyce Codd Normal Form
BCNF : It is an advance version of 3NF that’s why it is also referred as 3.5NF.
BCNF is stricter than 3NF. A table complies with BCNF if it is in 3NF and for
every functional dependency X->Y, X should be the super key of the table.
Example: Suppose there is a company wherein employees work in more than one
department. They store the data like this:
Empid Empnationality Empdept DeptNo No of
employee
1001 Indian Production
D001 200
and planning
1001 Indian stores D001 250
design and
1002 American technical D134 100
support
Purchasing
1002 American D134 600
department

Functional dependencies in the table above:


emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
 Candidate key: {emp_id, emp_dept}
 The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
 To make the table comply with BCNF we can break the table in three tables like
this:
 emp_nationality table:
Empid Empnationality
1001 Indian
1002 American

emp_dept table:

Empdept DeptNo No of employee


Production and D001 200
planning
Stores D001 250
design and
technical D134 100
support
Purchasing
D134 600
department
emp_dept_mapping table:

Empid Empdept
1001 Production and planning

1001 Stores

1002 design and technical support

1002 Purchasing department

 Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
 Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}
This is now in BCNF as in both the functional dependencies left side part is a
key.
Multi-valued Dependencies and Fourth Normal Form
Fourth Normal Form (4NF)
• Fourth normal form eliminates independent many-to-one relationships
between columns.
• To be in Fourth Normal Form,
– a relation must first be in Boyce-Codd Normal Form.
– a given relation may not contain more than one multi-valued
attribute.
Example 1 (Not in 4NF)
Scheme  {Manager, Child, Employee}
1. Primary Key  {Manager, Child, Employee}
2. Each manager can have more than one child
3. Each manager can supervise more than one employee
4. 4NF Violated
Manager Child Employee
Godson David Jessy
John Beulah Mary
John NULL Ganesh

4NF – Decomposition (Steps to make 4NF)


1. Move the two multi-valued relations to separate tables
2. Identify a primary key for each of the new entity

Old Scheme  {Manager, Child, Employee}


New Scheme  {Manager, Child}
New Scheme  {Manager, Employee}

Manager Child
Godson David
John Beulah

Manager Employee
Godson Jessy
John Mary
John Ganesh

Example 2:
Fifth Normal Form (5NF)
Fifth normal form is satisfied when all tables are broken into as many tables as
possible in order to avoid redundancy. Once it is in fifth normal form it cannot be
broken into smaller relations without changing the facts or the meaning.
BCNF : A relational schema R is considered to be in Boyce–Codd normal
form (BCNF) if and if the relation is in 3NF and for every one of its dependencies X
→ Y, one of the following conditions holds true
 X → Y is a trivial functional dependency (i.e., Y is a subset of
X)
 X is a superkey for schema That means if Y is a prime attribute
then X cannot be the non-prime attribute.
Example 1:
Let us consider the following student enrolment table for the subject
SRegisterNo Subject Name Instructor
101 Cloud Computing Anita
101 Python Rajesh
102 HTML Anita
103 PHP David
104 Python Joel
Where one student can register multiple subjects. For example, student with register
number 101 can enroll for cloud computing a well as Python. Similarly, multiple
instructor can teach the same subject. For example, subject name Python can be
taught by both the instructors namely Rajesh and Joel.
The above table holds the following dependencies
(SRegisterNo, Subject) --->Instructor
Instructor Subject
Where the Instructor is not super key. Moreover, non-prime attribute instructor
derives the prime attribute subject. It should not be occurred in the case of
BCNF.Hence the above table not in BCNF.
Join Dependencies and Fifth Normal Form
 A relation is in 5NF if it is in 4NF and not contains any join dependency and
joining should be lossless.
 5NF is satisfied when all the tables are broken into as many tables as possible in
order to avoid redundancy.
 5NF is also known as Project-join normal form (PJ/NF).
Example

SUBJECT LECTURER SEMESTER

Computer Anshika Semester 1

Computer John Semester 1

Math John Semester 1

Math Akash Semester 2

Chemistry Praveen Semester 1

In the above table, John takes both Computer and Math class for Semester 1 but he
doesn't take Math class for Semester 2. In this case, combination of all these fields
required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and
who will be taking that subject so we leave Lecturer and Subject as NULL. But all
three columns together acts as a primary key, so we can't leave other two columns
blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:
P1
SEMESTER SUBJECT

Semester 1 Computer

Semester 1 Math

Semester 1 Chemistry

Semester 2 Math

P2

SUBJECT LECTURER

Computer Anshika

Computer John

Math John

Math Akash

Chemistry Praveen
UNIT 4 TRANSCATION PROCESSING AND CONCURRENCY
CONTROL
Transaction Concepts
Transaction is a logical unit of work that represents real-world events of any
organisation or an enterprise whereas concurrency control is the management of
concurrent transaction execution. Transaction processing systems execute database
transactions with large databases and hundreds of concurrent users, for example,
railway and air reservations systems, banking system, credit card processing, stock
market monitoring, super market inventory and checkouts and so on.
TRANSACTION :
A transaction is a logical unit of work of database processing that includes one or
more database access operations.
A transaction can be defined as an action or series of actions that is carried out by a
single user or application program to perform operations for accessing the contents of
the database. The operations can include retrieval, (Read), insertion (Write), deletion
and modification. A transaction must be either completed or aborted.
It can either be embedded within an application program or can be specified
interactively via a high-level query language such as SQL. Its execution preserves
the consistency of the database.
Each transaction should access shared data without interfering with the other
transactions and whenever a transaction successfully completes its execution; its
effect should be permanent.
This basic abstraction frees the database application programmer from the following
concerns :
 Inconsistencies caused by conflicting updates from concurrent users.
 Partially completed transactions in the event of systems failure.
 User-directed undoing of transactions.
A transaction is a sequence of READ and WRITE actions that are grouped together to
from a database access. A transaction may consist of a simple SELECT operation to
generate a list of table contents, or it may consist of a series of related UPDATE
command sequences.
A transaction can include the following basic database access operations:

Operations Descriptions

Retrive To retrive data stored ina database.

Insert To store new data in database.

Delete To delete existing data from database.

Update To modify existing data in database.

Commit To save the work done permanently.

Rollback To undo the work done.

Transaction that changes the contents of the database must alter the database from
one consistent state to another. A consistent database state is one in which all data
integrity constraints are satisfied.
To ensure database consistency, every transaction must begin with the database in a
known consistent state.
ACID properties
PROPORTIES
Atomicity-either all operations of the transaction are properly executed in the
database or none are executed.
Consistency-Consistency means that the sum of A and B be unchanged by the
execution of transaction, without consistency money could be created or destroyed by
the transaction.
Execution of a transaction in isolation with no other transaction executing
concurrently preserves the consistency of the database.
Isolation-Even though multiple transaction executed concurrently the system
guarantee that for every pair of transaction Ti,Tj.
*either Tj started execution after Ti finished
(or)
* Tj finished execution before Ti started.
Durability-After transaction completes successfully the changes it has made to the
database persist, even if there are system failure.
Requirements of ACID Property:
Let us consider transaction Ti A(Ajay) want to send ₹1000 to his friend B(Shiva)
Ti: read(A)
A:=A-1000
Write(A)
Read(B)
B:=B+1000
Write(B)
Atomicity:
Suppose system failure happened after write(A) but before write(B) then the values of
A and B are A= ₹4000 and B= ₹1000 The system destroyed 1000 as a results of this
failures
If atomicity is present, all actions of transaction are reflected in the database or none
are
Basic idea behind ensuring atomicity is that the system keeps track of old values of
any data on which transaction performs write
If the transaction does not complete its execution the database system restores the old
values to make it appear as transaction never executed.
Ensuring the atomicity is the responsibility of the system itself it is handled by
transaction management component.
Consistency:
Consistency requirement here is that the sum of A & B be unchanged by the
execution of transaction
Without the consistency requirement money could be created or destroyed by the
transaction.
Programmer who codes the transaction is responsible for maintenance of that
recovery.
Isolation:
If several transactions are executed concurrently, their operations may interleave in
some undesirable way resulting in an inconsistent state.
To avoid the problem of concurrency transaction is to execute serially in any order
that is one after other.
Durability:
Durability guarantees that once a transaction completes successfully, all the update
that it carried out on the database persist, even if there is a system failure after the
transaction completes execution.
Failure of system may result loss of data in main memory but data written in disk are
never lost. By writing the updates carried out by transaction is in to disk is sufficient
to enable the database to reconstruct the updates when the database system is restarted
after failure
Transaction state:
A transaction must be in one of the following states;
Active:
The initial state, the transaction stays in this state while it is executing.
Partially committed:
After the final statement has been executed.
Failed:
After the transaction has been rolled back and the database has been restored to its
state prior to the start of the transaction.
Concurrency Control and Need for Concurrency
What is Concurrency Control?
Concurrency Control in Database Management System is a procedure of managing
simultaneous operations without conflicting with each other. It ensures that Database
transactions are performed concurrently and accurately to produce correct results
without violating data integrity of the respective Database.
Concurrent access is quite easy if all users are just reading data. There is no way they
can interfere with one another. Though for any practical Database, it would have a
mix of READ and WRITE operations and hence the concurrency is a challenge.
DBMS Concurrency Control is used to address such conflicts, which mostly occur
with a multi-user system. Therefore, Concurrency Control is the most important
element for proper functioning of a Database Management System where two or more
database transactions are executed simultaneously, which require access to the same
data.
In this tutorial, you will learn
 What is Concurrency Control?
 Potential problems of Concurrency
 Why use Concurrency method?
 Concurrency Control Protocols
 Lock-based Protocols
 Two Phase Locking (2PL) Protocol
 Timestamp-based Protocols
 Validation Based Protocol
 Characteristics of Good Concurrency Protocol
Potential problems of Concurrency
Here, are some issues which you will likely to face while using the DBMS
Concurrency Control method:
 Lost Updates occur when multiple transactions select the same row and update
the row based on the value selected
 Uncommitted dependency issues occur when the second transaction selects a row
which is updated by another transaction (dirty read)
 Non-Repeatable Read occurs when a second transaction is trying to access the
same row several times and reads different data each time.
 Incorrect Summary issue occurs when one transaction takes summary over the
value of all the instances of a repeated data-item, and second transaction update
few instances of that specific data-item. In that situation, the resulting summary
does not reflect a correct result.
Why use Concurrency method?
Reasons for using Concurrency control method is DBMS:
 To apply Isolation through mutual exclusion between conflicting transactions
 To resolve read-write and write-write conflict issues
 To preserve database consistency through constantly preserving execution
obstructions
 The system needs to control the interaction among the concurrent transactions.
This control is achieved using concurrent-control schemes.
 Concurrency control helps to ensure serializability
Example
Assume that two people who go to electronic kiosks at the same time to buy a movie
ticket for the same movie and the same show time.
However, there is only one seat left in for the movie show in that particular theatre.
Without concurrency control in DBMS, it is possible that both moviegoers will end up
purchasing a ticket. However, concurrency control method does not allow this to
happen. Both moviegoers can still access information written in the movie seating
database. But concurrency control only provides a ticket to the buyer who has
completed the transaction process first.
Locking protocols and Two phase locking
One way to ensure serializbility is to ensure that data items be accessed in a mutually
exclusive manner that means while one transaction is accessing a data item, no other
transaction can modify that data item.
The most common method to implement this requirement is to allow a transaction to
access a data item only if it is currently holding a lock on that item. Transaction
acquires a lock on the database before it starts and releases the lock after it has
committed. While a transaction holds a lock, no other transaction is allowed to
acquire the lock and all therefore must wait for the lock to be released.
There are two methods in which data item may be locked.
 Shared mode lock(S)-It can read data item but can’t perform write
operation
 Exclusive mode lock(X)-It can perform both read and write operation
on data item.
The concurrency control manager is responsible for releasing lock to the transaction.
Concurrency control manager will check the compatibility matrix shown in fig , if
the lock requests are compatible with each other then the manager will grant the lock.

S X
S T F
X F F
Figure: Lock Compatibility Matrix

Shared mode is compatible with shared mode but not with exclusive mode. At any
time several shared mode lock can be held on particular data item. Whereas exclusive
mode lock request has to wait until the currently held shared mode locks are released.
Transaction requesting shared mode lock by executing Lock-S(Q)
Transaction requesting exclusive mode lock by executing Lock-X(Q)
To illustrate lock based protocol let us consider our example Transaction T1 Ajay(A)
want to send 1000 to his friend Shiva(B). Then the corresponding lock based protocol
would be
Note: Initial value of Ajay and Shiva account balance will be A=₹5000 and
B=₹1000
T1: Lock-X(A)
Read(A)
A:=A-1000;
Write(A)
Unlock(A)
Lock-X(B)
B:=B+1000;
Write(B);
Unlock(B)

T1 T2
Lock-X(A)

Read(A)

A:=A-1000;

Write(A)

Unlock(A)

Lock-S(A)

Read(A)

Unlock(A)

Lock-S(B)

Read(B)

Unlock(B);
Display(A+B)
Lock-X(B)
B:=B+1000;
Write(B);

Unlock(B)

In the above schedule Transaction T2 displays ₹5000 which is inconsistent.(Sum of


Ajay and Shiva account would be ₹6000).We lost ₹1000 due to Transaction T1
unlocks data item A too early.
Deadlock and Transaction Recovery
Deadlock can be defined as “A system is in deadlock state if there exists a set of
transactions such that every transaction in the set is waiting for another transaction in
the set”
There are four conditions for a deadlock
 Mutual exclusive condition: There must be at least one resource that cannot be
used by more than one transaction at a time
 Hold and wait condition: One transaction holding a data item can be requested by
another transaction in the system
 No preemption condition: A data item cannot be taken from a transaction, only
the transaction can release a data item that is being held by it.
 Circular wait condition: A condition where one transaction is waiting for a data
item is being held by second transaction and second transaction is waiting for
third transaction and so on and the last transaction is waiting for the first
transaction. Thus makes circular chain of waiting.
Deadlock can be handled using two techniques
 Deadlock Prevention This technique used to ensure that the system will never
enter a deadlock state. It requires that each transaction locks all needed data item
before it begins its execution.
Disadvantage of Deadlock Prevention Scheme
 It is often hard to predict before the transaction begins,what data items ned to be
locked.
 Data item utilization may be very low, Since many of the data item might be
locked but may not be used for long time
There are two techniques used for deadlock Prevention-
i. Wait-Die (non preemptive)
When a transaction T1 request a data item currently held by
Transaction T2. T1 is allowed to wait only if it has a timestamp smaller
than that of T2.Otherwise T1 is rolled back(dies)
ii. Wound-Wait (Preemptive)
When a transaction T1 request a data item currently held by
Transaction T2. T1 is allowed to wait only if it has a timestamp larger
than that of T2.Otherwise T1 is rolled back(dies)
Advantages:
1. Easy to implement
2. Works well if the transaction are short transaction
Need to perform three actions.
1. Selection of Victim transaction We need to determine which transaction to
roll back to break the deadlock.We should rollback those transactions that
has minimum cost.
Many factors may determine the cost of roll back
a. Number of data items used by the transaction
b. Number of data items that the transaction needs to complete its
execution
c. Number of transactions involved in the roll back
2. Roll Back
Once we decided to roll back particular transaction we need to determine
how far this transaction should be rolled back.
Two techniques are there
1. Total Roll Back-Abort the transaction and then restart it
2. Partial Roll Back-It is more effective to roll back the transaction only
as far as necessary to break the deadlock,
3. Starvation
Some transaction is always picked as a victim. As a result, this transaction
never completes its designated task thus there is starvation. Solution is to
include number of roll backs in the cost factor
Save Points and Isolation Levels
In order to maintain consistency in a database, it follows ACID properties. Among
these four properties (Atomicity, Consistency, Isolation and Durability) Isolation
determines how transaction integrity is visible to other users and systems. It means
that a transaction should take place in a system in such a way that it is the only
transaction that is accessing the resources in a database system.
Isolation levels define the degree to which a transaction must be isolated from the
data modifications made by any other transaction in the database system. A
transaction isolation level is defined by the following phenomena –
 Dirty Read – A Dirty read is the situation when a transaction reads a data
that has not yet been committed. For example, Let’s say transaction 1 updates a
row and leaves it uncommitted, meanwhile, Transaction 2 reads the updated row.
If transaction 1 rolls back the change, transaction 2 will have read data that is
considered never to have existed.
 Non Repeatable read – Non Repeatable read occurs when a transaction
reads same row twice, and get a different value each time. For example, suppose
transaction T1 reads data. Due to concurrency, another transaction T2 updates
the same data and commit, Now if transaction T1 rereads the same data, it will
retrieve a different value.
 Phantom Read – Phantom Read occurs when two same queries are executed,
but the rows retrieved by the two, are different. For example, suppose
transaction T1 retrieves a set of rows that satisfy some search criteria. Now,
Transaction T2 generates some new rows that match the search criteria for
transaction T1. If transaction T1 re-executes the statement that reads the rows, it
gets a different set of rows this time.
Based on these phenomena, The SQL standard defines four isolation levels :
1. Read Uncommitted – Read Uncommitted is the lowest isolation level. In
this level, one transaction may read not yet committed changes made by other
transaction, thereby allowing dirty reads. In this level, transactions are not
isolated from each other.
2. Read Committed – This isolation level guarantees that any data read is
committed at the moment it is read. Thus it does not allows dirty read. The
transaction holds a read or write lock on the current row, and thus prevent other
transactions from reading, updating or deleting it.
3. Repeatable Read – This is the most restrictive isolation level. The
transaction holds read locks on all rows it references and writes locks on all rows
it inserts, updates, or deletes. Since other transaction cannot read, update or
delete these rows, consequently it avoids non-repeatable read.
4. Serializable – This is the Highest isolation level. A serializable execution is
guaranteed to be serializable. Serializable execution is defined to be an
execution of operations in which concurrently executing transactions appears to
be serially executing.
The Table is given below clearly depicts the relationship between isolation levels,
read phenomena and locks :

Anomaly Serializable is not the same as Serializable. That is, it is necessary, but not
sufficient that a Serializable schedule should be free of all three phenomena types.
SQL Facilities for Concurrency and Recovery
Two Phase Locking Protocol
Protocol that ensures serializability is that two phase locking protocol. This protocol
request that each transaction issue lock and unlock request in two phases
1. Growing Phase- During this phase transaction may obtain locks but may not
release the lock
2. Shrinking Phase-A transaction may release locks but may not obtain any new
locks.
Transaction is initially in the growing phase and it acquires lock as needed. Once
the transaction releases lock it enters the shrinking phase and it cannot issue
further lock request.
Limitation of Two Phase Locking Protocol
 Dead lock-Dead lock cannot be solved by two phase locking protocol.
 Cascading Roll back- Cascading Roll back may occur under two phase
locking protocol.
Let us consider the following schedule
T1 T2 T3
Lock-X(A)
read(A)
Lock-X(B)
read(B)
write(A)
Unlock(A)
Lock-X(A)
read(A)
write(A)
Unlock(A)
Lock-X(A)
read(A)

The data item A value written by T1 will read by T2.Tha value written by T2 will
read by T3.Failure of transaction T1 will result failure of T2 and T3.Hence we need
to perform cascading roll back of T2 and T3.
Types of Two Phase Locking
 Strict Two Phase locking protocol-This protocol require that not only the
lock be two phase, but also that all exclusive mode locks taken by a
transaction be held until that transaction commits
 Rigorous Two Phase locking protocol-This is more stricter than strict two
phase locking protocol. Here not only exclusive mode lock all the locks be
held until the transaction commits.
Lock Conversion
There are two modes of conversion
 Upgrade Conversion-Conversion from shared to exclusive mode
is known as upgrade conversion. It can take place only in the
growing phase
 Downgrade Conversion- Conversion from exclusive mode to
shared mode is known as downgrade conversion. It can take
place only in the shrinking phase
Time Stamp Based Protocol
This protocol ensures serializability.It selects an ordering among transactions in
advance using time stamps. For each and every transaction Timestamp is given.
Ti---> TS(Ti)
Time stamp is assigned by the database system before the transaction Ti starts its
execution.If Ti has been assigned timestamp and a new transaction Tj enters the
system such that
TS(Ti)< TS(Tj)
Then Ti appears before Tj.
W-timestamp(Q)-Denotes the largest timestamp of any instruction that executed
write(Q) successfully.
R-timestamp(Q)- Denotes the largest timestamp of any instruction that executed
read(Q) successfully
UNIT 5 IMPLEMENTATION TECHNIQUES
RAID
RAID (redundant array of independent disks) is a way of storing the same data in
different places on multiple hard disks or solid-state drives to protect data in the case
of a drive failure. There are different RAID levels, however, and not all have the goal
of providing redundancy.
How RAID works
RAID works by placing data on multiple disks and allowing input/output (I/O)
operations to overlap in a balanced way, improving performance. Because the use of
multiple disks increases the mean time between failures (MTBF), storing data
redundantly also increases fault tolerance.
RAID arrays appear to the operating system (OS) as a single logical drive. RAID
employs the techniques of disk mirroring or disk striping. Mirroring will copy
identical data onto more than one drive. Striping partitions helps spread data over
multiple disk drives. Each drive's storage space is divided into units ranging from
a sector (512 bytes) up to several megabytes. The stripes of all the disks are
interleaved and addressed in order.

Image of a five-tray RAID hard drive


Disk mirroring and disk striping can also be combined in a RAID array.
In a single-user system where large records are stored, the stripes are typically set up
to be small (perhaps 512 bytes) so that a single record spans all the disks and can be
accessed quickly by reading all the disks at the same time.
In a multi-user system, better performance requires a stripe wide enough to hold the
typical or maximum size record, allowing overlapped disk I/O across drives.
RAID controller
A RAID controller is a device used to manage hard disk drives in a storage array. It
can be used as a level of abstraction between the OS and the physical disks,
presenting groups of disks as logical units. Using a RAID controller can improve
performance and help protect data in case of a crash.
A RAID controller may be hardware- or software-based. In a hardware-based
RAID product, a physical controller manages the array. The controller can also be
designed to support drive formats such as SATA and SCSI. A physical RAID
controller can also be built into a server's motherboard.
With software-based RAID, the controller uses the resources of the hardware system,
such as the central processor and memory. While it performs the same functions as a
hardware-based RAID controller, software-based RAID controllers may not enable as
much of a performance boost and can affect the performance of other applications on
the server.
If a software-based RAID implementation isn't compatible with a system's boot-up
process, and hardware-based RAID controllers are too costly, firmware or
driver-based RAID is another potential option.
Firmware-based RAID controller chips are located on the motherboard, and all
operations are performed by the CPU, similar to software-based RAID. However,
with firmware, the RAID system is only implemented at the beginning of the boot
process. Once the OS has loaded, the controller driver takes over RAID functionality.
A firmware RAID controller isn't as pricy as a hardware option, but it puts more strain
on the computer's CPU. Firmware-based RAID is also called hardware-assisted
software RAID, hybrid model RAID and fake RAID.
File Organization and Organization of Records in files
A database is mapped into a number of different files that are maintained by the
underlying operating system. A file is logically organized as a sequence of records.
These records are mapped onto disk blocks.
There are two approaches are there to map database to files
1. Fixed Length Records
2. Variable-Length Records
Fixed Length Records
Fixed length Record approach to map the database to files uses different files, and
store records of only one fixed length in any given file.
Let us consider a file of instructor records for our University database.
type instructor=record
ID varchar (5);
name varchar (20);
deptname varchar (20);
salary numeric (8,2);
end
Assume that each character occupies 1 byte and that numeric (8,2) occupies 8 bytes.
So our one record becomes
ID requires 5*1 = 5 Bytes
Name requires 20*1 = 20 Bytes
Deptname requires 20*1 = 20 Bytes
Salary requires 8 bytes = 8 Bytes
Thus our instructor record becomes 53 bytes long. Then we can follow an approach
first 53 bytes for one record and next 53 bytes for the second record and so
on(Figure).
ID Name Deptname Salary
8101 Raguv.S Computer Science 8500
Engineering
8105 Vanthi.B Computer Science 85000
Engineering
8104 Meenakhi.R Information 85000
Technology
8103 Komala James Electronics and 85000
Communication
Engineering
8100 SenthilKumar.M Computer Science 75000
Engineering
8200 Saravanan.R Meachanical 70000
Department
8400 Gayathri.M Electronics and 80000
Communication
Engineering
8500 Anuradha..S Electrical and 60000
Electronis
Engineering
Figure File containing instructor records
Two problems with this simple approach
1. Block size happens to be a multiple of 53 otherwise some records will cross
the block boundaries. That is part of the record will be stored in one block and
part in another.It would require two block access to read or write such a
record.
2. It is difficult to delete a record from this structure. The space occupied by the
record to be deleted must be fill with some other record of the file
The first problem can be avoided by loading as many records the block can hold and
any remaining bytes of each block are left unused.
When a record is deleted, we need to move the neighbouring record that came after it
into the space occupied by the deleted record and so on until all record has moved
ahead .
Indexing and Hashing
An Index is a data structure used to quickly locate the data and access the data from
the database in an effective manner.
The indexes are essential to speed up the search operations on files of records
There are two types of indices
Ordered Indices-Where indexing is based on some sorted ordered values
Hash Indices: Indexing is based on uniform distribution of values across range of
buckets.The address of the bucket is obtained using hash function
Ordered Indices
Primary Index and Clustered index
An index that created on a relation using primary key of that relation is termed as
primary Index.

Clustered index is an index whose search key also defines the sequential order of the
file.
There are two types of ordered indices:
Dense Index:
In a dense Index, an index entry appears for every search key values. The index
record contains search key value and a pointer to the actual value.
Sparse Index:
In a sparse Index, an index entry appears for only some of the search key values.
To locate a record, we find the index entry with the largest search key value that is
less than or equal to the search key value for which we are looking.

State Capital Population


Andhra Pradesh Hyderabad 356,152
Andhra Pradesh
Assam Dispur 26,655,528
Kerala
Kerala Thiruvananthapuram 31,844,374
Tamil Nadu
Tamil Nadu Chennai 62,405,67
B+ tree Index files
The B+ tree index is same as that of balanced tree in which every path from the root
to leaf of the tree is of the same length. The leaf node only has direct connection to
the records of the relation
Structure of B+ Tree
It contains n-1 search key values K1, K2………. Kn-1 and pointers P1, P2…………. Pn

Search key values in the leaf node are kept in sorted order.
Algorithm for Insertion Operation
Step1: Find correct Leaf Node
Step 2: Enter the data onto that Leaf Node L
 If L is half enough then done
 Else must split L(into L and new node L1)
Allocate new node
Redistribute entries evenly
Copy up the middle key
Insert index entry pointing L1 into parent of L
Step 3: To split index node, redistribute entries evenly and push up the middle key
Do step3 recursively
Step4: Tree can grow due to node splitting. Root split increases height
Let us consider B+ tree to insert the following key elements (Order of the Tree 3)
5,3,4,9,7,15,14,21,22,23
B tree Index files
B-Tree is a self-balancing search tree. In most of the other self-balancing search
trees (like AVL and Red-Black Trees), it is assumed that everything is in main
memory. To understand the use of B-Trees, we must think of the huge amount of
data that cannot fit in main memory. When the number of keys is high, the data is
read from disk in the form of blocks. Disk access time is very high compared to the
main memory access time. The main idea of using B-Trees is to reduce the number
of disk accesses. Most of the tree operations (search, insert, delete, max, min, ..etc )
require O(h) disk accesses where h is the height of the tree. B-tree is a fat tree. The
height of B-Trees is kept low by putting maximum possible keys in a B-Tree node.
Generally, the B-Tree node size is kept equal to the disk block size. Since the height
of the B-tree is low so total disk accesses for most of the operations are reduced
significantly compared to balanced Binary Search Trees like AVL Tree, Red-Black
Tree, ..etc.
Time Complexity of B-Tree:

Sr. No. Algorithm Time Complexity

1. Search O(log n)

2. Insert O(log n)

3. Delete O(log n)

“n” is the total number of elements in the B-tree.


Properties of B-Tree:

1. All leaves are at the same level.


2. A B-Tree is defined by the term minimum degree ‘t’. The value of t depends
upon disk block size.
3. Every node except root must contain at least (ceiling)([t-1]/2) keys. The root
may contain minimum 1 key.
4. All nodes (including root) may contain at most t – 1 keys.
5. Number of children of a node is equal to the number of keys in it plus 1.
6. All keys of a node are sorted in increasing order. The child between two keys
k1 and k2 contains all keys in the range from k1 and k2.
7. B-Tree grows and shrinks from the root which is unlike Binary Search Tree.
Binary Search Trees grow downward and also shrink from downward.
8. Like other balanced Binary Search Trees, time complexity to search, insert
and delete is O(log n).
Following is an example of B-Tree of minimum order 5. Note that in practical
B-Trees, the value of the minimum order is much more than 5.

We can see in the above diagram that all the leaf nodes are at the same level and all
non-leaf have no empty sub-tree and have keys one less than the number of their
children.
Static Hashing and Dynamic Hashing
As the file size grows in size performance will degrade in the sequential file
organization mechanism. Moreover, it is impossible to search all the index values
through all its level and then reach the destination to retrieve the desired data.
Hashing is an effective technique to calculate the direct location of a data record on the
disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a
data record.
There are two main types of hashing techniques
1. Static Hashing
2. Dynamic Hashing

Static Hashing

In static hashing, when a search-key value is provided, the hash function always
computes the same address. For example, if mod-10 hash function is used, then it shall
generate only 10values. That means bucket size is fixed it is 10. The output address
shall always be same for that function.
Let us consider we want to create the table to show the list of available database
management tools.
With the use of hash function h(k)= k mod 10 the tools are stored in the bucket as
follows:
Where k is search key values here serial number and 10 is the bucket size
Tools table would be

S.No Database Tools


10 Oracle
11 IBM DB2
12 SQL Server
13 MySQL
14 MongoDB
15 Neo4j
16 MS Acess
17 SQL Lite
18 PostgreSQL
19 CouchDB

Hash function is responsible for mapping the elements (Key Value) =H (10) =10 Mod
10=0, Hence the oracle stored in the bucket 0. Similarly,H (19) =19Mod 10=9 Hence
the value couchDB stored in bucket 9.

Handling of Bucket Overflows


Bucket Overflow can occur for several reasons
 Insufficient buckets
 SkewSome buckets are assigned more records than others, so bucket overflow
occur even when other buckets have space. This is called skew.
Query optimization using Heuristics and Cost Estimation
Distribution Database
Query Processing refers to range of activities involved in extracting data from a
database. Basic steps involved in query processing are
 Parsing and translation
 Optimization
 Evaluation
Parsing and translation
In this step translator translate the query into a usable form based on relational
algebra expression.
Let us consider the following query:
Select salary from instructor where salary<75000;
During this step the syntax of the query is checked then corrected query can be send
for further processing.

Optimization
In this step query evaluation plan is prepared from all the relational algebraic
expressions
The query cost for all the evaluation plans is calculated. It is evaluated using
information from the database such as the number of tuples in each relation, size of
tuples etc
Evaluation
Query evaluation engine takes a query evaluation plan, executes that plan and
returns the answers to the query.
For example, Select salary from instructor where salary<75000;
The above query can be translated into either one of the following expression
𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000 (𝜋𝑠𝑎𝑙𝑎𝑟𝑦 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟))
𝜋𝑠𝑎𝑙𝑎𝑟𝑦 (𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000 (𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟))
To evaluate a query, in addition to relational algebra expression we need to give
instructions specifying how to evaluate each operation.
Evaluation Primitive
Relational algebra operation annotated with instructions on how to evaluate it is
called an evaluation primitive
A sequence of primitive operations that can be used to evaluate a query is a query
evaluation plan
𝜋𝑠𝑎𝑙𝑎𝑟𝑦

𝜎𝑠𝑎𝑙𝑎𝑟𝑦<75000

Instructor
The query evaluation engine takes the query evaluation plan and executes the plan
and returns the answers to the query.

You might also like