Dbms Notes
Dbms Notes
Database System Applications: A Historical Perspective, File Systems versus a DBMS, the Data
Introduction to Database Design: Database Design and ER Diagrams, Entities, Attributes, and
Entity Sets, Relationships and Relationship Sets, Additional Features of the ER Model,
Conceptual Design With the ER Model
Data:
Data is the known facts or figures that have implicit meaning. It can also be defined as it is the
representation of facts, concepts or instruction in a formal manner, which is suitable for
understanding and processing. Data can be represented in alphabets (A-Z, a-z),in digits(0-9) and
using special characters(+,-.#,$, etc) e. g: 25, “ ajit ” etc.
Information:
Information is the processed data on which decisions and actions are based. Information can be
defined as the organized and classified data to provide meaningful values. Eg: “The age of Ravi
is 25”
The traditional file oriented approach to information processing has for each application a
separate master file and its own set of personal file. In file oriented approach the program
dependent on the files and files become dependent on the files and files become dependents upon
the programs.
1) Data redundancy and inconsistency: The same information may be written in several files.
This redundancy leads to higher storage and access cost. It may lead data inconsistency that is
the various copies of the same data may longer agree for example a changed customer address
may be reflected in single file but not elsewhere in the system.
2) Difficulty in accessing data: The conventional file processing system does not allow data to
retrieve in a convenient and efficient manner according to user choice.
3) Data isolation: Because data are scattered in various file and files may be in different formats
with new application programs to retrieve the appropriate data is difficult.
4) Integrity Problems: Developers enforce data validation in the system by adding appropriate
code in the various application programs. However when new constraints are added, it is difficult
to change the programs to enforce them.
6) Concurrent access: In the file processing system it is not possible to access a same file for
transaction at same time.
7) Security problems: There is no security provided in file processing system to secure the data
from unauthorized user access.
Database:
It is a collection of interrelated data. These can be stored in the form of tables. A database can be
of any size and varying complexity.
Example: Customer database consists the fields as cname, cno, and ccity
My SQL
Oracle
IBM DB2
SQLite
Advantages of DBMS:
Concurrent Access: A DBMS schedules concurrent access to the data in such manner that users
can think of the data being accessed by only one user at a time.
Data Security: The DBA who has the ultimate responsibility for the data in the DBMS can
ensure that proper access procedures are followed including proper authentication schemas for
access to the DBS and additional check before permitting access to sensitive data.
Backup and Recovery: It provides backup and recovery subsystems which create automatic
backup of data, from hardware and software failures and restores the data if required.
Data Independence: Ina database system, the database management system provides the
interface between the application programs and the data. When changes are made to the data
representation, the meta data obtained by the DBMS is changed but the DBMS is continues to
provide the data to application program in the previously used way. The DBMs handles the task
of transformation of data wherever necessary.
Applications of DBMS:
Library Management System: There are thousands of books in the library so it is very difficult
to keep record of all the books in a copy or register. So DBMS used to maintain all the
information relate to book issue dates, name of the book, author and availability of the book.
Banking: For customer information, account activities, payments, deposits, loans, etc.
Airlines: For reservations and schedule information.
Universities: For to store the information about all instructors, students, departments, and course
offerings, colleges, grades.
Telecommunication: It helps to keep call records, monthly bills, maintaining balances, etc.
Manufacturing: It is used for the management of supply chain and for tracking production of
items.
For example distribution centre should keep a track of the product units that supplied into the
centre as well as the products that got delivered out from the distribution centre on each day.
HR Management: For information about employees, salaries, payroll, deduction, generation of
paychecks, etc.
Online Shopping:
Online shopping has become a big trend of these days. No one wants to go to shops and waste
his time. Everyone wants to shop from home. So all these products are added and sold only with
the help of DBMS. Purchase information, invoice bills and payment, all of these are done with
the help of DBMS.
History of DBMS
1960’s:
From the earliest days of computers storing and manipulating data have been a major application
focus. The first general purpose DBMS was designed by Charles Bachman at General Electric
in the early 1960s, called Integrated Data Store. It formed the basis for network data model,
which was standardized by the Conference on Data Systems Languages(CODASYL) type of
DBMS, the hierarchical DBMS.
In the late 1960’s IBM developed the Information Management System(IMS) DBMS, used even
today in many major illustrations. IMS formed as basis for an alternative data representation
framework called the hierarchical data model. The SABRE system for making airline
reservations was jointly developed by American Airlines and IBM at the same time and allows
the people to access the same data through computer network.
In 1970’s, Edger codd, at IBM, proposed a new data representation frame work called the
relational model: It sparked the rapid development of several DBMS’s based on the relational
model.
In 1980’s, the relational model consolidated this position as the dominant DBMS paradigm and
database systems continued to gain widespread use. The SQL query language for relational
databases developed at IBM. SQL was standardized in the late 1980’s and current standard
SQL: 1999, was adopted by ANSI and ISO.
In the late1980’s and 1990’s advances were made in many areas of database systems. Several
vendors (DB2, Oracle 8) extended their systems with ability to store new data types such as
images and text. Specialized systems have been developed by numerous for creating data
warehouses, consolidating data from several databases, and for carrying specialized analysis.
The Internet Age has perhaps influenced the data models much more. Data models were developed
using object oriented programming features, embedding with scripting languages like Hyper Text
Markup Language (HTML) for queries. With humongous data being available online, DBMS is
gaining more importance as more data is brought online and made over more accessible through
computer networking.
In the early years of computing, punch card was used in unit record machines for input, data
storage and processing this data. Data was entered offline and for both data, and computer
programs input. This input method is similar to voting machines now a days. This was the only
method, where it was fast to enter data, and retrieve it, but not to manipulate or edit it.
After that era, there was the introduction of the file type entries for data, then the DBMS as
hierarchical, network, and relational.
Components of DBMS
The database management system can be divided into five major components, they are:
1. Hardware
2. Software
3. Data
1. Procedures
2. Database Access Language
Hardware:
When we say Hardware, we mean computer, hard disks, I/O channels for data, and any other
physical component involved before any data is successfully stored into the memory.
When we run Oracle or MySQL on our personal computer, then our computer's Hard Disk, our
Keyboard using which we type in all the commands, our computer's RAM, ROM all become a
part of the DBMS hardware.
Software
This is the main component, as this is the program which controls everything. The DBMS
software is more like a wrapper around the physical database, which provides us with an easy-to-
use interface to store, access and update data.
The DBMS software is capable of understanding the Database Access Language and intrepret it
into actual database commands to execute them on the DB.
Data:
Data is that resource, for which DBMS was designed. The motive behind the creation of DBMS
was to store and utilize data.
In a typical Database, the user saved Data is present and meta data is stored.
Metadata is data about the data. This is information stored by the DBMS to better understand
the data stored in it.
For example: When I store my Name in a database, the DBMS will store when the name was
stored in the database, what is the size of the name, is it stored as related data to some other data,
or is it independent, all this information is metadata.
Procedures:
Procedures refer to general instructions to use a database management system. This includes
procedures to setup and install a DBMS, To login and logout of DBMS software, to manage
databases, to take backups, generating reports etc.
Database Access Language
Database Access Language is a simple language designed to write commands to access, insert,
update and delete data stored in any database.
A user can write commands in the Database Access Language and submit it to the DBMS for
execution, which is then translated and executed by the DBMS, can create new databases, tables,
insert data, fetch stored data, update data and delete the data using the access language.
Users
Database Administrators: Database Administrator or DBA is the one who manages the
complete database management system. DBA takes care of the security of the DBMS, it's
availability, managing the license keys, managing user accounts and access etc.
End User: These days all the modern applications, web or mobile, store user data. How do you
think they do it? Yes, applications are programmed in such a way that they collect user data and
store the data on DBMS systems running on their server. End users are the one who store,
retrieve, update and delete data.
Data Model:
Data Model is the modeling of the data description, data semantics, and consistency constraints
of the data. It provides the conceptual tools for describing the design of a database at each level
of data abstraction. Data models are used for to describe the structure of the database.
Hierarchical Model
In a Hierarchical database, model data is organized in a tree-like structure. The hierarchy starts
from the Root data, and expands like a tree, adding child nodes to the parent nodes. Data is
represented using a parent-child relationship. In Hierarchical DBMS parent may have many
children, but children have only one parent.
This model efficiently describes many real-world relationships like index of a book, recipes etc.
In hierarchical model, data is organized into tree-like structure with one one-to-many
relationship between two different types of data, for example, one department can have many
courses, many professors and of-course many students.
Network Model
The network database model allows each child to have multiple parents. It helps you to address
the need to model more complex relationships like as the orders/parts many-to-many
relationship. In this model, entities are organized in a graph which can be accessed through
several paths.
Relational model
Relational DBMS is the most widely used DBMS model because it is one of the easiest. This
model is based on normalizing data in the rows and columns of the tables. Relational model
stored in fixed structures and manipulated using SQL.
Ex: an instance of a student relation.
The above relation contains four attributes sid, name, age,gpa and five tuples or records or rows.
Object-Oriented Model
In Object-oriented Model data stored in the form of objects. The structure which is called classes
which display data within it. It defines a database as a collection of objects which stores both
data members values and operations.
Entity Relationship Model: The entity-relationship (E-R) data model uses a collection of basic
objects, called entities, and relationships among these objects. An entity is a “thing” or “object”
in the real world that is distinguishable from other objects. A relationship is an association
among several entities. For example, a depositor relationship associates a customer with each
account that she has
A database system is a collection of interrelated data and a set of programs that allow users to
access and modify these data. A major purpose of a database system is to provide users with an
abstract view of the data. That is, the system hides certain details of how the data are stored and
maintained.
Data Abstraction
Database systems are made-up of complex data structures. To ease the user interaction with
database, the developers hide internal irrelevant details from users. This process of hiding
irrelevant details from user is called data abstraction.
A data definition language (DDL) is used to define the external and conceptual schemas
Physical level: This is the lowest level of data abstraction. It the physical schema summarizes
how the relations described in the conceptual schema are actually stored on secondary storage
devices such as disks and tapes.
The access methods like sequential or random access and file organization methods like B+
trees, hashing used for the same.
Suppose we need to store the details of an employee. Blocks of storage and the amount of
memory used for these purposes is kept hidden from the user.
Conceptual level: The next-higher level of abstraction describes what data are stored in the
database, and what relationships exist among those data. The conceptual schema describes all
relations that are stored in the database. In our sample university database, these relations contain
information about entities, such as students and faculty, and about relationships, such as
students’ enrollment in courses.
View level:
External schema allows data access to be customized (and authorized) at the level of individual
users or groups of users. Any given database has exactly one conceptual schema and one
physical schema because it has just one set of stored relations, but it may have several external
schemas. Each external schema consists of a collection of one or more views and relations from
the conceptual schema. A view is conceptually a relation, but the records in a view are not stored
in the DBMS, they are computed using a definition for the view, in terms of relations stored in
the DBMS.
For example, we might want to allow students to find out the names of faculty members teaching
courses, as well as course enrollments.
This can be done by defining the following view: Courseinfo (cid: string, fname: string,
enrollment: integer)
A user can treat a view just like a relation and ask questions about the records in the view. Even
though the records in the view are not stored explicitly, they are computed as needed. We did not
include Courseinfo in the conceptual schema because we can compute Courseinfo from the
relations in the conceptual schema, and to store it in addition would be redundant. Such
redundancy, in addition to the wasted space, could lead to inconsistencies.
Data Independence:
A very important advantage of using a DBMS is that it offers data independence. The ability to
modify schema definition in one level without affecting schema of that definition in the next
higher level is called data independence. Data independence is achieved through use of the three
levels of data abstraction;
There are two levels of data independence; they are Physical data independence and Logical data
independence.
Physical data independence is the ability to modify the physical schema without causing
application programs to be rewritten. Modifications at the physical level are occasionally
necessary to improve performance. It means we change the physical storage/level without
affecting the conceptual or external view of the data.
Logical data independence is the ability to modify the logical schema without causing
application program to be rewritten. Modifications at the logical level are necessary whenever
the logical structure of the database is altered.
For example, suppose that the Faculty relation in our university database is replaced by the
following two relations:
Faculty public(fid: string, fname: string, office: integer)
Faculty private(fid: string, sal: real)
The Courseinfo view relation can be redefined in terms of Faculty public and Faculty private,
which together contain all the information in Faculty, so that a user who queries Courseinfo will
get the same answers as before.
The design of a database at physical level is called physical schema, how the data stored in
blocks of storage is described at this level.
Design of database at logical level is called logical schema, programmers and database
administrators work at this level, at this level data can be described as certain types of data
records gets stored in data structures, however the internal details such as implementation of data
structure is hidden at this level (available at physical level).
Design of database at view level is called view schema. This generally describes end user
interaction with database systems.
DBMS Instance
Definition of instance: The data stored in database at a particular moment of time is called
instance of database. Database schema defines the variable declarations in tables that belong to a
particular database; the value of these variables at a moment of time is called the instance of that
database.
For example, lets say we have a single table student in the database, today the table has 100
records, so today the instance of the database has 100 records. Lets say we are going to add
another 100 records in this table by tomorrow so the instance of database tomorrow will have
200 records in table. In short, at a particular moment the data stored in database is called the
instance, that changes over time when we add or delete data from the database.
Database Architecture
The architecture of a database system is greatly influenced by the underlying computer system
on which the database is running. The Database systems can be Centralized, Client-server,
Parallel (multi-processor), Distributed, the architecture database is divided into two types: Two-
tier architecture and Three-tier architecture.
Two-tier architecture: In Two-tier architecture the application resides at the client machine,
where it invokes database system functionality at the server machine through query language
statements. Application program interface standards like ODBC and JDBC are used for
interaction between the client and the server.
Example: Transaction performed by the bank employee after filling the deposit form by the
customer. Another example is railway reservation made by the employee in reservation counter.
Three-tier architecture: In three-tier architecture, the client machine acts as merely a front end
and does not contain any direct database calls. Instead, the client end communicates with an
application server, usually through a forms interface. The application server in turn
communicates with a database system to access data. The business logic of the application,
which says what actions to carry out under what conditions, is embedded in the application
server, instead of being distributed across multiple clients. Three-tier applications are more
appropriate for large applications, and for applications that run on the Worldwide Web.
Example: Amount transferred by the user through application program (Net banking) without
physically going to the bank. Another example is railway reservation made by the user from his
location without going to the reservation counter.
Three tier architecture is provides scalability when more no of clients or users interact with the
database.
Naive users are unsophisticated users who interact with the system by invoking one of the
application programs that have been written by the application programmer previously.
For example, a user who wishes to find her account balance over the World Wide Web. Such a
user may access a form, where she enters her account number. An application program at the
Web server then retrieves the account balance, using the given account number, and passes this
information back to the user.
The typical user interface for naive users is a forms interface, where the user can fill in
appropriate fields of the form. Naive users may also simply read reports generated from the
database.
Application programmers are computer professionals who write application programs.
Application programmers can choose from many tools to develop user interfaces. Rapid
application development (RAD) tools are tools that enable an application programmer to
construct forms and reports without writing a program.
There are also special types of programming languages that combine imperative control
structures (for example, for loops, while loops and if-then-else statements) with statements of the
data manipulation language. These languages, sometimes called fourth-generation languages,
often include special features to facilitate the generation of forms and the display of data on the
screen. Most major commercial database systems include a fourth generation language.
Sophisticated users interact with the system without writing programs. Instead, they form their
requests in a database query language. They submit each such query to a query processor, whose
function is to break down DML statements into instructions that the storage manager
understands. Analysts who submit queries to explore data in the database fall in this category.
Online analytical processing (OLAP) tools simplify analysts’ tasks by letting them view
summaries of data in different ways. For instance, an analyst can see total sales by region (for
example, North, South, East, and West), or by product, or by a combination of region and
product (that is, total sales of each product in each region). The tools also permit the analyst to
select specific regions, look at data in more detail (for example, sales by city within a region) or
look at the data in less detail (for example, aggregate products together by category). Another
class of tools for analysts is data mining tools, which help them find certain kinds of patterns in
data.
Sophisticated users who write specialized database applications that do not fit into the traditional
data processing framework. Among these applications are computer-aided design systems,
knowledge- base and expert systems, systems that store data with complex data types (for
example, graphics data and audio data), and environment-modeling systems.
Chapters 8 and 9 cover several of these applications.
Database Administrator
One of the main reasons for using DBMSs is to have central control of both the data and the
programs that access those data. A person who has such central control over the system is called
a database administrator (DBA).
Schema definition: The DBA creates the original database schema by executing a set of data
definition statements in the DDL.
Schema and physical-organization modification: The DBA carries out changes to the schema
and physical organization to reflect the changing needs of the organization, or to alter the
physical organization to improve performance.
Granting of authorization for data access: By granting different types of authorization, the
database administrator can regulate which parts of the data base various users can access. The
authorization information is kept in a special system structure that the database system consults
whenever someone attempts to access the data in the system.
Routine maintenance:
Periodically backing up the database, either onto tapes or onto remote servers, to prevent
loss of data in case of disasters such as flooding.
Ensuring that enough free disk space is available for normal operations, and upgrading
disk space as required.
Monitoring jobs running on the database and ensuring that performance is not degraded
by very expensive tasks submitted by some users.
Storage Manager
A storage manager is a program module that provides the interface between the low level data
stored in the database and the application programs and queries submitted to the system. The
storage manager is responsible for the interaction with the file manager. The raw data are stored
on the disk using the file system, which is usually provided by a conventional operating system.
The storage manager translates the various DML statements into low-level file-system
commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in
the database.
Authorization and integrity manager: Tests for the satisfaction of integrity constraints and
checks the authority of users to access data.
Transaction manager: Ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicting.
File manager: Manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.
Buffer manager: It is responsible for fetching data from disk storage into main memory, and
deciding what data to cache in main memory. The buffer manager is a critical part of the
database system, since it enables the database to handle data sizes that are much larger than the
size of main memory.
Query evaluation engine, which executes low-level instructions generated by the DML compiler.
Introduction to Database Design
Database Design: The database design process can be divided into six steps.
Requirements Analysis: The very first step in designing a database application is to understand
what data is to be stored in the database, what applications must be built on top of it, and what
operations are most frequent and subject to performance requirements. In other words, we must
find out what the users want from the database.
This is usually an informal process that involves discussions with user groups, a study of the
current operating environment and how it is expected to change, analysis of any available
documentation on existing applications that are expected to be replaced or complemented by the
database, and so on. Several methodologies have been proposed for organizing and presenting
the information gathered in this step, and some automated tools have been developed to support
this process.
Conceptual Database Design: The information gathered in the requirements analysis step is
used to develop a high-level description of the data to be stored in the database, along with the
constraints that are known to hold over this data. This step is often carried out using the ER
model, or a similar high-level data model, and is discussed in the rest of this chapter.
Logical Database Design: We must choose a DBMS to implement our database design, and
convert the conceptual database design into a database schema in the data model of the chosen
DBMS. We will only consider relational DBMSs, and therefore, the task in the logical design
step is to convert an ER schema into a relational database schema.
Schema Refinement: The fourth step in database design is to analyze the collection of relations
in our relational database schema to identify potential problems, and to refine it. In contrast to
the requirements analysis and conceptual design steps, which are essentially subjective, schema
refinement can be guided by some elegant and powerful theory.
Physical Database Design: In this step we must consider typical expected workloads that our
database must support and further refine the database design to ensure that it meets desired
performance criteria. This step may simply involve building indexes on some tables and
clustering some tables, or it may involve a substantial redesign of parts of the database schema
obtained from the earlier design steps.
Security Design: In this step, we identify different user groups and different roles played by
various users (e.g., the development team for a product, the customer support representatives, the
product manager). For each role and user group, we must identify the parts of the database that
they must be able to access and the parts of the database that they should not be allowed to
access, and take steps to ensure that they can access only the necessary parts.
Entity-Relationship Model:
An Entity–relationship model (ER model) describes the structure of a database with the help of
a diagram, which is known as Entity Relationship Diagram (ER Diagram). An ER model is a
design or blueprint of a database that can later be implemented as a database.
The entity-relationship (E-R) data model perceives the real world as consisting of basic objects,
called entities, and relationships among these objects. It develops a conceptual design for the
database.
The E-R data model employs three basic notions: entity sets, relationship sets, and attributes.
For example, Suppose we design a school database. In this database, the student will be an
entity with attributes like address, name, id, age, etc.
Fig: ER Model
Entity Sets:
An entity is a “thing” or “object” in the real world that is distinguishable from all other objects.
An entity has a set of properties, and the values for some set of properties may uniquely identify
an entity.
An entity set is a set of entities of the same type that share the same properties, or attributes. An
entity set is represented by a rectangle, and an attribute is represented by an oval.
For example, the Employees entity set could use name, social security number (ssn), and parking
lot (lot) as attributes.
name
ssn lot
Employees
An entity which doesn’t have its key attribute, it is uniquely identified by the key attribute of
another entity set is called weak entity. It can be represented by using double rectangle.
For example: weak entity transactions is identified using ATM ID in ATM Transactions.
The attributes have been simple; that is, they are not divided into subparts. Composite attributes,
on the other hand, can be divided into subparts (that is, other attributes).
The attributes in our examples all have a single value for a particular entity. For instance, the
CustID attribute for a specific customer entity refers to only one customer. Such attributes are
said to be single valued.
There may be instances where an attribute has a set of values for a specific entity. Consider an
customer entity set with the attribute Custphone. A customer may have more than one phone
numbers. This type of attribute is said to be multi valued. The Multi valued attribute is modeled
in ER using double circle
Derived attribute: The value for this type of attribute can be derived from the values of other
related attributes or entities. Derived attribute is modeled in ER using dotted ellipse.
Suppose that the student entity set has an attribute age, which indicates the students’s age. If the
student entity set also has an attribute date-of-birth, we can calculate age from date-of-birth, age
is a derived attribute.
Relationship Sets:
The relationship set Works In, in which each relationship indicates a department in which an
employee works. Note that several relationship sets might involve the same entity sets. For
example, we could also have a Manages relationship set involving Employees and Departments.
A relationship can also have descriptive attributes. Descriptive attributes are used to record
information about the relationship, rather than about any one of the par- ticipating entities; for
example, we may wish to record that Attishoo works in the pharmacy department as of January
1991. This information is captured by adding an attribute, since, to Works In.
Binary Relationship:
When there are two entities set participating in a relation, the relationship is called as binary
relationship.For example, Student is enrolled in Course.
Ternary Relationship:
When there are 3 entities set participating in a relation, the relationship is called as Ternary
The number of times an entity of an entity set participates in a relationship set is known as
cardinality. Cardinality can be of different types:
1. Key constraints:
The values of the attribute, values of an entity such that uniquely identify the entity.
Keys also help to identify relationship uniquely, thus distinguish relationship from each
other.
For example consider the relationship set called manages between the employees and
department’s entity sets such that each department has at most one manager, although
single employee is allowed to manage more than one department. The restriction each
department has at most one manager is the key constraint. This constraint represented in
ER diagram using arrow from Departments to Manages. Here ssn is key attribute of
employees entity set, did is key attribute of departments entity set
ssn
name since did dname budget
lot
Manages Departments
Employees
Participation Constraint:
Participation Constraint is applied on the entity participating in the relationship set.
1. Total Participation – Each entity in the entity set must participate in the relationship. If
each student must enroll in a course, the participation of student will be total. Total
participation is shown by double line in ER diagram.
2. Partial Participation – The entity in the entity set may or may not participate in the
relationship. If some courses are not enrolled by any of the student, the participation of
course will be partial.
The diagram depicts the ‘Enrolled in’ relationship set with Student Entity set having total
participation and Course Entity set having partial participation.
Weak entities:
An entity can be identified uniquely only by considering some of its attributes in conjunction
with the primary key of another entity is called weak entity.
It must hold the following restrictions
The owner entity set and the weak entity set must participate in one to-many relationship set.
This relationship set is called identifying relationship set of the weak entity set. The weak entity
set must have the total participation in the identifying relationship set.
For example a dependents entity can be identified uniquely only if we take the key of the owing
entity employees. Here the relationship between dependents and policy is total, it indicates that
each dependent entity is appears in at most one policy relationship
pname age
ssn lot cost
name
Class Hierarchies:
Class hierarchies are used to classify the entities of entity set into sub classes. It represented
using ISA (is a) relationship and can be viewed in two ways.
Generalization: Generalization is the process of extracting common properties from a set of
entities and creates a generalized entity from it. It is a bottom-up approach in which two or more
entities can be generalized to a higher level entity if they have some attributes in common. For
Example,
STUDENT and FACULTY can be generalized to a higher level entity called PERSON as shown
in Figure 1. In this case, common attributes like P_NAME, P_ADD become part of higher entity
(PERSON) and specialized attributes like S_FEE become part of specialized entity (STUDENT).
Aggregation: Aggregation allows us to indicate that the relationship set participates in another
relationship set. An ER diagram is not capable of representing relationship between an entity and
a relationship which may be required in some scenarios. In those cases, a relationship with its
corresponding entities is aggregated into a higher level entity.
For Example, Employee working for a project may require some machinery. So, REQUIRE
relationship is needed between relationship WORKS_FOR and entity MACHINERY. Using
aggregation, WORKS_FOR relationship with its entities EMPLOYEE and PROJECT is
aggregated into single entity and relationship REQUIRES is created between aggregated entity
and MACHINERY.
Example2: suppose that we have entity set called projects and each project entity is sponsored
by one or more departments. A department that sponsors a project might assign employees to
monitor the sponsorship.
Ssn name lot
Employees
Monitors until
dname budget
since
did
Started_o
pid pbudget
n
Fig: Aggregation
Conceptual Design with the ER Model:
3. What are the relationship sets and their participating entity sets? Should we use binary or
ternary relationships?
Sometimes it is not clear whether a property should be modeled as an attribute or an entity set.
We have to record only one value; it can be modeled as attribute. Modeled as Entity set, we
have to record more than one value and we want to capture the structure of property in ER
model.
For example: to model a concept as an entity set rather than an attribute an attribute. Employees
works for a department, it records the interval during an employee works for a department.
did dname
name to e budget
Ssn lot from
Suppose an employee works a department more than one period, the problem is that we want to
record several values for descriptive attributes for each instance of the works_in relationship.
We could define duration as entity set in association with works_in relationship.
did name budget
Ssn name
lot
Departments
Employees Works_in
from Duration to
Consider the relationship manages suppose each department manager is given a discretionary
budget.
If the discretionary budget is sum that covers all departments managed by employee, in this case
manages relationship that involves a given employee will have in the dbudget field. It leading to
redundant storage of same information.
This problem can be addressed by adding a new entity set called managers is a hierarchy with
employees entity set, with dbudget is an attribute of manager, and since is an attribute of
manages relationship.
dname
ssn did
name lot budget
since
ISA
Manager dbudget
Consider the ER diagram an employee can own several policies, each policy can be owned by
several employees, and each dependent covered by several policies, and dependents entity must
be deleted if the owning employees entity is deleted.
Employees Dependents
Benficiary
purchaser
We can model Concept a project can be sponsored by any number of number of departments, a
department can sponsor one or more projects, and each sponsorship is monitored by one or more
employees using aggregation.
Employees
Monitors until
Started_on
Departments
Projects sponsors
Fig: Aggregation
If we don’t need to record the until attribute of monitors, i.e we can’t express the constraint each
sponsorship is monitored by at most one employee, then we might use ternary relationship.
Employees
dname
did
Started_on budget
pbudget
pid
ER Diagrams to practice:
ER diagram of Bank Management System
Introduction to the Relational Model: Integrity constraint over relations, enforcing integrity
constraints, querying relational data, logical data base design, introduction to views,
destroying/altering tables and views.
Relational Algebra, Tuple relational Calculus, Domain relational calculus
Relational Model:
The Relational Model was developed by E. F. Codd for IBM in 1970 to model data in the form
of relations or tables
Relational Model represents how data is stored in relational databases. A relational database
stores data in the form of Relations (Tables).
After designing the conceptual model of database using ER we need to convert the conceptual
model in the relational model which can be implemented using any RDBMS languages.
RDBMS Languages like SQL Server, My SQL, and Oracle etc.
RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and
for all modern database systems like My SQL, SQL Server, IBM DB2, and Microsoft Access. It
is based on relational model.
Ex: student relation consists of attributes (sid, name, age, phone)
A relation model can be represented as a relation or table, each table consists of rows and
columns.
Attribute: column of a relation is called as attribute or field.
Ex: S_id, name, age, phone.
Tuple: Each row of a table is called as record or tuple.
Domain: Each attribute has domain, it specifies type and range of values allowed to store in the
attribute
Ex: Student (S-Id numeric(10), name char(10), age number(2), phone. No number(10))
The main construct for representing data in relational model is relation or table. A relation
consists of relation schema and relation instance. Relation instance is a table, that is set of tuples
or records, and relation schema describes the attribute or column name and domain of a relation.
Relation Schema: It specifies the relation’s name, the name of each fteld (or column, or
attribute), and the domain of each field. A domain is referred to in a relation schema by the
domain name and has a set of associated values.
This says, for instance, that the field named sid has a domain named string. The set of values
associated with domain string is the set of all character strings.
Relation instance: Instance of a relation is a set of tuples, also called records, in which each
tuple has the same number of fields as the relation schema. A relation instance can be thought of
as a table in which each tuple is a row, and all rows have the same number of fields.
Ex: an instance of a student relation.
Sid Name Age Gpa
1201 Suresh 20 3.7
1202 Ramesh 21 3.6
1203 Maesh 19 3.6
1204 Naresh 20 3.8
1205 Haresh 21 3.7
The above relation contains four attributes sid, name, age,gpa and five tuples or records or rows.
Degree of a relation: number of attributes of a relation is called degree of a relation. Degree of
the above student relation is 4.
Cardinality of a relation: number of tuples or records or rows of a relation is called cardinality
of relation, the above student relation is 5.
Column: the set of values of an attribute.
Null Value: Un Known or un available value is called null value.
Key attribute: is an attribute which is used for unique identification of each row of a table.
Properties of a Relation:
1. The relation has name that should be distinct from all names in the relation schema.
2. Each cell of relation should contain only one value.
3. Each attribute of a relation should be distinct.
4. Each tuple of relation should be distinct.
5. The order of attribute has no significance.
SQL: SQL (structured query language) is the most widely used language for creating,
manipulating the relation in relational database. It was developed by ANSI in 1986.
SQL uses these database languages read, store and update the data in the database.
ALTER The ALTER TABLE statement is used to add, delete, or modify columns, to add
and drop various constraints on in an existing table
SQL statement to change the data type of the column named "DateOfBirth" in the
"Persons" table is
SQL statement:
TRUNCATE: It is used to delete all the rows from the table and free the space
containing the table.
Syntax:
TRUNCATE TABLE table_name;
Example:
TRUNCATE TABLE EMPLOYEE;
INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of
a table.
Syntax:
INSERT INTO TABLE_NAME
(col1, col2, col3,.... col N)
VALUES (value1, value2, value3, .... valueN);
Or
INSERT INTO TABLE_NAME
VALUES (value1, value2, value3, .... valueN);
For example:
INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS");
UPDATE: This command is used to update or modify the value of a column in the table.
Syntax:
UPDATE table_name SET [column_name1= value1,...column_nameN = valueN] [WHE
RE CONDITION]
For example:
UPDATE students
SET User_Name = 'Sonoo'
WHERE Student_Id = '3' ;
Select: The SQL SELECT command is used to fetch data from the MySQL database.
Syntax
SELECT field1, field2,...fieldN
FROM table_name1, table_name2...
[WHERE Clause]
[OFFSET M ][LIMIT N]
Ex:
SELECT * from tutorials_tbl
+-------------+----------------+-----------------+-----------------+
| tutorial_id | tutorial_title | tutorial_author | submission_date |
+-------------+----------------+-----------------+-----------------+
| 1 | Learn PHP | John Poul | 2007-05-21 |
| 2 | Learn MySQL | Abdul S | 2007-05-21 |
| 3 | JAVA Tutorial | Sanjay | 2007-05-21 |
+-------------+----------------+-----------------+-----------------+
Integrity constrains over relations:
An integrity constraint is a condition specified on a database schema and restricts the data that
can be stored in an instance of database.
If a database instance satisfies all the integrity constraints specified on the database schema then
it becomes a legal instance, it is used to maintain the quality of information.
DBMS enforces the integrity constraints to permit only the legal instances to be stored in the
database.
Integrity constraints are specified and enforced on database
1. When the DBA or end user defines a database schema.
2. When a database application is run, the DBMS checks for violation and disallow changes
to the data that violates the specified IC’s.
For example no two students have the same id value.
Key constraints:
A key constraint is a statement, a minimal set of attributes of a relation that uniquely identifies a
tuple of relation. A set of attributes that uniquely identifies a tuple in a relation is called a
candidate key for relation.A relation may have several candidate keys, a database designer can
identifies a primary key out of all candidate keys.
Unique key: it allows only unique value to store into the database it does not allows the
duplicate values to store in to the database.
Example: CREATE table StudentS (SID VARCHAR (10), NAME CHAR (10), AGE
INETEGER, EMAIL VARHAR(20) UNIQUE (EMAIL), SID);
SID NAME AGE EMAIL
101 Ramesh 19 ramesh@gmail.com
102 Suresh 20 sursresh@gmail.com
103 Mahesh 18 mahesh@gmail.com
104 Hareesh 19
EMAIL attribute is defined as unique key for the above relation; it may allow null values but
does not allow the duplicate values.
NOT NULL: Is a key constraint it does not allows the NULL values to store into a relation.
Example: Example: CREATE table Students (SID VARCHAR (10) , NAME CHAR (10) NOT
NULL, AGE INETEGER, EMAIL VARHAR(20) UNIQUE (EMAIL), CONSTRAINT
Students key PRIMARYKEY (SID));
SID NAME AGE EMAIL
101 Ramesh 19 ramesh@gmail.com
102 Suresh 20 sursresh@gmail.com
103 Mahesh 18 mahesh@gmail.com
104 Ramesh 19 Ramesh123@gmail.com
Primary key: is a key constraint it allows only unique and not null values to store into a relation.
It won’t allow the duplicate and null values to store into the relation.
Example: CREATE table Students (SID VARCHAR (10), NAME CHAR (10), AGE
INETEGER, EMAIL VARHAR(20) UNIQUE (EMAIL), CONSTRAINT Students key
PRIMARYKEY (SID));
SID NAME AGE EMAIL
101 Ramesh 19 ramesh@gmail.com
102 Suresh 20 sursresh@gmail.com
103 Mahesh 18 mahesh@gmail.com
104 Hareesh 19 Hareesh@gmail.com
SID is the primary key for the above relation; it does not allow the null values and duplicate
values.
Foreign key: foreign key is used to link two tables or relations. The information stored in one
relation is linked to the information stored in another relation. If one of the relation is modified
the other must be checked and modified to keep the database consistent. The key constraint
involving two relations is called foreign key constraint. The foreign key in referencing relation
must matches with primary key of referenced relation.
For example in addition student we have another relation called enrolled, to ensure that only
bonafied students can enroll in course. Any value appear in the SID field of an instance of
enrolled, should also appear in SID field of student relation.
Example: CREATE table Students (SID VARCHAR (10), NAME CHAR (10), AGE
INETEGER, EMAIL VARHAR(20) UNIQUE (EMAIL), CONSTRAINT Students key
PRIMARYKEY (SID));
Ex: CREATE table Enrolled (SID VARCHAR (10), CID CHAR (10), GRADE CHAR(10)
PRIMARY KEY (CID),
FOREIGN KEY (SID) REFERENCES STUDENTS);
Ssn
name lot
Employees
CREATE TABLE Dept_Mgr (ssn CHAR(11), did INTEGER, since DATE, dname
CHAR(10), Budget REAL, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERRENCES
Employees);
Relation Dept_Mgr:
ssn did since dname budget
Works_in
since
pname age
lot
Ssn cost
name
name
Ssn lot
Employees
hours_worked
Contract_id
ISA
houlrly_wages
Hourly_Emps Contract_Emps
oyees
Figure: Class Hierarchy
Employees: ssn name lot
Hourly_Empsoyees: (houlrly_wages, hours_worked, Ssn)
Contract_Emps: (Contract_id, ssn)
7. Translating ER diagram with aggregation into relation
Employees
8.
Monitors
until
Started_on
pid dname budget
pbudget since did
Fig: Aggregation
Employees, Projects, and departments entity sets and sponsors relationship set are
mapped into relations as like Project(pid,stard_on, pbudget), employees(ssn, name, lot),
Dep_sponsors (did, dname, budget, pid since), for monitors relationship set create a
relation with following attributes, key attributes of employees(ssn), key attributes of
sponsors (pid, did), and descriptive attribute of monitors(until).
Monitors(ssn,pi,did,until)
The multivalued attribute is represented by a separate table:
It is not possible to represent in single column, so we should create a new table for Multivalued
attribute of ER model.
Student (Student_ID, Name, DOB, Door, Street, City, State, Pin, Course_Id)
Attends(Student_ID, Course_Id)
Relational Model:
There are some points for converting the ER diagram to the table:
In the given ER diagram, LECTURE, STUDENT, SUBJECT and COURSE forms individual
tables.
In the STUDENT entity, STUDENT_NAME and STUDENT_ID form the column of STUDENT
table. Similarly, COURSE_NAME and COURSE_ID form the column of COURSE table and so
on.
Relational Model:
Example4:
Relational Model:
Querying Relational Data
A relational database query (query, for short) is a question about the data, and the
answer consists of a new relation containing the result. For example, we might want to
find all students younger than 20.
A query language is a specialized language for writing queries.
SQL is the most popular commercial query language for a relational DBMS. We can
retrieve rows corresponding to students who are younger than 20 with the following SQL
query:
SELECT *
FROM Students S
WHERE S.age < 20
The symbol * means that we retain all fields of selected tuples in the result. S is a variable
that takes on the value of each tuple in Students, one tuple after the other. The condition
S.age < 20 in the WHERE clause specifies that we want to select only tuples in which the
age field has a value less than 20.
SID NAME AGE EMAIL
101 Ramesh 19 ramesh@gmail.com
103 Mahesh 18 mahesh@gmail.com
104 Hareesh 19 Hareesh@gmail.com
Student realtion:
sid name age marks mobileno
101 Ramesh 21 89 2222555222
102 Mahesh 22 98 8890333322
103 Hareesh 30 92 1233421123
104 Suresh 19 95 1234567321
Enrollmont table
sid name
102 Mahesh
103 Hareesh
104 Suresh
AND: The AND operator displays a record if all the conditions separated by AND are TRUE.
OR: The OR operator displays a record if any of the conditions separated by OR is TRUE.
Answer: the students who got marks between 80 to 90 and age is less than 20
IN: The IN operator allows you to specify multiple values in a WHERE clause.
NOT: The NOT operator displays a record if the condition(s) is NOT TRUE.
BETWEEN: The BETWEEN operator selects values within a given range. The values can be
numbers, text, or dates.
The BETWEEN operator is inclusive: begin and end values are included.
Example:
SELECT * FROM Student WHERE Marks BETWEEN (70, 100)
Enrollment
sid cid grade
101 C1 A
102 C2 B
We can also combine information in the Students and Enrolled relations. If we want to
obtain the names of all students who obtained an A and the id of the course in which they
got an A, we could write the following query:
LIKE: The LIKE operator is used in a WHERE clause to search for a specified pattern in a
column.
There are two wildcards often used in conjunction with the LIKE operator:
Example queries:
SELEC * FROM Student WHERE Name LIKE ‘n%’ OR Mobile no LIKE ‘9%’;
Views: a view is a virtual table whose rows are not explicitly stored in the database, but are
computed as needed from view definition. A view also has rows and columns as they are in a
real table in the database. We can create a view by selecting fields from one or more tables
present in the database. A View can either have all the rows of a table or specific rows based
on certain condition.
Syntax:
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE condition;
view_name: Name for the View
table_name: Name of the table
Condition: Condition to select rows
Example:
Consider students and enrolled relations. Suppose if we want to find names, id of students who
got B grade in some course with course id.
CREATE VIEW B-Students (name,sid,course) AS
SELECT S.sname, S.sid, E.cid
FROM Students S, Enrolled E
WHERE S.sid=E.sid AND E.Grade=’B’;
Just like table query, we can query the view to view the data.
SELECT * FROM Viewname;
SELECT * FROM B-Students;
UPDATING VIEWS
There are certain conditions needed to be satisfied to update a view. If any one of these
conditions is not met, then we will not be allowed to update the view.
1. The SELECT statement which is used to create the view should not include GROUP
BY clause or ORDER BY clause.
2. The SELECT statement should not have the DISTINCT keyword.
3. The View should have all NOT NULL values.
4. The view should not be created using nested queries or complex queries.
5. The view should be created from a single table. If the view is created using multiple
tables then we will not be allowed to update the view.
We can use the CREATE OR REPLACE VIEW statement to add or remove fields from a
view.
Syntax:
CREATE OR REPLACE VIEW view_name AS
SELECT column1,coulmn2,..
FROM table_name
WHERE condition;
For example, if we want to update the view MarksView and add the field AGE to this View
from StudentMarks Table, we can do this as:
We can insert a row in a View in a same way as we do in a table. We can use the INSERT
INTO statement of SQL to insert a row in a View.Syntax:
INSERT INTO view_name(column1, column2 , column3,..)
VALUES(value1, value2, value3..);
Deleting a row from a View:
Deleting rows from a view is also as simple as deleting rows from a table. We can use the
DELETE statement of SQL to delete rows from a view. Also deleting a row from a view first
delete the row from the actual table and the change is then reflected in the view.
Syntax:
DELETE FROM view_name
WHERE condition;
Example:
In this example we will delete the last row from the view DetailsView which we just added in
the above example of inserting rows.
DELETE FROM DetailsView
WHERE NAME="Suresh";
Destroying Views
A view can be dropped using the DROP VIEW command, which is just like DROP TABLE.
for example, we would use the following command:
DROP VIEW Viewname;
Consider the instances of Students and Clubs relations
cname Jyear mname
Selection(σ): this selection operator is for to select the tuples from a relation. It is denoted by
sigma(σ).
Ex: find the sailors rating is greater than from the instance of sailor’s relation (s2).
σ rating>8 (s2)
2,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,36,39,40,
41,42,43,44,45,46,47,48,49,50,51,52,54,56,57,58,59,60,LE-1,2,3,4,6
Projection(π): this selection operator is for to select the attributes from a relation. It is denoted
by pie(π).it can be represented as πa1,a2,….an (R). Here π is projection, a1, a2,.. etc, attributes of
relation and R is the relation.
Π sname, rating(S2)
sname Rating
yuppy 9
Lubber 8
guppy 5
Rusty 10
Ex: find the sname, rating of sailors whose rating is greater than 8 from instance of sailors S2.
Fig S1US2
Intersection: R∩S returns a relation instance containing all tuples that occur in both R and S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
sid sname rating age
31 Lubber 8 55.5
58 Rusty 10 35.0
Fig: S1∩S2
Set-difference: R−S returns a relation instance containing all tuples that occur in R but not in S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
Cross-product: RXS returns a relation instance whose schema contains all the fields of R (in the
same order as they appear in R) followed by all the fields of S (in the same order as they appear
in S). The result of R _ S contains one tuple hr; si (the concatenation of tuples r and s) for each
pair of tuples r 2 R; s 2 S. The cross-product opertion is sometimes called Cartesian product.
Sid cid
S1 C1 cid
S2 C1 C1
S1 C2 C2
S3 C2 Table: course (C)
Table: Enrolled (E)
E/C= Π sid (E)- Π sid(( Π sid (E) X C) – E).
Sid Cid
Π sid (E) X C =
S1 C1
S1 C2
S2 C1
S2 C2
S3 C1
S3 C2
( Π sid (E) X C) – E is
Sid Cid
S2 C1
S3 C1
Sid
S2
Π sid (E) is S3
sid
S1
S2
S3
Find the boats which have at least two reservations by different saiors
Ρ( reseserves, , Πsid,bid,bname,bcolor(sailorsxreservesxboats))
ρ (Reservation pairs(1Sid1,2 bid1,3bname, 4bcolor, 5 Sid2, 6
bi2,7bname, 8bcolor , 4bid2), Reservation X Reservations)
Πboats( σ Sid1< >Sid2) ∩ (Bid1 < >Bid2) Resevation pairs
Relational calculus: it is a non procedural language, it provides only description of the query,
but does not provide the methods to solve it. That is it explains what to do, not how to do.
It has two variants
Tuple relational calculus: a tuple variable is a variable that takes the tuples of particular
relation schema as values. Every value assigned to a given tuple variable has same nuber and
type of fields. A TRC a query can be expressed as {T|P (T)}. Where T is tuple variable, P(T) is a
predicate and these are conditions to fetch T. it returns the set of tuples T, such that predicate
P(T) is true for T.
P (T) may have various conditions
AND(∩), OR(V),NOT(┐), implise(
It also uses quantifiers
There exist Tє r(Q(T)), there exist a tupe T in relation r such that predicate Q(t) is true.
For all Tє r(Q(T)), q(t) is true for all tuples in relation r.
In relational calculus there exist and for all are said to bind the variables R.
A variable is said to be free in a formula or sub formula does not contain an occurrence of a
quantifier that binds it. Every variable in a TRC formula appears in a subformula that is atomic.
Syntax of TRC queries:
Let Rel be the relation name, R and S are tuple variables, a be the attributes of R, and b be the
attribute of S.
R€ Rel
R.a op S.b ( here op= <,>,= ,….etc)
R.a op constant
3. Find the sailor name, boat id and reservation date for each reservation.
{P| ∃ R є Reserves ∃ S є Sailors ( R.sid= S.sid ∩ P.bid= R.bid ∩ P. day=R. day ∩
P.sname= S.sname)}
4. Find the names of sailors who have reserved boat 103.
{P|∃ S є Sailors ∃ R є Reserves(R.sid= S.sid ∩ R.bid= 103 ∩ P.sname=S.sname)}
5. Find the names of sailors who have reserved a red boat.
{P|∃ S є Sailors ∃ R є Reserves(R.sid= S.sid ∩ P.sname=S.sname) ∩ ∃ B є
boats(B.bid=R.bid ∩ B.color=’red’)}
6. Find the names of sailors who have reserved at least two boats.
{P|∃ S є Sailors ∃ R1 є Reserves ∃ R2 є Reserves( S.sid=R1.sid ∩ R1.sid= R2.sid ∩
R1.bid != R2.bid ∩ P.sname=S.sname)}
7. Find the names of sailors who have reserved all boats.
{P|∃ S є Sailors ∀ B є boats ∃ R є Reserves(S.sid= R.sid ∩ R.bid=B.bid ∩
P.sname=S.sname)}
8. Find the names of sailors who have reserved all red boats.
{P|∃ S є Sailors ∀ B є boats (B.color=’red’ ⇒(∃ R є Reserves (S.sid= R.sid ∩
R.bid=B.bid))}
Or
{P|∃ S є Sailors ∀B є boats(B.color !=’red’ ∪(∃ R є Reserves (S.sid= R.sid ∩
R.bid=B.bid))}
Domain relational calculus: a domain variable is variable that ranges over the values in the
domain of some attribute.
DRC query can be expressed as {(x1, x2,……, xn )|P(x1, x2,……, xn )}, where x1, x2,……, xn represents
the domain variables, P(x1, x2,……, xn ) represents the condition or formula equivalent to the
predicate calculus.
The predicate calculus formula as TRC it contains:
1. Set of all comparison operators
2. Set of connectivity operators like AND,OR,NOT
3. Set of Quantifiers.
Domain relational calculus Queries on above tables:
1. Find all sailors with a rating above 7.
{(I,N,T,A)| <I,N,T,A> є Sailors ∩ T>7}
2. Find the names of sailors who have reserved boat 103.
{<N> | ∃ I,T,A <I,T,N,A> є Sailors ∩ ∃ Ir, Br, D <Ir, Br, D> є Reserves ( Ir= I ∩ Br=103)}
3. Find the names of sailors who have reserved a red boat.
{<N>|∃ I,T,A <I,T,N,A> є Sailors ∩ <Ir, Br, D> є Reserves ∩ ∃(Br, Bn, ‘red’) є Boats)}
4. Find the names of sailors who have reserved at least two boats.
{<N>|∃ I,T,A <I,T,N,A> є Sailors ∩ ∃ Br1, Br2, D1, D2(I,Br1, D1) є Reserves ∩ (I,Br2, D2) є
Reserves ∩ Br1!= Br2)}
5. Find the names of sailors who have reserved all boats.
{<N> | ∃ I,T,A <I,T,N,A> є Sailors ∩ ∀ B, Bn, C(¬ (B, Bn, C) ) є Boats) ∪ ∃ (<Ir, Br, D> є
Reserves (I=Ir ∩ Br= B)}
6. Find names of sailors who have reserved all red boats.
{(I,N,T,A)| <I,N,T,A> є Sailors ∩ ∀ (B, Bn, C) є Boats (c=’red’ ⇒ ∃ Ir, Br, D <Ir, Br, D> є
Reserves(I=Ir ∩ Br=B)}
Unit-III
SQL: QUERIES, CONSTRAINTS, TRIGGERS: form of basic SQL query, UNION,
INTERSECT, and EXCEPT, Nested Queries, aggregation operators, NULL values, complex
integrity constraints in SQL, triggers and active data bases.
Schema Refinement: Problems caused by redundancy, decompositions, problems related to
decomposition, reasoning about functional dependencies, FIRST, SECOND, THIRD normal
forms, BCNF, lossless join decomposition, multi-valued dependencies, FOURTH normal form,
FIFTH normal form.
SQL: SQL (structured query language) is most widely used commercial database language. It
was developed by IBM in 1970’s, it has several aspects:
DDL(Data Definition Language): this subset of SQL supports the creation, deletion,
modification of tables and views, allows to specify the integrity constraints on tables.
DML(Data Manipulation Language): it allows the user to pose the queries and to insert, delete
and modify the rows on a table.
Embedded and Dynamic SQL: Embedded SQL features allows user to call SQL code from a
host language such as C or COBOL. Dynamic SQL allows a query to construct at runtime.
Trigger: the new SQL1999 standard supports the triggers. Which are the actions executed by
DBMS whenever changes to the database meet the conditions specified in the trigger.
Security: SQL provides the mechanisms to control access to data objects such as tables and
views.
Transaction management: it allows a user to explicitly control how a transaction is to be
executed.
Form of a Basic SQL Query:
SQL is a programming language for Relational Databases. It is designed over relational algebra
and tuple relational calculus. SQL comprises both data definition and data manipulation
languages. Using the data definition properties of SQL, one can design and modify database
schema, whereas data manipulation properties allows SQL to store and retrieve data from
database.
The basic form of SQL query is
SELECT [Distinct] Select-list
FROM From-list
WHERE Condition;
Example: SQL Query for to get the names of students who got marks above 60.
SELECT sname
From student
Where marks>60
This query corresponds to a relational algebra expression involves selection, projection, and
cross product.
The SELECT clause specifies which columns to be retained in the result.
Select-list: it specifies the list of column names.
The FROM clause specifies the cross product of tables
From-list: list of table names.
WHERE clause specifies the selection condition on the tables mentioned in the FROM clause.
Conceptual evaluation strategy:
Computes the cross product of tables in the from-list.
Deletes the rows from the cross product that fails the condition.
Delete the column that does not appear in the select-list.
Eliminates the duplicate rows.
In this we wrote the queries using following table definitions.
Sailors (sid: integer, sname: string, rating: integer, age: real)
Boats ( bid: integer, bname: string, color: string)
Reserves( sid: integer, bid: integer, day: date)
Id sname rating Age Sid bid Day
22 Dustin 7 45.0 22 101 10/10/98
29 Brutus 1 33.0 22 102 10/10/98
31 Lubber 8 55.5 22 103 10/8/98
32 Andy 8 25.5 22 104 10/7/98
58 Rusty 10 35.0 31 102 11/10/98
64 Horatio 7 35.0 31 103 11/6/98
71 Zorba 10 16.0 31 104 11/12/98
74 Horatio 9 35.0 64 101 9/5/98
85 Art 3 25.5 64 102 9/8/98
95 Bob 3 63.5 74 103 9/8/98
Fig instance of sailors s3
Figure: An Instance R2 of Reserves
Example: Find the names of sailors who have reserved a red boat.
SELECT S.sname
FROM Sailors S
WHERE S.sid IN ( SELECT R.sid
FROM Reserves R
WHERE R.bid IN ( SELECT B.bid
FROM Boats B
WHERE B.color = `red' )
Example: Find the names of sailors who have not reserved a red boat.
SELECT S.sname
FROM Sailors S
WHERE S.sid NOT IN ( SELECT R.sid
FROM Reserves R
WHERE R.bid IN ( SELECT B.bid
FROM Boats B
WHERE B.color = `red' )
Correlated Nested Queries:
Correlated sub queries are used for row-by-row processing. Each sub query is executed once for
every row of the outer query.
Example: Find the names of sailors who have reserved boat number 103.
SELECT S.sname
FROM Sailors S
WHERE EXISTS ( SELECT *
FROM Reserves R
WHERE R.bid = 103 AND R.sid = S.sid)
Set-Comparison Operators:
SQL support the set operators EXIST, IN, and UNIQUE it also supports op ANY and op ALL,
where op is one of the arithmetic comparison operators (<;<=;=; <>;>=;>). SOME is also
available, but it is just a synonym for ANY.
The ANY and ALL operators are used with a WHERE or HAVING clause.
The ANY operator returns true if any of the subquery values meet the condition.
The ALL operator returns true if all of the subquery values meet the condition.
SOME: This operator is used to compare a value with a single column set of values returned by
the subquery. The SOME operator in SQL must match at least one value in a subquery and that
value must be preceded by comparison operators.
Generally we will use this SOME operator in WHERE clause to check whether the required
column values are matching with the set of values returned by subquery or not.
Example: write a Query to find the sailors whose rating of is greater than 8.
Example
INNER JOIN: The INNER JOIN creates a new result table by combining column values of two
tables (table1 and table2) based upon the join-predicate.
RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left
table. This means that if the ON clause matches 0 (zero) records in the left table; the join will
still return a row in the result, but with NULL in each column from the left table.
This means that a right join returns all the values from the right table, plus matched values from
the left table or NULL in case of no matching join predicate
SQL> SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
RIGHT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00
FULL JOIN: combines the results of both left and right outer joins.
The joined table will contain all records from both the tables and fill in NULLs for missing
matches on either side.
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
FULL JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
This would produce the following result
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00
Complex integrity constraints in SQL:
Integrity constraints over a single table:
We can specify complex integrity constraints over a single table using table constraints, which
have the form CHECK conditional-expression. When a row is inserted into table or an Existing
row is modified, the conditional expression in CHECK constraint is evaluated. If it evaluates to
false, the command is rejected.
Example:
Ensure that rating must be an integer in the range 1 to 10 only.
CREATE TABLE Sailors( sid INTEGER,
Sname CHAR(10),
Rating INTEGER,
Age REAL, PRIMARY KEY (Sid),
CHECK (rating >=1 AND rating <=10));
Restrict values:
Example: enforce the constraint that Interlake boats cannot be reserved
CREATE TABLE Reserves (Sid INTEGER,
Bid INTEGER (10),
Day DATE,
FOREIGN KEY (sid) REFERENCES Sailors,
FOREIGN KEY (bid) REFERENCES Boats,
CONSTRAINT noInterlakes
CHECK (‘Interlake’ < > (SELECT B.bname
FROM Boats B
WHERE B.bid=Reserves.bid)))
Domain Constraints and distinct types:
A user can define a new domain using the CREATE DOMAIN statement, which uses check
constraints.
CREATE DOMAIN ratingval INTEGER DEFAULT 1
CHECK (VALUE >=1 AND VALUE <=10)
Assertions:
Table constraints are associated with a single table, although the conditional expression in the
CHECK clause can refer to other tables. Table constraints are required to hold only if the
associated table is non empty. Thus when a constraint involves two or more tables, the table
constraint mechanism is not quite desired. To cover such situations, SQL supports the creation of
assertions. Assertions are constraints not associated with any one table.
For example, suppose that we wish to enforce the constraint that the number of
boats plus the number of sailors should be less than 100.
When a constraint involves two or more tables assertions are used.
Example: Database contains the CUSTOMERS table (id, name, age, address, salary),
creates a row-level trigger for the customers table to display the salary difference between
the old values and new values, that would fire for INSERT or UPDATE or DELETE
operations performed on the CUSTOMERS table.
Drop Trigger: we can remove the trigger on database using drop command.
Syntax: DROP TRIGGER [IF EXISTS] [schema_name.]trigger_name
Example: DROP TRIGGER [IF EXISTS] [customers] display_salary_changes
Active Databases:
Active Database is a database consisting of set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In
such database, DBMS initially verifies whether the particular trigger specified in the statement that
modifies the database is activated or not, prior to executing the statement.
If the trigger is active then DBMS executes the condition part and then executes the action part
only if the specified condition is evaluated to true. It is possible to activate more than one trigger
within a single statement.
In such situation, DBMS processes each of the trigger randomly. The execution of an action part of
a trigger may either activate other triggers or the same trigger that Initialized this action. Such
types of trigger that activates itself is called as ‘recursive trigger’. The DBMS executes such chains
of trigger.
For example consider Hourly_Emps relation, ssn is the key for this relation, and hourly wages
attribute is determined by rating attribute. That is, for a given rating value, there is only one
permissible hourly_wages value. This integrity constraint is an example of functional
dependency. It leads to possible redundancy in the relation Hourly_Emps.
Redundant storage: the rating value 8 corresponds to the hourly wage 10, this association
repeated three times, the rating value 5 corresponds to the hourly wage 7, this association
repeated two times.
Problems related to decomposition: decomposition relation schema can create more problems
than it solve. So we have to understand when we normally decompose a table into n number of
sub tables we have to realize that what the importance of doing decomposition is and problems
that might be we may if we do decomposition.
Example: Relation R(ABC) is decomposed into R1(AB) and R2(BC) check whether the
decomposition is lossy or loss less decomposition.
R(ABC)=
A B C
1 2 1
2 2 2
3 3 2
R1(AB)=
A B
1 2
2 2
3 3
R2(BC)=
B C
2 1
2 2
3 2
R1 X R2=
A B B C
1 2 2 1
1 2 2 2
1 2 3 2
2 2 2 1
2 2 2 2
2 2 3 2
3 3 2 1
3 3 2 2
3 3 3 2
R1 ⋈ R2 =
A B C
1 2 1
1 2 2
2 2 1
2 2 2
3 3 2
R1 ⋈ R2 != R, This decomposition is lossy decomposition.
Example: Relation R(ABC) is decomposed into R1(AB) and R2(AC) check whether the
decomposition is lossy or loss less decomposition.
R(ABC)=
A B C
1 1 1
2 1 2
3 2 1
4 3 2
R(ABC) is decomposed into R1(AB) and R2(AC)
R 1=
A B
1 1
2 1
3 2
4 3
R2=
A C
1 1
2 2
3 1
4 2
R1 ⋈ R2 = R
A B C
1 1 1
2 1 2
3 2 1
4 3 2
This decomposition is loss less.
The decomposition of relation schema R with set of FD’s, F into R 1 an R2 with FD’s F1 and F2
then this decomposition is said to be dependency preserving, the closure of set of functional
dependency set (F) is equals to the closure of the functional dependency sets F1 and F2.
Example:
Consider a table with two columns Employee_Id and Employee_Name.
{Employee_id, Employee_Name} → Employee_Id is a trivial functional dependency as
Employee_Id is a subset of {Employee_Id, Employee_Name}.
Also, Employee_Id → Employee_Id and Employee_Name → Employee_Name are trivial de
pendencies too.
2. Non-trivial functional dependency
A → B has a non-trivial functional dependency if B is not a subset of A.
Example:
ID → Name,
Name → DOB
Functional Dependency Set: Functional Dependency set or FD set of a relation is the set of all
FDs present in the relation.
Armstrong Axioms: The Armstrong axioms refer the set of inference rules. These are used to find
the closure of set of functional dependencies (F+), from the given set of functional dependencies
(F).
Given set of functional dependencies (F) = {AB, AC, CGH, CGI, BH}
Here we have
AB, and we also have BH then we can write AH (Transitivity)
CGH, and CGI, then we can write CG HI (Union)
AC, CGH, then we can write AGH (Pseudo transitivity)
AC, CGI, then we can write AGI (Pseudo transitivity)
Closure of set of functional dependencies (F+)= {AB, AC, CGH, CGI, BH, AH ,
CG HI, AGH, AGI }
Example: Relation has attributes R(ABCDEF) , set of functional dependencies of the relation
are F={AB, AC, CDE,CDF, BE} find the closure of set of functional
dependencies for relation R.
Given set of functional dependencies (F) = F={AB, AC, CDE,CDF, BE}
Here we have
AB and AC then we can write ABC(union)
CDE and CDF then we can write CDEF(union)
ABand BE then we can write AE(Transitivity)
AC, and CDE then we can write ADE (Pseudo transitivity)
AC, and CDF then we can write ADF (Pseudo transitivity)
Then closure of set of functional dependencies (F+)= { AB, AC, CDE,CDF, BE,
ABC, CDEF, AE, ADF, ADE }
Minimal cover or canonical cover or irreducible set of functional dependies:
To find the minimal set of functional dependencies we should follow these three rules
Here attribute B is extraneous, because from attribute A we can get attribute B, but from
attribute B we can’t get attribute A.
To removing extraneous attribute B, the functional dependency AB C can be written as AC
After removing the extraneous attribute B the FD’s are
AB,
BC,
AC
Step3: we should remove the redundant FD’s
AB,
BC,
AC
AB,
BC
Example2: find the canonical cover for the given set of functional dependencies
G = {AC, ABC,CDI, CDI, ECAB, EIC,AE}.
CLOSURE OF A {A+}={A,C,D,I,E,B}
CLOSURE OF B {B+}={B}
Here Attribute B Is Extranious.
AC
CDI
To find the extraneous attribute we should find the closure of C and Closure of D
To find the extraneous attribute we should find the closure of C and Closure of D
CLOSURE OF E {E+} = { E}
CLOSURE OF C {C+}= {C,D,I }
HERE THERE IS N EXTRANIOUS ATTRIBUTE
EIC
To find the extraneous attribute we should find the closure of E and Closure of I
CLOSURE OF E {E+} = { E}
CLOSURE OF C {I+}= {I}
Here there is noextranious attribute
ECB
To find the extraneous attribute we should find the closure of E and Closure of I
CLOSURE OF E {E+} = { E}
CLOSURE OF C {C+}= {C,D,I}
HERE THERE IS N EXTRANIOUS ATTRIBUTE
Example2: find the canonical cover for the given set of functional dependencies
F= {AD,EAD,BCAD,CB}
Example3: Find the canonical cover of F = { A BC, B CE, A E, AC H, D B}
Normalization
Normalization is the process of organizing the data in the database. It is used to minimize the
redundancy from a relation or set of relations. Normalization divides the larger table into the
smaller table and links them using relationship.
Types of Normal Forms
First Normal Form (1NF):
A relation will be 1NF if it contains an atomic value. It states that an attribute of a table cannot
hold multiple values. It must hold only single-valued attribute, it doesn’t allows the multi-valued
attribute, composite attribute, and their combinations.
The decomposition of the STUDENTT table into 1NF has been shown below:
4 C1 1000
2 C5 2000
{Note that, there are many courses having the same course fee.
Here,
COURSE_FEE cannot alone decide the value of COURSE_NO or STUD_NO;
COURSE_FEE together with STUD_NO cannot decide the value of COURSE_NO;
COURSE_FEE together with COURSE_NO cannot decide the value of STUD_NO;
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one only candidate key
{STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE , i.e., COURSE_FEE is dependent on COURSE_NO,
which is a proper subset of the candidate key. Non-prime attribute COURSE_FEE is dependent on
a proper subset of the candidate key, which is a partial dependency and so this relation is not in
2NF.
To convert the above relation to 2NF, we need to split the table into two tables such as :
Table 1: STUD_NO, COURSE_NO and Table 2: COURSE_NO, COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C4 2000
4 C3 C3 1000
4 C1 C1 1000
2 C5 C5 2000
Table 1: STUD_NO, COURSE_NO Table 2: COURSE_NO, COURSE_FEE
NOTE: 2NF tries to reduce the redundant data getting stored in memory. For instance, if there are
100 students taking C1 course, we dont need to store its Fee as 1000 for all the 100 records, instead
once we can store it in the second table as the course fee for C1 is 1000.
Example 2: Let's assume, a school can store the data of teachers and the subjects they teach. In a
school, a teacher can teach more than one subject.
TEACHER table
TEACHER_DETAIL table:
TEACHER_SUBJECT table:
Teacher ID Subject
25 Chemistry
25 Biol0gy
47 English
83 Math
83 Computer
A relation is in 3NF if at least one of the following condition holds in every non-trivial function
dependency X –> Y
1. X is a super key.
2. Y is a prime attribute i.e., each element of Y is part of some candidate key.
Transitive dependency – If A->B and B->C are two FDs then A->C is called transitive
dependency.
Example 1 – In relation STUDENT
STUD_
STUD_ID STUD _ STATE STUD_ COUNTRY STUD_ AGE
NAME
1 RAM HAYANA INDIA 20
2 RAM PUNJAB INDIA 19
3 SURESH PUNJAB INDIA 21
FD set: {STUD_ID -> STUD_NAME, STUD_ID -> STUD_STATE, STUD_STATE ->
STUD_COUNTRY, STUD_ID -> STUD_AGE}
Candidate Key: {STUD_ID}
For this relation, STUD_ID -> STUD_STATE and STUD_STATE -> STUD_COUNTRY are
true. So STUD_COUNTRY is transitively dependent on STUD_ID. It violates the third normal
form.
To convert it in third normal form, we will decompose the relation STUDENT into two relations.
STUDENT (STUD_ID, STUD_NAME, STUD_STATE, STUD_AGE)
STUD_
STUD_ID STUD _ STATE STUD_ AGE
NAME
1 RAM HAYANA 20
2 RAM PUNJAB 19
3 SURESH PUNJAB 21
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 INDIA
364 UK
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
264 Designing
264 Testing
364 Stores
364 Developing
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Fourth normal form (4NF):
A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
For a dependency A → B, if for a single value of A, multiple values of B exists, then the
relation will be a multi-valued dependency.
Example
STUDENT RELATION
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE, STUDENT_HOBBY.
Student_Course
STU_ID COURSE
21 COMPUTER
21 MATH
34 CHEMESTRY
74 BIALOGY
59 PHYSICS
Student_Hobby
STU_ID HOBBY
21 DANCING
21 SINGING
34 DANCING
74 CRICKET
59 HOCKEY
Fifth normal form (5NF):
A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
5NF is satisfied when all the tables are broken into as many tables as possible in
order to avoid redundancy.
5NF is also known as Project-join normal form (PJ/NF)
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take
Math class for Semester 2. In this case, combination of all these fields required to identify a valid
data.Suppose we add a new Semester as Semester 3 but do not know about the subject and who
will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns
together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
Table P1
SEMESTER SUBJECT
Semester 1 COMPUTER
Semester 1 MATH
Semester 2 MATH
Semester 1 CHEMESTRY
Table P2
SUBJECT LECTURER
COMPUTER Ansika
COMPUTER John
MATH John
MATH Akash
CHEMESTRY Praveen
Table P3
SEMESTER LECTURER
Semester 1 Ansika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
Example: Relation R has attributes R(ABCDEF) , set of FD’s, {ABC, CD, CE, EF,
FA} check the highest normal for relation R.
2NF: L.H.S. of all FD’s should be candidate key or RHS is Prime attribute it should not contain
any partial dependency L.H.S. is proper subset of any candidate key and R.H.S is non prime
attribute)
It is not in 2NF because is C is proper subset of candidate key and D is Non prime attribute.
(CDE)
The highest normal for of given relation R is first normal form (1NF).
Example: check the highest normal form of the Student relation R, set of functional
dependencies F is {RollnoName, RollnoVoterid, Voteridage, Voter Rollno }.
To check the highest normal form, we can check from highest to lowest.
BCNF: L.H. S of every functional dependency is the candidate key or super key of a relation.
Candidate keys of a relation are {Roll no, Voter id}
The above table is in BCNF in L.H. S of every functional dependency is the candidate key.
Example: R (A, B, C) and set of FD’s F={ABC, CA} show that R is in 3NF, not in
The above relation is in 3NF because AB is candidate key in FD (AB C) and A is the prime
attribute in FD (CA) of Relation R.
The above relation is not in BCNF because attribute C is not candidate key in FD (CA) of
relation R
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine all attribute of
relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the
relation, so there will be only 1 candidate key {AC}.
Step 2. Prime attributes are those attribute which are part of candidate key {A, C} in this
example and others will be non-prime {B, D, E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or
composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not a proper
subset of candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and B->E
is in 2nd normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is a
prime attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy 3rd
normal for, either LHS of an FD should be super key or RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.
Example: Find the highest normal form in R (A, B, C, D, E) under following functional
dependencies.
ABC D, CD AE
1) It is always a good idea to start checking from BCNF, then 3 NF and so on.
2) If any functional dependency satisfied a normal form then there is no need to check for
lower normal form.
For example, ABC –> D is in BCNF (Note that ABC is a super key), so no need to check this
dependency for lower normal forms.
Candidate keys in the given relation are {ABC, BCD}
BCNF: ABC -> D is in BCNF. Let us check CD -> AE, CD is not a super key so this dependency
is not in BCNF. So, R is not in BCNF.
3NF: ABC -> D we don’t need to check for this dependency as it already satisfied BCNF. Let us
consider CD -> AE. Since E is not a prime attribute, so the relation is not in 3NF.
2NF: In 2NF, we need to check for partial dependency. CD which is a proper subset of a candidate
key and it determine E, which is non-prime attribute. So, given relation is also not in 2 NF. So, the
highest normal form is 1 NF.
UNIT –IV
Syllabus: Transaction Concept, Transaction State, Implementation of Atomicity and Durability,
Concurrent Executions, Serializability, Recoverability, Implementation of Isolation, Testing for
serializability, Lock Based Protocols, Timestamp Based Protocols, Validation-Based Protocols,
Multiple Granularity, Recovery and Atomicity, Log–Based Recovery, Recovery with Concurrent
Transactions.
Transaction:
A set of logically related operations is known as transaction. The main operations of a transaction are:
Read (A): Read operations Read (A) or R (A) reads the value of A from the database and stores it
in a buffer in main memory.
Write (A): Write operation Write (A) or W (A) writes the value back to the database from buffer.
Let us take a debit transaction from an account which consists of following operations:
1. R(A);
2. A=A-1000;
3. W(A);
Assume A’s value before starting of transaction is 5000.
The first operation reads the value of A from database and stores it in a buffer.
Second operation will decrease its value by 1000. So buffer will contain 4000.
Third operation will write the value from buffer to database. So A’s final value will be
4000.
But it may also be possible that transaction may fail after executing some of its operations. The
failure can be because of hardware, software or power etc.
For example, if debit transaction discussed above fails after executing operation 2, the value of A
will remain 5000 in the database which is not acceptable by the bank.
To avoid this, Database has two important operations:
Commit: After all instructions of a transaction are successfully executed, the changes made by
transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the changes made
by transaction are undone.
Properties of a transaction: To ensure the consistency of database each transaction must follow
these properties. These are called ACID properties.
Atomicity: it ensures that either the entire transaction takes place at once or doesn’t happen at all.
There is no midway i.e. transactions do not occur partially. Each transaction is considered as one
unit and either runs to completion or is not executed at all.
It involves the following two operations.
Abort: If a transaction aborts, changes made to database are not visible.
Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.
If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but
before write(Y)), then amount has been deducted from X but not added to Y. This results in an
inconsistent database state. Therefore, the transaction must be executed in entirety in order to
ensure correctness of database state.
Consistency:
This means that integrity constraints must be maintained so that the database is consistent before
and after the transaction. It refers to the correctness of a database. Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a
result T is incomplete.
Isolation:
This property ensures that multiple transactions can occur concurrently without leading to the
inconsistency of database state. Transactions occur independently without interference. Changes
occurring in a particular transaction will not be visible to any other transaction until that particular
change in that transaction is written to memory or has been committed. This property ensures that
the execution of transactions concurrently will result in a state that is equivalent to a state achieved
these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.
Suppose T has been executed till Read (Y) and then T’’ starts. As a result , interleaving of
operations takes place due to which T’’ reads correct value of X but incorrect value of Y and sum
computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take
place in isolation and changes should be visible only after they have been made to the main
memory.
Durability:
This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if a system
failure occurs. These updates now become permanent and are stored in non-volatile memory. The
effects of the transaction, thus, are never lost.
The ACID properties, in totality, provide a mechanism to ensure correctness and consistency of a
database in a way such that each transaction is a group of operations that acts a single unit,
produces consistent results, acts in isolation from other operations and updates that it makes are
durably stored.
States of Transactions
Active: This is the initial state of every transaction, all the operations of transaction is being
executed in this state.
Partially Committed: When a transaction executes its final operation, it is said to be in a
partially committed state.
Failed: A transaction is said to be in a failed state if any of the checks made by the database
recovery system fails. A failed transaction can no longer proceed further.
Aborted: If any of the checks fails and the transaction has reached a failed state, then the
recovery manager rolls back all its write operations on the database to bring the database back to
its original state where it was prior to the execution of the transaction. Transactions in this state
are called aborted.
The database recovery module can select one of the two operations after a transaction aborts
Re-start the transaction
Kill the transaction
Committed: If a transaction executes all its operations successfully, it is said to be committed.
All its effects are now permanently established on the database system.
Implementation of Atomicity and Durability:
The Recovery Manager is responsible for ensuring Atomicity and Durability.
Atomicity is guaranteed by undoing the actions of the transactions that did not commit
(aborted). Durability is guaranteed by making sure that all actions of committed transactions
survive crashes and failures.
The recovery-management component of a database system can support atomicity and durability
by a variety of schemes, one of the simplest schemes is called Shadow copy.
Shadow copy:
In the shadow-copy scheme, a transaction that wants to update the database first creates a
complete copy of the database. All updates are done on the new database copy, leaving the
original copy, the shadow copy, untouched. If at any point the transaction has to be aborted, the
system merely deletes the new copy. The old copy of the database has not been affected.
This scheme is based on making copies of the database, called shadow copies, assumes that only
one transaction is active at a time. The scheme also assumes that the database is simply a file on
disk. A pointer called db-pointer is maintained on disk; it points to the current copy of the
database.
If the transaction completes, it is committed as follows:
First, the operating system is asked to make sure that all pages of the new copy of the
database have been written out to disk. (Unix systems use the flush command for this
purpose.)
After the operating system has written all the pages to disk, the database system updates the
pointer db-pointer to point to the new copy of the database; the new copy then becomes the
current copy of the database. The old copy of the database is then deleted.
Fig: Shadow copy technique for atomicity and durability
The transaction is said to have been committed at the point where the updated db pointer is
written to disk.
How the technique handles transaction failures:
If the transaction fails at any time before db-pointer is updated, the old contents of the
database are not affected. We can abort the transaction by just deleting the new copy of
the database.
Once the transaction has been committed, all the updates that it performed are in the
database pointed to by db pointer.
Thus, either all updates of the transaction are reflected, or none of the effects are
reflected, regardless of transaction failure.
How the technique handles system failures:
Suppose that the system fails at any time before the updated db-pointer is written to disk.
Then, when the system restarts, it will read db-pointer and will thus see the original contents
of the database, and none of the effects of the transaction will be visible on the database.
Next, suppose that the system fails after db-pointer has been updated on disk. Before the
pointer is updated, all updated pages of the new copy of the database were written to disk.
Again, we assume that, once a file is written to disk, its contents will not be damaged even if
there is a system failure. Therefore, when the system restarts, it will read db-pointer and will
thus see the contents of the database after all the updates performed by the transaction.
Implementation of Isolation
Isolation determines how transaction integrity is visible to other users and systems. It means that a
transaction should take place in a system in such a way that it is the only transaction that is
accessing the resources in a database system.
Isolation levels define the degree to which a transaction must be isolated from the data
modifications made by any other transaction in the database system. A transaction isolation level is
defined by the following phenomena.
Dirty Read – A Dirty read is the situation when a transaction reads a data that has not yet been
committed. For example, let’s say transaction 1 updates a row and leaves it uncommitted,
meanwhile, Transaction 2 reads the updated row. If transaction 1 rolls back the change,
transaction 2 will have read data that is considered never to have existed.
For example:
Non Repeatable read – Non Repeatable read occurs when a transaction reads same row twice,
and get a different value each time. For example, suppose transaction T1 reads data. Due to
concurrency, another transaction T2 updates the same data and commit, Now if transaction T1
rereads the same data, it will retrieve a different value.
A=200
T1 T2
R(A)
W(A)
R(A)
A=A+100
W(A)
COMMIT
R(A)
Phantom Read – Phantom Read occurs when two same queries are executed, but the rows
retrieved by the two, are different. For example, suppose transaction T1 retrieves a set of rows
that satisfy some search criteria. Now, Transaction T2 generates some new rows that match the
search criteria for transaction T1. If transaction T1 re-executes the statement that reads the rows,
it gets a different set of rows this time.
Based on these phenomena, The SQL standard defines four isolation levels:
Read Uncommitted: Read Uncommitted is the lowest isolation level. In this level, one
transaction may read not yet committed changes made by other transaction, thereby allowing
dirty reads. In this level, transactions are not isolated from each other.
Read Committed: This isolation level guarantees that any data read is committed at the
moment it is read. Thus it does not allow dirty read. The transaction holds a read or write
lock on the current row, and thus prevent other transactions from reading, updating or
deleting it.
Repeatable Read: This is the most restrictive isolation level. The transaction holds read
locks on all rows it references and writes locks on all rows it inserts, updates, or deletes.
Since other transaction cannot read, update or delete these rows, consequently it avoids non-
repeatable read.
Serializable: This is the Highest isolation level. A serializable execution is guaranteed to be
serializable. Serializable execution is defined to be an execution of operations in which
concurrently executing transactions appears to be serially executing.
The Table is given below clearly depicts the relationship between isolation levels, read
phenomena and locks:
Schedules in DBMS
The sequence of operation from one transaction to another transaction is known as schedule. It is
used to preserve the order of the operation in each of the individual transaction in a concurrent
execution of multiple transactions.
There are two types of Schedules
1. Serial schedule
2. Non serial schedule
Serial schedule
Schedules in which the transactions are executed non-interleaved. In the serial schedule, when
the first transaction completes its cycle, then the next transaction is executed.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
RA)
W(A)
R(B)
W(B)
R(A)
R(B)
This is a serial schedule since the transactions perform serially in the order T1 T2
Non serial schedule:
This is a type of Scheduling where the operations of multiple transactions are interleaved. This
might lead to a rise in the concurrency problem. in the non-serial schedule, the other transaction
proceeds without waiting for the previous transaction to complete. The transactions are executed
in a non-serial manner, keeping the end result correct and same as the serial schedule.
T1 T2
R(A)
W(A)
W(B)
R(A)
R(B)
R(B)
Non-Serializable schedules:
The non-serializable schedule is divided into two types, Recoverable and Non-recoverable
Schedule.
Recoverable Schedule:
Schedules in which transactions commit only after all transactions whose changes they read
commit are called recoverable schedules. In other words, if some transaction Tj is reading value
updated or written by some other transaction Ti, then the commit of Tj must occur after the
commit of Ti.
Example – Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
Commit
commit
This is a recoverable schedule since T1 commits before T2, that makes the value read by T2 correct.
Recoverable schedules are three types:
Cascading roll back Schedule: A schedule in which failure in one transaction leads to the rolling
back or aborting other dependent transactions, then such scheduling is referred to as Cascading
rollback or cascading abort
Example:
The table below shows a schedule with two transactions, T1 reads and writes A and that value
is read and written by T2. But later on, T1 fails. So we have to rollback T1. Since T2 has read
the value written by T1, it should also be rollbacked. As it has not committed, we can rollback T2
as well. So it is recoverable with cascading rollback.
Therefore, if Tj is reading value updated by Ti and commit of Tj is delayed till commit of Ti, the
schedule is called recoverable with cascading rollback.
Cascade less Schedule: in which transactions read values only after all transactions whose
changes they are going to read commit are called cascade less schedules. It avoids that a single
transaction abort leads to a series of transaction rollbacks. A strategy to prevent cascading aborts
is to disallow a transaction from reading uncommitted changes from another transaction in the
same schedule.
In other words, if some transaction Tj wants to read value updated or written by some other
transaction Ti, then the commit of Tj must read it after the commit of Ti.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
Commit
R(A)
Commit
This schedule is cascadeless. Since the updated value of A is read by T2 only after the updating
transaction i.e. T1 commits.
Strict Schedule:
A schedule is strict if for any two transactions Ti, Tj, if a write operation of Ti precedes a
conflicting operation of Tj (either read or write), then the commit or abort event of Ti also
precedes that conflicting operation of Tj.
In other words, Tj can read or write updated or written value of Ti only after Ti commits/aborts.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
commit
W(A)
R(A)
Commit
This is a strict schedule since T2 reads and writes A which is written by T1 only after the commit of
T1
Non-Recoverable Schedule:
When Tj is reading the value updated by Ti and Tj is committed before committing of Ti, the
schedule will be irrecoverable.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
Commit
abort
T2 read the value of A written by T1, and committed. T1 later aborted, therefore the value read by
T2 is wrong, but since T2 committed, this schedule is non-recoverable.
Serializability:
When multiple transactions are running concurrently then there is a possibility that the database
may be left in an inconsistent state. Serializability is a concept that helps us to check
which schedules are serializable. A serializable schedule is the one that always leaves the
database in consistent state.
A non-serial schedule of n number of transactions is said to be serializable schedule, if it is
equivalent to the serial schedule of those n transactions.
There are two types of Serializability.
1. Conflict Serializability
2. View Serializability
Conflict Serializability:
A schedule is called conflict serializable if it can be transformed into a serial schedule by
swapping non-conflicting operations. Two operations are said to be conflicting if all conditions
satisfy:
They belong to different transactions
They operate on the same data item
At Least one of them is a write operation
Example1: check whether the given schedule is conflict serializable or non conflict serializable.
T1 T2
R(A)
R(B)
R(A)
R(B)
W(B)
W(A)
To convert this schedule into a serial schedule we must have to swap the R(A) operation of
transaction T2 with the W(A) operation of transaction T1. However we cannot swap these two
operations because they are conflicting operations, thus we can say that this given schedule is not
Conflict Serializable.
Example2: check whether the given schedule is conflict serializable or non conflict serializable.
T1 T2
R(A)
R(A)
R(B)
W(B)
R(B)
W(A)
We finally got a serial schedule after swapping all the non-conflicting operations so we can say
that the given schedule is Conflict Serializable.
Precedence graph:
A precedence graph, also known as serialization graph, it is used to test Conflict Serializability of
a schedule. It is a directed Graph (V, E) consisting of a set of nodes V = {T1, T2, T3……….Tn} and
a set of directed edges E = {e1, e2, e3………………em}.
The graph contains one node for each Transaction Ti. An edge ei is of the form Tj –> Tk where Tj is
the starting node of ei and Tk is the ending node of ei. An edge ei is constructed between nodes Tj to
Tk if one of the operations in Tj appears in the schedule before some conflicting operation in Tk .
The Algorithm can be written as:
1. Create a node T in the graph for each participating transaction in the schedule.
2. For the conflicting operation read_item(X) and write_item(X) – If a Transaction
Tj executes a read_item (X) after Ti executes a write_item (X), draw an edge from Ti to
Tj in the graph.
3. For the conflicting operation write_item(X) and read_item(X) – If a Transaction
Tj executes a write_item (X) after Ti executes a read_item (X), draw an edge from Ti to
Tj in the graph.
4. For the conflicting operation write_item(X) and write_item(X) – If a Transaction
Tj executes a write_item (X) after Ti executes a write_item (X), draw an edge from Tj to
Ti in the graph.
5. The Schedule S is serializable if there is no cycle in the precedence graph.
Example: Consider the schedule S: R1(x), R1(y), W2(x),W1(x),R2(y) and check whether
schedule S, conflict serializable or non conflict serializable .
The given schedule is
S: R1(x), R1(y), W2(x), W1(x), R2(y)
T1 T2
R(x)
R(y)
W(x)
W(x)
R(y)
T1 T2
2. For the conflicting pair r1(x) w2(x), where r1(x) happens before w2(x), draw an edge
from T1 to T2..
T1 T2
3. For the conflicting pair w2(x) w1(x), where w2(x) happens before w1(x), draw an edge
from T2 to T1.
T1 T2
Since the graph is cyclic, we can conclude that it is not conflict serializable
Example2: check whether the given schedule S1: R1(x), R3(y),W2(y) R3(x), W2(x) is conflict
serializable or non conflict serializable
Sol: the given schedule S1: R1(x), R3(y),W2(y) R3(x), W2(x)
The graph for this schedule is :
Since the graph is cyclic, we can conclude that it is not conflict serializable.
View Serializability:
View Serializability is used to check whether a given schedule is view serializable or not. To
check that, we need to check whether the given schedule is view equivalent to its serial
schedule.
View Equivalent:
Two schedules S1 and S2 are said to be view equivalent, if they satisfy all the following
conditions:
Initial Read: the first read operation on a data item is called initial read. An initial read of both
schedules must be the same. For example, there are two schedules S1 and S2. In schedule S1, if a
transaction T1 is reading the data item A, then in S2, transaction T1 should also read A.
Update Read: If in schedule S1, the transaction T1 is reading a data item updated by T2
then in schedule S2 also, T1 should read the value after the write operation of T2 on same
data item. For example, In schedule S1, T3 performs a read operation on A after the write
operation on A by T2 then in S2, T3 should read the A after T2 performs write on A
Above two schedules are not view equal because, in S1, T3 is reading A updated by T2
and in S2, T3 is reading A updated by T1.
Final Write: Final write operations on each data item must match in both the schedules. For
example, In schedule S1, the final write operation on A is done by transaction T2. In S2 also
transaction T2 performs the final write on X.
Above two schedules is view equal because Final write operation in S1 is done by T3 and in S2,
the final write operation is also done by T3.
Example: check whether the given non serial schedule (S1) is view serializable schedule or not.
T1 T2
S1
R(X)
W(X)
R(X)
W(X)
R(Y)
W(Y) R(Y)
W(Y)
Sol: The schedule (S1) is the given schedule, and schedule (S2) is some serial schedule for the
given schedule (S1). If the schedule (S1) is view equivalent to serial schedule (S2) then this
schedule is said to be view serializable schedule.
T1 T2
T1 T2
S1
S2
R(X) R(X)
W(X) W(X)
R(X) R(Y)
W(X) W(Y)
R(Y) R(X)
W(Y) W(X)
R(Y) R(Y)
W(Y) W(Y)
In the transaction process, a system usually allows executing more than one transaction
simultaneously. This process is called a concurrent execution.
Advantages: Advantages of concurrent execution of a transaction
Problems of concurrency
Several problems can occur when concurrent transactions are run in an uncontrolled manner,
such type of problems is known as concurrency problems.
There are following different types of problems or conflicts which occur due to concurrent
execution of transaction:
Lost update problem (Write – Write conflict)
This type of problem occurs when two transactions in database access the same data item and
have their operations in an interleaved manner that makes the value of some database item
incorrect.
If there are two transactions T1 and T2 accessing the same data item value and then update it,
then the second record overwrites the first record.
Example: Let’s take the value of A is 100
Concurrency Control
Concurrency control is the procedure in DBMS for managing simultaneous operations without
conflicting with each another. It is used to address such conflicts which mostly occur with a
multi-user system. It helps you to make sure that database transactions are performed
concurrently without violating the data integrity of respective databases.
A lock is a data variable which is associated with a data item. This lock signifies that operations
that can be performed on the data item. Locks help synchronize access to the database items by
concurrent transactions.
All lock requests are made to the concurrency-control manager. Transactions proceed only once
the lock request is granted.
There are two types of lock:
It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
It can be shared between the transactions because when the transaction holds a lock, then
it can't update the data on the data item.
In the exclusive lock, the data item can be both reads as well as written by the
transaction.
This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released,
but no new locks can be acquired.
The upgrading of lock( from S(a) to X(a) ) is allowed in Growing Phase and downgrading of lock
(from X(a) to S(a)) must be done in shrinking phase.
This protocol assures Serializability, if all the transactions are serialized in order of their lock
points (the point where a transaction acquires its final lock), but there are still some drawbacks.
The drawbacks of 2-PL:
Cascading Rollback is possible under 2-PL.
Deadlocks and Starvation is possible.
Conservative Two-Phase Locking: In this all the locks should acquire before the operation of
the transaction started. It avoids the dead lock, but cascading roll back is not solved.
T T1 T2
1 Lock X(A)
2 Lock X(A)
3 Lock X(B)
4 Unlock (A)
5 Lock X(C)
6 Unlock (B)
7 Unlock (A)
8 Unlock (C)
Transaction T1 is in
Transaction T2 is in
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
Where,
TS(Ti) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.
Thomas' Write Rule
This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and Ti is rolled back.
Time-stamp ordering rules can be modified to make the schedule view serializable.
Instead of making Ti rolled back, the 'write' operation itself is ignored.
Example: there are four transactions are running concurrently, time stamp values of these
transactions are 10, 20, 30, and 40.
T1 T2 T3 T4
R1(A)
W1(A)
R3(A)
R2(A)
Abort
W4(A)
Sol: Initially Read time stamp and write time stamp of A is zero i.e. (RTS (A), and WTS (A) = 0)
Then it executes this read operation R1 (A) and updates the RTS (A) =10.
After execution read operation RTS (A) =10, WTS (A) =0.
Then transaction T1 issuing the write operation on data item A
Now TS (T1) =10, RTS (A) =10, WTS (A) =0.
TS (T1) > WTS (A) (that is 10 > 0)
So it execute this write operation, and updates the WTS (A) = 10
After execution write operation RTS (A) =10, WTS (A) = 10.
Then transaction T3 issuing read operation on data item A with time stamp value 30,
Now TS (T3) = 30, RTS (A) =10 and WTS (A) = 10
WTS (A) <=TS (T3) (that is 10 <= 30)
Then it executes this read operation on data item A and updates the RTS (A) = 30
After execution of read operation RTS (A) =30 and WTS (A) = 10
Then transaction T2 issuing read operation on data item A with time stamp value 20
Now TS (T2) =20, RTS (A) =30 and WTS (A) = 10
RTS (A) >TS (T2) (that is 30 >20)
Here it reject the read operation of transaction T2
Then transaction T4 issuing write operation on data item A with time stamp value 40
Now TS (T4) = 40, RTS (A) =30 and WTS (A) = 10
WTS (A) <=TS (T4) (that is 10 < 40)
Then it executes this write operation on data item A and updates the WTS (A) = 40
After execution of write operation
RTS (A) =30 and WTS (A) = 40
Validation Based Protocol:
This protocol is also called optimistic concurrency control technique. This protocol is used in
DBMS for avoiding concurrency in transactions.
It is called optimistic because of very less interference occurs therefore there is no need for
checking while the transaction is executed.
In this technique, no checking is done while the transaction is been executed. Until the transaction
end is reached updates in the transaction are not applied directly to the database. All updates are
applied to local copies of data items kept for transaction.
At the end of transaction execution, while execution of transaction, a validation phase checks
whether any of transaction updates violate serializability. If there is no violation of serializability
the transaction is committed and the database is updated; or else, the transaction is updated and
then restarted.
The transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the
value of various data items and stores them in temporary local variables. It can perform
all the write operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against
the actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results
are written to the database or system otherwise the transaction is rolled back.
Here each phase has the following different timestamps:
Start(Ti): It contains the time when Ti started its execution.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation
phase.
Finish(Ti): It contains the time when Ti finishes its write phase.
The idea behind optimistic concurrency is to do all the checks at once; hence transaction execution
proceeds with a minimum of overhead until the validation phase is reached. If there is not much
interference among transactions most of them will have successful validation, otherwise, results
will be discarded and restarted later.
Validation based protocol is useful for rare conflicts.
Granularity is the size of data item allowed to lock. Multiple granularity means hierarchically
breaking up the database into blocks, can be represented graphically as a tree. In which lower level
nodes are nested within the higher blocks (nodes). When node is locked explicitly, children’s to
that node locked implicitly.
In this locks are acquired from root node to leaf node and released from leaf node to root node. In
this protocol to lock a lower level node no upper level nodes to that node should not be locked by
some other transaction and to acquire a lock on upper level node, its descendent nodes should not
locked by some other transaction.
For example, consider the tree, which consists of four levels of nodes. The highest level represents
the entire database. Below it are nodes of type area; the database consists of exactly these areas.
Area has children nodes which are called files. Every area has those files that are its child nodes.
No file can span more than one area.
Finally, each file has child nodes called records. As before, the file consists of exactly those
records that are its child nodes, and no record can be present in more than one file. Hence, the
levels starting from the top level are:
database
area
file
record
Figure – Multi Granularity tree Hierarchy
For example, if transaction Ti gets an explicit lock on file Fb in exclusive mode, then it has an
implicit lock in exclusive mode on all the records belonging to that file. It does not need to lock
the individual records of Fb explicitly. This is the main difference between Tree Based
Locking and Hierarchical locking for multiple granularities.
Intention Mode Lock –
To locks the data items of different levels, In addition to S and X lock modes, there are three
additional lock modes with multiple granularity:
Intention-Shared (IS): if a node is locked in IS mode, the lower level nodes to that node are
locked explicitly in shared mode.
Intention-Exclusive (IX): It indicates explicit locking at a lower level with exclusive or
shared locks.
Shared & Intention-Exclusive (SIX): the sub-tree rooted by that node is locked explicitly in
shared mode and explicit locking is being done at a lower level with exclusive mode locks.
The compatibility matrix for these lock modes are described below:
The multiple-granularity locking protocol uses the intention lock modes to ensure serializability. It
requires that a transaction Ti that attempts to lock a node must follow these protocols:
1. Transaction Ti must follow the lock-compatibility matrix.
2. Transaction Ti must lock the root of the tree first, and it can lock it in any mode.
3. Transaction Ti can lock a node in S or IS mode only if Ti currently has the parent of the
node locked in either IX or IS mode.
4. Transaction Ti can lock a node in X, SIX, or IX mode only if Ti currently has the parent
of the node locked in either IX or SIX mode.
5. Transaction Ti can lock a node only if Ti has not previously unlocked any node (i.e., Ti is
two phase).
6. Transaction Ti can unlock a node only if Ti currently has none of the children of the node
locked.
Example: In Multi Granularity, transactions T1 and T2 are running concurrently, Transaction
T1 want to read the data item ra2, and Transaction T2 want to modify the record ra3.
T1 reads ra2: in order to read ra2, it need to obtain a shared lock , to obtain the shared lock it
should traverse from that node to the root node to check whether some node is locked in that
path, if not it should acquire IS in upper level node.
T2 modifies the ra3: to modify the data item it need to acquire exclusive lock on that data item,
to obtain the exclusive lock on ra3, it should acquire Ix lock on upper level, here the upper level
nodes are locked in intention shared IS by transaction. Here the IX lock is compatible with IS
locking, so to modify the data item, it acquires the IX intention exclusive lock on the data item.
Recovery and Atomicity
When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations, which
are atomic in nature. But according to ACID properties of DBMS, atomicity of transactions as a
whole must be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled
back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe.
Log-based recovery works as follows −
The log file is kept on a stable storage media.
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.
When more than one transaction are being executed in parallel, the logs are interleaved. At the
time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner −
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
UNIT – V
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary Indexes, Index data
Structures, Hash Based Indexing, Tree base Indexing, Comparison of File Organizations, Indexes and Performance Tuning,
Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index Structure.
Primary Storage
It is the primary area that offers quick access to the stored data. We also know the primary storage
as volatile storage. It is because this type of memory does not permanently store the data. As soon
as the system leads to a power cut or a crash, the data also get lost. Main memory and cache are
the types of primary storage.
o Main Memory: It is the one that is responsible for operating the data that is available by the
storage medium. The main memory handles each instruction of a computer machine. This type of
memory can store gigabytes of data on a system but is small enough to carry the entire database.
At last, the main memory loses the whole content if the system shuts down because of power
failure or other reasons.
o Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a
tiny storage media which is maintained by the computer hardware usually. While designing the
algorithms and query processors for the data structures, the designers keep concern on the cache
effects.
Secondary Storage
Secondary storage it is also called as online storage. It is the storage area that allows the user to
save and store data permanently. This type of memory does not lose the data due to any power
failure or system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost every
type of computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further
plugged into the USB slots of a computer system. These USB keys help transfer data to a
computer system, but it varies in size limits. Unlike the main memory, it is possible to get back
the stored data which may be lost due to a power cut or other reasons. This type of memory
storage is most commonly used in the server systems for caching the frequently used data. This
leads the systems towards high performance and is capable of storing large amounts of databases
than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online storage media. A
magnetic disk is used for storing the data for a long time. It is capable of storing an entire
database. It is the responsibility of the computer system to make availability of the data from a
disk to the main memory for further accessing. Also, if the system performs any operation over
the data, the modified data should be written back to the disk. The tremendous capability of a
magnetic disk is that it does not affect the data due to a system crash or failure, but a disk failure
can easily ruin as well as destroy the stored data.
Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:
o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk
(CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a
Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data sequentially
from the start. Thus, tape storage is also known as sequential-access storage. Disk storage is
known as direct-access storage as we can directly access the data from any location on disk.
Hard disk drives are the most common secondary storage devices in present computer systems.
These are called magnetic disks because they use the concept of magnetization to store
information. Hard disks consist of metal disks coated with magnetizable material. These disks are
placed vertically on a spindle. A read/write head moves in between the disks and is used to
magnetize or de-magnetize the spot under it. A magnetized spot can be recognized as 0 (zero) or
1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has
many concentric circles on it, called tracks. Every track is further divided into sectors. A sector
on a hard disk typically stores 512 bytes of data.
RAID refers to redundancy array of the independent disk. It is a technology which is used to
connect multiple secondary storage devices for increased performance, data redundancy or both. It
gives you the ability to survive one or more drive failure depending upon the RAID level used.
It consists of an array of disks in which multiple disks are connected to achieve different goals.
o Level 0: Non- redundant: RAID level 0 provides data stripping, i.e., a data can place across
multiple disks. It is based on stripping that means if one disk fails then all data in the array is
lost.
o This level doesn't provide fault tolerance but increases the system performance. It doesn't
contain any error detection mechanism.
o In this level, failure of either disk results in complete data loss in respective array.
In our example, the RAID Level 0 solution consists of only four data disks. Independent of the
number of data disks, the effective space utilization for a RAID Level 0 system is always 100 percent.
Level 1: Mirrored:
This level is called mirroring of data as it copies the data from drive 1 to drive 2. It provides 100%
redundancy in case of a failure. Instead of having one copy of the data, two identical copies of the
data on two different disks are maintained. This type of redundancy is often called mirroring.
Only half space of the drive is used to store the data. The other half of drive is just a mirror to the
already stored data.
Level 5:
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated
disk.
Level 6: RAID 6 is an extension of level 5. In this level, two independent parities are generated
and stored in distributed fashion among multiple disks. Two parities provide additional fault
tolerance. This level requires at least four disk drives to implement RAID.
File Organization:
A database consists of a huge amount of data. The data is grouped within a table in RDBMS,
and each table has related records. A user can see that the data is stored in form of tables, but in
actual this huge amount of data is stored in physical memory in form of files.
A file is named collection of related information that is recorded on secondary storage such as
magnetic disks, magnetic tapes and optical disks.
File Organization refers to the logical relationships among various records that constitute the
file, particularly with respect to the means of identification and access to any specific record. In
simple terms, Storing the files in certain order is called file Organization. File Structure refers to
the format of the label and data blocks and of any logical control record.
This method is the easiest method for file organization. In this method, files are stored sequentially. This
method can be implemented in two ways:
Suppose we have four records R1, R3 and so on up to R9 and R8 in a sequence. Hence, records are
nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence, then it
will be placed at the end of the file. Here, records are nothing but a row in any table.
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on up to R6 and R7.
Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of
the file, and then it will sort the sequence.
Pros of sequential file organization
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a
student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a
new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the data block
selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-consuming
because there is no sorting or ordering of records. In the heap file organization, we need to check
all the data until we get the requested record.
Hash File Organization uses the computation of hash function on some fields of the records. The
hash function's output determines the location of disk block where the records are to be placed.
When a record has to be received using the hash key columns, then the address is generated, and
the whole record is retrieved using that address. In the same way, when a new record has to be
inserted, then the address is generated using the hash key and record is directly inserted. The same
process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
Indexing
If the records in the file are in sorted order, then searching will become very fast. But, in
most of the cases, records are placed in the file in the order in which they are inserted, so new
records are inserted at the end of the file. It indicates, the records are not in sorted order. In
order to make searching faster in the files with unsorted records, indexing is used.
Indexing is a data structure technique which allows you to quickly retrieve records from a
database file. An Index is a small table having only two columns. The first column contains a
copy of the primary or candidate key of a table. The second column contains a set of disk block
addresses where the record with that specific key value stored.
Indexing in DBMS can be of the following types:
Indexing
i. Primary Index
If the index is created by using the primary key of the table, then it is known as primary
indexing.
As primary keys are unique and are stored in a sorted manner, the performance of the
searching operation is quite efficient.
The primary index can be classified into two types: dense index and sparse index.
Dense index
If every record in the table has one index entry in the index table, then it is called dense
index.
In this, the number of records (rows) in the index table is same as the number of records
(rows) in the main table.
As every record has one index entry, searching becomes faster.
TS TS Hyderabad KCR
AP AP Amaravathi Jagan
TN TN Madras Palani Swamy
MH MH Bombay Thackray
Sparse index
If only few records in the table have index entries in the index table, then it is called
sparse index.
In this, the number of records (rows) in the index table is less than the number of records
(rows) in the main table.
As not all the record have index entries, searching becomes slow for records that does not
have index entries.
TS TS Hyderabad KCR
TN AP Amaravathi Jagan
MH TN Madras Palani Swamy
MH Bombay Thackray
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
It contains two levels. In the first level each record in the main table has one entry in the first-
level index table.
The index entries in the fisrt level index table are divided into different groups. For each
group, one index entry is created and added in the second level index table.
Multi-level Index: When the main table size becomes too large, creating secondary level index
improves the search process. Even if the search process is slow; we can add one more level of
indexing and so on. This type of indexing is called multi-level index.
iii. Clustering Index
Sometimes the index is created on non-primary key columns which may not be unique
for each record.
In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
The records which have similar characteristics are grouped, and indexes are created for
these group.
Example: Consider a college contains many students in each department. All the students belong
to the same Dept_ID are grouped together and treated as a single cluster. One index pointers
point to the one cluster as a whole. The idex pointer points to the first record in each cluster.
Here Dept_ID is a non-unique key.
In above diagram we can see that, indexes are created for each department in the index
file. In the data block, the students of each department are grouped together to form the cluster.
The address in the index file points to the beginning of each cluster.
Hashing is a technique to directly search the location of desired data on the disk without
using index structure. Hash function is a function which takes a piece of data ( key) as input and
produces a hash value as output which maps the data to a particular location in the hash table.
The concept of hashing and hash table is shown in the below figure
Static Hashing:
In static hashing, the hash function produce only fixed number of hash values. For
example consider the hash function
f(x) = x mod 7
For any value of x, the above function produces one of the hash values from {0, 1, 2, 3, 4, 5, 6}.
Itmeans static hashing maps search-key values to a fixed set of bucket addresses.
i. Open Addressing:
Open addressing is a collision resolving technique which stores all the keys inside the
hash table. No key is stored outside the hash table. Techniques used for open addressing are:
o Linear Probing
o Quadratic Probing
o Double Hashing
Linear Probing:
In linear probing, when there is a collision, we scan forwards for the next empty slot to
fill the key’s record. If you reach last slot, then start from beginning.
Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with 2 is not empty. Then move to next slot (hash value 3), even it is also full, then move
once again to next slot with hash value 4. As it is empty store 23 there. This is shown in
the below diagram.
0 21*
3 10*
4 23*
5 12*
In quadratic probing, when collision occurs, it compute new hash value by taking the
original hash value and adding successive values of quadratic polynomial until an open
slot is found. If here is a collision, it use the following hash function: h(x) = ( f(x) + i2 )
mod n , where I = 1, 2, 3, 4,….. and f(x) is initial hash value.
Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with hash value 2 is not empty. Then compute new hash value as (2 +1 2) mod 7 =3, even
it is also full, and then once again compute new hash value as (2 +2 2) mod 7 = 6. As it is
empty store 23 there. This is shown in the below diagram.
Hash
Data Record
Value
0 21*
3 10*
5 12*
6 23*
Double Hashing
In double hashing, there are two hash functions. The second hash function is used to
provide an offset value in case the first function causes a collision. The following
function is an example of double hashing: (first Hash(key) + i * secondHash(key)) %
table Size. Use i = 1, 2, 3, …
To handle the collision, this technique creates a linked list to the slot for which collision
occurs. The new key is then inserted in the linked list. These linked lists to the slots appear like
chains. So, this technique is called as separate chaining. It is also called as closed addressing.
Example: Inserting 10, 21, 16, 12, 23, 19, 28, 30 in hash table.
f(21) = 21 mod 7 = 0 1
f(23) = 23 mod 7 = 2 4
o Extended hashing
o Linear Hashing
i. Extendable hashing
In extendable hashing, a separate directory of pointers to buckets is used. The number
bits used in directory is called global depth (gd) and number entries in directory = 2 gd. Number of
bits used for locating the record in the buckets is called local depth(ld) and each bucket can
stores up to 2ld entries. The hash function use last few binary bits of the key to find the bucket. If
a bucket overflows, it splits, and if local depth greater than global depth, then the table doubles in
size. It is one form of dynamic hashing.
Example: Let global depth (gd) = 2. It means the directory contains four entries. Let the local
depth (ld) of each bucket = 2. It means each bucket need two bits to perform search operation.
Let each Bucket capacity is four. Let us insert 21, 15, 28, 17, 16, 13, 19, 12, 10, 24, 25 and 11.
21 = 10101 19 = 10011
15 = 01111 12 = 01100
28 = 11100 10 = 01010
17 = 10001 24 = 11000
16 = 10000 25 = 11101
13 = 01101 11 = 01011
Local depth 2
Global depth 28* 16* 12* 24* Bucket A
2 2
00 21* 17* 25* 13* Bucket B
01
2
10
10* Bucket C
11
Directory 2
15* 19* 11* Bucket D
Figure 6.1: Extendible hashing example
Now, let us consider insertion of data entry 20* (binary 10100). Looking at directory
element 00, we are led to bucket A, which is already full. So, we have to split the bucket by
allocating a new bucket and redistributing the contents (including the new entry to be inserted)
across the old bucket and its 'split image (new bucket). To redistribute entries across the old
bucket and its split image, we consider the last three bits; the last two bits are 00, indicating a
data entry that belongs to one of these two buckets, and the third bit discriminates between these
buckets. That is if a key’s last three bits are 000, then it belongs to bucket A and if the last three
bits are 100, then the key belongs to Bucket A2. As we are using three bits for A and A2, so the
local depth of these buckets becomes 3. This is illustrated in the below Figure 6.2.
Local depth 3
16* 24* Bucket A
Global depth 3
28* 12* 20* Bucket A2
2
00 2
01 21* 17* 25* 13* Bucket B
10
2
11
10* Bucket C
Directory 2
15* 19* 11* Bucket D
Figure 6.2: After inserting 20 and splitting Bucket A
After split, Bucket A and A2 have local depth greater than global depth (3 > 2), so double
the directory and use three bits as global depth. As Bucket A and A2 has local depth 3, so they
have separate pointers from the directory. But, Buckets B, C and D use only local depth 2, so
they have two pointers each. This is shown in the below diagram.
Local depth 3
16* 24* Bucket A
Global depth 3
28* 12* 20* Bucket A2
3
000
2 Bucket B
001
21* 17* 25* 13*
010
011
100 2 Bucket C
101 10*
110
111 2 Bucket D
15* 19* 11*
Directory
Key Observations:
A Bucket will have more than one pointers pointing to it if its local depth is less than the
global depth.
When overflow condition occurs in a bucket, all the entries in the bucket are rehashed
with a new local depth.
If new Local Depth of the overflowing bucket is equal to the global depth, only then the
directories are doubled and the global depth is incremented by 1.
The size of a bucket cannot be changed after the data insertion process begins.
Example: Let N = 4, so we use 4 buckets and hash function h0(key) = key % 4 is used to map
any key into one of the four buckets. Let us initially insert 4, 13, 19, 25, 14, 24, 15, 18, 23, 11,
16, 12 and 10.This is shown in the below figure.
Next, when 27 is inserted, an overflow occurs in bucket 3. So, bucket 0 (first bucket) is splited
and its content is distributed between bucket 0 and new bucket. This is shown in below figure.
4 100 00 4* 12*
Next, when 30, 31 and 34 is inserted, an overflow occurs in bucket 2. So, bucket 1 is splited and
its content is distributed between bucket 1 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16*
1 001 01 13*
2 010 10 14* 18* 10* 30* 34*
3 011 11 19* 15* 23* 11* 27* 31*
4 100 00 4* 12*
5 101 01 25*
When 32, 35, 40 and 48 is inserted, an overflow occurs in bucket 0. So, bucket 2 is splited and its
content is distributed between bucket 2 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16* 32* 40* 48*
1 001 01 13*
2 010 10 18* 10* 34*
3 011 11 19* 15* 23* 11* 27* 31* 35*
4 100 00 4* 12*
5 101 01 25*
6 110 10 14* 30*
When 26, 20 and 42 is inserted, an overflow occurs in bucket 0. So, bucket 3 is splited and its
content is distributed between bucket 3 and new bucket. This is shown in below figure.
Bucket# h1 h0 Primary pages Overflow pages
0 000 00 24* 16* 32* 40* 48*
1 001 01 13*
2 010 10 18* 10* 34* 26* 42
We can use tree-like structures as index as well. For example, a binary search tree can
also be used as an index. If we want to find out a particular record from a binary search tree, we
have the added advantage of binary search procedure, that makes searching be performed even
faster. A binary tree can be considered as a 2-way Search Tree, because it has two pointers in
each of its nodes, thereby it can guide you to two distinct ways. Remember that for every node
storing 2 pointers, the number of value to be stored in each node is one less than the number of
pointers, i.e. each node would contain 1 value each.
The abovementioned concept can be further expanded with the notion of the m-Way
Search Tree, where m represents the number of pointers in a particular node. If m = 3, then each
node of the search tree contains 3 pointers, and each node would then contain 2 values.
We use mainly two tree structure indexes in DBMS. They are:
23 68 42 59
10 20 23 27 68 71 31 35 42 46 59 61
After inserting 24, 33, 36, and 39 in the above tree, it looks like
31
23 68 42 59
10 20 23 27 68 71 31 35 42 46 59 61
24 33 36
39
Deletion: From the above figure, after deleting 42, 71, 24 and 36
31
23 68 42 59
10 20 23 27 68 31 35 46 59 61
33
39
B+ Tree:
B+ Tree is an extension of Binary Tree which allows efficient insertion, deletion and search
operations. It is used to implement indexing in DBMS. In B+ tree, data can be stored only on the
leaf nodes while internal nodes can store the search key values.
Example: Searching for 35 in the below given B+ tree. The search path is shown in red color.
18
11 31 64
8 15 23 42 59 68
2 5 8 9 11 12 15 16 18 20 23 27 31 35 42 46 59 61 64 66 68 71
B+ tree Insertion:
1. Apply search operation on B+ tree and find a leaf node where the new value has to insert.
2. If the leaf node is not full, then insert the value in the leaf node.
3. If the leaf node is full, then
a. Split that leaf node including newly inserted value into two nodes such that each
contains half of the values (In case of odd, 2nd node contains extra value).
b. Insert smallest value from new right leaf node (2nd node) into the parent node.
Add pointers from these new leaf nodes to their parent node.
c. If the parent is full, split it too. Add the middle key (In case of even,1st value from
2nd part)of this parent node to its parent node.
d. Repeat until a parent is found that need not split.
4. If the root splits, create a new root which has one key and two pointers.
Initially
After inserting 1
1
After inserting 5
1 5
After inserting 3
1 3 5
5
After inserting 7
1 3 5 7
1 3 5 7
After inserting 9
5
1 3 5 7 9
After inserting 2
5
1 2 3 5 7 9
After inserting 4
5 3 5
1 2 3 4 5 7 9 1 2 3 4 5 7 9
After inserting 6
3 5
1 2 3 4 5 7 9 6
3 5 7
1 2 3 4 5 6 7 9
After inserting 8
3 5 7
1 2 3 4 5 6 7 8 9
After inserting 10
3 5 7
1 2 3 4 5 6 7 8 9 10
3 5 7 9
1 2 3 4 5 6 7 8 9 10
3 5
9
9 10
1 2 3 4 5 6 7 8
B+ tree Deletion
Identify the leaf node L from where deletion should take place.
Remove the data value to be deleted from the leaf node L
If L meets the "half full" criteria, then its done.
If L does not meets the "half full" criteria, then
o If L's right sibling can give a data value, then move smallest value in right sibling
to L (After giving a data value, the right sibling should satisfy the half full
criteria. Otherwise it should not give)
o Else, if L's left sibling can give a data value, then move largest value in left
sibling to L (After giving a data value, the left sibling should satisfy the half full
criteria. Otherwise it does not give)
o Else, merge L and a sibling
o If any internal nodes (including root) contain key value same as deleted value,
then delete those values and replace with it successor. This deletion may
propagate up to root. (If the changes propagate up to root then tree height
decreases).
5 14 24 33
2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39
Delete 19 : Half full criteria is satisfied even after deleting 19, so just delete 19 from leaf node
19
5 14 24 33
2 3 5 7 14 16 20 22 24 27 29 33 34 38 39
Delete 20: Half full criteria is not satisfied after deleting 20, so bring 24 from its right sibling
and change key values in the internal nodes.
19
5 14 27 33
2 3 5 7 14 16 22 24 27 29 33 34 38 39
Delete 24: Half full criteria is not satisfied after deleting 24, bringing a value from its siblings
also not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
19
5 14 33
2 3 5 7 14 16 22 27 29 33 34 38 39
Delete 5: a half full criterion is not satisfied after deleting 5, bringing a value from its siblings
also not possible. Therefore merge it with its left sibling (you can also merge with right) and
change key values in the internal nodes.
19
14 33
2 3 7 14 16 22 27 29 33 34 38 39
Delete 7: Half full criteria is satisfied even after deleting 7, so just delete 7 from leaf node.
17
14 33
2 3 14 16 22 27 29 33 34 38 39
Delete 2: Half full criteria is not satisfied after deleting 2, bringing a value from its siblings also
not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
22 33
3 14 16 22 27 29 33 34 38 39
While indexes can improve query execution speed, the price we pay is on index
maintenance. Update and insert operations need to update the index with new data. This means
that writes will slow down slightly with each index we add to a table. We also need to monitor
index usage and identify when an existing index is no longer needed. This allows us to keep our
indexing relevant and trim enough to ensure that we don’t waste disk space and I/O on write
operations to any unnecessary indexes. To improve performance of the system, we need to do the
following: