Dbms Notes by Kailash PDF

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 100

DBMS

DBMS NOTES

Collections of data

Data may be collected, manipulated and retrieved in various ways:

 plain text editor - simple editing and retrieval


 word processor - adds tables and simple calculations
 spreadsheet programme - adds more sophisticated calculations
 database management system (DBMS) - adds formats, structures, rules, ...

Reasons for a DBMS

A DBMS is a software package for defining and managing a database.

A ‘real’ database includes

 definitions of
o field names
o data formats (text? binary? integer? etc.)
o record structures (fixed-length? pointers? field order, etc.)
o file structures (sequential? indexed? etc.)
 rules for validating and manipulating data
 data

A DBMS provides Data independence.

One must be able to change storage mechanisms and formats without having to modify all
application programmes. For example:

 method of representation of alphanumeric data (e.g., changing date format to avoid Y2000
problem)
 method of representation of numeric data (e.g., integer vs. long integer or floating-point)
 units (e.g., metric vs. furlongs)
 record structure (example)
 file structure (sequential, sorted, indexed, etc.)

Problems in conventional file system and need for Database Systems

1. To see why database management systems are necessary, let's look at a typical ``file-
processing system'' supported by a conventional operating system.

The application is a savings bank:

o Savings account and customer records are kept in permanent system files.
o Application programs are written to manipulate files to perform the following tasks:
 Debit or credit an account.

DBMS Notes By Kailash Jadhav


DBMS
 Add a new account.
 Find an account balance.
 Generate monthly statements.
2. Development of the system proceeds as follows:
o New application programs must be written as the need arises.
o New permanent files are created as required.
o but over a long period of time files may be in different formats, and
o Application programs may be in different languages.
3. So we can see there are problems with the straight file-processing approach:
o Data redundancy and inconsistency
 Same information may be duplicated in several places.
 All copies may not be updated properly.
o Difficulty in accessing data
 May have to write a new application program to satisfy an unusual request.
 E.g. find all customers with the same postal code.
 Could generate this data manually, but a long job...
o Data isolation
 Data in different files.
 Data in different formats.
 Difficult to write new application programs.
o Multiple users
 Want concurrency for faster response time.
 Need protection for concurrent updates.
 E.g. two customers withdrawing funds from the same account at the same time -
account has $500 in it, and they withdraw $100 and $50. The result could be
$350, $400 or $450 if no protection.
o Security problems
 Every user of the system should be able to access only the data they are
permitted to see.
 E.g. payroll people only handle employee records, and cannot see customer
accounts; tellers only access account data and cannot see payroll data.
 Difficult to enforce this with application programs.
o Integrity problems
 Data may be required to satisfy constraints.
 E.g. no account balance below $25.00.
 Again, difficult to enforce or to change constraints with the file-processing
approach.

These problems and others led to the development of database management systems.

What is the difference between DBMS and RDBMS?

DBMS stands for Database Management System which is a general term for a set of software
dedicated to controlling the storage of data.

RDMBS stand for Relational DataBase Management System. This is the most common form of DBMS.
Invented by E.F. Codd, the only way to view the data is as a set of tables. Because there can be
DBMS Notes By Kailash Jadhav
DBMS
relationships between the tables, people often assume that is what the word "relational" means. Not
so. Codd was a mathematician and the word "relational" is a mathematical term from the science of
set theory. It means, roughly, "based on tables".

Data Abstraction

1. The major purpose of a database system is to provide users with an abstract view of the
system.

The system hides certain details of how data is stored and created and maintained

Complexity should be hidden from database users.

2. There are several levels of abstraction:


1. Physical Level:
 How the data are stored.
 E.g. index, B-tree, hashing.
 Lowest level of abstraction.
 Complex low-level structures described in detail.
2. Conceptual Level:
 Next highest level of abstraction.
 Describes what data are stored.
 Describes the relationships among data.
 Database administrator level.
3. View Level:
 Highest level.
 Describes part of the database for a particular group of users.
 Can be many different views of a database.
 E.g. tellers in a bank get a view of customer accounts, but not of payroll data.

Fig. 1.1 (figure 1.1 in the text) illustrates the three levels.

  
Figure 1.1: The three levels of data abstraction

Data Manipulation Language (DML)


DBMS Notes By Kailash Jadhav
DBMS
1. Data Manipulation is:
o retrieval of information from the database
o insertion of new information into the database
o deletion of information in the database
o modification of information in the database
2. A DML is a language which enables users to access and manipulate data.

The goal is to provide efficient human interaction with the system.

3. There are two types of DML:


o procedural: the user specifies what data is needed and how to get it
o nonprocedural: the user only specifies what data is needed
 Easier for user
 May not generate code as efficient as that produced by procedural languages
4. A query language is a portion of a DML involving information retrieval only. The terms DML
and query language are often used synonymously.

Data Definition Language (DDL)

1. Used to specify a database scheme as a set of definitions expressed in a DDL


2. DDL statements are compiled, resulting in a set of tables stored in a special file called a data
dictionary or data directory.
3. The data directory contains metadata (data about data)
4. The storage structure and access methods used by the database system are specified by a set
of definitions in a special type of DDL called a data storage and definition language
5. basic idea: hide implementation details of the database schemes from the users

Data Independence

1. The ability to modify a scheme definition in one level without affecting a scheme definition in a
higher level is called data independence.
2. There are two kinds:
o Physical data independence
 The ability to modify the physical scheme without causing application programs
to be rewritten
 Modifications at this level are usually to improve performance
o Logical data independence
 The ability to modify the conceptual scheme without causing application
programs to be rewritten
 Usually done when logical structure of database is altered
3. Logical data independence is harder to achieve as the application programs are usually heavily
dependent on the logical structure of the data. An analogy is made to abstract data types in
programming languages.

DBMS Notes By Kailash Jadhav


DBMS
Three-level Database Model

Often referred to as the three-level model, this is where the design moves from a written
specification taken from the real-world requirements to a physically-implementable design for a
specific DBMS. The three levels commonly referred to are `Conceptual Design', `Data Model
Mapping', and `Physical Design'.

Figure : Logic behind the three level architecture

The specification is usually in the form of a written document containing customer requirements,
mock reports, screen drawings and the like, written by the client to indicate the requirements which
the final system is to have. Often such data has to be collected together from a variety of internal
sources to the company and then analyzed to see if the requirements are necessary, correct, and
efficient.

Once the Database requirements have been collated, the Conceptual Design phase takes the
requirements and produces a high-level data model of the database structure. In this module, we use
ER modelling to represent high-level data models, but there are other techniques. This model is
independent of the final DBMS which the database will be installed in.

Next, the Conceptual Design phase takes the high-level data model it taken and converted into a
conceptual schema, which is specific to a particular DBMS class (e.g. relational). For a relational
system, such as Oracle, an appropriate conceptual schema would be relations.

Finally, in the Physical Design phase the conceptual schema is converted into database internal
structures. This is specific to a particular DBMS product.

DBMS Architecture

Overall System Structure

1. Database systems are partitioned into modules for different functions. Some functions (e.g.
file systems) may be provided by the operating system.
2. Components include:
o File manager manages allocation of disk space and data structures used to represent
information on disk.
DBMS Notes By Kailash Jadhav
DBMS
o Database manager: The interface between low-level data and application programs
and queries.
o Query processor translates statements in a query language into low-level instructions
the database manager understands. (May also attempt to find an equivalent but more
efficient form.)
o DML precompiler converts DML statements embedded in an application program to
normal procedure calls in a host language. The precompiler interacts with the query
processor.
o DDL compiler converts DDL statements to a set of tables containing metadata stored
in a data dictionary.

In addition, several data structures are required for physical system implementation:

o Data files: store the database itself.


o Data dictionary: stores information about the structure of the database. It is used
heavily. Great emphasis should be placed on developing a good design and efficient
implementation of the dictionary.
o Indices: provide fast access to data items holding particular values.
3. Figure 1.6 shows these components.

Role of Database Administrator


DBMS Notes By Kailash Jadhav
DBMS
1. The database administrator is a person having central control over data and programs
accessing that data. Duties of the database administrator include:
o Scheme definition: the creation of the original database scheme. This involves writing a
set of definitions in a DDL (data storage and definition language), compiled by the DDL
compiler into a set of tables stored in the data dictionary.
o Storage structure and access method definition: writing a set of definitions translated
by the data storage and definition language compiler
o Scheme and physical organization modification: writing a set of definitions used by the
DDL compiler to generate modifications to appropriate internal system tables (e.g. data
dictionary). This is done rarely, but sometimes the database scheme or physical
organization must be modified.
o Granting of authorization for data access: granting different types of authorization for
data access to various users
o Integrity constraint specification: generating integrity constraints. These are consulted
by the database manager module whenever updates occur

Kinds of Database Users

1. The database users fall into several categories:


o Application programmers are computer professionals interacting with the system
through DML calls embedded in a program written in a host language (e.g. C, PL/1,
Pascal).
 These programs are called application programs.
 The DML precompiler converts DML calls (prefaced by a special character like $,
#, etc.) to normal procedure calls in a host language.
 The host language compiler then generates the object code.
 Some special types of programming languages combine Pascal-like control
structures with control structures for the manipulation of a database.
 These are sometimes called fourth-generation languages.
 They often include features to help generate forms and display data.
o Sophisticated users interact with the system without writing programs.
 They form requests by writing queries in a database query language.
 These are submitted to a query processor that breaks a DML statement down
into instructions for the database manager module.
o Specialized users are sophisticated users writing special database application
programs. These may be CADD systems, knowledge-based and expert systems,
complex data systems (audio/video), etc.
o Naive users are unsophisticated users who interact with the system by using
permanent application programs (e.g. automated teller machine).

Entity Relationship Diagram(Data modeling)

What are ERDs?


Entity Relationship Diagrams (ERDs) illustrate the logical structure of databases. The database
derived through ERD is normalized upto 3 NF.
DBMS Notes By Kailash Jadhav
DBMS
Data models are tools used in analysis to describe the data requirements and
assumptions in the system from a top-down perspective. They also set the stage for the
design of databases later on in the SDLC.

Basics

Entity Relationship (ER) modelling

 is a design tool
 is a graphical representation of the database system
 provides a high-level conceptual data model
 supports the user's perception of the data
 is DBMS and hardware independent
 had many variants
 is composed of entities, attributes, and relationships

Entity Relationship Diagram Notations


Peter Chen developed ERDs in 1976. Since then Charles Bachman and James
Martin have added some sligh refinements to the basic ERD principles.

Entity
An entity is an object or concept about
which you want to store information.

Weak Entity
A weak entity is dependent on another
entity to exist and does not have it’s
own primary key.

Attributes
Attributes are the properties or
characteristics of an entity.

Key attribute
A key attribute is the unique,
distinguishing characteristic of the
entity. For example, an employee's
social security number might be the
employee's key attribute.
DBMS Notes By Kailash Jadhav
DBMS
Multivalued attribute
A multivalued attribute can have more
than one value. For example, an
employee entity can have multiple skill
values.
Derived attribute
A derived attribute is based on another
attribute. For example, an employee's
monthly salary is based on the
employee's annual salary.

Relationships
Relationships illustrate how two entities
share information in the database
structure.

Weak relationship
To connect a weak entity with others,
you should use a weak relationship
notation.

Cardinality
Cardinality specifies how many instances
of an entity relate to one instance of
another entity.

Cardinality is also closely linked to


cardinality. While cardinality specifies
the occurrences of a relationship,
cardinality describes the relationship as
either mandatory or optional. In other
words, cardinality specifies the
maximum number of relationships and
cardinality specifies the absolute
minimum number of relationships.

DBMS Notes By Kailash Jadhav


DBMS

Recursive relationship
In some cases, entities can be self-
linked. For example, employees can
supervise other employees.

Entities

 An entity is any object in the system that we want to model and store information about
 Individual objects are called entities
 Groups of the same type of objects are called entity types or entity sets
 Entities are represented by rectangles (either with round or square corners)

Figure: Entities

 There are two types of entities; weak and strong entity types.

Attribute

 All the data relating to an entity is held in its attributes.


 An attribute is a property of an entity.
 Each attribute can have any value from its domain.
 Each entity within an entity type:
o May have any number of attributes.
o Can have different attribute values than that in any other entity.
o Have the same number of attributes.
 Attributes can be
 simple or composite
 single-valued or multi-valued
 Attributes can be shown on ER models
 They appear inside ovals and are attached to their entity.
 Note that entity types can have a large number of attributes... If all are shown then the
diagrams would be confusing. Only show an attribute if it adds information to the ER diagram,
or clarifies a point.

DBMS Notes By Kailash Jadhav


DBMS

Figure : Attributes

Keys

 A key is a data item that allows us to uniquely identify individual occurrences or an entity type.
 A candidate key is an attribute or set of attributes that uniquely identifies individual
occurrences or an entity type.
 An entity type may have one or more possible candidate keys, the one which is selected is
known as the primary key.
 A composite key is a candidate key that consists of two or more attributes
 The name of each primary key attribute is underlined.

Relationships

 A relationship type is a meaningful association between entity types


 A relationship is an association of entities where the association includes one entity from each
participating entity type.
 Relationship types are represented on the ER diagram by a series of lines.
 As always, there are many notations in use today...
 In the original Chen notation, the relationship is placed inside a diamond, e.g. managers
manage employees:

Figure : Chens notation for relationships

 For this module, we will use an alternative notation, where the relationship is a label on the
line. The meaning is identical

Figure : Relationships used in this document

Degree of a Relationship

 The number of participating entities in a relationship is known as the degree of the


relationship.
 If there are two entity types involved it is a binary relationship type

Figure : Binary Relationships


DBMS Notes By Kailash Jadhav
DBMS
 If there are three entity types involved it is a ternary relationship type

Figure : Ternary relationship

 It is possible to have a n-ary relationship (e.g. quaternary or unary).


 Unary relationships are also known as a recursive relationship.

Figure : Recursive relationship

 It is a relationship where the same entity participates more than once in different roles.
 In the example above we are saying that employees are managed by employees.
 If we wanted more information about who manages whom, we could introduce a second
entity type called manager.

Degree of a Relationship

 It is also possible to have entities associated through two or more distinct relationships.

Figure : Multiple relationships

 In the representation we use it is not possible to have attributes as part of a relationship. To


support this other entity types need to be developed.

Replacing ternary relationships

When ternary relationships occurs in an ER model they should always be removed before finishing
the model. Sometimes the relationships can be replaced by a series of binary relationships that link
pairs of the original ternary relationship.

Figure : A ternary relationship example


DBMS Notes By Kailash Jadhav
DBMS
 This can result in the loss of some information - It is no longer clear which sales assistant sold
a customer a particular product.
 Try replacing the ternary relationship with an entity type and a set of binary relationships.

Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as a
noun.

 The relationship sells can become the entity type sale.

Figure : Replacing a ternary relationship

 So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product.
 This process also works for higher order relationships.

Cardinality

 Relationships are rarely one-to-one


 For example, a manager usually manages more than one employee
 This is described by the cardinality of the relationship, for which there are four possible
categories.
 One to one (1:1) relationship
 One to many (1:m) relationship
 Many to one (m:1) relationship
 Many to many (m:n) relationship
 On an ER diagram, if the end of a relationship is straight, it represents 1, while a "crow's foot"
end represents many.
 A one to one relationship - a man can only marry one woman, and a woman can only marry
one man, so it is a one to one (1:1) relationship

Figure : One to One relationship example

 A one to may relationship - one manager manages many employees, but each employee only
has one manager, so it is a one to many (1:n) relationship

Figure : One to Many relationship example

DBMS Notes By Kailash Jadhav


DBMS
 A many to one relationship - many students study one course. They do not study more than
one course, so it is a many to one (m:1) relationship

Figure : Many to One relationship example

 A many to many relationship - One lecturer teaches many students and a student is taught by
many lecturers, so it is a many to many (m:n) relationship

Figure : Many to Many relationship example

Optionality

A relationship can be optional or mandatory.

 If the relationship is mandatory


 an entity at one end of the relationship must be related to an entity at the other end.
 The optionality can be different at each end of the relationship
 For example, a student must be on a course. This is mandatory. To the relationship `student
studies course' is mandatory.
 But a course can exist before any students have enrolled. Thus the relationship `course
is_studied_by student' is optional.
 To show optionality, put a circle or `0' at the `optional end' of the relationship.
 As the optional relationship is `course is_studied_by student', and the optional part of this is
the student, then the `O' goes at the student end of the relationship connection.

Figure : Optionality example

 It is important to know the optionality because you must ensure that whenever you create a
new entity it has the required mandatory links.

Foreign keys

A foreign key is an attribute (or group of attributes) that is the primary key to another relation.

DBMS Notes By Kailash Jadhav


DBMS
 Roughly, each foreign key represents a relationship between two entity types.
 They are added to relations as we go through the mapping process.
 They allow the relations to be linked together.
 A relation can have several foreign keys.
 It will generally have a foreign key from each table that it is related to.
 Foreign keys are usually shown in italics or with a wiggly underline.

Preparing to map the ER model

Before we start the actual mapping process we need to be certain that we have simplified the ER
model as much as possible.

This is the ideal time to check the model, as it is really the last chance to make changes to the ER
model without causing major complications.

Mapping 1:1 relationships

Before tackling a 1:1 relationship, we need to know its optionality.

There are three possibilities the relationship can be:

1. mandatory at both ends


2. mandatory at one end and optional at the other
3. optional at both ends

Mandatory at both ends

If the relationship is mandatory at both ends it is often possible to subsume one entity type into the
other.

 The choice of which entity type subsumes the other depends on which is the most important
entity type (more attributes, better key, semantic nature of them).
 The result of this amalgamation is that all the attributes of the `swallowed up' entity become
attributes of the more important entity.
 The key of the subsumed entity type becomes a normal attribute.
 If there are any attributes in common, the duplicates are removed.
 The primary key of the new combined entity is usually the same as that of the original more
important entity type.

When not to combine

There are a few reason why you might not combine a 1:1 mandatory relationship.

 the two entity types represent different entities in the `real world'.
 the entities participate in very different relationships with other entities.
 efficiency considerations when fast responses are required or different patterns of updating
occur to the two different entity types.

DBMS Notes By Kailash Jadhav


DBMS
If not combined...

If the two entity types are kept separate then the association between them must be represented by
a foreign key.

 The primary key of one entity type comes the foreign key in the other.
 It does not matter which way around it is done but you should not have a foreign key in each
entity.

1:1 Optional

So for 1:1 optional relationships, take the primary key from the `mandatory end' and add it to the
`optional end' as a foreign key.

Mapping 1:m relationships

To map 1:m relationships, the primary key on the `one side' of the relationship is added to the
`many side' as a foreign key.

Mapping n:m relationships

If you have some m:n relationships in your ER model then these are mapped in the following
manner.

 A new relation is produced which contains the primary keys from both sides of the relationship
 These primary keys form a composite primary key.

Splitting n:m Relationships

A many to many relationship in an ER model is not necessarily incorrect. They can be replaced using
an intermediate entity. This should only be done where:

 the m:n relationship hides an entity


 the resulting ER diagram is easier to understand.

Splitting n:m Relationships - Example

Consider the case of a car hire company. Customers hire cars, one customer hires many card and a
car is hired by many customers.

Figure : Many to Many example

DBMS Notes By Kailash Jadhav


DBMS
The many to many relationship can be broken down to reveal a `hire' entity, which contains an
attribute `date of hire'.

Figure : Splitting the Many to Many example

Constructing an ER model

Before beginning to draw the ER model, read the requirements specification carefully. Document any
assumptions you need to make.

1. Identify entities - list all potential entity types. These are the object of interest in the system.
It is better to put too many entities in at this stage and them discard them later if necessary.
2. Remove duplicate entities - Ensure that they really separate entity types or just two names for
the same thing.
o Also do not include the system as an entity type
o e.g. if modelling a library, the entity types might be books, borrowers, etc.
o The library is the system, thus should not be an entity type.
3. List the attributes of each entity (all properties to describe the entity which are relevant to the
application).
o Ensure that the entity types are really needed.
o are any of them just attributes of another entity type?
o if so keep them as attributes and cross them off the entity list.
o Do not have attributes of one entity as attributes of another entity!
4. Mark the primary keys.
o Which attributes uniquely identify instances of that entity type?
o This may not be possible for some weak entities.
5. Define the relationships
o Examine each entity type to see its relationship to the others.
6. Describe the cardinality and optionality of the relationships
o Examine the constraints between participating entities.
7. Remove redundant relationships
o Examine the ER model for redundant relationships.

ER modelling is an iterative process, so draw several versions, refining each one until you are happy
with it. Note that there is no one right answer to the problem, but some solutions are better than
others!

DBMS Notes By Kailash Jadhav


DBMS

Mapping parallel relationships

Parallel relationships occur when there are two or more relationships between two entity types (e.g.
employees own and service cars).

Figure : Parallel Relationships

 In order to distinguish between the two roles we can give the foreign keys different names.
 Each relationship is mapped according to the rules, and we end up with two foreign keys
added to the Vehicle table.
 So we add the employee_no as the owner_no in order to represent the `owns' relationship.

Mapping 1:m in unary relationships

Figure : Mapping recursive relationships

 Employees manage employees


 Each employee has an employee_no with is the primary key
DBMS Notes By Kailash Jadhav
DBMS
 We represent the manages relationship by adding a manager_no as a foreign key.
 This is in fact the employee_no of the manager.
 It is given a different name to clearly convey what it represents, and to ensure that all the
entity type's attributes have unique names, as to do otherwise would be invalid.
 After mapping

 Employee(employee_no,manager_no, name,...)

 So in general, for unary 1:n relationships, the foreign key is the primary key of the same table,
but is given a different name.
 Note that the relationship is optional in both directions because not all staff can be managers,
and the top manager is not managed by anybody else.

Mapping superclasses and subclasses

There are three ways of implementing superclasses and subclasses and it depends on the application
which will be the most suitable.

Only the first method is a true reflection of the superclasses and subclasses and if either of the other
methods is preferential then the model should not have subclasses.

1. One relation for the superclass and one relation for each subclass.
2. One relation for each subclass.
3. One relation for the superclass.

Enhanced Entity Relationship Diagram Concepts

Generalization

Consider extending the entity set account by classifying accounts as being either savings-account or
chequing-account.

Each of these is described by the attributes of account plus additional attributes. (savings has
interest-rate and chequing has overdraft-amount.)

We can express the similarities between the entity sets by generalization. This is the process of
forming containment relationships between a higher-level entity set and one or more lower-level
entity sets.

In E-R diagrams, generalization is shown by a triangle, as shown in Figure 2.19.

DBMS Notes By Kailash Jadhav


DBMS

  

Figure 2.19: Generalization

 Generalization hides differences and emphasizes similarities.


 Distinction made through attribute inheritance.
 Attributes of higher-level entity are inherited by lower-level entities.
 Two methods for conversion to a table form:
o Create a table for the high-level entity, plus tables for the lower-level entities containing
also their specific attributes.
o Create only tables for the lower-level entities.

Aggregation

The E-R model cannot express relationships among relationships.

When would we need such a thing?

Consider a DB with information about employees who work on a particular project and use a number
of machines doing that work. We get the E-R diagram shown in Figure 2.20.

DBMS Notes By Kailash Jadhav


DBMS

  
Figure 2.20: E-R diagram with redundant relationships

Relationship sets work and uses could be combined into a single set. However, they shouldn't be, as
this would obscure the logical structure of this scheme.

The solution is to use aggregation.

 An abstraction through which relationships are treated as higher-level entities.


 For our example, we treat the relationship set work and the entity sets employee and project
as a higher-level entity set called work.
 Figure 2.21 shows the E-R diagram with aggregation.

  
Figure 2.21: E-R diagram with aggregation

DBMS Notes By Kailash Jadhav


DBMS
Transforming an E-R diagram with aggregation into tabular form is easy. We create a table for each
entity and relationship set as before

STEP FOUR -MAP RELATIONSHIPS


For M:1 relationships Take the PK at the one end and put it in the table at the
many end.
If the table's PK includes a The FK columns to support the relationship may have been
foreign key,  added when you mapped the UIDs
For a mandatory 1:1 Place the unique FK in the table at the mandatory end and
relationship,  use the NOT NULL constraint to enforce the mandatory
condition.
If a 1:1 relationship is
Place the FK in the table at either end of the relationship.
optional in both directions, 
For a 1:M recursive Add a FK column to the single table. This FK column will
relationship,  refer to values of the PK column.
For a 1:1 recursive Add a unique FK to the table. This FK column will refer to
relationship,  values of the PK column.

Role of normalization

This E-R diagram says that a customer may have several accounts, each located in a specific bank
branch, and that an account may belong to several different customers.

In an earlier chapter, we have discussed data modelling using the entity-relationship model. In the E-
R model, a conceptual schema using an entity- relationship diagram is built and then mapped to a set
of relations. This technique ensures that each entity has information about only one thing and once
the conceptual schema is mapped into a set of relations, each relation would have information about
only one thing. The relations thus obtained would normally not suffer from any of the anomalies that
have been discussed in the last section. It can be shown that if the entity-relationship is built using
the guidelines that were presented in Chapter 2, the resulting relations are in BCNF. (How???
Check?)

Of course, mistakes can often be made in database modelling specially when the database is large
and complex or one may, for some reasons, carry out database schema design using techniques
other than a modelling technique like the entity- relationship model. For example, one could collect
all the information that an enterprise possesses and build one giant relation (often called the
universal relation) to hold it. This bottom-up approach is likely to lead to a relation that is likely to
suffer from all the problems that we have discussed in the last section. For example, the relation is
highly likely to have redundant information and update, deletion and insertion anomalies.
Normalization of such large relation will then be essential to avoid (or at least minimise) these
problems.

DBMS Notes By Kailash Jadhav


DBMS
To summarise, normalization plays only a limited role in database schema design if a top-down
modelling approach like the entity- relationship approach is used. Normalization however plays a
major role when the bottom-up approach is being used. Normalization is then essential to build
appropriate relations to hold the information of the enterprise.

Now to define the normal forms more formally, we first need to define the concept of functional
dependence.

Flat File Database

 In it's most simple form, a flat-file database is nothing more than a single, large
table (e.g., a spreadsheet)
 A flat file contains only one record structure; there are no links between separate
records
 Access to data is done in a sequential manner; random access is not supported
 Access times are slow because the entire file must be scanned to locate  the
desired data 
 Access times can be improved if the data is sorted but this introduces the
potential for error (e.g., one or more records may be misfiled)  
 Other  problems with a flat-file database include 1) data redundancy; 2) data
maintenance; and 3) data integrity
 EXAMPLE.  An ‘orders’ file might require fields for the order number, date,
customer name, customer address, quantity, part description, price, etc.  In this
example, each record must repeat the name and address of the customer ( data
redundancy). If the customer’s address changed, it would have to be changed in
multiple locations (data maintenance). If the customer name were spelled
differently in different orders (e.g., Acme, Acme Inc, Acme Co.) then the data
would be inconsistent (data integrity). (See the Normalization Exercise for a
practical example of these deficiencies. 
 Flat file data structures are only viable for small data processing requirements 

Hierarchical Database Model

 The hierarchical and network database models preceded the relational model;
today very few commercial databases use either of these models
 A hierarchical database is a series of flat-files linked in structured 'tree'
relationships 
 IBM's IMS (Information Management System) database, often referred to by the
name of its proprietary language, DL/I (Data Language I), is the only viable
commercial hierarchical database still in use today, primarily on older mainframe
computers
 The concept for the hierarchical database model was originally based on a bill of
materials (BOM)
 Data is represented as a series of parent/child relationships

DBMS Notes By Kailash Jadhav


DBMS
 This concept  is fundamentally different from that used in relational model where
data resides in a collection of tables without any hierarchy and that are physically
independent of each other
 In the hierarchical model, a database 'record' is a tree that consists of one or
more groupings of fields called 'segments'
 Segments make up the individual 'nodes' of the tree (e.g., the 'customers' record
may consist of 'customer' and 'order' segments)
 The model requires that each child segment can be linked to only one parent and
a child can only be reached through its parent
 The requirement for a one-to-many relationship between parent and child can
result in redundant data (e.g., 'orders' might be a child of 'customers' as well as
a child of 'parts') 
 To get around the data redundancy problem, data is stored in one place and
referenced by links or physical pointers in other places (e.g., the 'customers'
record contains actual data in the 'orders' segment while the 'parts' record
contains a pointer to the 'orders' data in 'customers')
 EXAMPLE: To create a sales report, you have to access  ‘customers’  to get to
‘orders’ 
 This is fundamentally different from the way a relational database operates; in a
relational database there is no hierarchy among tables and any table can be
accessed directly or potentially linked with any other table; THERE ARE NO
HARD-CODED, PREDEFINED PATHS AMONG THE DATA (Note that while a 
primary key-foreign key combination in a relational database represents a logical
relationship among data, it does not necessarily limit the possible physical access
paths through the data)
 In the hierarchical model, the link established by the pointers is permanent and
cannot be modified; IOW the links are hard-coded into the data structure
 The hard-coding makes the hierarchical model very inflexible; a design originally
optimized to work with the data in one way may be totally inefficient in working
with the data in other ways
 In addition, the physical links make it very difficult to expand or modify the
database; changes typically require substantial rewriting efforts 

Network Database Model

 The network database model expands on the hierarchical model by providing


multiple paths among segments (i.e., more than one parent/child relationship)
 The network model was standardized as the  CODASYL DBTG (Conference on
Data System Languages, Data Base Task Group) model
 Although supporting multiple paths in the data structure eliminates some of the
inflexibility of the hierarchical model, the network model is not very practical
 The network model only supports simple network relationship that are
implemented as 'chains' connecting individual records
 With no restrictions on the number of relations, the database design can become
overwhelmingly complex 
 The relational model provides the same flexibility offered by the network model
but is much easier to work with
DBMS Notes By Kailash Jadhav
DBMS
 The network model is for all practical purposes obsolete

Indexed Sequential Access Method (ISAM)

 The Indexed Sequential Access Method (ISAM), is a disk storage and access
method, not a database model
 There is much confusion surrounding the term ISAM and in practice it is used to
refer to desktop and/or file-based databases like Microsoft's FoxPro and Jet-
based Access, Clipper, dBase, Paradox, and  Btrieve
 It also is used to refer to navigational database applications that rely on a
procedural approach to data access and retrieval
 Possibly the only 'true' ISAM-based products are IBM’s Information Management
System (IMS) and  Btrieve
 Under ISAM, records are located using a key value. A smaller index file stores the
keys along with pointers to the records in the larger data file. The index file is
first searched for the key and then the associated pointer is used to locate the
desired record
 ISAM is more concerned with the access and storage of data; it does not
represent a full database management system (DBMS)
 Data access and storage methods are discussed below

Relational Database Model

 The theory behind the relational database model is discussed the section entitled
'Characteristics of a Relational Database'; this section focuses on the distinctions
between the relational model and other database models
 In a relational database, the logical design is independent of the physical design
 Queries against a relational database management system (RDBMS) are based
on logical relationships and processing those queries does not require pre-
defined access paths among the data (i.e., pointers)
 The relational database provides flexibility that allows changes to the database
structure to be easily accommodated
 Because the data reside in tables, the structure of the database can be changed
without having to change any applications that were based on that structure
 EXAMPLE: You add a new field for e-mail address in the customers table.  If you
are using a non-relational database,  you probably have to modify the application
that will access this information by including 'pointers' to the new data. With a
relational database, the information is immediately accessible because it is
automatically related to the other data by virtue of its position in the table.  All
that is required to access the new e-mail field is to add it to a SELECT list
 The structural flexibility of a relational database allows combinations of data to
be retrieved that were never anticipated at the time the database was initially
designed
 In contrast, the database structure in older database models is "hard-coded" into
the application; if you add new fields to a non-relational database, any
application that access the database will have to be updated

DBMS Notes By Kailash Jadhav


DBMS
 In practice, there is significant confusion as to what constitutes a relational
database management system (RDBMS) 
 Dr. E.F. Codd provided 12 rules that define the basic  characteristics of a
relational database but implementation of these rules varies from vendor to
vendor
 No RDBMS product is  fully compliant with all 12 rules but some are more so
than others
 When distinguishing DBMS products, there are typically three key area on which
to focus: 1) query formulation and data storage/access; 2) application
integration; and 3) processing architecture

Object-Oriented Database

 The object-oriented database, also referred to as the ‘post-relational’ database


model,  addresses some of the limitations of the relational model
 The most significant limitation of the relational model is it’s limited ability to deal
with BLOBs
 Binary Large OBjects or BLOBs are complex data types such as images,
spreadsheets, documents, CAD, e-mail messages, and directory structures
 At its most basic level, 'data' is a sequence of bits ('1s' and '0s') residing in some
sort of storage structure
 'Traditional' databases are designed to support small bit streams representing
values expressed as numeric or small character strings
 Bit stream data is atomic; it cannot be broken down into small pieces
 BLOBs are large and non-atomic data; they have parts and subparts and are not
easily represented in a relational database
 There is no specific mechanism in the relational model to allow for the retrieval
of parts of a BLOB
 A relational database can store BLOBs but they are stored outside the database
and referenced by pointers
 The pointers allow the relational database to be searched for BLOBs, but the
BLOB itself must be manipulated by conventional file I/O methods
 Object-orient databases  provide native support BLOBs
 Unfortunately, there is no clear model or framework for the object-oriented
database like the one Codd provided for the relational database
 Under the general concept of an object-oriented database, everything is treated
as an object that can be manipulated
 Objects inherit characteristics of their class and have a set of behaviors
(methods) and properties that can be manipulated
 The hierarchical notion of classes and subclasses in the object-oriented database
model replaces the relational concept of atomic data types
 The object-oriented approach provides a natural way to represent the hierarchies
that occur in complex data
 EXAMPLE, a Word document object consists of paragraph objects and has
method to ‘draw' itself
 There are a limited number of commercial object-oriented database systems
available; mostly for engineering or CAD applications
DBMS Notes By Kailash Jadhav
DBMS
 In a way, object-oriented concept represents a ‘Back to the Future’ approach in
that it is very similar to the old hierarchical database  design
 Relational databases are not obsolete and may simply evolve by adding
additional support for BLOBs

Codd’s Rules

E. F. Codd presented these rules as a basis of determining whether a DBMS could be classified as
Relational

Codd's Rules

 Codd's Rules can be divided into 5 functional areas –


 Foundation Rules
 Structural Rules
 Integrity Rules
 Data Manipulation Rules
 Data Independence Rules

Codd's Rules

 Foundation Rules (Rules 0 & 12)


 Structural Rules (Rules 1 & 6)
 The fundamental structural construct is the table.
 Codd states that an RDBMS must support tables, domains, primary & foreign keys.
 Each table should have a primary key.

 Integrity Rules (Rules 3 & 10)


 Integrity should be maintained by the DBMS not the application.
 Data Manipulation Rules (Rule 2, 4, 5 & 7)
 User should be able to manipulate the 'Logical View' of the data with no need for knowledge of
how it is Physically stored or accessed.
 Data Independence Rules (Rules 8, 9 11)

 These rules protect users & application developers from having to change the applications
following any low-level reorganisation of the DB.
 Rule 11 - Distribution Independence -

Codd's Rules

DBMS Notes By Kailash Jadhav


DBMS
 Rule 0 –
 Any system claimed to be a RDBMS must be able to manage databases entirely through its
relational capabilities.
 All data definition & manipulation must be able to be done through relational ops.

 Rule 1 -
 All info in a RDB is represented explicitly at the logical level in exactly one way - by values in a
table.
 ALL info even the Metadata held in the system catalogue MUST be stored as relations(tables) &
manipulated in the same way as data.
 Rule 2 - Guaranteed Access -
 Each & every datum in an RDB is guaranteed to be logically accessible by a combination of table
name, primary key value & column name.
 Rule 3 - Systematic treatment of null values -
 Null values are supported for representation of 'missing' & inapplicable information in a systematic
way & independent of data type.

 Rule 4 - Dynamic on-line Catalog based on relational model


 The DB description (metadata) is represented at logical level in the same way as ordinary
data, so that same relational language can be used to interrogate the metadata as regular data.
 System & other data stored & manipulated in the same way.
 Rule 5 - Comprehensive Data Sublanguage -
 RDBMS may support many languages & modes of use, but there must be at least ONE language
whose statements can express ALL of the following -
 Data Definition
 View Definition
 Data manipulation (interactive & via program)
 Integrity constraints
 Authorization
 Transaction boundaries (begin, commit & rollback)
1992 - ISO standard for SQL provides all these functions

 Rule 6 - View Updating –


 All views that are theoretically updatable are updatable by the system.

 Not really implemented yet by any available system.


 Rule 7 - High-level insert, update & delete -
 Capability of handling a base table or view as a single operand applies not only to data retrieval
but also to insert, update & delete operations.
 Rule 8 - Physical Data Independence -
 Application Programs & Terminal Activities remain logically unimpaired whenever any changes are
DBMS Notes By Kailash Jadhav
DBMS
made either to the storage organisation or access methods.

 This gives the advantage of centralised control & enforcement


 Rule 9 - Logical Data Independence -
 Appn Progs & Terminal Acts remain logically unimpaired when information-preserving changes of
any kind that theoretically permit unimpairment are made to the base tables.
 Rule 10 - Integrity independence -
 Integrity constraints specific to a particular RDB MUST be definable in the relational data
sublanguage & storable in the DB, NOT the application program.
 Rule 11 - Distribution Independence -
 The data manipulation sublanguage of an RDBMS must enable application programs & queries
to remain logically unchanged whether & whenever data is physically centralised or distributed.
This means that an Application Program that accesses the DBMS on a single computer should
also work ,without modification, even if the data is moved from one computer to another in a
network environment.
The user should 'see' one centralised DB whether data is located on one or more computers.

 This rule does not say that to be fully Relational the DBMS must support distributed DB's but
that if it does the query must remain the same.
 Rule 12 - Nonsubversion Rule -
 If a RDBMS has a low level (record at a time) language, that low level language cannot be used to
subvert or bypass the integrity rules &constraints expressed in the higher-level relational language.
 All database access must be controlled through the DBMS so that the integrity of the database
cannot be compromised without the knowledge of the user or the DBA.
This does not prohibit use of record at a time languages e.g. PL/SQL

Relational Algebra

 Recall, the Relational Model consists of the elements: relations, which are made up of
attributes.
 A relation is a set of attributes with values for each attribute such that:
1. Each attribute value must be a single value only (atomic).
2. All values for a given attribute must be of the same type (or domain).
3. Each attribute name must be unique.
4. The order of attributes is insignificant
5. No two rows (tuples) in a relation can be identical.
6. The order of the rows (tuples) is insignificant.
 Relational Algebra is a collection of operations on Relations.
 Relations are operands and the result of an operation is another relation.
 Two main collections of relational operators:

DBMS Notes By Kailash Jadhav


DBMS
1. Set theory operations:
Union, Intersection, Difference and Cartesian product.
2. Specific Relational Operations:
Selection, Projection, Join, Division

Set Theoretic Operations


Consider the following relations R and S
R
First Last Age
Bill Smith 22
Sally Green 28
Mary Keen 23
Tony Jones 32
S
First Last Age
Forrest Gump 36
Sally Green 28
DonJuan DeMarco 27

 Union: R S
Result: Relation with tuples from R and S with duplicates removed.
 Difference: R - S
Result: Relation with tuples from R but not from S
 Intersection: R S
Result: Relation with tuples that appear in both R and S.

R S

First Last Age


Bill Smith 22
Sally Green 28
Mary Keen 23
Tony Jones 32
Forrest Gump 36
DonJuan DeMarco 27

R-S

First Last Age

DBMS Notes By Kailash Jadhav


DBMS
Bill Smith 22
Mary Keen 23
Tony Jones 32

R S

First Last Age


Sally Green 28

Union Compatible Relations

 Attributes of relations need not be identical to perform union, intersection and difference
operations.
 However, they must have the same number of attributes or arity and the domains for
corresponding attributes must be identical.
 Domain is the datatype and size of an attribute.
 The degree of relation R is the number of attributes it contains.
 Definition: Two relations R and S are union compatible if and only if they have the same
degree and the domains of the corresponding attributes are the same.
 Some additional properties:
o Union, Intersection and difference operators may only be applied to Union Compatible
relations.
o Union and Intersection are commutative operations
R S=S R
R S=S R
o Difference operation is NOT commutative.
R - S not equal S - R
o The resulting relations may not have meaningful names for the attributes. Convention is
to use the attribute names from the first relation.

Cartesian Product

 Produce all combinations of tuples from two relations.

First Last Age


Bill Smith 22
Mary Keen 23
Tony Jones 32

DBMS Notes By Kailash Jadhav


DBMS
Dinner Dessert
Steak Ice Cream
Lobster Cheesecake

RXS

First Last Age Dinner Dessert


Bill Smith 22 Steak Ice Cream
Bill Smith 22 Lobster Cheesecake
Mary Keen 23 Steak Ice Cream
Mary Keen 23 Lobster Cheesecake
Tony Jones 32 Steak Ice Cream
Tony Jones 32 Lobster Cheesecake

Selection Operator

 Selection and Projection are unary operators.


 The selection operator is sigma:
 The selection operation acts like a filter on a relation by returning only a certain number of
tuples.
 The resulting relation will have the same degree as the original relation.
 The resulting relation may have fewer tuples than the original relation.
 The tuples to be returned are dependent on a condition that is part of the selection operator.
 C (R) Returns only those tuples in R that satisfy condition C
 A condition C can be made up of any combination of comparison or logical operators that
operate on the attributes of R.
o Comparison operators:
o Logical operators:

Selection Examples
Assume the following relation EMP has the following tuples:
Name Office Dept Rank
Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant
Brown 420 CS Associate
Smith 500 Fin Associate

DBMS Notes By Kailash Jadhav


DBMS
 Select only those Employees in the CS department:
Dept = 'CS' (EMP)
Result:

Name Office Dept Rank


Smith 400 CS Assistant
Brown 420 CS Associate

 Select only those Employees with last name Smith who are assistant professors:
Name = 'Smith' Rank = 'Assistant' (EMP)
Result:

Name Office Dept Rank


Smith 400 CS Assistant

 Select only those Employees who are either Assistant Professors or in the Economics
department:
Rank = 'Assistant' Dept = 'Econ' (EMP)
Result:

Name Office Dept Rank


Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant

 Select only those Employees who are not in the CS department or Adjuncts:
(Rank = 'Adjunct' Dept = 'CS') (EMP)
Result:

Name Office Dept Rank


Green 160 Econ Assistant
Smith 500 Fin Associate

Projection Operator

 Projection is also a Unary operator.


 The Projection operator is pi:
 Projection limits the attributes that will be returned from the original relation.

DBMS Notes By Kailash Jadhav


DBMS
 The general syntax is: attributes R
Where attributes is the list of attributes to be displayed and R is the relation.
 The resulting relation will have the same number of tuples as the original relation (unless
there are duplicate tuples produced).
 The degree of the resulting relation may be equal to or less than that of the original relation.

Projection Examples
Assume the same EMP relation above is used.

 Project only the names and departments of the employees:


name, dept (EMP)
Results:

Name Dept
Smith CS
Jones Econ
Green Econ
Brown CS
Smith Fin

Combining Selection and Projection

 The selection and projection operators can be combined to perform both operations.
 Show the names of all employees working in the CS department:
name ( Dept = 'CS' (EMP) )
Results:

Name
Smith
Brown

 Show the name and rank of those Employees who are not in the CS department or Adjuncts:
name, rank ( (Rank = 'Adjunct' Dept = 'CS') (EMP) )
Result:

Name Rank
Green Assistant
Smith Associate

DBMS Notes By Kailash Jadhav


DBMS
Exercises

 Evaluate the following expressions:


1. name, rank ( (Rank = 'Adjunct' Dept = 'CS') (EMP) )
2. fname, age ( Age > 22 (R S) )

For this expression, use R and S from the Set Theoretic Operations section above.

3. office > 300 ( name, rank (EMP))

Join Operation

 Join operations bring together two relations and combine their attributes and tuples in a
specific fashion.

 The generic join operator (called the Theta Join is:


 It takes as arguments the attributes from the two relations that are to be joined.
 For example assume we have the EMP relation as above and a separate DEPART relation with
(Dept, MainOffice, Phone) :
EMP EMP.Dept = DEPART.Dept DEPART
 The join condition can be
 When the join condition operator is   =   then we call this an Equijoin
 Note that the attributes in common are repeated.

Join Examples
Assume we have the EMP relation from above and the following DEPART relation:
Dept MainOffice Phone
CS 404 555-1212
Econ 200 555-1234
Fin 501 555-4321
Hist 100 555-9876

 Find all information on every employee including their department info:


EMP emp.Dept = depart.Dept DEPART
Results:

Name Office EMP.Dept Salary DEPART.Dept MainOffice Phone


Smith 400 CS 45000 CS 404 555-1212

DBMS Notes By Kailash Jadhav


DBMS
Jones 220 Econ 35000 Econ 200 555-1234
Green 160 Econ 50000 Econ 200 555-1234
Brown 420 CS 65000 CS 404 555-1212
Smith 500 Fin 60000 Fin 501 555-4321

 Find all information on every employee including their department info where the employee
works in an office numbered less than the department main office:
EMP (emp.office < depart.mainoffice) (emp.dept = depart.dept) DEPART
Results:

Name Office EMP.Dept Salary DEPART.Dept MainOffice Phone


Smith 400 CS 45000 CS 404 555-1212
Green 160 Econ 50000 Econ 200 555-1234
Smith 500 Fin 60000 Fin 501 555-4321

Natural Join

 Notice in the generic (Theta) join operation, any attributes in common (such as dept above)
are repeated.
 The Natural Join operation removes these duplicate attributes.
 The natural join operator is: *
 We can also assume using * that the join condition will be = on the two attributes in common.
 Example: EMP * DEPART
Results:

Name Office Dept Salary MainOffice Phone


Smith 400 CS 45000 404 555-1212
Jones 220 Econ 35000 200 555-1234
Green 160 Econ 50000 200 555-1234
Brown 420 CS 65000 404 555-1212
Smith 500 Fin 60000 501 555-4321

Outer Join

 In the Join operations so far, only those tuples from both relations that satisfy the join
condition are included in the output relation.
 The Outer join includes other tuples as well according to a few rules.
 Three types of outer joins:
DBMS Notes By Kailash Jadhav
DBMS
1. Left Outer Join includes all tuples in the left hand relation and includes only those
matching tuples from the right hand relation.
2. Right Outer Join includes all tuples in the right hand relation and includes ony those
matching tuples from the left hand relation.
3. Full Outer Join includes all tuples in the left hand relation and from the right hand
relation.
 Examples:

Assume we have two relations: PEOPLE and MENU:

PEOPLE: MENU:
Name Age Food Food Day
Alice 21 Hamburger Pizza Monday
Bill 24 Pizza Hamburger Tuesday
Carl 23 Beer Chicken Wednesday
Dina 19 Shrimp Pasta Thursday
Tacos Friday

 PEOPLE people.food = menu.food MENU

Name Age people.Food menu.Food Day


Alice 21 Hamburger Hamburger Tuesday
Bill 24 Pizza Pizza Monday
Carl 23 Beer NULL NULL
Dina 19 Shrimp NULL NULL

 PEOPLE people.food = menu.food MENU

Name Age people.Food menu.Food Day


Bill 24 Pizza Pizza Monday
Alice 21 Hamburger Hamburger Tuesday
NULL NULL NULL Chicken Wednesday
NULL NULL NULL Pasta Thursday
NULL NULL NULL Tacos Friday
DBMS Notes By Kailash Jadhav
DBMS
 PEOPLE people.food = menu.food MENU

Name Age people.Food menu.Food Day


Alice 21 Hamburger Hamburger Tuesday
Bill 24 Pizza Pizza Monday
Carl 23 Beer NULL NULL
Dina 19 Shrimp NULL NULL
NULL NULL NULL Chicken Wednesday
NULL NULL NULL Pasta Thursday
NULL NULL NULL Tacos Friday

Outer Union

 The Outer Union operation is applied to partially union compatible relations.


 Operator is: *
 Example: PEOPLE * MENU

Name Age Food Day


Alice 21 Hamburger NULL
Bill 24 Pizza NULL
Carl 23 Beer NULL
Dina 19 Shrimp NULL
NULL NULL Hamburger Monday
NULL NULL Pizza Tuesday
NULL NULL Chicken Wednesday
NULL NULL Pasta Thursday
NULL NULL Tacos Friday

Formal Definitions of the Normal Forms

1st Normal Form (1NF)

Def: A table (relation) is in 1NF if

1. There are no duplicated rows in the table.

DBMS Notes By Kailash Jadhav


DBMS
2. Each cell is single-valued (i.e., there are no repeating groups or arrays).

3. Entries in a column (attribute, field) are of the same kind.

Note: The order of the rows is immaterial; the order of the columns is immaterial.

Note: The requirement that there be no duplicated rows in the table means that the table has a key
(although the key might be made up of more than one column—even, possibly, of all the columns).

2nd Normal Form (2NF)

Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of
the key.

Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the
(composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if
it has no partial dependencies."

3rd Normal Form (3NF)

Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies.

Boyce-Codd Normal Form (BCNF)

Def: A table is in BCNF if it is in 3NF and if every determinant is a candidate key.

4th Normal Form (4NF)

Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.

5th Normal Form (5NF)

Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and
if every join dependency in the table is a consequence of the candidate keys of the table.

Domain-Key Normal Form (DKNF)

Def: A table is in DKNF if every constraint on the table is a logical consequence of the
definition of keys and domains

5.3 Database Normal Forms

A normal form refers to a class of relational schemas that obey some set of rules. Schemas that obey
the rules are said to be in the normal form. It is usually the case that a schema designer wants all
relations created to be in as many normal forms as possible.

DBMS Notes By Kailash Jadhav


DBMS
5.3.1 Un-normalized Data
  List List List Composite        
Accountant Skill Skill Accountant Accountant Office Office Office
Prof
Number Number Category Name Age Number City Supervisor
21 113 Systems 3 Alice Adams 55 52 NYC Brown
113 Systems 5
35 179 Tax 1 Doug Baker 32 44 LA Green
204 Audit 6
50 179 Tax 2 Charlie Cody 40 44 LA Green
148 Consulting 6
77 Doug Doe 52 52 NYC Brown
179 Tax 6

5.3.2 First Normal Form

A relation is in First Normal Form (1NF) if every field value is non-decomposable (atomic) and every
tuple is unique. Attributes that partition into sub-attributes, and "repeating groups," which are very
popular in the file-based systems, are not permitted. Tuples in which the value for every tuple is
exactly the same as another tuple are not allowed.

There are two common types of non-atomic fields. The first non-atomic type of field is a list or
"repeating group." For example, one person may have several skills. If these are simply listed in the
same field then the relation is not in first normal form. The second type is a structured field, e.g. an
address. If the entire address is put in one field the relation is not in first normal form because a US
address consists of a street address, a city, a state, and a zip code. Whether a field is considered
structured or not is somewhat application dependent. If you will never want to access the data by
city, by state or by zip code then it may be appropriate to have the entire address in one field, but if
you want to sort by zip code, for example, then it is not appropriate to put the entire address in one
field.

The usefulness of first normal form is fairly obvious. Repeating groups destroy the natural
rectangular structure of a relation. It is extremely difficult to refer to a particular element of the
repeating group, because some sort of "position" within the attribute value must be specified.
Furthermore, the various parts of a partitioned attribute may behave very differently from the
standpoint of dependencies. The rule of first normal form embodies the common sense requirement
that each attribute have its own name. The relation in Figure C10 is not in 1NF because it contains
lists and/or structured fields.

 First Normal Form Data

Key  
Accountant Skill Skill First Last Accountant Group Group Group
Prof
Number Number Category Name Name Age Number City Supervisor
21 113 Systems 3 Alice  Adams 55 52 NYC Brown
35 113 Systems 5 Doug  Baker 32 44 LA Green
35 179 Tax 1 Doug  Baker 32 44 LA Green
DBMS Notes By Kailash Jadhav
DBMS
35 204 Audit 6 Doug  Baker 32 44 LA Green
50 179 Tax 2 Charlie  Cody 40 44 LA Green
77 148 Consulting 6 Doug  Doe 52 52 NYC Brown
77 179 Tax 6 Doug  Doe 52 52 NYC Brown

Information that is only in 1NF is highly redundant. To reduce redundancy we will convert the data to
second normal form. Fixing relations that are not in first normal form is a fairly simple task. Each
attribute is assigned its own name, and repeating group elements become individual tuples.

Normal forms higher than first normal form have been motivated by the discovery of anomalies,
operations on a relation that result in an inconsistency or undesirable loss of data. The removal of
anomalies involves progression through the various levels of normal form. This progression is aimed
at an ideal:

Every information-bearing association between data values appears once and only once, and does
not depend on the presence of other associations.

5.3.3 Second Normal Form

A relation is in Second Normal Form (2NF) if it is in 1NF and if all of its attributes are dependent on
the whole key (i.e. none of the non-key attributes are related only to a part of the key). The
decomposition technique for producing a second-normal-form relation is also fairly simple: construct
a separate relation to embody the partial dependency, and remove the dependent attribute from the
original relation.

The three new relations created when the partial dependencies are eliminated are shown below.

Second Normal Form Relations

Accountant Table The


Key   mechanic
name, age
Accountant First Last Accountant Group Group
Group Supervisor and group
Number Name Name Age Number City
information
21 Alice  Adams 55 52 NYC Brown is
35 Doug  Baker 32 44 LA Green determined
50 Charlie  Cody 40 44 LA Green by the
accountant
number
77 Doug  Doe 52 52 NYC Brown and is not
dependent
on the skill.
Skill Table The Skill Category is determined by the
Key Skill Number and is not dependant on
DBMS Notes By Kailash Jadhav
DBMS
Skill Skill
Number Category
113 Systems
179 Tax the Accountant  Number.
204 Audit
148 Consulting
Proficiency Table
Key  
Accountant Skill
Prof
Number Number
21 113 3 Only Prof (proficiency) was
35 113 5 determined by the combined
35 179 1 key.
35 204 6
50 179 2
77 148 6
77 179 6

5.3.4 Third Normal Form

A relation is in Third Normal Form (3NF) if it is in 2NF and there are no transitive dependencies (i.e.
none of the non-key attributes are dependent upon another attribute which in turn is dependent on
the relation key).

A relation is said to be in third normal form if it is in second normal form and there are no attributes
(or sets of attributes) X and Y for which:

KEY X X Y

when X is not a candidate key.

Another way of looking at 3NF is that it results in relations that represent entities and relations that
represent relationships among entities. An appropriate decomposition for converting a schema to
third normal form is to place those attributes which are not directly (i.e., are only transitively)
dependent on the key into a separate relation. The decompositions shown so far have two important
features:

1.Each dependency is embodied in some relation (the decomposition preserves dependencies).

2.If an extension of the original relation is decomposed, it can be reconstructed via a JOIN of its
components (the decomposition has lossless join).

DBMS Notes By Kailash Jadhav


DBMS
In general, any relation schema can be decomposed into a third normal form schema that has
lossless join and preserves dependencies.

Third Normal Form Data

Accountant Table
Key  
Accountant First Last Accountan Group
Number Name Name t Age Number
21 Alice  Adams 55 52
35 Doug  Baker 32 44
50 Charlie  Cody 40 44
77 Doug  Doe 52 52

Group Table
Key  
Group Group Group
Number City Supervisor
44 LA Green
52 NYC Brown
Skill Table
Skill Number Skill Category
113 Systems
179 Tax
204 Audit
148 Consulting

   

Proficiency Table
Key  
Accountant Skill
Prof
Number Number
21 113 3
35 113 5
35 179 1
35 204 6

DBMS Notes By Kailash Jadhav


DBMS
50 179 2
77 148 6
77 179 6

5.3.5 Boyce/Codd Normal Form 

A relation is in Boyce/Codd Normal Form (BCNF) if it is in 2NF and all of its determinants (attributes
upon which other attributes depend) are candidate keys (i.e. the attributes could have been chosen
as keys). For most practical purposes 3NF and BCNF have the same effect. 3NF and BCNF are the
same if the word "candidate" is added to "key" in the definitions of 2NF and 3NF.

5.3.6 Other Normal Forms

The road to total normalization does not end here. Other anomalous situations may occur, and many
other normal forms have been defined. Although six levels of normalization are possible, normalizing
to 3NF is usually sufficient, and in some cases 2NF is sufficient.

Normalization by example

Rule 1. Eliminate Repeating Groups. Make a separate table for each set of related
attributes, and give each table a primary key.

Unnormalized Data Items In the original list of data, each puppy's description is followed by a list
for Puppies of tricks the puppy has learned. Some might know ten tricks, some
might not know any. To answer the question, "Can Fifi roll over?", we
Puppy number need first to find Fifi's puppy record, then scan the list of tricks
Puppy name asociated with the record. This is awkward, inefficient, and extremely
Kennel Code untidy.
Kennel Name
Kennel location Moving the tricks into a separate table helps considerably. Separating
Trick ID 1...n the repeating groups of tricks from the puppy information results in
Trick Name 1...n first normal form. The puppy number in the trick table matches the
Trick Where Learned 1...n primary key in the puppy table, providing a foreign key for relating the
Skill Level 1...n two tables with a join operation. Now we can answer our question with
a direct retrieval: look to see if Fifi's puppy number and the trick ID for
"roll over" appear together in the trick table.

DBMS Notes By Kailash Jadhav


DBMS
FIRST NORMAL FORM [Note that Rettig's Rule 1, "Eliminate
Repeating Groups," has the effect of
Puppy Table ensuring that the tables satisfy the
Puppy Number                       Primary Key conditions on what we have called a
Puppy Name                           Every puppy gets a unique well-formed table in a relational
number. database, and that well-formed tables
Kennel Name satisfy the conditions of the 1st Normal
Kennel Location Form. Note also that in the Trick Table,
the primary key is a multi-valued, or
Trick Table "composite," key, which consists of both
Puppy Number                       Primary Key [composite] Puppy Number and Trick ID.]
Trick ID
Trick Name                             We'll add a row for every Rule 2. Eliminate Redundant Data.
trick If an attribute depends on only part
Trick Where Learned               learned by every puppy. of a multi-valued key, remove it to
Skill Level a separate table.

The trickTABLE
TRICK name (e.g., "Roll Over") appears redundantly for every puppy that knows it. Just Trick ID
Puppy Trick Trick Where would do.
Skill Level
Number ID Name Learned
[Note that Trick Name depends on only a
Roll part (the Trick ID) of the multi-valued, i.e.,
52 27 16 9
Over composite, key.]
Nose
53 16 9 9
Stand  
Roll
54 27 9 5  
Over

In the Trick
SECOND Table, the primary key is made up of the puppy number and the trick ID. This makes
NORMAL
FORM sense for the "Where Learned" and "Skill Level" attributes, since they will be
different for every puppy-trick combination. But the trick name depends only on
Puppy Table the Trick ID. The same name will appear redundantly every time its associated
Puppy Number ID appears in the Trick Table.
Puppy Name
Kennel Code Suppose you want to reclassify a trick, i.e., to give it a different trick ID. The
Kennel Name change has to be made for every puppy that knows the trick! If you miss some
Kennel Location of the changes, you will have several puppies with the same trick under
different IDs. This is an update anomaly.
Tricks
Trick ID Or suppose the last puppy knowing a particular trick gets run over by a car. His
Trick Name records will be removed from the database, and the trick will not be stored
anywhere! This is a delete anomaly. To avoid these problems, we need the
Puppy Tricks second normal form.
Puppy Number
Trick ID
Trick Where
Learned
Skill Level
DBMS Notes By Kailash Jadhav
DBMS
To achieve this, separate the attributes depending on both parts of the key from those depending
only on the Trick ID. This results in two tables: "Tricks," which gives the name for each Trick ID; and
"Puppy Tricks," which lists the tricks learned by each puppy.

[Note that Rettig's Rule 2, "Eliminate Redundant Data," is aimed at tables that fail to satisfy the
conditions imposed on tables by the definition of 2nd Normal Form. In this example it takes the
"Trick" table, which fails to be in 2NF because the Trick Name attribute depends only on the Trick ID
attribute (i.e., on just part of the composite key), and turns "Trick" into two tables. Note also that the
result is a set of tables that come close to being single-theme tables. "Tricks" is clearly a single-
theme table. "Puppy Tricks" is a single-theme table because it pairs puppies and their tricks, i.e., it
deals with the relationships between individual puppies and the tricks they know. In "Puppy Tricks,"
the "Trick Where Learned" attribute is clearly dependent on both parts of the key, since the attribute
is based not only on which particular trick is being referred to but also on where the particular puppy
learned that trick. The same is true of the "Skill Level" attribute, since this attribute is based not only
on which particular trick is being referred to but also on what the particular puppy's level of skill on
that trick is.]

Now we can reclassify a trick in a single operation: Look up the Trick ID in the "Tricks" table and
change its name. The result will instantly be available throughout the application.

Rule 3. Eliminate Columns Not Dependent on Key. If attributes do not contribute to a


description of the key, remove them to a separate table.

Puppy Table The Puppy Table satisfies the first normal form, since it contains no repeating groups.
Puppy It satisfies the second normal form, since it does not have a multivalued key. But the
Number key is Puppy Number, and the kennel name and kennel location describe only a
Puppy Name kennel, not a puppy. To achieve the third normal form, they must be moved into a
Kennel Code separate table. Since they describe a kennel, Kennel Code becomes the key of the
THIRD NORMAL new "Kennels" table.
Kennel Name
FORM
 
Puppies
Puppy Number  
Puppy Name
Kennel Code The motivation for this is the same as for the second normal form: We want to
avoid update and delete anomalies. For example, suppose no puppies from the
Kennels Daisy Hill Puppy Farm were currently stored in the database. With the previous
Kennel Code design, there would be no record of its existence!
Kennel Name
Kennel Location [Note that Rettig's Rule 3, "Eliminate Columns Not Dependent on Key," amounts
to removing the transitive dependencies that keep a table from achieving 3rd
Tricks Normal Form. In the "Puppy" table, attributes Kennel Name and Kennel Location
Trick ID are dependent only on Kennel Code. This situation is superficially similar to that of
Trick Name the dependence on part of a composite key that is the focus of the 2nd Normal
Form. However, this situation differs in that here the attribute (viz., Kennel Code)
Puppy Tricks on which other attributes are dependent is not part of a composite key; it is
Puppy Number simply one of the non-key attributes in the "Puppy" table. Note also that
Trick ID
DBMS
Trick Notes
Where By Kailash Jadhav
Learned
Skill Level
DBMS
"Kennels" is clearly a single-theme table. "Puppies" is close to a single-theme table, in that it
concentrates on puppies and their numbers. If the real-life situation is—which is likely—that any
given puppy can be identified with one and only one kennel, then we can consider "Puppies" to be a
single-theme table. But if—which is not likely—puppies sometimes start life in one kennel and are
then transferred, as puppies, to another kennel, we would not be able to consider "Puppies" to be
single-theme because we would have to make further modifications to handle the mixed origins.]

Third Normal Form is sufficient for most situations. But if that isn't normal enough for
you ...

Rule 4. Isolate Independent Multiple Relationships. No table may contain two or more l:n
or n:m relationships that are not directly related.

This applies only to designs that include one-to-many and many-to-many relationships. An example
of a one-to-many relationship is that one kennel can hold many puppies. An example of a many-to-
many relationship is that a puppy can know many tricks and several puppies can know the same
trick.

Puppy Tricks and Suppose we want to add a new attribute to the Puppy-Trick table,
Costumes "Costume." This way we can look for puppies that can both "sit-up-and-
Puppy Number beg" and wear a Groucho Marx mask, for example. The fourth normal
Trick ID NORMAL
FOURTH form dictates against this (i.e., against using the Puppy-Tricks table, not
Trick
FORMWhere Learned against begging while wearing a Groucho mask). The two attributes do
Skill Level not share a meaningful relationship. A puppy may be able to wear a wet
Costume
Puppies suit. This does not mean it can simultaneously sit up and beg. How will
Puppy Number you represent this if you store both attributes in the same table?
Puppy Name
Kennel Code  

Kennels [Note that Rettig's Rule 4, "Isolate Independent Multiple


Kennel Code Relationships,"amounts to the elimination of multi-valued dependencies that
Kennel Name is the focus of the 4th Normal Form. In table "Puppy Tricks and Costumes,"
Kennel Location attribute Costume may or may not be dependent on attribute Trick ID as well
as on attribute Puppy Number. We remove the ambiguity by constructing
Tricks two replacement tables, "Costumes" and "Puppy Costumes," each of which is
Trick ID clearly a single-theme table. At left is the 4th Normal Form of the database. 
Trick Name It consists of the Third Normal Form tables plus the new tables. Note also
that table "Puppy Costumes" has a composite primary key that is made up of
Puppy Tricks two foreign keys, from the "Puppies" and "Costumes" tables, respectively.]
Puppy Number
Trick ID  
Trick Where Learned
Skill Level  

Costumes  
Costume Number
Costume Name  

DBMS Notes
Puppy Costumes By Kailash Jadhav
Puppy Number
Costume Number
DBMS
  Rule 5. Isolate Semantically Related Multiple Relationships. There may be practical
constraints on information that justify separating logically related many-to-many
relationships.

Usually, related attributes belong together. For example, if we really wanted to record which tricks
each puppy could do in which costume, we would want to keep the Costume attribute in the Puppy-
Tricks table. But there are times when special characteristics of the data make it more efficient to
separate even logically related attributes.

Imagine that our database will record when breeds are available in each kennel, and which breeder
supplies dogs to those kennels. This suggests a Kennel-Breeder-Breeds table, which satisfies fourth
normal form. As long as any kennel can supply any breed from any breeder, this works fine.

Kennel-Breeder-Breeds
FIFTH NORMAL
FORM Number
Kennel
Breeder
Breed
Puppies
Puppy Number
Puppy Name  
Kennel Code
Kennel-Breeder-Breeds Now suppose a law is passed to prevent exclusive
Kennels arrangements: a kennel selling any breed must offer that breed
Kennel
Breeder
Kennel Code Breed from all breeders it deals with. In other words, if Khabul
Number
Kennel Name Khennels sells Afghans and wants to sell any Daisy Hill
5 Acme
Kennel Location Spaniel
puppies, it must sell Daisy Hill Afghans.
5 Acme Dachshund
Tricks
5 Acme Banana-Biter The need for fifth normal form becomes clear when we
Trick
5 ID Puppy Factory Spaniel consider inserts and deletes. Suppose that a kennel (whose
Trick Name number in the database happens to be 5) decides to offer
5 Puppy Factory Dachshund
three new breeds: Spaniels, Dachshunds, and West Indian
5 Puppy Factory Banana-Biter Banana-Biters. Suppose further that this kennel already deals
Puppy Tricks
Puppy
5 Number
Whatapuppy Spaniel with three breeders that can supply those breeds. This will
Trick
5 ID Whatapuppy Dachshund require nine new rows in the database, one for each breeder-
Trick Where Learned
5 Whatapuppy Banana-Biter and-breed combination.
Skill Level
Breaking up the table reduces the number of inserts to six. Here are the tables
Costumes necessary for fifth normal form, shown with the six newly inserted rows.
Costume Number
Costume Name  

Puppy Costumes If an application involves significant update activity, fifth normal form can mean
Puppy Number important savings. Note that these combination tables develop naturally out of
Costume Number entity-relationship analysis.

Kennel-Breed [Note also that each of the tables at the left is clearly a single-theme table.]
Kennel Number
Breed  
DBMS Notes
Kennel-Breeder By Kailash Jadhav
Kennel Number
Breeder
DBMS
 

Kennel-Breed
Kennel
Breed
Number
5 Spaniel
5 Dachshund
5 Banana-Biter

Kennel-Breeder
Kennel
Breeder
Number

DBMS Notes By Kailash Jadhav


DBMS
5 Acme
5 Puppy Factory
5 Whatapuppy

Transaction Models

A transaction is a unit of program execution that accesses and possibly updates various data
items.

A transaction must see a consistent database.

During transaction execution the database may be inconsistent.

When the transaction is committed, the database must be consistent.

Two main issues to deal with:

Failures of various kinds, such as hardware failures and system crashes

Concurrent execution of multiple transactions

ACID Properties

To preserve integrity of data, the database system must ensure:

Atomicity. Either all operations of the transaction are properly reflected in the database or none
are.

Consistency. Execution of a transaction in isolation preserves the consistency of the database.

Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions T i
and T j , it appears to T i that either T j finished execution before T i started, or T j started
execution after T i finished.

Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.

Transaction State

1. Active, the initial state; the transaction stays in this state while it is executing
2. Partially committed, after the final statement has been executed.
DBMS Notes By Kailash Jadhav
DBMS
3. Failed, after the discovery that normal execution can no longer proceed.
4. Aborted, after the transaction has been rolled back and the database restored to its state prior
to the start of the transaction. Two options after it has been aborted :
a. restart the transaction -- only if no internal logical error
b. kill the transaction
5. Committed, after successful completion.

Concurrent Executions

Multiple transactions are allowed to run concurrently in the system. Advantages are :

increased processor and disk utilization, leading to better transaction throughput: one
transaction can be using the CPU while another is reading from or writing to the disk

reduced average response time for transactions: short transactions need not wait behind
long ones

Concurrency control schemes -- mechanisms to control the interaction among the concurrent
transactions in order to prevent them from destroying the consistency of the database

 Most DBMS are multi-user systems


 There is a need to ensure that concurrent transactions do not iterfere with each others
operations
 Transaction scheduling algorithms

Problems due to the Concurrent Execution of Transactions

v The Lost Update Problem


v The Temporary Update (uncommitted dependency) Problem
v The Incorrect Summary (inconsistent analysis) Problem
DBMS Notes By Kailash Jadhav
DBMS
The Lost Update Problem

v Two transactions accessing the same database item have their operations interleaved in a way
that makes the database item incorrect.

T1: T2:
read_item(X);
X:= X - N;
read_item(X);
X:= X + M;
write_item(X);
read_item(Y);
write_item(X);
Y:= Y + N;
write_item(Y);

If transactions T1 and T2 are submitted at approximately the same time and their operations are
interleaved then the value of database item X will be incorrect because T2 reads the value of X
before T1 changes it in the database and hence the updated database value resulting from T1 is
lost.

The Temporary Update Problem

v One transaction updates a database item and then the transaction -for some reason- fails. The
updated item is accessed by another transaction before it is changed back to its original value.

T1: T2: read_item(X);


X:= X - N;
write_item(X);
read_item(X);
X:= X - N;
write_item(X);
read_item(Y);

transaction T1 fails and must change the value of X back to its old value; meanwhile T2 has read the
"temporary" incorrect value of X

The Incorrect Summary Problem

DBMS Notes By Kailash Jadhav


DBMS
One transaction is calculating an aggregate summary function on a number of records while other
transactions are updating some of these records. The aggregate function may calculate some values
before they are updated and others after

T1:T3:
sum:= 0;
read_item(A);
sum:= sum + A;
.
read_item(X);.
X:= X - N;.
write_item(X);
read_item(X);
sum:= sum + X;
read_item(Y);
sum:= sum + Y;
read_item(Y);
Y:= Y + N;
write_item(Y);

T3 reads X after N is subtracted and reads Y before X is added, so a wrong summary is the result

Schedules

Schedules -- sequences that indicate the chronological order in which instructions of concurrent
transactions are executed

a schedule for a set of transactions must consist of all instructions of those transactions
must preserve the order in which the instructions appear in each individual transaction

v A schedule S of n transactions is a sequential ordering of the operations of the n transactions.


v A schedule maintains the order of operations within the individual transaction. It is subject to
the constraint that for each transaction T participating in S, if operation i is performed in T
before operation j, then operation i will be performed before operation j in S.
v The serializability theory attempts to determine the 'correctness' of the schedules.
v A schedule S is serial if, for every transaction T participating in S all of T's operations are
executed consecutively in the schedule; otherwise it is called nonserial.
v A schedule S of n transactions is serializable if it is equivalent to some serial schedule of the
same n transactions.

Example Schedules

Let T 1 transfer $50 from A to B, and T 2 transfer 10% of the balance from A to B. The following
is a serial schedule (Schedule 1 in the text), in which T 1 is followed by T 2 .

DBMS Notes By Kailash Jadhav


DBMS

Let T 1 and T 2 be the transactions defined previously. The following schedule (Schedule 3 in the
text) is not a serial schedule, but it is equivalent to Schedule 1.

The following concurrent schedule (Schedule 4 in the text) does not preserve the value of the
sum A + B.

DBMS Notes By Kailash Jadhav


DBMS

Serializability

Basic Assumption -- Each transaction preserves database consistency.

Thus serial execution of a set of transactions preserves database consistency

A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different


forms of schedule equivalence gives rise to the notions of :

1. conflict serializability

2. view serializability

We ignore operations other than read and write instructions, and we assume that transactions
may perform arbitrary computations on data in local buffers in between reads and writes. Our
simplified schedules consist of only read and write instructions.

Conflict Serializability

Conflict serializability

 Two operations conflict if:


o they are issued by different transactions,
o they operate on the same data item, and
o at least one of them is a write operation

DBMS Notes By Kailash Jadhav


DBMS
 Two executions are conflict-equivalent, if in both executions all conflicting operations have the
same order

 An execution is conflict-serializable if it is conflict-equivalent to a serial history

T i and T j respectively, conflict if and only if there exists some item Q accessed by both I i and I j
, and at least one of these instructions wrote Q.

1. I i = read(Q), I j = read(Q). I i and I j don't conflict.

2. I i = read(Q), I j = write(Q). They conflict.

3. I i = write(Q), I j = read(Q). They conflict.

4. I i = write(Q), I j = write(Q). They conflict.

Intuitively, a conflict between I i and I j forces a (logical) temporal Instructions I i and I j , of


transactions order between them. If I i and I j are consecutive in a schedule and they do not
conflict, their results would remain the same even if they had been interchanged in the schedule.

If a schedule S can be transformed into a schedule S‘ by a series of swaps of non-conflicting


instructions, we say that S and S’are conflict equivalent.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule.

Example of a schedule that is not conflict serializable

Schedule 3 below can be transformed into Schedule 1, a serial schedule where T 2 follows T 1 ,
by a series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.

DBMS Notes By Kailash Jadhav


DBMS

View serializability

 Two executions (schedules, histories) are view-equivalent, if in both executions:


o each transaction reads the same value (i.e., the value produced by the same
transaction), and
o the final write on each data item is the same (i.e., was performed by the same
transaction)

 An execution is view-serializable if it is view-equivalent to a serial history

 An execution is conflict-serializable if it is conflict-equivalent to a serial history

Let S and S‘ be two schedules with the same set of transactions. S and S’ are view equivalent if
the following three conditions are met:
1. For each data item Q, if transaction T i reads the initial value
of Q in schedule S, then transaction T i must, in schedule
S‘ , also read the initial value of Q.
2. For each data item Q, if transaction T i executes read(Q) in
schedule S, and that value was produced by transaction T j
(if any), then transaction T i must in schedule S‘ also read
the value of Q that was produced by transaction T j .
3. For each data item Q, the transaction (if any) that performs
the final write(Q) operation in schedule S must perform the
final write(Q) operation in schedule S‘ .
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.

DBMS Notes By Kailash Jadhav


DBMS
Schedule 9 (from text) --- a schedule which is view-serializable but not conflict serializable.

Every view serializable schedule which is not conflict serializable has blind writes

Other Notions of Serializability


Schedule 8 (from text) given below produces same outcome as the serial schedule < T 1 , T 5 >,
yet is not conflict equivalent or view equivalent to it.

Determining such equivalence requires analysis of operations other than read and write

Testing for Serializability


Consider some schedule of a set of transactions T 1 , T 2 , ... , T n
Precedence graph --- a directed graph where the vertices are the transactions (names).
We draw an arc from T i to T j if the two transactions conflict, and T i accessed the data item on
which the conflict arose earlier.
We may label the arc by the item that was accessed.
Example 1

DBMS Notes By Kailash Jadhav


DBMS

Example Schedule (Schedule A)

Precedence Graph for Schedule A

Test for Conflict Serializability


A schedule is conflict serializable if and only if its precedence graph is acyclic.

DBMS Notes By Kailash Jadhav


DBMS
Cycle-detection algorithms exist which take order n 2 time, where n is the number of vertices in
the graph. (Better algorithms take order n + e where e is the number of edges.)
If precedence graph is acyclic, the serializability order can be obtained by a topological sorting of
the graph. This is a linear order consistent with the partial order of the graph. For example, a
serializability order for Schedule A would be T 5 -> T 1 -> T 3 ->T 2 -> T 4 .

Test for View Serializability


The precedence graph test for conflict serializability must be modified to apply to a test for view
serializability.
Construct a labeled precedence graph. Look for an acyclic graph which is derived from the labeled
precedence graph by choosing one edge from every pair of edges with the same non-zero label.
Schedule is view serializable if and only if such an acyclic graph can be found.
The problem of looking for such an acyclic graph falls in the class of NP-complete problems. Thus
existence of an efficient algorithm is unlikely. However practical algorithms that just check some
sufficient conditions for view serializability can still be used.

Concurrency Control vs. Serializability Tests


Testing a schedule for serializability after it has executed is a little too late!
Goal -- to develop concurrency control protocols that will assure serializability. They will generally
not examine the precedence graph as it is being created; instead a protocol will impose a
discipline that avoids nonserializable schedules. Will study such protocols in Chapter 14.
Tests for serializability help understand why a concurrency control protocol is correct.

Recoverability
Need to address the effect of transaction failures on concurrently
running transactions.
Recoverable schedule --- if a transaction T j reads a data items previously written by a transaction
T i , the commit operation of T i appears before the commit operation of T j .
The following schedule (Schedule 11) is not recoverable if T 9 commits immediately after the read

If T 8 should abort, T 9 would have read (and possibly shown to the user) an inconsistent
database state. Hence database must ensure that schedules are recoverable.

Cascading rollback -- a single transaction failure leads to a series of transaction rollbacks.


Consider the following schedule where none of the transactions has yet committed (so the
schedule is recoverable)

DBMS Notes By Kailash Jadhav


DBMS

If T 10 fails, T 11 and T 12 must also be rolled back.


Can lead to the undoing of a significant amount of work

Cascadeless schedules --- cascading rollbacks cannot occur; for each pair of transactions T i and T
j such that T j reads a data item previously written by T i , the commit operation of T i appears
before the read operation of T j .
Every cascadeless schedule is also recoverable
It is desirable to restrict the schedules to those that are cascadeless.

Relational System Architecture


Databases are BIG pieces of software. Typically very hard to modularize. Lots of system design
decisions at the macro and micro scale. We will focus mostly on micro decisions -- and hence ideas
reusable outside DBMSs -- in subsequent lectures. Here we focus on macro design.

Disk management choices:

o file per relation


o big file in file system
o raw device

Process Model:

o process per user


o server
o multi-server

Basic modules:

o parser
o query rewrite
DBMS Notes By Kailash Jadhav
DBMS
o optimizer
o query executor
o access methods
o buffer manager
o lock manager
o log/recovery manager

Query Rewriter

o Flattens views (why?)


o may change query semantics (constraints, protection, etc.)

Optimizer

o large space of equivalent relational plans


o pick one that's going to be "optimal" (?)
o produces either an interpretable plan tree, or compiled code

Executor

o modules to perform relation operations like joins, sorts, aggregations, etc.


o calls Access Methods for operations on base and temporary relations

Access Methods

o uniform relational interface (open, get next), a la INGRES AMI, System R's RSS
o multiple implementations: heap, B-tree, extensible hashing

Buffer Manager

o Intelligent user-level disk cache


o must interact with transaction manager & lock manager
o Virtual memory does not cut it!  (we'll discuss this at length)

Lock Manager

o must efficiently support lock table


o System R architecture influential:
 physical and logical locks treated uniformly
 multiple granularity of locks
 set intent locks at high levels
 we will study this in more detail later (Gray)
o deadlock handling: detection

Log/Recovery Manager

o "before/after" log on values


DBMS Notes By Kailash Jadhav
DBMS
o checkpoint/restore facility for quick recovery
o Redo/Undo on restore
o Support soft crashes off disk, hard crashes off tape.
o System R?s shadowing is too slow. Use Write-Ahead Logging! (WAL) More on this to
come.
o Hard to get right!

Concurrency Control Techniques

 Locking Techniques for Concurrency Control


 Concurrency Control Based on Timestamp Ordering
 Multiversion Concurrency Control Techniques
 Validation (Optimistic) Concurrency Control Techniques

Locking Techniques for Concurrency Control

A lock is a variable associated with a data item that describes the status of the item with respect to
possible operations that can be applied to it.

TYPES OF LOCKS: 

Binary locks: only two states of a lock; too simple and too restrictive; not used in practice. 

Shared/exclusive locks: which provide more general locking capabilities and are used in practical
database locking schemes. (Read Lock as a shared lock, Write Lock as an exclusive lock).

Binary Locks:

A binary lock can have two states or values: locked and unlocked (or 1 and 0, for simplicity). A
distinct lock is associated with each database item X. 

1. If the value of the lock on X is 1, item X cannot be accessed by a database operation that
requests the item.
2.  If the value of the lock on X is 0, the item can be accessed when requested. We refer to the
current value (or state) of the lock associated with item X as LOCK(X).

Two operations, lock_item and unlock_item, are used with binary locking. 

Lock_item(X): A transaction requests access to an item X by first issuing a lock_item(X)


operation. If LOCK(X) = 1, the transaction is forced to wait. If LOCK(X) = 0, it is set to 1 (the
transaction locks the item) and the transaction is allowed to access item X. 

DBMS Notes By Kailash Jadhav


DBMS
Unlock_item (X): When the transaction is through using the item, it issues an unlock_item(X)
operation, which sets LOCK(X) to 0 (unlocks the item) so that X may be accessed by other
transactions. Hence, a binary lock enforces mutual exclusion on the data item ; i.e., at a time only
one transaction can hold a lock.

If the simple binary locking scheme described here is used, every transaction must obey the following
rules:

1. A transaction T must issue the operation lock_item(X) before any read_item(X) or


write_item(X) operations are performed in T.
2. A transaction T must issue the operation unlock_item(X) after all read_item(X) and
write_item(X) operations are completed in T.
3. A transaction T will not issue a lock_item(X) operation if it already holds the lock on item X .
4. A transaction T will not issue an unlock_item(X) operation unless it already holds the lock on
item X.

These rules can be enforced by the lock manager module of the DBMS. Between the lock_item(X)
and unlock_item(X) operations in transaction T, T is said to hold the lock on item X. At most one
transaction can hold the lock on a particular item. Thus no two transactions can access the same
item concurrently.

Shared/Exclusive (or Read/Write) Locks:

DBMS Notes By Kailash Jadhav


DBMS
A read-locked item is also called share-locked, because other transactions are allowed to read
the item, whereas a write-locked item is called exclusive-locked, because a single transaction
exclusively holds the lock on the item.There are three locking operations: read_lock(X), write_lock(X),
and unlock(X).

When we use the shared/exclusive locking scheme, the system must enforce the following rules:

1. A transaction T must issue the operation read_lock(X) or write_lock(X) before any


read_item(X) operation is performed in T.
2. A transaction T must issue the operation write_lock(X) before any write_item(X) operation is
performed in T.
3. A transaction T must issue the operation unlock(X) after all read_item(X) and write_item(X)
operations are completed in T.
4. A transaction T will not issue a read_lock(X) operation if it already holds a read (shared) lock
or a write (exclusive) lock on item X. This rule may be relaxed. 
5. A transaction T will not issue a write_lock(X) operation if it already holds a read (shared) lock
or write (exclusive) lock on item X. This rule may be relaxed.

DBMS Notes By Kailash Jadhav


DBMS
6. A transaction T will not issue an unlock(X) operation unless it already holds a read (shared)
lock or a write (exclusive) lock on item X.

Conversion of Locks

Sometimes it is desirable to relax conditions 4 and 5 in the preceding list in order to allow lock
conversion; that is, a transaction that already holds a lock on item X is allowed under certain
conditions to convert the lock from one locked state to another. For example, it is possible for a
transaction T to issue a read_lock(X) and then later on to upgrade the lock by issuing a
write_lock(X) operation. 

UPGRADING

If T is the only transaction holding a read lock on X at the time it issues the write_lock(X) operation,
the lock can be upgraded; otherwise, the transaction must wait.

DOWNGRADING

It is also possible for a transaction T to issue a write_lock(X) and then later on to downgrade the
lock by issuing a read_lock(X) operation. When upgrading and downgrading of locks is used, the lock
table must include transaction identifiers in the record structure for each lock (in the
locking_transaction(s) field) to store the information on which transactions hold locks on the item. 

Note: using binary locks and read/write locks in transactions does not guarantee serializability of
schedule on its own.

DBMS Notes By Kailash Jadhav


DBMS
Guaranteeing Serializability by Two-Phase Locking

A transaction is said to follow the two-phase locking protocol if all locking operations (read_lock,
write_lock) precede the first unlock operation in the transaction. Such a transaction can be divided
into two phases: 

 an expanding or growing (first) phase, during which new locks on items can be acquired
but none can be released; and
 a shrinking (second) phase, during which existing locks can be released but no new locks
can be acquired. If lock conversion is allowed, then upgrading of locks (from read-locked to
write-locked) must be done during the expanding phase, and downgrading of locks (from
write-locked to read-locked) must be done in the shrinking phase. Hence, a read_lock(X)
operation that downgrades an already held write lock on X can appear only in the shrinking
phase.

It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the
schedule is guaranteed to be serializable, obviating the need to test for serializability of schedules
any more. The locking mechanism, by enforcing two-phase locking rules, also enforces serializability.

T1 T2 T'1 T'2
-------------- --------------- --------------- ---------------
read_lock(Y); read_lock(X); read_lock(Y); read_lock(X);
read_item(Y); read_item(X); read_item(Y); read_item(X);
unlock(Y); unlock(X); write_lock(X); write_lock(Y);
write_lock(X): write_lock(Y); unlock(Y); unlock(X);
read_item(X); read_item(Y); read_item(X); read_item(Y);
X := X + Y; Y := X + Y; X := X + Y; Y := X + Y;
write_item(X); write_item(Y); write_item(X); write_item(Y);
DBMS Notes By Kailash Jadhav
DBMS
unlock(X); unlock(Y); unlock(X); unlock(Y);

Two-phase locking may limit the amount of concurrency that can occur in a schedule. This is because
a transaction T may not be able to release an item X after it is through using it if T must lock an
additional item Y later on; or conversely, T must lock the additional item Y before it needs it so that it
can release X. Hence, X must remain locked by T until all items that the transaction needs to read or
write have been locked; only then can X be released by T. Meanwhile, another transaction seeking to
access X may be forced to wait, even though T is done with X; conversely, if Y is locked earlier than
it is needed, another transaction seeking to access Y is forced to wait even though T is not using Y
yet. This is the price for guaranteeing serializability of all schedules without having to check the
schedules themselves.

Dealing with Deadlock and Starvation

Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some
item X, but X is locked by another transaction T ' in the set. Hence, each transaction in the set is on a
waiting queue, waiting for one of the other transactions in the set to release the lock on an item.

 Deadlock Prevention Protocols (e.g., lock all the items in advance, ordering all items,
transaction timestamp, and no waiting algorithm).

o Wait-die: If TS(Ti) < TS (Tj), then (Ti older than Tj) Ti is allowed to wait; otherwise (Ti
younger than Tj) abort Ti (Ti dies) and restart it later with the same timestamp.
o Wound-wait: If TS(Ti) < TS(Tj), then (Ti older than Tj) abort Tj (Ti wounds Tj) and
restart it later with the same timestamp; otherwise (Ti younger than Tj) Ti is allowed to
wait.

DBMS Notes By Kailash Jadhav


DBMS
 

 Deadlock detection, where the system checks if a state of deadlock actually exists.
 Another problem that may occur when we use locking is starvation, which occurs when a
transaction cannot proceed for an indefinite period of time while other transactions in the
system continue normally.
 Solution for starvation (e.g., first-come-first-serve and priority).

Concurrency Control Based on Timestamp Ordering

A timestamp is a unique identifier created by the DBMS to identify a transaction (can be thought as
the transaction start time). The timestamp of transaction T as TS(T). Concurrency control techniques
based on timestamp ordering do not use locks; hence, deadlocks cannot occur

The idea for this scheme is to order the transactions based on their timestamps. A schedule in which
the transactions participate is then serializable, and the equivalent serial schedule has the
transactions in order of their timestamp values. This is called timestamp ordering (TO). I
In timestamp ordering, however, the schedule is equivalent to the particular serial order
corresponding to the order of the transaction timestamps. The algorithm must ensure that, for each
item accessed by conflicting operations in the schedule, the order in which the item is accessed does
not violate the serializability order. To do this, the algorithm associates with each database item X
two timestamp (TS) values:

1. Read_TS(X): The read timestamp of item X; this is the largest timestamp among all the
timestamps of transactions that have successfully read item X—that is, read_TS(X) = TS(T),
where T is the youngest transaction that has read X successfully.
2. Write_TS(X): The write timestamp of item X; this is the largest of all the timestamps of
transactions that have successfully written item X—that is, write_TS(X) = TS(T), where T is
the youngest transaction that has written X successfully.

Basic Timestamp Ordering

Whenever some transaction T tries to issue a read_item(X) or a write_item(X) operation, the basic
TO algorithm compares the timestamp of T with read_TS(X) and write_TS(X) to ensure that the
timestamp order of transaction execution is not violated. If this order is violated, then transaction T is
aborted and resubmitted to the system as a new transaction with a new timestamp. If T is aborted
and rolled back, any transaction T1 that may have used a value written by T must also be rolled back.
Similarly, any transaction T2 that may have used a value written by T1 must also be rolled back, and
so on. This effect is known as cascading rollback and is one of the problems associated with basic
TO, since the schedules produced are not recoverable. An additional protocol must be enforced to
ensure that the schedules are recoverable, cascadeless, or strict. We first describe the basic TO
algorithm here. The concurrency control algorithm must check whether conflicting operations violate
the timestamp ordering in the following two cases:

1. Transaction T issues a write_item(X) operation:


a. If read_TS(X) > TS(T) or if write_TS(X) > TS(T), then abort and roll back T and reject
the operation. This should be done because some younger transaction with a

DBMS Notes By Kailash Jadhav


DBMS
timestamp greater than TS(T)—and hence after T in the timestamp ordering—has
already read or written the value of item X before T had a chance to write X, thus
violating the timestamp ordering.
b. If the condition in part (a) does not occur, then execute the write_item(X) operation of
T and set write_TS(X) to TS(T).

2. Transaction T issues a read_item(X) operation:


a. If write_TS(X) > TS(T), then abort and roll back T and reject the operation. This should
be done because some younger transaction with timestamp greater than TS(T)—and
hence after T in the timestamp ordering—has already written the value of item X before
T had a chance to read X.
b. If write_TS(X) <= TS(T), then execute the read_item( X) operation of T and set
read_TS(X) to the larger of TS(T) and the current read_TS(X).

Hence, whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect
order, it rejects the later of the two operations by aborting the transaction that issued it. The
schedules produced by basic TO are hence guaranteed to be conflict serializable, like the 2PL
protocol. However, some schedules are possible under each protocol that are not allowed under the
other. Hence, neither protocol allows all possible serializable schedules. As mentioned earlier,
deadlock does not occur with timestamp ordering. However, cyclic restart (and hence starvation) may
occur if a transaction is continually aborted and restarted

Multiversion Concurrency Control Techniques

Multiversion Technique Based on Timestamp Ordering

In this method, several versions X1, X2, ..., Xn of each data item X are maintained. For each version,
the value of version Xi and the following two timestamps are kept:

1. read_TS(Xi): The read timestamp of Xi is the largest of all the timestamps of transactions
that have successfully read version Xi.
2. write_TS(Xi): The write timestamp of Xi  is the timestamp of the transaction that wrote
the value of version Xi.

Whenever a transaction T is allowed to execute a write_item(X) operation, a new version Xk+1 of


item X is created, with both the write_TS(Xk+1) and the read_TS(Xk+1) set to TS(T).
Correspondingly, when a transaction T is allowed to read the value of version Xi, the value of
read_TS(Xi) is set to the larger of the current read_TS(Xi) and TS(T).

To ensure serializability, the following two rules are used:

1. If transaction T issues a write_item(X) operation, and version i of X has the highest


write_TS(X) of all versions of X that is also less than or equal to TS(T), and read_TS(Xi) >
TS(T), then abort and roll back transaction T; otherwise, create a new version Xj of X with
read_TS(Xj) = write_TS(Xj) = TS(T).
2. If transaction T issues a read_item(X) operation, find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the value

DBMS Notes By Kailash Jadhav


DBMS
of Xi to transaction T, and set the value of read_TS(Xi) to the larger of TS(T) and the current
read_TS(Xi).

Validation (Optimistic) Concurrency Control Techniques

In optimistic concurrency control techniques, also known as validation or certification


techniques, no checking is done while the transaction is executing. Several proposed concurrency
control methods use the validation technique. One of the schemes, updates in the transaction are not
applied directly to the database items until the transaction reaches its end. During transaction
execution, all updates are applied to local copies of the data items that are kept for the transaction.
At the end of transaction execution, a validation phase checks whether any of the transaction’s
updates violate serializability. Certain information needed by the validation phase must be kept by the
system. If serializability is not violated, the transaction is committed and the database is updated
from the local copies; otherwise, the transaction is aborted and then restarted later.

There are three phases for this concurrency control protocol:

1. Read phase: A transaction can read values of committed data items from the database.
However, updates are applied only to local copies (versions) of the data items kept in the
transaction workspace.
2. Validation phase: Checking is performed to ensure that serializability will not be violated if
the transaction updates are applied to the database.
3. Write phase: If the validation phase is successful, the transaction updates are applied to the
database; otherwise, the updates are discarded and the transaction is restarted.

The idea behind optimistic concurrency control is to do all the checks at once; hence, transaction
execution proceeds with a minimum of overhead until the validation phase is reached. If there is little
interference among transactions, most will be validated successfully. However, if there is much
interference, many transactions that execute to completion will have their results discarded and must
be restarted later. Under these circumstances, optimistic techniques do not work well. The
techniques are called "optimistic" because they assume that little interference will occur and hence
that there is no need to do checking during transaction execution.

The optimistic protocol we describe uses transaction timestamps and also requires that the write_sets
and read_sets of the transactions be kept by the system. In addition, start and end times for some of
the three phases need to be kept for each transaction. Recall that the write_set of a transaction is
the set of items it writes, and the read_set is the set of items it reads. In the validation phase for
transaction Ti, the protocol checks that Ti does not interfere with any committed transactions or with
any other transactions currently in their validation phase. The validation phase for T i checks that, for
each such transaction Tj that is either committed or is in its validation phase, one of the following
conditions holds:

1. Transaction Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items
in common with the write_set of Tj.
3. Both the read_set and write_set of Ti have no items in common with the write_set of Tj, and Tj
completes its read phase before Ti completes its read phase.

DBMS Notes By Kailash Jadhav


DBMS
When validating transaction Ti, the first condition is checked first for each transaction Tj, since (1) is
the simplest condition to check. Only if condition (1) is false is condition (2) checked, and only if (2)
is false is condition (3)—the most complex to evaluate—checked. If any one of these three conditions
holds, there is no interference and Ti is validated successfully. If none of these three conditions holds,
the validation of transaction Ti fails and it is aborted and restarted later because interference may
have occurred.

Recovery

Failure Classification

 Transaction failure :
 Logical errors: transaction cannot complete due to some internal error condition
 System errors: the database system must terminate an active transaction due to an
error condition (e.g., deadlock)
 System crash
 a power failure or other hardware or software failure causes the system to crash.
 Fail-stop assumption: non-volatile storage contents are assumed to not have been
corrupted by system crash
 Disk failure
 a head crash or similar disk failure destroys all or part of disk storage

Recovery Algorithms

 techniques to ensure database consistency and transaction atomicity and durability


despite failures
 Focus of this chapter
 Recovery algorithms have two parts
 Actions taken during normal transaction processing

=> to ensure enough information exists to recover from failures


 Actions taken after a failure

=> to recover the database contents to a state that ensures atomicity, consistency
and durability

Storage Structure

 Volatile storage:
 does not survive system crashes
 examples: main memory, cache memory
 Nonvolatile storage:
 survives system crashes
DBMS Notes By Kailash Jadhav
DBMS
 examples: disk, tape, flash memory,
non-volatile (battery backed up) RAM
 Stable storage:
 a theoretical form of storage that survives all failures
 approximated by maintaining multiple copies on distinct nonvolatile media

Data Access

 Physical blocks are those blocks residing on the disk.


 Buffer blocks are the blocks residing temporarily in main memory (disk buffer).
 Block movements between disk and main memory are initiated through:
 input(B): transfers the physical block B to main memory.
 output(B): transfers the buffer block B to the disk, and replaces the appropriate
physical block there.
 Each transaction Ti has its private work-area
 Ti manipulates its local copies data items.
 Ti's local copy of a data item X is called xi.

Data Access (Cont.)

 Transaction transfers data items between system buffer blocks and its private work-
area using the following operations :
 read(X)
 If BX in which X resides is not in memory, issue input(BX)
 assign to the local variable xi the value of X from the buffer block
 write(X)
 If BX in which X resides is not in memory, issue input(BX)
 assign the value xi to X in the buffer block.
 We assume that no data item spans two or more blocks.
 Transactions
 Perform read(X) while accessing X for the first time;
 All subsequent accesses are to the local copy xi.
 After last access, transaction executes write(X) if updated.
 input(BX) and output(BX) are not issued by individual transactions. The system (buffer
manager) is responsible for these operations.

DBMS Notes By Kailash Jadhav


DBMS

Example of Data Access


buffer
Buffer Block A XX input(A)

Buffer Block B YY
XX A
output(B)
read(X) YY B

x2
x1 write(Y)
y1

work area work area


of T1 of T2

Main Memory Disk

Original Slides: Intro to DB (2005-2)


© Silberschatz, Korth and Sudarshan Copyright  2001 ~ 2005 by S.-g. Lee Chap 17 - 8

Recovery and Atomicity

 Modifying the database without ensuring that the transaction will commit may leave
the database in an inconsistent state.
 A failure may occur after some but not all modifications have been made.
 To ensure atomicity despite failures
 keep (maintain) certain information describing the modifications on stable storage
without modifying the database itself.
 Log-based recovery
 Shadow-paging

Log-Based Recovery

 A log is a sequence of log records that describe update activities on the database.
 Log is kept on stable storage.
 Assume that transactions execute serially
 Assume that log records are written directly to stable storage (they are not
buffered)
 Logging
 When transaction Ti starts
 Write a log record <Ti start>
 Before Ti executes write(X)
 Write a log record <Ti, X, V1, V2>

DBMS Notes By Kailash Jadhav


DBMS
V1: the value of X before the write

V2: the value to be written to X.


 When Ti finishes its last statement
 Write the log record <Ti commit>

Deferred Database Modification

 Records all modifications to the log, but defer all the writes to after partial commit.
 Transaction start: write <Ti start> record to log.
 A write(X) operation results in
 a log record <Ti, X, V> being written, where V is the new value for X (old value is
not needed for this scheme)
 The write is not performed on X at this time, but is deferred.
 When Ti partially commits, <Ti commit> is written to the log (Ti enters commit state)
 Finally, the log records are read and used to actually execute the previously deferred
writes.

Deferred Database Modification Recovery after a crash

 a transaction needs to be redone if and only if both <Ti start> and<Ti commit> are in
the log.
 Redoing a transaction Ti (redo Ti) sets the value of all data items updated by the
transaction to the new values.
 Example (T0 executes before T1):

DBMS Notes By Kailash Jadhav


DBMS
T0: read (A) T1 : read (C)

A:= A - 50 C:= C- 100

Write (A) write (C)

read (B)

B:- B + 50

write (B)

 If log (on stable storage) as in case:

(a) No redo actions need to be taken

(b) redo(T0) must be performed since <T0 commit> is present

(c) redo(T0) must be performed followed by redo(T1) since

<T0 commit> and <Ti commit> are present


 Crashes can also occur while while recovery action is being taken

Immediate Database Modification

 Allows DB updates of an uncommitted transactions


 Transaction start: <Ti start> written to log.
 A write(X) operation results in
 a log record <Ti, X, V1, V2> being written to log (undoing may be needed)
 followed by the write operation
 When Ti partially commits, <Ti commit> is written to the log
 Output of updated blocks can take place at any time before or after transaction
commit

DBMS Notes By Kailash Jadhav


DBMS
 Example

Log Write Output

<T0 start>

<T0, A, 1000, 950>

<T0, B, 2000, 2050>

A = 950

B = 2050

<T0 commit>

<T1 start>

<T1, C, 700, 600>

C = 600

BB, BC

<T1 commit>

BA

(BX denotes block containing X)

Recovery procedure has two operations:

 undo(Ti) restores the value of all data items updated by Ti to their old values, going
backwards from the last log record for Ti
 redo(Ti) sets the value of all data items updated by Ti to the new values, going
forward from the first log record for Ti

When recovering after failure:

 Undo Ti if the log contains the record <Ti start>, but does not contain the record <Ti
commit>.
 Redo Ti if the log contains both the record <Ti start> and the record <Ti commit>.
 (Undo operations are performed before redo operations)

Both operations must be idempotent

DBMS Notes By Kailash Jadhav


DBMS
 That is, even if the operation is executed multiple times the effect is the same as if it is
executed once
 Needed since operations may get re-executed during recovery

The log as it appears at three instances of time.

If log on stable storage at time of crash is as in case: :

(a) undo (T0): B is restored to 2000 and A to 1000.

(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are

set to 950 and 2050 respectively.

(c) redo (T0) and redo (T1): A and B are set to 950 and 2050

respectively. Then C is set to 600

Checkpoints

 Problems in recovery procedure as discussed earlier :


 searching the entire log is time-consuming
 we might unnecessarily redo transactions which have already output their updates
to the database.
 Reduce recovery overhead by periodically performing checkpoints
 Output all log records currently residing in main memory onto stable storage.
 Output all modified buffer blocks to the disk.
 Write a log record < checkpoint> onto stable storage.

Checkpoints (Cont.)

 During recovery we need to consider only


 the most recent transaction Ti that started before the checkpoint
 and transactions that started after Ti.
 Recovery procedure
 Scan backwards from end of log to find the most recent <checkpoint> record
 Continue scanning backwards till a record <Ti start> is found.

DBMS Notes By Kailash Jadhav


DBMS
 Need only consider the part of log following above start record. Earlier part of log
can be ignored during recovery, and can be erased whenever desired.
 For all transactions (starting from Ti or later) with no <Ti commit>, execute undo(Ti).
(in case of immediate modification)
 Scanning forward in the log, for all transactions (starting from Ti or later) with a <Ti
commit>, execute redo(Ti).

Example of Checkpoints

Tc Tf
T1
T2
T3
T4

checkpoint system failure


 T1 can be ignored (updates already output to disk due to checkpoint)
 T2 and T3 redone.
 T4 undone

Original Slides: Intro to DB (2005-2)


© Silberschatz, Korth and Sudarshan Copyright  2001 ~ 2005 by S.-g. Lee Chap 17 - 20

Introduction to RDBMS, OODBMS and ORDBMS - Object-Relational Database


(Page 4 of 4 )

Object-Relational database (ORDBMS) is the third type of database common today. ORDBMS are
systems that “attempt to extend relational database systems with the functionality necessary to
support a broader class of applications and, in many ways, provide a bridge between the relational
and object-oriented paradigms.”

ORDBMS was created to handle new types of data such as audio, video, and image files that
relational databases were not equipped to handle. In addition, its development was the result of
increased usage of object-oriented programming languages, and a large mismatch between these
and the DBMS software.

One advantage of ORDBMS is that it allows organizations to continue using their existing systems,
without having to make major changes. A second advantage is that it allows users and programmers
to start using object-oriented systems in parallel.

There are challenges in implementing an ORDBMS. The first is storage and access methods. The
second is query processing, and the third is query optimization.

DBMS Notes By Kailash Jadhav


DBMS
Since the development of RDBMS, OODBMS, and ORDBMS, many vendors have extended their
systems with the ability to store new data types such as images and texts, and with the ability to ask
more complex queries.

One rising technique is enterprise resource planning and management resource planning, which add
another layer of application-oriented features on top of a DBMS. Included applications come
from Baan, Oracle, SAP, and Siebel. These programs each identify a set of common tasks
encountered by a large number of organizations and provide a general application layer to carry out
these tasks.

More importantly, DBMS have advanced into the Internet and Web Age. Stored data is widely being
accessed through a Web browser. Today, queries are being generated through Web-accessible forms
and answers are being formatted using a mark-up language such as HTML. In addition, many
vendors and distributors are adding features to their DBMS aimed at making it better equipped for
Internet usage.

In summary, relational and object-oriented database systems each have certain strengths as well as
certain weaknesses. In general, the weakness of one type of system tends to be strength of the
other.

Data Mining: What is Data Mining?


Overview
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from
different perspectives and summarizing it into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens
of fields in large relational databases.

Continuous Innovation
Although data mining is a relatively new term, the technology is not. Companies have used powerful computers
to sift through volumes of supermarket scanner data and analyze market research reports for years. However,
continuous innovations in computer processing power, disk storage, and statistical software are dramatically
increasing the accuracy of analysis while driving down the cost.

Example
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local
buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended
to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on
Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased
the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For example, they could move the beer display closer to the
diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

DBMS Notes By Kailash Jadhav


DBMS
Data, Information, and Knowledge

Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are
accumulating vast and growing amounts of data in different formats and different databases. This includes:

 operational or transactional data such as, sales, cost, inventory, payroll, and accounting

 nonoperational data, such as industry sales, forecast data, and macro economic data

 meta data - data about the data itself, such as logical database design or data dictionary definitions

Information

The patterns, associations, or relationships among all this data can provide information. For example, analysis
of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example, summary
information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of
consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible
to promotional efforts.

Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling
organizations to integrate their various databases into data warehouses. Data warehousing is defined as a
process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new
term although the concept itself has been around for years. Data warehousing represents an ideal vision of
maintaining a central repository of all organizational data. Centralization of data is needed to maximize user
access and analysis. Dramatic technological advances are making this vision a reality for many companies. And,
equally dramatic advances in data analysis software are allowing users to access this data freely. The data
analysis software is what supports data mining.

What can data mining do?

Data mining is primarily used today by companies with a strong consumer focus - retail, financial,
communication, and marketing organizations. It enables these companies to determine relationships among
"internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic
indicators, competition, and customer demographics. And, it enables them to determine the impact on sales,
customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information
to view detail transactional data.

With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions
based on an individual's purchase history. By mining demographic data from comment or warranty cards, the
retailer could develop products and promotions to appeal to specific customer segments.

DBMS Notes By Kailash Jadhav


DBMS
For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to
individual customers. American Express can suggest products to its cardholders based on analysis of their
monthly expenditures.

WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-
sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5
terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products
and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display
level. They use this information to manage local store inventory and identify new merchandising opportunities.
In 1995, WalMart computers processed over 1 million complex data queries.

The National Basketball Association (NBA) is exploring a data mining application that can be used in
conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements
of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of
the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that
when Mark Price played the Guard position, John Williams attempted four jump shots and made each one!
Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably
from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump
shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage.
Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds
Williams for an open jump shot.

How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical systems, data
mining provides the link between the two. Data mining software analyzes relationships and patterns in stored
transaction data based on open-ended user queries. Several types of analytical software are available: statistical,
machine learning, and neural networks. Generally, any of four types of relationships are sought:

 Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain
could mine customer purchase data to determine when customers visit and what they typically order.
This information could be used to increase traffic by having daily specials.

 Clusters: Data items are grouped according to logical relationships or consumer preferences. For
example, data can be mined to identify market segments or consumer affinities.

 Associations: Data can be mined to identify associations. The beer-diaper example is an example of
associative mining.

 Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor
equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

 Extract, transform, and load transaction data onto the data warehouse system.

DBMS Notes By Kailash Jadhav


DBMS
 Store and manage the data in a multidimensional database system.

 Provide data access to business analysts and information technology professionals.

 Analyze the data by application software.

 Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

 Artificial neural networks: Non-linear predictive models that learn through training and resemble
biological neural networks in structure.

 Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation,
and natural selection in a design based on the concepts of natural evolution.

 Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules
for the classification of a dataset. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are
decision tree techniques used for classification of a dataset. They provide a set of rules that you can
apply to a new (unclassified) dataset to predict which records will have a given outcome. CART
segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create
multi-way splits. CART typically requires less data preparation than CHAID.

 Nearest neighbor method: A technique that classifies each record in a dataset based on a combination
of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called
the k-nearest neighbor technique.

 Rule induction: The extraction of useful if-then rules from data based on statistical significance.

 Data visualization: The visual interpretation of complex relationships in multidimensional data.


Graphics tools are used to illustrate data relationships.

What technological infrastructure is required?

Today, data mining applications are available on all size systems for mainframe, client/server, and PC
platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a
terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11
terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical
technological drivers:

 Size of the database: the more data being processed and maintained, the more powerful the system
required.

 Query complexity: the more complex the queries and the greater the number of queries being
processed, the more powerful the system required.

DBMS Notes By Kailash Jadhav


DBMS
Relational database storage and management technology is adequate for many data mining applications less
than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger
applications. Some vendors have added extensive indexing capabilities to improve query performance. Others
use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude
improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium
processors to achieve performance levels exceeding those of the largest supercomputers.

Data Mining is the process of extracting knowledge hidden from large volumes of raw data.

The importance of collecting data that reflect your business or scientific activities to achieve competitive
advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases
are in place in all large and mid-range companies. However, the bottleneck of turning this data into your success
is the difficulty of extracting knowledge about the system you study from the collected data.

Human analysts with no special tools can no longer make sense of enormous volumes of data that require
processing in order to make informed business decisions. Data mining automates the process of finding
relationships and patterns in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.

 What goods should be promoted to this customer?


 What is the probability that a certain customer will respond to a planned promotion?
 Can one predict the most profitable securities to buy/sell during the next trading session?
 Will this customer default on a loan or pay back on schedule?
 What medical diagnose should be assigned to this patient?
 How large the peak loads of a telephone or energy network are going to be?
 Why the facility suddenly starts to produce defective goods?

These are all the questions that can probably be answered if information hidden among megabytes of data in
your database can be found explicitly and utilized. Modeling the investigated system, discovering relations that
connect variables in a database are the subject of data mining.

Modern computer data mining systems self learn from the previous history of the investigated system,
formulating and testing hypotheses about the rules which this system obeys. When concise and valuable
knowledge about the system of interest had been discovered, it can and should be incorporated into some
decision support system which helps the manager to make wise and informed business decisions.

Why use data mining?

Data might be one of the most valuable assets of your corporation - but only if you know how to reveal valuable
knowledge hidden in raw data. Data mining allows you to extract diamonds of knowledge from your historical
data and predict outcomes of future situations. It will help you optimize your business decisions, increase the
value of each customer and communication, and improve satisfaction of customer with your services.

Data that require analysis differ for companies in different industries. Examples include:

 Sales and contacts histories


 Call support data

DBMS Notes By Kailash Jadhav


DBMS
 Demographic data on your customers and prospects
 Patient diagnoses and prescribed drugs data
 Clickstream and transactional data from your website

In all these cases data mining can help you reveal knowledge hidden in data and turn this knowledge into a
crucial competitive advantage. Today increasingly more companies acknowledge the value of this new
opportunity and turn to Megaputer for leading edge data mining tools and solutions that help optimizing their
operations and increase your bottom line.

What can Data Mining do for you?

 Identify your best prospects and then retain them as customers.


By concentrating your marketing efforts only on your best prospects you will save time and money, thus
increasing effectiveness of your marketing operation.
 Predict cross-sell opportunities and make recommendations.
Whether you have a traditional or web-based operation, you can help your customers quickly locate
products of interest to them - and simultaneously increase the value of each communication with your
customers.
 Learn parameters influencing trends in sales and margins.
You think you could do this with your OLAP tools? True, OLAP can help you prove a hypothesis - but
only if you know what questions to ask in the first place. In the majority of cases you have no clue on
what combination of parameters influences your operation. In these situations data mining is your only
real option.
 Segment markets and personalize communications.
There might be distinct groups of customers, patients, or natural phenomena that require different
approaches in their handling. If you have a broad customer range, you would need to address teenagers
in California and married homeowners in Minnesota with different products and messages in order to
optimize your marketing campaign.

Reasons for the growing popularity of Data Mining

Growing Data Volume

The main reason for necessity of automated computer systems for intelligent data analysis is the enormous
volume of existing and newly appearing data that require processing. The amount of data accumulated each day
by various business, scientific, and governmental organizations around the world is daunting. According to
information from GTE research center, only scientific organizations store each day about 1 TB (terabyte!) of
new information. And it is well known that academic world is by far not the leading supplier of new data. It
becomes impossible for human analysts to cope with such overwhelming amounts of data.

Limitations of Human Analysis

Two other problems that surface when human analysts process data are the inadequacy of the human brain
when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis.
A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this
helps, sometimes this hurts, but it is almost impossible to get rid of this fact.
DBMS Notes By Kailash Jadhav
DBMS
Low Cost of Machine Learning

One additional benefit of using automated data mining systems is that this process has a much lower cost than
hiring an army of highly trained (and payed) professional statisticians. While data mining does not eliminate
human participation in solving the task completely, it significantly simplifies the job and allows an analyst who
is not a professional in statistics and programming to manage the process of extracting knowledge from data.

Tasks Solved by Data Mining

Predicting

A task of learning a pattern from examples and using the developed model to predict future values of the target
variable.

The following PolyAnalyst engines help solving this task:

PolyNet Predictor, Find Laws, Memory Based Reasoning, and Linear Regression

Classification

A task of finding a function that maps records into one of several discrete classes.

The following PolyAnalyst engines help solving this task:

Memory Based Reasoning, Classify and Discriminate

Detection of relations

A task of searching for the most influential independent variables for a selected target variable.

Find Dependencies engine help solving this task

Explicit modeling

A task of finding explicit formulae describing dependencies between various variables.

PolyAnalyst Find Laws engine help solving this task

Clustering

A task of identifying groups of records that are similar between themselves but different from the rest of the
data. Often, the variables providing the best clustering should be identified as well.

PolyAnalyst Cluster engine help solving this task

Market Basket Analysis

DBMS Notes By Kailash Jadhav


DBMS
Processing transactional data in order to find those groups of products that are sold together well. One also
searches for directed association rules identifying the best product to be offered with a current selection of
purchased products.

PolyAnalyst Market Basket Analysis engine help solving this task

Deviation Detection

A task of determining the most significant changes in some key measures of data from previous or expected
values.

Different DM Technologies and Systems

It would be very instructive to discuss various existing approaches to data mining while stressing out the
following three vital criteria:

 Control of the significance of obtained results


 Transparity of developed empirical models and their interpretability
 Degree of search process automatisation and ease-of-use

To build a bridge from more traditional methods for data analysis to data mining methods we start by discussing
some more traditional approaches:

 Subject-oriented analytical systems


 Statistical packages

And then proceed to consider Data Mining methods:

 Neural Networks
 Evolutionary Programming
 Memory Based Reasoning
 Decision Trees
 Genetic Algorithms
 Nonlinear Regression Methods

Data Warehousing Concepts


This chapter provides an overview of the Oracle data warehousing implementation. It includes:

 What is a Data Warehouse?


 Data Warehouse Architectures

Note that this book is meant as a supplement to standard texts about data warehousing. This book focuses on
Oracle-specific material and does not reproduce in detail material of a general nature. Two standard texts are:

 The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996)
DBMS Notes By Kailash Jadhav
DBMS
 Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction
processing. It usually contains historical data derived from transaction data, but it can include data from other
sources. It separates analysis workload from transaction workload and enables an organization to consolidate
data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction, transportation,


transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis
tools, and other applications that manage the process of gathering data and delivering it to business users.

See Also:

Chapter 10, "Overview of Extraction, Transformation, and Loading"

A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set
forth by William Inmon:

 Subject Oriented
 Integrated
 Nonvolatile
 Time Variant

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales
data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like
"Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter,
sales in this case, makes the data warehouse subject oriented.

Integrated

Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a
consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of
measure. When they achieve this, they are said to be integrated.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the
purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to
online transaction processing (OLTP) systems, where performance requirements demand that historical data

DBMS Notes By Kailash Jadhav


DBMS
be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time
variant.

Contrasting OLTP and Data Warehousing Environments

Figure 1-1 illustrates key differences between an OLTP system and a data warehouse.

Figure 1-1 Contrasting OLTP and Data Warehousing Environments

Text description of the illustration dwhsg005.gif

One major difference between the types of system is that data warehouses are not usually in third normal form
(3NF), a type of data normalization common in OLTP environments.

Data warehouses and OLTP systems have very different requirements. Here are some examples of differences
between typical data warehouses and OLTP systems:

 Workload

Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of
your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide
variety of possible query operations.

OLTP systems support only predefined operations. Your applications might be specifically tuned or
designed to support only these operations.

 Data modifications

A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk
data modification techniques. The end users of a data warehouse do not directly update the data
warehouse.

In OLTP systems, end users routinely issue individual data modification statements to the database. The
OLTP database is always up to date, and reflects the current state of each business transaction.

Schema design
DBMS Notes By Kailash Jadhav
DBMS
Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to
optimize query performance.

OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to
guarantee data consistency.

 Typical operations

A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales
for all customers last month."

A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order
for this customer."

 Historical data

Data warehouses usually store many months or years of data. This is to support historical analysis.

OLTP systems usually store data from only a few weeks or months. The OLTP system stores only
historical data as needed to successfully meet the requirements of the current transaction.

Data Warehouse Architectures

Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three
common architectures are:

 Data Warehouse Architecture (Basic)


 Data Warehouse Architecture (with a Staging Area)
 Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)

Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from
several source systems through the data warehouse.

DBMS Notes By Kailash Jadhav


DBMS
Figure 1-2 Architecture of a Data Warehouse

Text description of the illustration dwhsg013.gif

In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of
data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations
in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary
in Oracle is called a materialized view.

Data Warehouse Architecture (with a Staging Area)

In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can
do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies
building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture.

Figure 1-3 Architecture of a Data Warehouse with a Staging Area

Text description of the illustration dwhsg015.gif

DBMS Notes By Kailash Jadhav


DBMS

Data Warehouse Architecture (with a Staging Area and Data Marts)

Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's
architecture for different groups within your organization. You can do this by adding data marts, which are
systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales,
and inventories are separated. In this example, a financial analyst might want to analyze historical data for
purchases and sales.

Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts

This illustrates five things:

 Data Sources (operational systems and flat files)


 Staging Area (where data sources go before the warehouse)
 Warehouse (metadata, summary data, and raw data)
 Data Marts (purchasing, sales, and inventory)
 Users (analysis, reporting, and mining)

Operational vs. Informational Systems

Perhaps the most important concept that has come out of the Data Warehouse movement is the recognition that there are two
fundamentally different types of information systems in all organizations: operational systems and informational systems.

"Operational systems" are just what their name implies; they are the systems that help us run the enterprise operation day-to-day.
These are the backbone systems of any enterprise, our "order entry', "inventory", "manufacturing", "payroll" and "accounting"
systems. Because of their importance to the organization, operational systems were almost always the first parts of the enterprise to be
computerized. Over the years, these operational systems have been extended and rewritten, enhanced and maintained to the point that

DBMS Notes By Kailash Jadhav


DBMS
they are completely integrated into the organization. Indeed, most large organizations around the world today couldn't operate without
their operational systems and the data that these systems maintain.

On the other hand, there are other functions that go on within the enterprise that have to do with planning, forecasting and managing
the organization. These functions are also critical to the survival of the organization, especially in our current fast-paced world.
Functions like "marketing planning", "engineering planning" and "financial analysis" also require information systems to support
them. But these functions are different from operational ones, and the types of systems and information required are also different.
The knowledge-based functions are informational systems.

"Informational systems" have to do with analyzing data and making decisions, often major decisions, about how the enterprise will
operate, now and in the future. And not only do informational systems have a different focus from operational ones, they often have a
different scope. Where operational data needs are normally focused upon a single area, informational data needs often span a number
of different areas and need large amounts of related operational data.

In the last few years, Data Warehousing has grown rapidly from a set of related ideas into an architecture for data delivery for
enterprise end-user computing.

2) Understanding the Framework of the Data Warehouse

One of the reasons that data warehousing has taken such a long time to develop is that it is actually a very comprehensive technology.
In fact, data warehousing can be best represented as an enterprise-wide framework for managing informational data within the
organization. In order to understand how all the components involved in a data warehousing strategy are related, it is essential to have
a Data Warehouse Architecture.

2.1) A Data Warehouse Architecture

A Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and
presentation that exists for end-user computing within the enterprise. The architecture is made up of a number of interconnected parts:

 Operational Database / External Database Layer


 Information Access Layer
 Data Access Layer
 Data Directory (Metadata) Layer
 Process Management Layer
 Application Messaging Layer
 Data Warehouse Layer
 Data Staging Layer

DBMS Notes By Kailash Jadhav


DBMS

Figure 1 - Data Warehouse Architecture

2.2) Operational Database / External Database Layer

Operational systems process data to support critical operational needs. In order to do that, operational databases have been historically
created to provide an efficient processing structure for a relatively small number of well-defined business transactions. However,
because of the limited focus of operational systems, the databases designed to support operational systems have difficulty accessing
the data for other management or informational purposes. This difficulty in accessing operational data is amplified by the fact that
many operational systems are often 10 to 15 years old. The age of some of these systems means that the data access technology
available to obtain operational data is itself dated.

Clearly, the goal of data warehousing is to free the information that is locked up in the operational databases and to mix it with
information from other, often external, sources of data. Increasingly, large organizations are acquiring additional data from outside
databases. This information includes demographic, econometric, competitive and purchasing trends. The so-called "information
superhighway" is providing access to more data resources every day.

2.3) Information Access Layer

The Information Access layer of the Data Warehouse Architecture is the layer that the end-user deals with directly. In particular, it
represents the tools that the end-user normally uses day to day, e.g., Excel, Lotus 1-2-3, Focus, Access, SAS, etc. This layer also
includes the hardware and software involved in displaying and printing reports, spreadsheets, graphs and charts for analysis and
presentation. Over the past two decades, the Information Access layer has expanded enormously, especially as end-users have moved
to PCs and PC/LANs.

Today, more and more sophisticated tools exist on the desktop for manipulating, analyzing and presenting data; however, there are
significant problems in making the raw data contained in operational systems available easily and seamlessly to end-user tools. One of
the keys to this is to find a common data language that can be used throughout the enterprise.

2.4) Data Access Layer

The Data Access Layer of the Data Warehouse Architecture is involved with allowing the Information Access Layer to talk to the
Operational Layer. In the network world today, the common data language that has emerged is SQL. Originally, SQL was developed
by IBM as a query language, but over the last twenty years has become the de facto standard for data interchange.

One of the key breakthroughs of the last few years has been the development of a series of data access "filters" such as EDA/SQL that
make it possible for  SQL to access nearly all DBMSs and data file systems, relational or nonrelational. These filters make it possible
for state-of-the-art Information Access tools to access data stored on database management systems that are twenty years old.

DBMS Notes By Kailash Jadhav


DBMS
The Data Access Layer not only spans different DBMSs and file systems on the same hardware, it spans manufacturers and network
protocols as well. One of the keys to a Data Warehousing strategy is to provide end-users with "universal data access". Universal data
access means that, theoretically at least, end-users, regardless of location or Information Access tool, should be able to access any or
all of the data in the enterprise that is necessary for them to do their job.

The Data Access Layer then is responsible for interfacing between Information Access tools and Operational Databases. In some
cases, this is all that certain end-users need. However, in general, organizations are developing a much more sophisticated scheme to
support Data Warehousing.

2.5) Data Directory (Metadata) Layer

In order to provide for universal data access, it is absolutely necessary to maintain some form of data directory or repository of meta-
data information. Meta-data is the data about data within the enterprise. Record descriptions in a COBOL program are meta-data. So
are DIMENSION statements in a FORTRAN program, or SQL Create statements. The information in an ERA diagram is also meta-
data.

In order to have a fully functional warehouse, it is necessary to have a variety of meta-data available, data about the end-user views of
data and data about the operational databases. Ideally, end-users should be able to access data from the data warehouse (or from the
operational databases) without having to know where that data resides or the form in which it is stored.

2.6) Process Management Layer

The Process Management Layer is involved in scheduling the various tasks that must be accomplished to build and maintain the data
warehouse and data directory information. The Process Management Layer can be thought of as the scheduler or the high-level job
control for the many processes (procedures) that must occur to keep the Data Warehouse up-to-date.

2.7) Application Messaging Layer

The Application Message Layer has to do with transporting information around the enterprise computing network. Application
Messaging is also referred to as "middleware", but it can involve more that just networking protocols. Application Messaging for
example can be used to isolate applications, operational or informational, from the exact data format on either end. Application
Messaging can also be used to collect transactions or messages and deliver them to a certain location at a certain time. Application
Messaging in the transport system underlying the Data Warehouse.

2.8) Data Warehouse (Physical) Layer

The (core) Data Warehouse is where the actual data used primarily for informational uses occurs. In some cases, one can think of the
Data Warehouse simply as a logical or virtual view of data. In many instances, the data warehouse may not actually involve storing
data.

In a Physical Data Warehouse, copies, in some cases  many copies, of operational and or external data are actually stored in a form
that is easy to access and is highly flexible. Increasingly, Data Warehouses are stored on client/server platforms, but they are often
stored on main frames as well.

2.9) Data Staging Layer

The final component of the Data Warehouse Architecture is Data Staging. Data Staging is also called copy management or replication
management, but in fact, it includes all of the processes necessary to select, edit, summarize, combine and load data warehouse and
information access data from operational and/or external databases.

Data Staging often involves complex programming, but increasingly data warehousing tools are being created that help in this process.
Data Staging may also involve data quality analysis programs and filters that identify patterns and data structures within existing
operational data.

3) Data Warehouse Options

DBMS Notes By Kailash Jadhav


DBMS
There are perhaps as many ways to develop data warehouses as there are organizations. Moreover, there are a number of different
dimensions that need to be considered:

 Scope of the data warehouse


 Data redundancy
 Type of end-user

Figure 2 shows a two-dimensional grid for analyzing the basic options, with the horizontal dimension indicating the scope of the
warehouse, and the vertical dimension showing the amount of redundant data that must be stored and maintained.

Figure 2 - Data Warehouse Options

3.1) Data Warehouse Scope

The scope of a data warehouse may be as broad as all the informational data for the entire enterprise from the beginning of time, or it
may be as narrow as a personal data warehouse for a single manager for a single year. There is nothing that makes one of these more
of a data warehouse than another.

In practice, the broader the scope, the more value the data warehouse is to the enterprise and the more expensive and time consuming
it is to create and maintain. As a consequence, most organizations seem to start out with functional, departmental or divisional data
warehouses and then expand them as users provide feedback.

3.2) Data Redundancy

There are essentially three levels of data redundancy that enterprises should think about when considering their data warehouse
options:

 "Virtual" or "Point-to-Point" Data Warehouses


 Central Data Warehouses
 Distributed Data Warehouses

There is no one best approach. Each option fits a specific set of requirements, and a data warehousing strategy may ultimately include
all three options

DBMS Notes By Kailash Jadhav


DBMS
3.2.1) "Virtual" or "Point-to-Point" Data Warehouses

A virtual or point-to-point data warehousing strategy means that end-users are allowed to get at operational databases directly using
whatever tools are enabled to the "data access network" . This approach provides the ultimate in flexibility as well as the minimum
amount of redundant data that must be loaded and maintained. This approach can also put the largest unplanned query load on
operational systems.

As we will see, virtual warehousing is often an initial strategy in organizations where there is a broad but largely undefined need to get
at operational data from a relatively large class of end-users and where the likely frequency of requests is low. Virtual data warehouses
often provide a starting point for organizations to learn what end-users are really looking for. Figure 3 below shows a Virtual Data
Warehouse within the Data Warehouse Architecture.

3.2.2) Central Data Warehouses

Central Data Warehouses are what most people think of when they first are introduced to the concept of data warehouse. The central
data warehouse is a single physical database that contains all of the data for a specific functional area, department, division, or
enterprise. Central Data Warehouses are often selected where there is a common need for informational data and there are large
numbers of end-users already connected to a central computer or network. A Central Data Warehouse may contain data for any
specific period of time. Usually,  Central Data Warehouses contain data from multiple operational systems.

Central Data Warehouses are real. The data stored in the data warehouse is accessible from one place and must be loaded and
maintained on a regular basis. Normally, data warehouses are built around advanced RDBMs or some form of multi-dimensional
informational database server.

3.2.3) Distributed Data Warehouses

Distributed Data Warehouses are just what their name implies. They are data warehouses in which the certain components of the data
warehouse are distributed across a number of different physical databases. Increasingly, large organizations are pushing decision-
making down to lower and lower levels of the organization and in turn pushing the data needed for decision making down (or out) to
the LAN or local computer serving the local decision-maker.

Distributed Data Warehouses usually involve the most redundant data and, as a consequence, most complex loading and updating
processes.

3.3) Type of End-user

In the same sense that there are lots of different ways to organize a data warehouse, it is important to note that there are an
increasingly wide range of end-users as well. In general we tend to think in terms of three broad categories of end-users:

 Executives and managers


 "Power" users (business and financial analysts, engineers, etc.)
 Support users (clerical, administrative, etc.)

Each of these different categories of user has its own set of requirements for data, access, flexibility and ease of use.

4) Developing Data Warehouses

Developing a good data warehouse is no different from any other IT project; it requires careful planning, requirements definition,
design, prototyping and implementation. The first and most important element is a planning process that determines what kind of data
warehouse strategy the organization is going to start with.

4.1) Developing a Data Warehouse Strategy

Before developing a data warehouse, it is critical to develop a balanced data warehousing strategy that is appropriate for its needs and
its user population. Who is the audience? What is the scope? What type of data warehouse should we build?

DBMS Notes By Kailash Jadhav


DBMS
There are a number of strategies by which organizations can get into data warehousing. One way is to establish a "Virtual Data
Warehouse" environment. A Virtual Data Warehouse is created by: (1) installing a set of data access, data directory and process
management facilities, (2) training the end-users (3) monitoring how the data warehouse facilities are actually used and then (4) based
on actual usage, create a physical data warehouse to support the high-frequency requests.

A second strategy is simply to build a copy of the operational data from a single operational system and enable the data warehouse
from a series of information access tools. This strategy has the advantage of being both simple and fast. Unfortunately, if the existing
data is of poor quality and/or the access to the data has not been thought through, then this approach can create a number of significant
problems.

Ultimately, the optimal data warehousing strategy is to select a user population based on value to the enterprise and do an analysis of
their issues, questions and data access needs. Based on these needs, prototype data warehouses are built and populated so the end-
users can experiment and modify their requirements. Once there is general agreement on the needs, then the data can be acquired from
existing operational systems across the enterprise and/or from external data sources and loaded into the data warehouse. If it is
required, information access tools can also be enabled to allow end-users to have access to required data using their own favorite tools
or to allow for the creation of high-performance multi-dimensional information access systems using the core data warehouse as the
basis.

In the final analysis, there is no one approach to building a data warehouse that will fit the needs of every enterprise. Each enterprise's
needs are different as is each enterprise's context. In addition, since data warehouse technology is evolving as we learn more about
developing data warehouses, it turns out that the only practical approach to data warehousing is an evolutionary one.

4.2) Evolving a Data Warehouse Architecture

The Data Warehouse Architecture in Figure 1 is simply a framework for understanding data warehousing and how the components of
data warehousing fit together. Only the most sophisticated organizations will be able to put together such an architecture the first time
out. What the Data Warehouse Architecture provides then is a kind of roadmap that can be used to design toward. Coupled with an
understanding of the options at hand, the Data Warehouse Architecture provides a useful way of determining if the organization is
moving toward a reasonable data warehousing framework.

One of the keys to data warehousing is flexibility. It is critical to keep in mind that the more successful a data warehouse strategy is,
the more end-users are going to want to add to it.

4.3) Designing Data Warehouses

Designing data warehouses is very different from designing traditional operational systems. For one thing, data warehouse users
typically don't know nearly as much about their wants and needs as operational users. Second, designing a data warehouse often
involves thinking in terms of much broader, and more difficult to define, business concepts than does designing an operational system.
In this respect, data warehousing  is quite close to Business Process Reengineering (BPR). Finally, the ideal design strategy for a data
warehouse is often outside-in as opposed to top-down.

But while data warehouse design is different from what we have been used to, it is no less important. The fact that end-users have
difficulty defining what they need as a bare minimum is no less necessary. In practice, data warehouse designers find that they have to
use every trick in the book to help their users "visualize" their requirements. In this respect, robust working prototypes are essential.

5) Managing Data Warehouses

Data Warehouses are not magic-they take a great deal of very hard work. In many cases data warehouse projects are viewed as a
stopgap measure to get users off our backs or to provide something for nothing. But data warehouses require careful management and
marketing. A data warehouse is a good investment only if end-users actually can get at vital information faster and cheaper than they
can using current technology. As a consequence, management has to think seriously about how they want their warehouses to perform
and how they are going to get the word out to the end-user community. And management has to recognize that the maintenance of the
data warehouse structure is as critical as the maintenance of any other mission-critical application. In fact, experience has shown that
data warehouses quickly become one of the most used systems in any organization.

DBMS Notes By Kailash Jadhav


DBMS
Management, especially IT management, must also understand that if they embark on a data warehousing program, they are going to
create new demands upon their operational systems: demands for better data, demands for consistent data, demands for different kinds
of data.

6) Future Developments

Data Warehousing is such a new field that it is difficult to estimate what new developments are likely to most affect it. Clearly, the
development of parallel DB servers with improved query engines is likely to be one of the most important. Parallel servers will make
it possible to access huge data bases in much less time. 

Another new technology is data warehouses that allow for the mixing of traditional numbers, text and multi-media. The availability of
improved tools for data visualization (business intelligence) will allow users to see things that could never be seen before.

7) Conclusion

Data Warehousing is not a new phenomenon. All large organizations already have data warehouses, but they are just not managing
them. Over the next few years, the growth of data warehousing is going to be enormous with new products and technologies coming
out frequently. In order to get the most out of this period, it is going to be important that data warehouse planners and developers have
a clear idea of what they are looking for and then choose strategies and methods that will provide them with performance today and
flexibility for tomorrow.

What is a data warehouse?


A data warehouse has several processes that require several technology components. Batch and transaction
processing data first has to be extracted from operational databases and then cleaned up to remove redundant
data, fill in blank and missing fields and organized into consistent formats. The data is then loaded into a
relational database. Business analysts can then dig into the data using data access and reporting software
including On-Line Analytical Processing (OLAP) tools, statistical modeling tools, geographic information
systems (GIS) and data mining tools.

What are these different kinds of analyses?


They range from the most basic (query and reporting) to the more complex (OLAP and statistical analysis) to
the most complex (data mining). Basic queries and reports are usually performed by functional managers who
use pre-defined queries to look up such things as average monthly sales, total regional expenses and daily totals.
OLAP and multi-dimensional analysis tools are designed more for business analysts who need to look at data
across multiple dimensions. These tools let them drill down from summary data sets into the specific data
underlying the summaries. Statistical analysis tools provide summary information too and help determine the
degree of relationship between two factors, such as zip code and sales. Data mining tools analyze very large
data sets to highlight hidden patterns, such as what items grocery shoppers buy as a pair.

What is a data warehouse used for?


Many things. Data warehouses are the basis for customer relationship management systems because they can be
used for consolidating customer data and identifying areas of customer satisfaction and frustration. Warehouses
are also used for fraud detection, product repositioning analysis, profit center discovery and corporate asset
management. For retailers, a data warehouse can help identify customer demographic characteristics, identify
shopping patterns, and improve direct mailing responses. For banks, it can assist in spotting credit card fraud,
help identify the most profitable customers, and highlight the most loyal customers. Telecommunications firms
use data warehousing to predict which customers are likeliest to switch and then target them with special
incentives to stay. Insurance companies use data warehousing for claims analysis to see which procedures are
claimed together and to identify patterns of risky customers. Manufacturers can use data warehousing to
compare costs of each of their product lines over the last several years, determine which factors produced
increases and see what effect these increases had on overall margins.

DBMS Notes By Kailash Jadhav


DBMS
Is it hard to set up a data warehouse?
Setting up a data warehouse isn’t easy. Just identifying where all a business’s data comes from, how it gets
entered into a system and where it is all stored can be difficult, and setting up a data cleansing processes is quite
complicated. It all depends on how large and complex the data collecting and storing operation is. Large data
warehousing projects take years and millions of dollars to implement.

Is there such a thing as a small data warehouse?


Yes. Some companies begin with a data mart, a scaled-down warehouse that focuses on just one functional
department area, such as finance. Data marts often can be implemented in a couple of months and later be
linked together into a confederated warehouse.

DBMS Notes By Kailash Jadhav

You might also like