Dbms Notes by Kailash PDF
Dbms Notes by Kailash PDF
Dbms Notes by Kailash PDF
DBMS NOTES
Collections of data
definitions of
o field names
o data formats (text? binary? integer? etc.)
o record structures (fixed-length? pointers? field order, etc.)
o file structures (sequential? indexed? etc.)
rules for validating and manipulating data
data
One must be able to change storage mechanisms and formats without having to modify all
application programmes. For example:
method of representation of alphanumeric data (e.g., changing date format to avoid Y2000
problem)
method of representation of numeric data (e.g., integer vs. long integer or floating-point)
units (e.g., metric vs. furlongs)
record structure (example)
file structure (sequential, sorted, indexed, etc.)
1. To see why database management systems are necessary, let's look at a typical ``file-
processing system'' supported by a conventional operating system.
o Savings account and customer records are kept in permanent system files.
o Application programs are written to manipulate files to perform the following tasks:
Debit or credit an account.
These problems and others led to the development of database management systems.
DBMS stands for Database Management System which is a general term for a set of software
dedicated to controlling the storage of data.
RDMBS stand for Relational DataBase Management System. This is the most common form of DBMS.
Invented by E.F. Codd, the only way to view the data is as a set of tables. Because there can be
DBMS Notes By Kailash Jadhav
DBMS
relationships between the tables, people often assume that is what the word "relational" means. Not
so. Codd was a mathematician and the word "relational" is a mathematical term from the science of
set theory. It means, roughly, "based on tables".
Data Abstraction
1. The major purpose of a database system is to provide users with an abstract view of the
system.
The system hides certain details of how data is stored and created and maintained
Fig. 1.1 (figure 1.1 in the text) illustrates the three levels.
Figure 1.1: The three levels of data abstraction
Data Independence
1. The ability to modify a scheme definition in one level without affecting a scheme definition in a
higher level is called data independence.
2. There are two kinds:
o Physical data independence
The ability to modify the physical scheme without causing application programs
to be rewritten
Modifications at this level are usually to improve performance
o Logical data independence
The ability to modify the conceptual scheme without causing application
programs to be rewritten
Usually done when logical structure of database is altered
3. Logical data independence is harder to achieve as the application programs are usually heavily
dependent on the logical structure of the data. An analogy is made to abstract data types in
programming languages.
Often referred to as the three-level model, this is where the design moves from a written
specification taken from the real-world requirements to a physically-implementable design for a
specific DBMS. The three levels commonly referred to are `Conceptual Design', `Data Model
Mapping', and `Physical Design'.
The specification is usually in the form of a written document containing customer requirements,
mock reports, screen drawings and the like, written by the client to indicate the requirements which
the final system is to have. Often such data has to be collected together from a variety of internal
sources to the company and then analyzed to see if the requirements are necessary, correct, and
efficient.
Once the Database requirements have been collated, the Conceptual Design phase takes the
requirements and produces a high-level data model of the database structure. In this module, we use
ER modelling to represent high-level data models, but there are other techniques. This model is
independent of the final DBMS which the database will be installed in.
Next, the Conceptual Design phase takes the high-level data model it taken and converted into a
conceptual schema, which is specific to a particular DBMS class (e.g. relational). For a relational
system, such as Oracle, an appropriate conceptual schema would be relations.
Finally, in the Physical Design phase the conceptual schema is converted into database internal
structures. This is specific to a particular DBMS product.
DBMS Architecture
1. Database systems are partitioned into modules for different functions. Some functions (e.g.
file systems) may be provided by the operating system.
2. Components include:
o File manager manages allocation of disk space and data structures used to represent
information on disk.
DBMS Notes By Kailash Jadhav
DBMS
o Database manager: The interface between low-level data and application programs
and queries.
o Query processor translates statements in a query language into low-level instructions
the database manager understands. (May also attempt to find an equivalent but more
efficient form.)
o DML precompiler converts DML statements embedded in an application program to
normal procedure calls in a host language. The precompiler interacts with the query
processor.
o DDL compiler converts DDL statements to a set of tables containing metadata stored
in a data dictionary.
In addition, several data structures are required for physical system implementation:
Basics
is a design tool
is a graphical representation of the database system
provides a high-level conceptual data model
supports the user's perception of the data
is DBMS and hardware independent
had many variants
is composed of entities, attributes, and relationships
Entity
An entity is an object or concept about
which you want to store information.
Weak Entity
A weak entity is dependent on another
entity to exist and does not have it’s
own primary key.
Attributes
Attributes are the properties or
characteristics of an entity.
Key attribute
A key attribute is the unique,
distinguishing characteristic of the
entity. For example, an employee's
social security number might be the
employee's key attribute.
DBMS Notes By Kailash Jadhav
DBMS
Multivalued attribute
A multivalued attribute can have more
than one value. For example, an
employee entity can have multiple skill
values.
Derived attribute
A derived attribute is based on another
attribute. For example, an employee's
monthly salary is based on the
employee's annual salary.
Relationships
Relationships illustrate how two entities
share information in the database
structure.
Weak relationship
To connect a weak entity with others,
you should use a weak relationship
notation.
Cardinality
Cardinality specifies how many instances
of an entity relate to one instance of
another entity.
Recursive relationship
In some cases, entities can be self-
linked. For example, employees can
supervise other employees.
Entities
An entity is any object in the system that we want to model and store information about
Individual objects are called entities
Groups of the same type of objects are called entity types or entity sets
Entities are represented by rectangles (either with round or square corners)
Figure: Entities
There are two types of entities; weak and strong entity types.
Attribute
Figure : Attributes
Keys
A key is a data item that allows us to uniquely identify individual occurrences or an entity type.
A candidate key is an attribute or set of attributes that uniquely identifies individual
occurrences or an entity type.
An entity type may have one or more possible candidate keys, the one which is selected is
known as the primary key.
A composite key is a candidate key that consists of two or more attributes
The name of each primary key attribute is underlined.
Relationships
For this module, we will use an alternative notation, where the relationship is a label on the
line. The meaning is identical
Degree of a Relationship
It is a relationship where the same entity participates more than once in different roles.
In the example above we are saying that employees are managed by employees.
If we wanted more information about who manages whom, we could introduce a second
entity type called manager.
Degree of a Relationship
It is also possible to have entities associated through two or more distinct relationships.
When ternary relationships occurs in an ER model they should always be removed before finishing
the model. Sometimes the relationships can be replaced by a series of binary relationships that link
pairs of the original ternary relationship.
Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as a
noun.
So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product.
This process also works for higher order relationships.
Cardinality
A one to may relationship - one manager manages many employees, but each employee only
has one manager, so it is a one to many (1:n) relationship
A many to many relationship - One lecturer teaches many students and a student is taught by
many lecturers, so it is a many to many (m:n) relationship
Optionality
It is important to know the optionality because you must ensure that whenever you create a
new entity it has the required mandatory links.
Foreign keys
A foreign key is an attribute (or group of attributes) that is the primary key to another relation.
Before we start the actual mapping process we need to be certain that we have simplified the ER
model as much as possible.
This is the ideal time to check the model, as it is really the last chance to make changes to the ER
model without causing major complications.
If the relationship is mandatory at both ends it is often possible to subsume one entity type into the
other.
The choice of which entity type subsumes the other depends on which is the most important
entity type (more attributes, better key, semantic nature of them).
The result of this amalgamation is that all the attributes of the `swallowed up' entity become
attributes of the more important entity.
The key of the subsumed entity type becomes a normal attribute.
If there are any attributes in common, the duplicates are removed.
The primary key of the new combined entity is usually the same as that of the original more
important entity type.
There are a few reason why you might not combine a 1:1 mandatory relationship.
the two entity types represent different entities in the `real world'.
the entities participate in very different relationships with other entities.
efficiency considerations when fast responses are required or different patterns of updating
occur to the two different entity types.
If the two entity types are kept separate then the association between them must be represented by
a foreign key.
The primary key of one entity type comes the foreign key in the other.
It does not matter which way around it is done but you should not have a foreign key in each
entity.
1:1 Optional
So for 1:1 optional relationships, take the primary key from the `mandatory end' and add it to the
`optional end' as a foreign key.
To map 1:m relationships, the primary key on the `one side' of the relationship is added to the
`many side' as a foreign key.
If you have some m:n relationships in your ER model then these are mapped in the following
manner.
A new relation is produced which contains the primary keys from both sides of the relationship
These primary keys form a composite primary key.
A many to many relationship in an ER model is not necessarily incorrect. They can be replaced using
an intermediate entity. This should only be done where:
Consider the case of a car hire company. Customers hire cars, one customer hires many card and a
car is hired by many customers.
Constructing an ER model
Before beginning to draw the ER model, read the requirements specification carefully. Document any
assumptions you need to make.
1. Identify entities - list all potential entity types. These are the object of interest in the system.
It is better to put too many entities in at this stage and them discard them later if necessary.
2. Remove duplicate entities - Ensure that they really separate entity types or just two names for
the same thing.
o Also do not include the system as an entity type
o e.g. if modelling a library, the entity types might be books, borrowers, etc.
o The library is the system, thus should not be an entity type.
3. List the attributes of each entity (all properties to describe the entity which are relevant to the
application).
o Ensure that the entity types are really needed.
o are any of them just attributes of another entity type?
o if so keep them as attributes and cross them off the entity list.
o Do not have attributes of one entity as attributes of another entity!
4. Mark the primary keys.
o Which attributes uniquely identify instances of that entity type?
o This may not be possible for some weak entities.
5. Define the relationships
o Examine each entity type to see its relationship to the others.
6. Describe the cardinality and optionality of the relationships
o Examine the constraints between participating entities.
7. Remove redundant relationships
o Examine the ER model for redundant relationships.
ER modelling is an iterative process, so draw several versions, refining each one until you are happy
with it. Note that there is no one right answer to the problem, but some solutions are better than
others!
Parallel relationships occur when there are two or more relationships between two entity types (e.g.
employees own and service cars).
In order to distinguish between the two roles we can give the foreign keys different names.
Each relationship is mapped according to the rules, and we end up with two foreign keys
added to the Vehicle table.
So we add the employee_no as the owner_no in order to represent the `owns' relationship.
Employee(employee_no,manager_no, name,...)
So in general, for unary 1:n relationships, the foreign key is the primary key of the same table,
but is given a different name.
Note that the relationship is optional in both directions because not all staff can be managers,
and the top manager is not managed by anybody else.
There are three ways of implementing superclasses and subclasses and it depends on the application
which will be the most suitable.
Only the first method is a true reflection of the superclasses and subclasses and if either of the other
methods is preferential then the model should not have subclasses.
1. One relation for the superclass and one relation for each subclass.
2. One relation for each subclass.
3. One relation for the superclass.
Generalization
Consider extending the entity set account by classifying accounts as being either savings-account or
chequing-account.
Each of these is described by the attributes of account plus additional attributes. (savings has
interest-rate and chequing has overdraft-amount.)
We can express the similarities between the entity sets by generalization. This is the process of
forming containment relationships between a higher-level entity set and one or more lower-level
entity sets.
Aggregation
Consider a DB with information about employees who work on a particular project and use a number
of machines doing that work. We get the E-R diagram shown in Figure 2.20.
Figure 2.20: E-R diagram with redundant relationships
Relationship sets work and uses could be combined into a single set. However, they shouldn't be, as
this would obscure the logical structure of this scheme.
Figure 2.21: E-R diagram with aggregation
Role of normalization
This E-R diagram says that a customer may have several accounts, each located in a specific bank
branch, and that an account may belong to several different customers.
In an earlier chapter, we have discussed data modelling using the entity-relationship model. In the E-
R model, a conceptual schema using an entity- relationship diagram is built and then mapped to a set
of relations. This technique ensures that each entity has information about only one thing and once
the conceptual schema is mapped into a set of relations, each relation would have information about
only one thing. The relations thus obtained would normally not suffer from any of the anomalies that
have been discussed in the last section. It can be shown that if the entity-relationship is built using
the guidelines that were presented in Chapter 2, the resulting relations are in BCNF. (How???
Check?)
Of course, mistakes can often be made in database modelling specially when the database is large
and complex or one may, for some reasons, carry out database schema design using techniques
other than a modelling technique like the entity- relationship model. For example, one could collect
all the information that an enterprise possesses and build one giant relation (often called the
universal relation) to hold it. This bottom-up approach is likely to lead to a relation that is likely to
suffer from all the problems that we have discussed in the last section. For example, the relation is
highly likely to have redundant information and update, deletion and insertion anomalies.
Normalization of such large relation will then be essential to avoid (or at least minimise) these
problems.
Now to define the normal forms more formally, we first need to define the concept of functional
dependence.
In it's most simple form, a flat-file database is nothing more than a single, large
table (e.g., a spreadsheet)
A flat file contains only one record structure; there are no links between separate
records
Access to data is done in a sequential manner; random access is not supported
Access times are slow because the entire file must be scanned to locate the
desired data
Access times can be improved if the data is sorted but this introduces the
potential for error (e.g., one or more records may be misfiled)
Other problems with a flat-file database include 1) data redundancy; 2) data
maintenance; and 3) data integrity
EXAMPLE. An ‘orders’ file might require fields for the order number, date,
customer name, customer address, quantity, part description, price, etc. In this
example, each record must repeat the name and address of the customer ( data
redundancy). If the customer’s address changed, it would have to be changed in
multiple locations (data maintenance). If the customer name were spelled
differently in different orders (e.g., Acme, Acme Inc, Acme Co.) then the data
would be inconsistent (data integrity). (See the Normalization Exercise for a
practical example of these deficiencies.
Flat file data structures are only viable for small data processing requirements
The hierarchical and network database models preceded the relational model;
today very few commercial databases use either of these models
A hierarchical database is a series of flat-files linked in structured 'tree'
relationships
IBM's IMS (Information Management System) database, often referred to by the
name of its proprietary language, DL/I (Data Language I), is the only viable
commercial hierarchical database still in use today, primarily on older mainframe
computers
The concept for the hierarchical database model was originally based on a bill of
materials (BOM)
Data is represented as a series of parent/child relationships
The Indexed Sequential Access Method (ISAM), is a disk storage and access
method, not a database model
There is much confusion surrounding the term ISAM and in practice it is used to
refer to desktop and/or file-based databases like Microsoft's FoxPro and Jet-
based Access, Clipper, dBase, Paradox, and Btrieve
It also is used to refer to navigational database applications that rely on a
procedural approach to data access and retrieval
Possibly the only 'true' ISAM-based products are IBM’s Information Management
System (IMS) and Btrieve
Under ISAM, records are located using a key value. A smaller index file stores the
keys along with pointers to the records in the larger data file. The index file is
first searched for the key and then the associated pointer is used to locate the
desired record
ISAM is more concerned with the access and storage of data; it does not
represent a full database management system (DBMS)
Data access and storage methods are discussed below
The theory behind the relational database model is discussed the section entitled
'Characteristics of a Relational Database'; this section focuses on the distinctions
between the relational model and other database models
In a relational database, the logical design is independent of the physical design
Queries against a relational database management system (RDBMS) are based
on logical relationships and processing those queries does not require pre-
defined access paths among the data (i.e., pointers)
The relational database provides flexibility that allows changes to the database
structure to be easily accommodated
Because the data reside in tables, the structure of the database can be changed
without having to change any applications that were based on that structure
EXAMPLE: You add a new field for e-mail address in the customers table. If you
are using a non-relational database, you probably have to modify the application
that will access this information by including 'pointers' to the new data. With a
relational database, the information is immediately accessible because it is
automatically related to the other data by virtue of its position in the table. All
that is required to access the new e-mail field is to add it to a SELECT list
The structural flexibility of a relational database allows combinations of data to
be retrieved that were never anticipated at the time the database was initially
designed
In contrast, the database structure in older database models is "hard-coded" into
the application; if you add new fields to a non-relational database, any
application that access the database will have to be updated
Object-Oriented Database
Codd’s Rules
E. F. Codd presented these rules as a basis of determining whether a DBMS could be classified as
Relational
Codd's Rules
Codd's Rules
These rules protect users & application developers from having to change the applications
following any low-level reorganisation of the DB.
Rule 11 - Distribution Independence -
Codd's Rules
Rule 1 -
All info in a RDB is represented explicitly at the logical level in exactly one way - by values in a
table.
ALL info even the Metadata held in the system catalogue MUST be stored as relations(tables) &
manipulated in the same way as data.
Rule 2 - Guaranteed Access -
Each & every datum in an RDB is guaranteed to be logically accessible by a combination of table
name, primary key value & column name.
Rule 3 - Systematic treatment of null values -
Null values are supported for representation of 'missing' & inapplicable information in a systematic
way & independent of data type.
This rule does not say that to be fully Relational the DBMS must support distributed DB's but
that if it does the query must remain the same.
Rule 12 - Nonsubversion Rule -
If a RDBMS has a low level (record at a time) language, that low level language cannot be used to
subvert or bypass the integrity rules &constraints expressed in the higher-level relational language.
All database access must be controlled through the DBMS so that the integrity of the database
cannot be compromised without the knowledge of the user or the DBA.
This does not prohibit use of record at a time languages e.g. PL/SQL
Relational Algebra
Recall, the Relational Model consists of the elements: relations, which are made up of
attributes.
A relation is a set of attributes with values for each attribute such that:
1. Each attribute value must be a single value only (atomic).
2. All values for a given attribute must be of the same type (or domain).
3. Each attribute name must be unique.
4. The order of attributes is insignificant
5. No two rows (tuples) in a relation can be identical.
6. The order of the rows (tuples) is insignificant.
Relational Algebra is a collection of operations on Relations.
Relations are operands and the result of an operation is another relation.
Two main collections of relational operators:
Union: R S
Result: Relation with tuples from R and S with duplicates removed.
Difference: R - S
Result: Relation with tuples from R but not from S
Intersection: R S
Result: Relation with tuples that appear in both R and S.
R S
R-S
R S
Attributes of relations need not be identical to perform union, intersection and difference
operations.
However, they must have the same number of attributes or arity and the domains for
corresponding attributes must be identical.
Domain is the datatype and size of an attribute.
The degree of relation R is the number of attributes it contains.
Definition: Two relations R and S are union compatible if and only if they have the same
degree and the domains of the corresponding attributes are the same.
Some additional properties:
o Union, Intersection and difference operators may only be applied to Union Compatible
relations.
o Union and Intersection are commutative operations
R S=S R
R S=S R
o Difference operation is NOT commutative.
R - S not equal S - R
o The resulting relations may not have meaningful names for the attributes. Convention is
to use the attribute names from the first relation.
Cartesian Product
RXS
Selection Operator
Selection Examples
Assume the following relation EMP has the following tuples:
Name Office Dept Rank
Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant
Brown 420 CS Associate
Smith 500 Fin Associate
Select only those Employees with last name Smith who are assistant professors:
Name = 'Smith' Rank = 'Assistant' (EMP)
Result:
Select only those Employees who are either Assistant Professors or in the Economics
department:
Rank = 'Assistant' Dept = 'Econ' (EMP)
Result:
Select only those Employees who are not in the CS department or Adjuncts:
(Rank = 'Adjunct' Dept = 'CS') (EMP)
Result:
Projection Operator
Projection Examples
Assume the same EMP relation above is used.
Name Dept
Smith CS
Jones Econ
Green Econ
Brown CS
Smith Fin
The selection and projection operators can be combined to perform both operations.
Show the names of all employees working in the CS department:
name ( Dept = 'CS' (EMP) )
Results:
Name
Smith
Brown
Show the name and rank of those Employees who are not in the CS department or Adjuncts:
name, rank ( (Rank = 'Adjunct' Dept = 'CS') (EMP) )
Result:
Name Rank
Green Assistant
Smith Associate
For this expression, use R and S from the Set Theoretic Operations section above.
Join Operation
Join operations bring together two relations and combine their attributes and tuples in a
specific fashion.
Join Examples
Assume we have the EMP relation from above and the following DEPART relation:
Dept MainOffice Phone
CS 404 555-1212
Econ 200 555-1234
Fin 501 555-4321
Hist 100 555-9876
Find all information on every employee including their department info where the employee
works in an office numbered less than the department main office:
EMP (emp.office < depart.mainoffice) (emp.dept = depart.dept) DEPART
Results:
Natural Join
Notice in the generic (Theta) join operation, any attributes in common (such as dept above)
are repeated.
The Natural Join operation removes these duplicate attributes.
The natural join operator is: *
We can also assume using * that the join condition will be = on the two attributes in common.
Example: EMP * DEPART
Results:
Outer Join
In the Join operations so far, only those tuples from both relations that satisfy the join
condition are included in the output relation.
The Outer join includes other tuples as well according to a few rules.
Three types of outer joins:
DBMS Notes By Kailash Jadhav
DBMS
1. Left Outer Join includes all tuples in the left hand relation and includes only those
matching tuples from the right hand relation.
2. Right Outer Join includes all tuples in the right hand relation and includes ony those
matching tuples from the left hand relation.
3. Full Outer Join includes all tuples in the left hand relation and from the right hand
relation.
Examples:
PEOPLE: MENU:
Name Age Food Food Day
Alice 21 Hamburger Pizza Monday
Bill 24 Pizza Hamburger Tuesday
Carl 23 Beer Chicken Wednesday
Dina 19 Shrimp Pasta Thursday
Tacos Friday
Outer Union
Note: The order of the rows is immaterial; the order of the columns is immaterial.
Note: The requirement that there be no duplicated rows in the table means that the table has a key
(although the key might be made up of more than one column—even, possibly, of all the columns).
Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of
the key.
Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the
(composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if
it has no partial dependencies."
Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and
if every join dependency in the table is a consequence of the candidate keys of the table.
Def: A table is in DKNF if every constraint on the table is a logical consequence of the
definition of keys and domains
A normal form refers to a class of relational schemas that obey some set of rules. Schemas that obey
the rules are said to be in the normal form. It is usually the case that a schema designer wants all
relations created to be in as many normal forms as possible.
A relation is in First Normal Form (1NF) if every field value is non-decomposable (atomic) and every
tuple is unique. Attributes that partition into sub-attributes, and "repeating groups," which are very
popular in the file-based systems, are not permitted. Tuples in which the value for every tuple is
exactly the same as another tuple are not allowed.
There are two common types of non-atomic fields. The first non-atomic type of field is a list or
"repeating group." For example, one person may have several skills. If these are simply listed in the
same field then the relation is not in first normal form. The second type is a structured field, e.g. an
address. If the entire address is put in one field the relation is not in first normal form because a US
address consists of a street address, a city, a state, and a zip code. Whether a field is considered
structured or not is somewhat application dependent. If you will never want to access the data by
city, by state or by zip code then it may be appropriate to have the entire address in one field, but if
you want to sort by zip code, for example, then it is not appropriate to put the entire address in one
field.
The usefulness of first normal form is fairly obvious. Repeating groups destroy the natural
rectangular structure of a relation. It is extremely difficult to refer to a particular element of the
repeating group, because some sort of "position" within the attribute value must be specified.
Furthermore, the various parts of a partitioned attribute may behave very differently from the
standpoint of dependencies. The rule of first normal form embodies the common sense requirement
that each attribute have its own name. The relation in Figure C10 is not in 1NF because it contains
lists and/or structured fields.
Key
Accountant Skill Skill First Last Accountant Group Group Group
Prof
Number Number Category Name Name Age Number City Supervisor
21 113 Systems 3 Alice Adams 55 52 NYC Brown
35 113 Systems 5 Doug Baker 32 44 LA Green
35 179 Tax 1 Doug Baker 32 44 LA Green
DBMS Notes By Kailash Jadhav
DBMS
35 204 Audit 6 Doug Baker 32 44 LA Green
50 179 Tax 2 Charlie Cody 40 44 LA Green
77 148 Consulting 6 Doug Doe 52 52 NYC Brown
77 179 Tax 6 Doug Doe 52 52 NYC Brown
Information that is only in 1NF is highly redundant. To reduce redundancy we will convert the data to
second normal form. Fixing relations that are not in first normal form is a fairly simple task. Each
attribute is assigned its own name, and repeating group elements become individual tuples.
Normal forms higher than first normal form have been motivated by the discovery of anomalies,
operations on a relation that result in an inconsistency or undesirable loss of data. The removal of
anomalies involves progression through the various levels of normal form. This progression is aimed
at an ideal:
Every information-bearing association between data values appears once and only once, and does
not depend on the presence of other associations.
A relation is in Second Normal Form (2NF) if it is in 1NF and if all of its attributes are dependent on
the whole key (i.e. none of the non-key attributes are related only to a part of the key). The
decomposition technique for producing a second-normal-form relation is also fairly simple: construct
a separate relation to embody the partial dependency, and remove the dependent attribute from the
original relation.
The three new relations created when the partial dependencies are eliminated are shown below.
A relation is in Third Normal Form (3NF) if it is in 2NF and there are no transitive dependencies (i.e.
none of the non-key attributes are dependent upon another attribute which in turn is dependent on
the relation key).
A relation is said to be in third normal form if it is in second normal form and there are no attributes
(or sets of attributes) X and Y for which:
KEY X X Y
Another way of looking at 3NF is that it results in relations that represent entities and relations that
represent relationships among entities. An appropriate decomposition for converting a schema to
third normal form is to place those attributes which are not directly (i.e., are only transitively)
dependent on the key into a separate relation. The decompositions shown so far have two important
features:
2.If an extension of the original relation is decomposed, it can be reconstructed via a JOIN of its
components (the decomposition has lossless join).
Accountant Table
Key
Accountant First Last Accountan Group
Number Name Name t Age Number
21 Alice Adams 55 52
35 Doug Baker 32 44
50 Charlie Cody 40 44
77 Doug Doe 52 52
Group Table
Key
Group Group Group
Number City Supervisor
44 LA Green
52 NYC Brown
Skill Table
Skill Number Skill Category
113 Systems
179 Tax
204 Audit
148 Consulting
Proficiency Table
Key
Accountant Skill
Prof
Number Number
21 113 3
35 113 5
35 179 1
35 204 6
A relation is in Boyce/Codd Normal Form (BCNF) if it is in 2NF and all of its determinants (attributes
upon which other attributes depend) are candidate keys (i.e. the attributes could have been chosen
as keys). For most practical purposes 3NF and BCNF have the same effect. 3NF and BCNF are the
same if the word "candidate" is added to "key" in the definitions of 2NF and 3NF.
The road to total normalization does not end here. Other anomalous situations may occur, and many
other normal forms have been defined. Although six levels of normalization are possible, normalizing
to 3NF is usually sufficient, and in some cases 2NF is sufficient.
Normalization by example
Rule 1. Eliminate Repeating Groups. Make a separate table for each set of related
attributes, and give each table a primary key.
Unnormalized Data Items In the original list of data, each puppy's description is followed by a list
for Puppies of tricks the puppy has learned. Some might know ten tricks, some
might not know any. To answer the question, "Can Fifi roll over?", we
Puppy number need first to find Fifi's puppy record, then scan the list of tricks
Puppy name asociated with the record. This is awkward, inefficient, and extremely
Kennel Code untidy.
Kennel Name
Kennel location Moving the tricks into a separate table helps considerably. Separating
Trick ID 1...n the repeating groups of tricks from the puppy information results in
Trick Name 1...n first normal form. The puppy number in the trick table matches the
Trick Where Learned 1...n primary key in the puppy table, providing a foreign key for relating the
Skill Level 1...n two tables with a join operation. Now we can answer our question with
a direct retrieval: look to see if Fifi's puppy number and the trick ID for
"roll over" appear together in the trick table.
The trickTABLE
TRICK name (e.g., "Roll Over") appears redundantly for every puppy that knows it. Just Trick ID
Puppy Trick Trick Where would do.
Skill Level
Number ID Name Learned
[Note that Trick Name depends on only a
Roll part (the Trick ID) of the multi-valued, i.e.,
52 27 16 9
Over composite, key.]
Nose
53 16 9 9
Stand
Roll
54 27 9 5
Over
In the Trick
SECOND Table, the primary key is made up of the puppy number and the trick ID. This makes
NORMAL
FORM sense for the "Where Learned" and "Skill Level" attributes, since they will be
different for every puppy-trick combination. But the trick name depends only on
Puppy Table the Trick ID. The same name will appear redundantly every time its associated
Puppy Number ID appears in the Trick Table.
Puppy Name
Kennel Code Suppose you want to reclassify a trick, i.e., to give it a different trick ID. The
Kennel Name change has to be made for every puppy that knows the trick! If you miss some
Kennel Location of the changes, you will have several puppies with the same trick under
different IDs. This is an update anomaly.
Tricks
Trick ID Or suppose the last puppy knowing a particular trick gets run over by a car. His
Trick Name records will be removed from the database, and the trick will not be stored
anywhere! This is a delete anomaly. To avoid these problems, we need the
Puppy Tricks second normal form.
Puppy Number
Trick ID
Trick Where
Learned
Skill Level
DBMS Notes By Kailash Jadhav
DBMS
To achieve this, separate the attributes depending on both parts of the key from those depending
only on the Trick ID. This results in two tables: "Tricks," which gives the name for each Trick ID; and
"Puppy Tricks," which lists the tricks learned by each puppy.
[Note that Rettig's Rule 2, "Eliminate Redundant Data," is aimed at tables that fail to satisfy the
conditions imposed on tables by the definition of 2nd Normal Form. In this example it takes the
"Trick" table, which fails to be in 2NF because the Trick Name attribute depends only on the Trick ID
attribute (i.e., on just part of the composite key), and turns "Trick" into two tables. Note also that the
result is a set of tables that come close to being single-theme tables. "Tricks" is clearly a single-
theme table. "Puppy Tricks" is a single-theme table because it pairs puppies and their tricks, i.e., it
deals with the relationships between individual puppies and the tricks they know. In "Puppy Tricks,"
the "Trick Where Learned" attribute is clearly dependent on both parts of the key, since the attribute
is based not only on which particular trick is being referred to but also on where the particular puppy
learned that trick. The same is true of the "Skill Level" attribute, since this attribute is based not only
on which particular trick is being referred to but also on what the particular puppy's level of skill on
that trick is.]
Now we can reclassify a trick in a single operation: Look up the Trick ID in the "Tricks" table and
change its name. The result will instantly be available throughout the application.
Puppy Table The Puppy Table satisfies the first normal form, since it contains no repeating groups.
Puppy It satisfies the second normal form, since it does not have a multivalued key. But the
Number key is Puppy Number, and the kennel name and kennel location describe only a
Puppy Name kennel, not a puppy. To achieve the third normal form, they must be moved into a
Kennel Code separate table. Since they describe a kennel, Kennel Code becomes the key of the
THIRD NORMAL new "Kennels" table.
Kennel Name
FORM
Puppies
Puppy Number
Puppy Name
Kennel Code The motivation for this is the same as for the second normal form: We want to
avoid update and delete anomalies. For example, suppose no puppies from the
Kennels Daisy Hill Puppy Farm were currently stored in the database. With the previous
Kennel Code design, there would be no record of its existence!
Kennel Name
Kennel Location [Note that Rettig's Rule 3, "Eliminate Columns Not Dependent on Key," amounts
to removing the transitive dependencies that keep a table from achieving 3rd
Tricks Normal Form. In the "Puppy" table, attributes Kennel Name and Kennel Location
Trick ID are dependent only on Kennel Code. This situation is superficially similar to that of
Trick Name the dependence on part of a composite key that is the focus of the 2nd Normal
Form. However, this situation differs in that here the attribute (viz., Kennel Code)
Puppy Tricks on which other attributes are dependent is not part of a composite key; it is
Puppy Number simply one of the non-key attributes in the "Puppy" table. Note also that
Trick ID
DBMS
Trick Notes
Where By Kailash Jadhav
Learned
Skill Level
DBMS
"Kennels" is clearly a single-theme table. "Puppies" is close to a single-theme table, in that it
concentrates on puppies and their numbers. If the real-life situation is—which is likely—that any
given puppy can be identified with one and only one kennel, then we can consider "Puppies" to be a
single-theme table. But if—which is not likely—puppies sometimes start life in one kennel and are
then transferred, as puppies, to another kennel, we would not be able to consider "Puppies" to be
single-theme because we would have to make further modifications to handle the mixed origins.]
Third Normal Form is sufficient for most situations. But if that isn't normal enough for
you ...
Rule 4. Isolate Independent Multiple Relationships. No table may contain two or more l:n
or n:m relationships that are not directly related.
This applies only to designs that include one-to-many and many-to-many relationships. An example
of a one-to-many relationship is that one kennel can hold many puppies. An example of a many-to-
many relationship is that a puppy can know many tricks and several puppies can know the same
trick.
Puppy Tricks and Suppose we want to add a new attribute to the Puppy-Trick table,
Costumes "Costume." This way we can look for puppies that can both "sit-up-and-
Puppy Number beg" and wear a Groucho Marx mask, for example. The fourth normal
Trick ID NORMAL
FOURTH form dictates against this (i.e., against using the Puppy-Tricks table, not
Trick
FORMWhere Learned against begging while wearing a Groucho mask). The two attributes do
Skill Level not share a meaningful relationship. A puppy may be able to wear a wet
Costume
Puppies suit. This does not mean it can simultaneously sit up and beg. How will
Puppy Number you represent this if you store both attributes in the same table?
Puppy Name
Kennel Code
Costumes
Costume Number
Costume Name
DBMS Notes
Puppy Costumes By Kailash Jadhav
Puppy Number
Costume Number
DBMS
Rule 5. Isolate Semantically Related Multiple Relationships. There may be practical
constraints on information that justify separating logically related many-to-many
relationships.
Usually, related attributes belong together. For example, if we really wanted to record which tricks
each puppy could do in which costume, we would want to keep the Costume attribute in the Puppy-
Tricks table. But there are times when special characteristics of the data make it more efficient to
separate even logically related attributes.
Imagine that our database will record when breeds are available in each kennel, and which breeder
supplies dogs to those kennels. This suggests a Kennel-Breeder-Breeds table, which satisfies fourth
normal form. As long as any kennel can supply any breed from any breeder, this works fine.
Kennel-Breeder-Breeds
FIFTH NORMAL
FORM Number
Kennel
Breeder
Breed
Puppies
Puppy Number
Puppy Name
Kennel Code
Kennel-Breeder-Breeds Now suppose a law is passed to prevent exclusive
Kennels arrangements: a kennel selling any breed must offer that breed
Kennel
Breeder
Kennel Code Breed from all breeders it deals with. In other words, if Khabul
Number
Kennel Name Khennels sells Afghans and wants to sell any Daisy Hill
5 Acme
Kennel Location Spaniel
puppies, it must sell Daisy Hill Afghans.
5 Acme Dachshund
Tricks
5 Acme Banana-Biter The need for fifth normal form becomes clear when we
Trick
5 ID Puppy Factory Spaniel consider inserts and deletes. Suppose that a kennel (whose
Trick Name number in the database happens to be 5) decides to offer
5 Puppy Factory Dachshund
three new breeds: Spaniels, Dachshunds, and West Indian
5 Puppy Factory Banana-Biter Banana-Biters. Suppose further that this kennel already deals
Puppy Tricks
Puppy
5 Number
Whatapuppy Spaniel with three breeders that can supply those breeds. This will
Trick
5 ID Whatapuppy Dachshund require nine new rows in the database, one for each breeder-
Trick Where Learned
5 Whatapuppy Banana-Biter and-breed combination.
Skill Level
Breaking up the table reduces the number of inserts to six. Here are the tables
Costumes necessary for fifth normal form, shown with the six newly inserted rows.
Costume Number
Costume Name
Puppy Costumes If an application involves significant update activity, fifth normal form can mean
Puppy Number important savings. Note that these combination tables develop naturally out of
Costume Number entity-relationship analysis.
Kennel-Breed [Note also that each of the tables at the left is clearly a single-theme table.]
Kennel Number
Breed
DBMS Notes
Kennel-Breeder By Kailash Jadhav
Kennel Number
Breeder
DBMS
Kennel-Breed
Kennel
Breed
Number
5 Spaniel
5 Dachshund
5 Banana-Biter
Kennel-Breeder
Kennel
Breeder
Number
Transaction Models
A transaction is a unit of program execution that accesses and possibly updates various data
items.
ACID Properties
Atomicity. Either all operations of the transaction are properly reflected in the database or none
are.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions T i
and T j , it appears to T i that either T j finished execution before T i started, or T j started
execution after T i finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
Transaction State
1. Active, the initial state; the transaction stays in this state while it is executing
2. Partially committed, after the final statement has been executed.
DBMS Notes By Kailash Jadhav
DBMS
3. Failed, after the discovery that normal execution can no longer proceed.
4. Aborted, after the transaction has been rolled back and the database restored to its state prior
to the start of the transaction. Two options after it has been aborted :
a. restart the transaction -- only if no internal logical error
b. kill the transaction
5. Committed, after successful completion.
Concurrent Executions
Multiple transactions are allowed to run concurrently in the system. Advantages are :
increased processor and disk utilization, leading to better transaction throughput: one
transaction can be using the CPU while another is reading from or writing to the disk
reduced average response time for transactions: short transactions need not wait behind
long ones
Concurrency control schemes -- mechanisms to control the interaction among the concurrent
transactions in order to prevent them from destroying the consistency of the database
v Two transactions accessing the same database item have their operations interleaved in a way
that makes the database item incorrect.
T1: T2:
read_item(X);
X:= X - N;
read_item(X);
X:= X + M;
write_item(X);
read_item(Y);
write_item(X);
Y:= Y + N;
write_item(Y);
If transactions T1 and T2 are submitted at approximately the same time and their operations are
interleaved then the value of database item X will be incorrect because T2 reads the value of X
before T1 changes it in the database and hence the updated database value resulting from T1 is
lost.
v One transaction updates a database item and then the transaction -for some reason- fails. The
updated item is accessed by another transaction before it is changed back to its original value.
transaction T1 fails and must change the value of X back to its old value; meanwhile T2 has read the
"temporary" incorrect value of X
T1:T3:
sum:= 0;
read_item(A);
sum:= sum + A;
.
read_item(X);.
X:= X - N;.
write_item(X);
read_item(X);
sum:= sum + X;
read_item(Y);
sum:= sum + Y;
read_item(Y);
Y:= Y + N;
write_item(Y);
T3 reads X after N is subtracted and reads Y before X is added, so a wrong summary is the result
Schedules
Schedules -- sequences that indicate the chronological order in which instructions of concurrent
transactions are executed
a schedule for a set of transactions must consist of all instructions of those transactions
must preserve the order in which the instructions appear in each individual transaction
Example Schedules
Let T 1 transfer $50 from A to B, and T 2 transfer 10% of the balance from A to B. The following
is a serial schedule (Schedule 1 in the text), in which T 1 is followed by T 2 .
Let T 1 and T 2 be the transactions defined previously. The following schedule (Schedule 3 in the
text) is not a serial schedule, but it is equivalent to Schedule 1.
The following concurrent schedule (Schedule 4 in the text) does not preserve the value of the
sum A + B.
Serializability
1. conflict serializability
2. view serializability
We ignore operations other than read and write instructions, and we assume that transactions
may perform arbitrary computations on data in local buffers in between reads and writes. Our
simplified schedules consist of only read and write instructions.
Conflict Serializability
Conflict serializability
T i and T j respectively, conflict if and only if there exists some item Q accessed by both I i and I j
, and at least one of these instructions wrote Q.
Schedule 3 below can be transformed into Schedule 1, a serial schedule where T 2 follows T 1 ,
by a series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.
View serializability
Let S and S‘ be two schedules with the same set of transactions. S and S’ are view equivalent if
the following three conditions are met:
1. For each data item Q, if transaction T i reads the initial value
of Q in schedule S, then transaction T i must, in schedule
S‘ , also read the initial value of Q.
2. For each data item Q, if transaction T i executes read(Q) in
schedule S, and that value was produced by transaction T j
(if any), then transaction T i must in schedule S‘ also read
the value of Q that was produced by transaction T j .
3. For each data item Q, the transaction (if any) that performs
the final write(Q) operation in schedule S must perform the
final write(Q) operation in schedule S‘ .
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Every view serializable schedule which is not conflict serializable has blind writes
Determining such equivalence requires analysis of operations other than read and write
Recoverability
Need to address the effect of transaction failures on concurrently
running transactions.
Recoverable schedule --- if a transaction T j reads a data items previously written by a transaction
T i , the commit operation of T i appears before the commit operation of T j .
The following schedule (Schedule 11) is not recoverable if T 9 commits immediately after the read
If T 8 should abort, T 9 would have read (and possibly shown to the user) an inconsistent
database state. Hence database must ensure that schedules are recoverable.
Cascadeless schedules --- cascading rollbacks cannot occur; for each pair of transactions T i and T
j such that T j reads a data item previously written by T i , the commit operation of T i appears
before the read operation of T j .
Every cascadeless schedule is also recoverable
It is desirable to restrict the schedules to those that are cascadeless.
Process Model:
Basic modules:
o parser
o query rewrite
DBMS Notes By Kailash Jadhav
DBMS
o optimizer
o query executor
o access methods
o buffer manager
o lock manager
o log/recovery manager
Query Rewriter
Optimizer
Executor
Access Methods
o uniform relational interface (open, get next), a la INGRES AMI, System R's RSS
o multiple implementations: heap, B-tree, extensible hashing
Buffer Manager
Lock Manager
Log/Recovery Manager
A lock is a variable associated with a data item that describes the status of the item with respect to
possible operations that can be applied to it.
TYPES OF LOCKS:
Binary locks: only two states of a lock; too simple and too restrictive; not used in practice.
Shared/exclusive locks: which provide more general locking capabilities and are used in practical
database locking schemes. (Read Lock as a shared lock, Write Lock as an exclusive lock).
Binary Locks:
A binary lock can have two states or values: locked and unlocked (or 1 and 0, for simplicity). A
distinct lock is associated with each database item X.
1. If the value of the lock on X is 1, item X cannot be accessed by a database operation that
requests the item.
2. If the value of the lock on X is 0, the item can be accessed when requested. We refer to the
current value (or state) of the lock associated with item X as LOCK(X).
Two operations, lock_item and unlock_item, are used with binary locking.
If the simple binary locking scheme described here is used, every transaction must obey the following
rules:
These rules can be enforced by the lock manager module of the DBMS. Between the lock_item(X)
and unlock_item(X) operations in transaction T, T is said to hold the lock on item X. At most one
transaction can hold the lock on a particular item. Thus no two transactions can access the same
item concurrently.
When we use the shared/exclusive locking scheme, the system must enforce the following rules:
Conversion of Locks
Sometimes it is desirable to relax conditions 4 and 5 in the preceding list in order to allow lock
conversion; that is, a transaction that already holds a lock on item X is allowed under certain
conditions to convert the lock from one locked state to another. For example, it is possible for a
transaction T to issue a read_lock(X) and then later on to upgrade the lock by issuing a
write_lock(X) operation.
UPGRADING
If T is the only transaction holding a read lock on X at the time it issues the write_lock(X) operation,
the lock can be upgraded; otherwise, the transaction must wait.
DOWNGRADING
It is also possible for a transaction T to issue a write_lock(X) and then later on to downgrade the
lock by issuing a read_lock(X) operation. When upgrading and downgrading of locks is used, the lock
table must include transaction identifiers in the record structure for each lock (in the
locking_transaction(s) field) to store the information on which transactions hold locks on the item.
Note: using binary locks and read/write locks in transactions does not guarantee serializability of
schedule on its own.
A transaction is said to follow the two-phase locking protocol if all locking operations (read_lock,
write_lock) precede the first unlock operation in the transaction. Such a transaction can be divided
into two phases:
an expanding or growing (first) phase, during which new locks on items can be acquired
but none can be released; and
a shrinking (second) phase, during which existing locks can be released but no new locks
can be acquired. If lock conversion is allowed, then upgrading of locks (from read-locked to
write-locked) must be done during the expanding phase, and downgrading of locks (from
write-locked to read-locked) must be done in the shrinking phase. Hence, a read_lock(X)
operation that downgrades an already held write lock on X can appear only in the shrinking
phase.
It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the
schedule is guaranteed to be serializable, obviating the need to test for serializability of schedules
any more. The locking mechanism, by enforcing two-phase locking rules, also enforces serializability.
T1 T2 T'1 T'2
-------------- --------------- --------------- ---------------
read_lock(Y); read_lock(X); read_lock(Y); read_lock(X);
read_item(Y); read_item(X); read_item(Y); read_item(X);
unlock(Y); unlock(X); write_lock(X); write_lock(Y);
write_lock(X): write_lock(Y); unlock(Y); unlock(X);
read_item(X); read_item(Y); read_item(X); read_item(Y);
X := X + Y; Y := X + Y; X := X + Y; Y := X + Y;
write_item(X); write_item(Y); write_item(X); write_item(Y);
DBMS Notes By Kailash Jadhav
DBMS
unlock(X); unlock(Y); unlock(X); unlock(Y);
Two-phase locking may limit the amount of concurrency that can occur in a schedule. This is because
a transaction T may not be able to release an item X after it is through using it if T must lock an
additional item Y later on; or conversely, T must lock the additional item Y before it needs it so that it
can release X. Hence, X must remain locked by T until all items that the transaction needs to read or
write have been locked; only then can X be released by T. Meanwhile, another transaction seeking to
access X may be forced to wait, even though T is done with X; conversely, if Y is locked earlier than
it is needed, another transaction seeking to access Y is forced to wait even though T is not using Y
yet. This is the price for guaranteeing serializability of all schedules without having to check the
schedules themselves.
Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some
item X, but X is locked by another transaction T ' in the set. Hence, each transaction in the set is on a
waiting queue, waiting for one of the other transactions in the set to release the lock on an item.
Deadlock Prevention Protocols (e.g., lock all the items in advance, ordering all items,
transaction timestamp, and no waiting algorithm).
o Wait-die: If TS(Ti) < TS (Tj), then (Ti older than Tj) Ti is allowed to wait; otherwise (Ti
younger than Tj) abort Ti (Ti dies) and restart it later with the same timestamp.
o Wound-wait: If TS(Ti) < TS(Tj), then (Ti older than Tj) abort Tj (Ti wounds Tj) and
restart it later with the same timestamp; otherwise (Ti younger than Tj) Ti is allowed to
wait.
Deadlock detection, where the system checks if a state of deadlock actually exists.
Another problem that may occur when we use locking is starvation, which occurs when a
transaction cannot proceed for an indefinite period of time while other transactions in the
system continue normally.
Solution for starvation (e.g., first-come-first-serve and priority).
A timestamp is a unique identifier created by the DBMS to identify a transaction (can be thought as
the transaction start time). The timestamp of transaction T as TS(T). Concurrency control techniques
based on timestamp ordering do not use locks; hence, deadlocks cannot occur
The idea for this scheme is to order the transactions based on their timestamps. A schedule in which
the transactions participate is then serializable, and the equivalent serial schedule has the
transactions in order of their timestamp values. This is called timestamp ordering (TO). I
In timestamp ordering, however, the schedule is equivalent to the particular serial order
corresponding to the order of the transaction timestamps. The algorithm must ensure that, for each
item accessed by conflicting operations in the schedule, the order in which the item is accessed does
not violate the serializability order. To do this, the algorithm associates with each database item X
two timestamp (TS) values:
1. Read_TS(X): The read timestamp of item X; this is the largest timestamp among all the
timestamps of transactions that have successfully read item X—that is, read_TS(X) = TS(T),
where T is the youngest transaction that has read X successfully.
2. Write_TS(X): The write timestamp of item X; this is the largest of all the timestamps of
transactions that have successfully written item X—that is, write_TS(X) = TS(T), where T is
the youngest transaction that has written X successfully.
Whenever some transaction T tries to issue a read_item(X) or a write_item(X) operation, the basic
TO algorithm compares the timestamp of T with read_TS(X) and write_TS(X) to ensure that the
timestamp order of transaction execution is not violated. If this order is violated, then transaction T is
aborted and resubmitted to the system as a new transaction with a new timestamp. If T is aborted
and rolled back, any transaction T1 that may have used a value written by T must also be rolled back.
Similarly, any transaction T2 that may have used a value written by T1 must also be rolled back, and
so on. This effect is known as cascading rollback and is one of the problems associated with basic
TO, since the schedules produced are not recoverable. An additional protocol must be enforced to
ensure that the schedules are recoverable, cascadeless, or strict. We first describe the basic TO
algorithm here. The concurrency control algorithm must check whether conflicting operations violate
the timestamp ordering in the following two cases:
Hence, whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect
order, it rejects the later of the two operations by aborting the transaction that issued it. The
schedules produced by basic TO are hence guaranteed to be conflict serializable, like the 2PL
protocol. However, some schedules are possible under each protocol that are not allowed under the
other. Hence, neither protocol allows all possible serializable schedules. As mentioned earlier,
deadlock does not occur with timestamp ordering. However, cyclic restart (and hence starvation) may
occur if a transaction is continually aborted and restarted
In this method, several versions X1, X2, ..., Xn of each data item X are maintained. For each version,
the value of version Xi and the following two timestamps are kept:
1. read_TS(Xi): The read timestamp of Xi is the largest of all the timestamps of transactions
that have successfully read version Xi.
2. write_TS(Xi): The write timestamp of Xi is the timestamp of the transaction that wrote
the value of version Xi.
1. Read phase: A transaction can read values of committed data items from the database.
However, updates are applied only to local copies (versions) of the data items kept in the
transaction workspace.
2. Validation phase: Checking is performed to ensure that serializability will not be violated if
the transaction updates are applied to the database.
3. Write phase: If the validation phase is successful, the transaction updates are applied to the
database; otherwise, the updates are discarded and the transaction is restarted.
The idea behind optimistic concurrency control is to do all the checks at once; hence, transaction
execution proceeds with a minimum of overhead until the validation phase is reached. If there is little
interference among transactions, most will be validated successfully. However, if there is much
interference, many transactions that execute to completion will have their results discarded and must
be restarted later. Under these circumstances, optimistic techniques do not work well. The
techniques are called "optimistic" because they assume that little interference will occur and hence
that there is no need to do checking during transaction execution.
The optimistic protocol we describe uses transaction timestamps and also requires that the write_sets
and read_sets of the transactions be kept by the system. In addition, start and end times for some of
the three phases need to be kept for each transaction. Recall that the write_set of a transaction is
the set of items it writes, and the read_set is the set of items it reads. In the validation phase for
transaction Ti, the protocol checks that Ti does not interfere with any committed transactions or with
any other transactions currently in their validation phase. The validation phase for T i checks that, for
each such transaction Tj that is either committed or is in its validation phase, one of the following
conditions holds:
1. Transaction Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items
in common with the write_set of Tj.
3. Both the read_set and write_set of Ti have no items in common with the write_set of Tj, and Tj
completes its read phase before Ti completes its read phase.
Recovery
Failure Classification
Transaction failure :
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to an
error condition (e.g., deadlock)
System crash
a power failure or other hardware or software failure causes the system to crash.
Fail-stop assumption: non-volatile storage contents are assumed to not have been
corrupted by system crash
Disk failure
a head crash or similar disk failure destroys all or part of disk storage
Recovery Algorithms
=> to recover the database contents to a state that ensures atomicity, consistency
and durability
Storage Structure
Volatile storage:
does not survive system crashes
examples: main memory, cache memory
Nonvolatile storage:
survives system crashes
DBMS Notes By Kailash Jadhav
DBMS
examples: disk, tape, flash memory,
non-volatile (battery backed up) RAM
Stable storage:
a theoretical form of storage that survives all failures
approximated by maintaining multiple copies on distinct nonvolatile media
Data Access
Transaction transfers data items between system buffer blocks and its private work-
area using the following operations :
read(X)
If BX in which X resides is not in memory, issue input(BX)
assign to the local variable xi the value of X from the buffer block
write(X)
If BX in which X resides is not in memory, issue input(BX)
assign the value xi to X in the buffer block.
We assume that no data item spans two or more blocks.
Transactions
Perform read(X) while accessing X for the first time;
All subsequent accesses are to the local copy xi.
After last access, transaction executes write(X) if updated.
input(BX) and output(BX) are not issued by individual transactions. The system (buffer
manager) is responsible for these operations.
Buffer Block B YY
XX A
output(B)
read(X) YY B
x2
x1 write(Y)
y1
Modifying the database without ensuring that the transaction will commit may leave
the database in an inconsistent state.
A failure may occur after some but not all modifications have been made.
To ensure atomicity despite failures
keep (maintain) certain information describing the modifications on stable storage
without modifying the database itself.
Log-based recovery
Shadow-paging
Log-Based Recovery
A log is a sequence of log records that describe update activities on the database.
Log is kept on stable storage.
Assume that transactions execute serially
Assume that log records are written directly to stable storage (they are not
buffered)
Logging
When transaction Ti starts
Write a log record <Ti start>
Before Ti executes write(X)
Write a log record <Ti, X, V1, V2>
Records all modifications to the log, but defer all the writes to after partial commit.
Transaction start: write <Ti start> record to log.
A write(X) operation results in
a log record <Ti, X, V> being written, where V is the new value for X (old value is
not needed for this scheme)
The write is not performed on X at this time, but is deferred.
When Ti partially commits, <Ti commit> is written to the log (Ti enters commit state)
Finally, the log records are read and used to actually execute the previously deferred
writes.
a transaction needs to be redone if and only if both <Ti start> and<Ti commit> are in
the log.
Redoing a transaction Ti (redo Ti) sets the value of all data items updated by the
transaction to the new values.
Example (T0 executes before T1):
read (B)
B:- B + 50
write (B)
<T0 start>
A = 950
B = 2050
<T0 commit>
<T1 start>
C = 600
BB, BC
<T1 commit>
BA
undo(Ti) restores the value of all data items updated by Ti to their old values, going
backwards from the last log record for Ti
redo(Ti) sets the value of all data items updated by Ti to the new values, going
forward from the first log record for Ti
Undo Ti if the log contains the record <Ti start>, but does not contain the record <Ti
commit>.
Redo Ti if the log contains both the record <Ti start> and the record <Ti commit>.
(Undo operations are performed before redo operations)
(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050
Checkpoints
Checkpoints (Cont.)
Example of Checkpoints
Tc Tf
T1
T2
T3
T4
Object-Relational database (ORDBMS) is the third type of database common today. ORDBMS are
systems that “attempt to extend relational database systems with the functionality necessary to
support a broader class of applications and, in many ways, provide a bridge between the relational
and object-oriented paradigms.”
ORDBMS was created to handle new types of data such as audio, video, and image files that
relational databases were not equipped to handle. In addition, its development was the result of
increased usage of object-oriented programming languages, and a large mismatch between these
and the DBMS software.
One advantage of ORDBMS is that it allows organizations to continue using their existing systems,
without having to make major changes. A second advantage is that it allows users and programmers
to start using object-oriented systems in parallel.
There are challenges in implementing an ORDBMS. The first is storage and access methods. The
second is query processing, and the third is query optimization.
One rising technique is enterprise resource planning and management resource planning, which add
another layer of application-oriented features on top of a DBMS. Included applications come
from Baan, Oracle, SAP, and Siebel. These programs each identify a set of common tasks
encountered by a large number of organizations and provide a general application layer to carry out
these tasks.
More importantly, DBMS have advanced into the Internet and Web Age. Stored data is widely being
accessed through a Web browser. Today, queries are being generated through Web-accessible forms
and answers are being formatted using a mark-up language such as HTML. In addition, many
vendors and distributors are adding features to their DBMS aimed at making it better equipped for
Internet usage.
In summary, relational and object-oriented database systems each have certain strengths as well as
certain weaknesses. In general, the weakness of one type of system tends to be strength of the
other.
Continuous Innovation
Although data mining is a relatively new term, the technology is not. Companies have used powerful computers
to sift through volumes of supermarket scanner data and analyze market research reports for years. However,
continuous innovations in computer processing power, disk storage, and statistical software are dramatically
increasing the accuracy of analysis while driving down the cost.
Example
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local
buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended
to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on
Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased
the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For example, they could move the beer display closer to the
diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.
Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are
accumulating vast and growing amounts of data in different formats and different databases. This includes:
operational or transactional data such as, sales, cost, inventory, payroll, and accounting
nonoperational data, such as industry sales, forecast data, and macro economic data
meta data - data about the data itself, such as logical database design or data dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide information. For example, analysis
of retail point of sale transaction data can yield information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and future trends. For example, summary
information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of
consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible
to promotional efforts.
Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling
organizations to integrate their various databases into data warehouses. Data warehousing is defined as a
process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new
term although the concept itself has been around for years. Data warehousing represents an ideal vision of
maintaining a central repository of all organizational data. Centralization of data is needed to maximize user
access and analysis. Dramatic technological advances are making this vision a reality for many companies. And,
equally dramatic advances in data analysis software are allowing users to access this data freely. The data
analysis software is what supports data mining.
Data mining is primarily used today by companies with a strong consumer focus - retail, financial,
communication, and marketing organizations. It enables these companies to determine relationships among
"internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic
indicators, competition, and customer demographics. And, it enables them to determine the impact on sales,
customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information
to view detail transactional data.
With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions
based on an individual's purchase history. By mining demographic data from comment or warranty cards, the
retailer could develop products and promotions to appeal to specific customer segments.
WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-
sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5
terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products
and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display
level. They use this information to manage local store inventory and identify new merchandising opportunities.
In 1995, WalMart computers processed over 1 million complex data queries.
The National Basketball Association (NBA) is exploring a data mining application that can be used in
conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements
of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of
the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that
when Mark Price played the Guard position, John Williams attempted four jump shots and made each one!
Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably
from the average shooting percentage of 49.30% for the Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump
shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage.
Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds
Williams for an open jump shot.
While large-scale information technology has been evolving separate transaction and analytical systems, data
mining provides the link between the two. Data mining software analyzes relationships and patterns in stored
transaction data based on open-ended user queries. Several types of analytical software are available: statistical,
machine learning, and neural networks. Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain
could mine customer purchase data to determine when customers visit and what they typically order.
This information could be used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer preferences. For
example, data can be mined to identify market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper example is an example of
associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor
equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.
Extract, transform, and load transaction data onto the data warehouse system.
Artificial neural networks: Non-linear predictive models that learn through training and resemble
biological neural networks in structure.
Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation,
and natural selection in a design based on the concepts of natural evolution.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules
for the classification of a dataset. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are
decision tree techniques used for classification of a dataset. They provide a set of rules that you can
apply to a new (unclassified) dataset to predict which records will have a given outcome. CART
segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create
multi-way splits. CART typically requires less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record in a dataset based on a combination
of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called
the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
Today, data mining applications are available on all size systems for mainframe, client/server, and PC
platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a
terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11
terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical
technological drivers:
Size of the database: the more data being processed and maintained, the more powerful the system
required.
Query complexity: the more complex the queries and the greater the number of queries being
processed, the more powerful the system required.
Data Mining is the process of extracting knowledge hidden from large volumes of raw data.
The importance of collecting data that reflect your business or scientific activities to achieve competitive
advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases
are in place in all large and mid-range companies. However, the bottleneck of turning this data into your success
is the difficulty of extracting knowledge about the system you study from the collected data.
Human analysts with no special tools can no longer make sense of enormous volumes of data that require
processing in order to make informed business decisions. Data mining automates the process of finding
relationships and patterns in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.
These are all the questions that can probably be answered if information hidden among megabytes of data in
your database can be found explicitly and utilized. Modeling the investigated system, discovering relations that
connect variables in a database are the subject of data mining.
Modern computer data mining systems self learn from the previous history of the investigated system,
formulating and testing hypotheses about the rules which this system obeys. When concise and valuable
knowledge about the system of interest had been discovered, it can and should be incorporated into some
decision support system which helps the manager to make wise and informed business decisions.
Data might be one of the most valuable assets of your corporation - but only if you know how to reveal valuable
knowledge hidden in raw data. Data mining allows you to extract diamonds of knowledge from your historical
data and predict outcomes of future situations. It will help you optimize your business decisions, increase the
value of each customer and communication, and improve satisfaction of customer with your services.
Data that require analysis differ for companies in different industries. Examples include:
In all these cases data mining can help you reveal knowledge hidden in data and turn this knowledge into a
crucial competitive advantage. Today increasingly more companies acknowledge the value of this new
opportunity and turn to Megaputer for leading edge data mining tools and solutions that help optimizing their
operations and increase your bottom line.
The main reason for necessity of automated computer systems for intelligent data analysis is the enormous
volume of existing and newly appearing data that require processing. The amount of data accumulated each day
by various business, scientific, and governmental organizations around the world is daunting. According to
information from GTE research center, only scientific organizations store each day about 1 TB (terabyte!) of
new information. And it is well known that academic world is by far not the leading supplier of new data. It
becomes impossible for human analysts to cope with such overwhelming amounts of data.
Two other problems that surface when human analysts process data are the inadequacy of the human brain
when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis.
A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this
helps, sometimes this hurts, but it is almost impossible to get rid of this fact.
DBMS Notes By Kailash Jadhav
DBMS
Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has a much lower cost than
hiring an army of highly trained (and payed) professional statisticians. While data mining does not eliminate
human participation in solving the task completely, it significantly simplifies the job and allows an analyst who
is not a professional in statistics and programming to manage the process of extracting knowledge from data.
Predicting
A task of learning a pattern from examples and using the developed model to predict future values of the target
variable.
PolyNet Predictor, Find Laws, Memory Based Reasoning, and Linear Regression
Classification
A task of finding a function that maps records into one of several discrete classes.
Detection of relations
A task of searching for the most influential independent variables for a selected target variable.
Explicit modeling
Clustering
A task of identifying groups of records that are similar between themselves but different from the rest of the
data. Often, the variables providing the best clustering should be identified as well.
Deviation Detection
A task of determining the most significant changes in some key measures of data from previous or expected
values.
It would be very instructive to discuss various existing approaches to data mining while stressing out the
following three vital criteria:
To build a bridge from more traditional methods for data analysis to data mining methods we start by discussing
some more traditional approaches:
Neural Networks
Evolutionary Programming
Memory Based Reasoning
Decision Trees
Genetic Algorithms
Nonlinear Regression Methods
Note that this book is meant as a supplement to standard texts about data warehousing. This book focuses on
Oracle-specific material and does not reproduce in detail material of a general nature. Two standard texts are:
The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996)
DBMS Notes By Kailash Jadhav
DBMS
Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)
A data warehouse is a relational database that is designed for query and analysis rather than for transaction
processing. It usually contains historical data derived from transaction data, but it can include data from other
sources. It separates analysis workload from transaction workload and enables an organization to consolidate
data from several sources.
See Also:
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set
forth by William Inmon:
Subject Oriented
Integrated
Nonvolatile
Time Variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales
data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like
"Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter,
sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a
consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of
measure. When they achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the
purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to
online transaction processing (OLTP) systems, where performance requirements demand that historical data
Figure 1-1 illustrates key differences between an OLTP system and a data warehouse.
One major difference between the types of system is that data warehouses are not usually in third normal form
(3NF), a type of data normalization common in OLTP environments.
Data warehouses and OLTP systems have very different requirements. Here are some examples of differences
between typical data warehouses and OLTP systems:
Workload
Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of
your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide
variety of possible query operations.
OLTP systems support only predefined operations. Your applications might be specifically tuned or
designed to support only these operations.
Data modifications
A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk
data modification techniques. The end users of a data warehouse do not directly update the data
warehouse.
In OLTP systems, end users routinely issue individual data modification statements to the database. The
OLTP database is always up to date, and reflects the current state of each business transaction.
Schema design
DBMS Notes By Kailash Jadhav
DBMS
Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to
optimize query performance.
OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to
guarantee data consistency.
Typical operations
A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales
for all customers last month."
A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order
for this customer."
Historical data
Data warehouses usually store many months or years of data. This is to support historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP system stores only
historical data as needed to successfully meet the requirements of the current transaction.
Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three
common architectures are:
Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from
several source systems through the data warehouse.
In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of
data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations
in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary
in Oracle is called a materialized view.
In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can
do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies
building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture.
Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's
architecture for different groups within your organization. You can do this by adding data marts, which are
systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales,
and inventories are separated. In this example, a financial analyst might want to analyze historical data for
purchases and sales.
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Perhaps the most important concept that has come out of the Data Warehouse movement is the recognition that there are two
fundamentally different types of information systems in all organizations: operational systems and informational systems.
"Operational systems" are just what their name implies; they are the systems that help us run the enterprise operation day-to-day.
These are the backbone systems of any enterprise, our "order entry', "inventory", "manufacturing", "payroll" and "accounting"
systems. Because of their importance to the organization, operational systems were almost always the first parts of the enterprise to be
computerized. Over the years, these operational systems have been extended and rewritten, enhanced and maintained to the point that
On the other hand, there are other functions that go on within the enterprise that have to do with planning, forecasting and managing
the organization. These functions are also critical to the survival of the organization, especially in our current fast-paced world.
Functions like "marketing planning", "engineering planning" and "financial analysis" also require information systems to support
them. But these functions are different from operational ones, and the types of systems and information required are also different.
The knowledge-based functions are informational systems.
"Informational systems" have to do with analyzing data and making decisions, often major decisions, about how the enterprise will
operate, now and in the future. And not only do informational systems have a different focus from operational ones, they often have a
different scope. Where operational data needs are normally focused upon a single area, informational data needs often span a number
of different areas and need large amounts of related operational data.
In the last few years, Data Warehousing has grown rapidly from a set of related ideas into an architecture for data delivery for
enterprise end-user computing.
One of the reasons that data warehousing has taken such a long time to develop is that it is actually a very comprehensive technology.
In fact, data warehousing can be best represented as an enterprise-wide framework for managing informational data within the
organization. In order to understand how all the components involved in a data warehousing strategy are related, it is essential to have
a Data Warehouse Architecture.
A Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and
presentation that exists for end-user computing within the enterprise. The architecture is made up of a number of interconnected parts:
Operational systems process data to support critical operational needs. In order to do that, operational databases have been historically
created to provide an efficient processing structure for a relatively small number of well-defined business transactions. However,
because of the limited focus of operational systems, the databases designed to support operational systems have difficulty accessing
the data for other management or informational purposes. This difficulty in accessing operational data is amplified by the fact that
many operational systems are often 10 to 15 years old. The age of some of these systems means that the data access technology
available to obtain operational data is itself dated.
Clearly, the goal of data warehousing is to free the information that is locked up in the operational databases and to mix it with
information from other, often external, sources of data. Increasingly, large organizations are acquiring additional data from outside
databases. This information includes demographic, econometric, competitive and purchasing trends. The so-called "information
superhighway" is providing access to more data resources every day.
The Information Access layer of the Data Warehouse Architecture is the layer that the end-user deals with directly. In particular, it
represents the tools that the end-user normally uses day to day, e.g., Excel, Lotus 1-2-3, Focus, Access, SAS, etc. This layer also
includes the hardware and software involved in displaying and printing reports, spreadsheets, graphs and charts for analysis and
presentation. Over the past two decades, the Information Access layer has expanded enormously, especially as end-users have moved
to PCs and PC/LANs.
Today, more and more sophisticated tools exist on the desktop for manipulating, analyzing and presenting data; however, there are
significant problems in making the raw data contained in operational systems available easily and seamlessly to end-user tools. One of
the keys to this is to find a common data language that can be used throughout the enterprise.
The Data Access Layer of the Data Warehouse Architecture is involved with allowing the Information Access Layer to talk to the
Operational Layer. In the network world today, the common data language that has emerged is SQL. Originally, SQL was developed
by IBM as a query language, but over the last twenty years has become the de facto standard for data interchange.
One of the key breakthroughs of the last few years has been the development of a series of data access "filters" such as EDA/SQL that
make it possible for SQL to access nearly all DBMSs and data file systems, relational or nonrelational. These filters make it possible
for state-of-the-art Information Access tools to access data stored on database management systems that are twenty years old.
The Data Access Layer then is responsible for interfacing between Information Access tools and Operational Databases. In some
cases, this is all that certain end-users need. However, in general, organizations are developing a much more sophisticated scheme to
support Data Warehousing.
In order to provide for universal data access, it is absolutely necessary to maintain some form of data directory or repository of meta-
data information. Meta-data is the data about data within the enterprise. Record descriptions in a COBOL program are meta-data. So
are DIMENSION statements in a FORTRAN program, or SQL Create statements. The information in an ERA diagram is also meta-
data.
In order to have a fully functional warehouse, it is necessary to have a variety of meta-data available, data about the end-user views of
data and data about the operational databases. Ideally, end-users should be able to access data from the data warehouse (or from the
operational databases) without having to know where that data resides or the form in which it is stored.
The Process Management Layer is involved in scheduling the various tasks that must be accomplished to build and maintain the data
warehouse and data directory information. The Process Management Layer can be thought of as the scheduler or the high-level job
control for the many processes (procedures) that must occur to keep the Data Warehouse up-to-date.
The Application Message Layer has to do with transporting information around the enterprise computing network. Application
Messaging is also referred to as "middleware", but it can involve more that just networking protocols. Application Messaging for
example can be used to isolate applications, operational or informational, from the exact data format on either end. Application
Messaging can also be used to collect transactions or messages and deliver them to a certain location at a certain time. Application
Messaging in the transport system underlying the Data Warehouse.
The (core) Data Warehouse is where the actual data used primarily for informational uses occurs. In some cases, one can think of the
Data Warehouse simply as a logical or virtual view of data. In many instances, the data warehouse may not actually involve storing
data.
In a Physical Data Warehouse, copies, in some cases many copies, of operational and or external data are actually stored in a form
that is easy to access and is highly flexible. Increasingly, Data Warehouses are stored on client/server platforms, but they are often
stored on main frames as well.
The final component of the Data Warehouse Architecture is Data Staging. Data Staging is also called copy management or replication
management, but in fact, it includes all of the processes necessary to select, edit, summarize, combine and load data warehouse and
information access data from operational and/or external databases.
Data Staging often involves complex programming, but increasingly data warehousing tools are being created that help in this process.
Data Staging may also involve data quality analysis programs and filters that identify patterns and data structures within existing
operational data.
Figure 2 shows a two-dimensional grid for analyzing the basic options, with the horizontal dimension indicating the scope of the
warehouse, and the vertical dimension showing the amount of redundant data that must be stored and maintained.
The scope of a data warehouse may be as broad as all the informational data for the entire enterprise from the beginning of time, or it
may be as narrow as a personal data warehouse for a single manager for a single year. There is nothing that makes one of these more
of a data warehouse than another.
In practice, the broader the scope, the more value the data warehouse is to the enterprise and the more expensive and time consuming
it is to create and maintain. As a consequence, most organizations seem to start out with functional, departmental or divisional data
warehouses and then expand them as users provide feedback.
There are essentially three levels of data redundancy that enterprises should think about when considering their data warehouse
options:
There is no one best approach. Each option fits a specific set of requirements, and a data warehousing strategy may ultimately include
all three options
A virtual or point-to-point data warehousing strategy means that end-users are allowed to get at operational databases directly using
whatever tools are enabled to the "data access network" . This approach provides the ultimate in flexibility as well as the minimum
amount of redundant data that must be loaded and maintained. This approach can also put the largest unplanned query load on
operational systems.
As we will see, virtual warehousing is often an initial strategy in organizations where there is a broad but largely undefined need to get
at operational data from a relatively large class of end-users and where the likely frequency of requests is low. Virtual data warehouses
often provide a starting point for organizations to learn what end-users are really looking for. Figure 3 below shows a Virtual Data
Warehouse within the Data Warehouse Architecture.
Central Data Warehouses are what most people think of when they first are introduced to the concept of data warehouse. The central
data warehouse is a single physical database that contains all of the data for a specific functional area, department, division, or
enterprise. Central Data Warehouses are often selected where there is a common need for informational data and there are large
numbers of end-users already connected to a central computer or network. A Central Data Warehouse may contain data for any
specific period of time. Usually, Central Data Warehouses contain data from multiple operational systems.
Central Data Warehouses are real. The data stored in the data warehouse is accessible from one place and must be loaded and
maintained on a regular basis. Normally, data warehouses are built around advanced RDBMs or some form of multi-dimensional
informational database server.
Distributed Data Warehouses are just what their name implies. They are data warehouses in which the certain components of the data
warehouse are distributed across a number of different physical databases. Increasingly, large organizations are pushing decision-
making down to lower and lower levels of the organization and in turn pushing the data needed for decision making down (or out) to
the LAN or local computer serving the local decision-maker.
Distributed Data Warehouses usually involve the most redundant data and, as a consequence, most complex loading and updating
processes.
In the same sense that there are lots of different ways to organize a data warehouse, it is important to note that there are an
increasingly wide range of end-users as well. In general we tend to think in terms of three broad categories of end-users:
Each of these different categories of user has its own set of requirements for data, access, flexibility and ease of use.
Developing a good data warehouse is no different from any other IT project; it requires careful planning, requirements definition,
design, prototyping and implementation. The first and most important element is a planning process that determines what kind of data
warehouse strategy the organization is going to start with.
Before developing a data warehouse, it is critical to develop a balanced data warehousing strategy that is appropriate for its needs and
its user population. Who is the audience? What is the scope? What type of data warehouse should we build?
A second strategy is simply to build a copy of the operational data from a single operational system and enable the data warehouse
from a series of information access tools. This strategy has the advantage of being both simple and fast. Unfortunately, if the existing
data is of poor quality and/or the access to the data has not been thought through, then this approach can create a number of significant
problems.
Ultimately, the optimal data warehousing strategy is to select a user population based on value to the enterprise and do an analysis of
their issues, questions and data access needs. Based on these needs, prototype data warehouses are built and populated so the end-
users can experiment and modify their requirements. Once there is general agreement on the needs, then the data can be acquired from
existing operational systems across the enterprise and/or from external data sources and loaded into the data warehouse. If it is
required, information access tools can also be enabled to allow end-users to have access to required data using their own favorite tools
or to allow for the creation of high-performance multi-dimensional information access systems using the core data warehouse as the
basis.
In the final analysis, there is no one approach to building a data warehouse that will fit the needs of every enterprise. Each enterprise's
needs are different as is each enterprise's context. In addition, since data warehouse technology is evolving as we learn more about
developing data warehouses, it turns out that the only practical approach to data warehousing is an evolutionary one.
The Data Warehouse Architecture in Figure 1 is simply a framework for understanding data warehousing and how the components of
data warehousing fit together. Only the most sophisticated organizations will be able to put together such an architecture the first time
out. What the Data Warehouse Architecture provides then is a kind of roadmap that can be used to design toward. Coupled with an
understanding of the options at hand, the Data Warehouse Architecture provides a useful way of determining if the organization is
moving toward a reasonable data warehousing framework.
One of the keys to data warehousing is flexibility. It is critical to keep in mind that the more successful a data warehouse strategy is,
the more end-users are going to want to add to it.
Designing data warehouses is very different from designing traditional operational systems. For one thing, data warehouse users
typically don't know nearly as much about their wants and needs as operational users. Second, designing a data warehouse often
involves thinking in terms of much broader, and more difficult to define, business concepts than does designing an operational system.
In this respect, data warehousing is quite close to Business Process Reengineering (BPR). Finally, the ideal design strategy for a data
warehouse is often outside-in as opposed to top-down.
But while data warehouse design is different from what we have been used to, it is no less important. The fact that end-users have
difficulty defining what they need as a bare minimum is no less necessary. In practice, data warehouse designers find that they have to
use every trick in the book to help their users "visualize" their requirements. In this respect, robust working prototypes are essential.
Data Warehouses are not magic-they take a great deal of very hard work. In many cases data warehouse projects are viewed as a
stopgap measure to get users off our backs or to provide something for nothing. But data warehouses require careful management and
marketing. A data warehouse is a good investment only if end-users actually can get at vital information faster and cheaper than they
can using current technology. As a consequence, management has to think seriously about how they want their warehouses to perform
and how they are going to get the word out to the end-user community. And management has to recognize that the maintenance of the
data warehouse structure is as critical as the maintenance of any other mission-critical application. In fact, experience has shown that
data warehouses quickly become one of the most used systems in any organization.
6) Future Developments
Data Warehousing is such a new field that it is difficult to estimate what new developments are likely to most affect it. Clearly, the
development of parallel DB servers with improved query engines is likely to be one of the most important. Parallel servers will make
it possible to access huge data bases in much less time.
Another new technology is data warehouses that allow for the mixing of traditional numbers, text and multi-media. The availability of
improved tools for data visualization (business intelligence) will allow users to see things that could never be seen before.
7) Conclusion
Data Warehousing is not a new phenomenon. All large organizations already have data warehouses, but they are just not managing
them. Over the next few years, the growth of data warehousing is going to be enormous with new products and technologies coming
out frequently. In order to get the most out of this period, it is going to be important that data warehouse planners and developers have
a clear idea of what they are looking for and then choose strategies and methods that will provide them with performance today and
flexibility for tomorrow.