DBMS 2020-21 Ii-Cse PDF
DBMS 2020-21 Ii-Cse PDF
DBMS 2020-21 Ii-Cse PDF
LECTURE NOTES
R18
B. TECH
II YEAR – I SEM (Sec- A & B)
Academic Year 2020-21
ABHAY KUMAR
ASSOCIATE PROFESSOR, DEPARTMENT OF CSE
J.B.I.E.T
Bhaskar Nagar, Yenkapally(V), Moinabad(M),
Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.
J.B. INSTITUTE OF ENGINEERING & TECHNOLOGY
UGC AUTONOMOUS
Bhaskar Nagar, Moinabad (M), RR Dist, Telangana-500075
UNIT - I:
Data base Systems- Data base System Applications, data base System VS file System – View of Data – Data
Abstraction –Instances and Schemas – data Models – the ER Model – Relational Model – Other Models – Database
Languages – DDL – DML – database Access for applications Programs – data base Users and Administrator –
Transaction Management – data base System Structure – Storage Manager – the Query Processor
ER diagrams – Beyond ER Design Entities, Attributes and Entity sets – Relationships and Relationship sets –
Additional features of ER Model – Concept Design with the ER Model
UNIT - II:
Introduction to the Relational Model
-Integrity Constraint Over relations – Enforcing Integrity constraints – Querying relational data – Logical data
base Design – Introduction to Views – Destroying /altering Tables and Views.
Relational Algebra
-Selection and projection set operations – renaming – Joins – Division – Examples of Algebra overviews –
Relational calculus – Tuple relational Calculus – Domain relational calculus – Expressive Power of Algebra and
calculus.
UNIT - III:
Form of Basic SQL Query
-Examples of Basic SQL Queries – Introduction to Nested Queries – Correlated Nested Queries Set – Comparison
Operators – Aggregative Operators – NULL values – Comparison using Null values – Logical connectivity’s –
AND, OR and NOT – Impact on SQL Constructs – Outer Joins – Disallowing NULL values – Complex Integrity
Constraints in SQL Triggers and Active Data bases.
Schema refinement
-Problems Caused by redundancy – Decompositions – Problem related to decomposition – reasoning about FDS
– FIRST, SECOND, THIRD Normal forms – BCNF – Lossless join Decomposition – Dependency preserving
Decomposition – Schema refinement in Data base Design – Multi valued Dependencies – FORTH Normal Form.
UNIT - IV:
Transaction Concept
-Transaction State- Implementation of Atomicity and Durability – Concurrent – Executions – Serializability-
Recoverability– Implementation of Isolation – Testing for serializability- Lock –Based Protocols – Timestamp
Based Protocols- Validation- Based Protocols – Multiple Granularity.
Recovery and Atomicity
-Log – Based Recovery – Recovery with Concurrent Transactions – Buffer Management – Failure with loss of
nonvolatile storage-Advance Recovery systems- Remote Backup systems.
UNIT - V:
Data on External Storage
-File Organization and Indexing – Cluster Indexes, Primary and Secondary Indexes – Index data Structures – Hash
Based Indexing – Tree base Indexing – Comparison of File Organizations – Indexes and Performance Tuning-
Intuitions for tree Indexes – Indexed Sequential Access Methods (ISAM) – B+ Trees: A Dynamic Index Structure.
Advanced Database Management System
Introduction to Distributed Database-Reference Architecture, fragmentation, Allocation, Joins
Text Books:
1. Data Base Management Systems, Raghurama Krishnan, Johannes Gehrke, TATA McGrawHill
3rd Edition
2. Data base System Concepts, Silberschatz, Korth, McGraw hill, V edition.
Reference Books:
1. Data base Systems design, Implementation, and Management, Peter Rob & Carlos Coronel 7th
Edition.
2. Fundamentals of Database Systems, Elmasri Navrate Pearson Education
3. Introduction to Database Systems, C.J.Date Pearson Education
Course outcomes:
The student will be able to:
1. Identify the underlying concepts of database technologies
2. Design a database schema for a given problem domain
3. Formulate SQL queries and integrity constraints over relations
4. Apply normalization on database for eliminating redundancy
5. Summarize transaction properties, concurrency control and recovery techniques and learn various
data storage and security mechanisms
Database Management System
UNIT-1
What is a Database?
To find out what database is, we have to start from data, which is the basic building block of any DBMS.
Data: Facts, figures, statistics etc. having no particular meaning (e.g. 1, ABC, 19 etc).
Record: Collection of related data items, e.g. in the above example the three data items had no meaning. But
if we organize them in the following way, then they collectively represent meaningful information.
Roll Name Age
1 ABC 19
The columns of this relation are called Fields, Attributes or Domains. The rows are called Tuples
or Records.
Database: Collection of related relations. Consider the following collection of tables:
T1 T2
Roll Name Age Roll Address
1 ABC 19 1 KOL
2 DEF 22 2 DEL
3 XYZ 28 3 MUM
T3 T4
We now have a collection of 4 tables. They can be called a “related collection” because we can clearly find out
that there are some common attributes existing in a selected pair of tables. Because of these common attributes
we may combine the data of two or more tables together to find out the complete details of a student. Questions
like “Which hostel does the youngest student live in?” can be answered now, although
Age and Hostel attributes are in different tables.
A database in a DBMS could be viewed by lots of different people with different responsibilities.
For example, within a company there are different departments, as well as customers, who each need to see
different kinds of data. Each employee in the company will have different levels of access to the database with
their own customized front-end application.
In a database, data is organized strictly in row and column format. The rows are called Tuple or Record. The
data items within one row may belong to different data types. On the other hand, the columns are often called
Domain or Attribute. All the data items within a single attribute are of the same data type.
A database-management system (DBMS) is a collection of interrelated data and a set of programs to access
those data. This is a collection of related data with an implicit meaning and hence is a database. The collection
of data, usually referred to as the database, contains information relevant to an enterprise. The primary goal of
a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient. By
data, we mean known facts that can be recorded and that have implicit meaning.
The management system is important because without the existence of some kind of rules and regulations it is
not possible to maintain the database. We have to select the particular attributes which should be included in a
particular table; the common attributes to create relationship between two tables; if a new record has to be
inserted or deleted then which tables should have to be handled etc. These issues must be resolved by having
some kind of rules to follow in order to maintain the integrity of the database.
Database systems are designed to manage large bodies of information. Management of data involves both
defining structures for storage of information and providing mechanisms for the manipulation of information. In
addition, the database system must ensure the safety of the information stored, despite system crashes or
attempts at unauthorized access. If data are to be shared among several users, the system must avoid possible
anomalous results.
Because information is so important in most organizations, computer scientists have developed a large body of
concepts and techniques for managing data.
2. Database Management System (DBMS) and Its Applications:
A Database management system is a computerized record-keeping system. It is a repository or a container for
collection of computerized data files. The overall purpose of DBMS is to allow he users to define, store, retrieve
and update the information contained in the database on demand. Information can be anything that is of
significance to an individual or organization.
Databases touch all aspects of our lives. Some of the major areas of application are as follows:
1. Banking
2. Airlines
3. Universities
4. Manufacturing and selling
5. Human resources
Enterprise Information
◦ Sales: For customer, product, and purchase information.
◦ Accounting: For payments, receipts, account balances, assets and other accounting information.
◦ Human resources: For information about employees, salaries, payroll taxes, and benefits, and for generation
of paychecks.
◦ Manufacturing: For management of the supply chain and for tracking production of items in factories,
inventories of items in warehouses and stores, and orders for items.
Online retailers: For sales data noted above plus online order tracking, generation of recommendation lists,
and
maintenance of online product evaluations.
Banking and Finance
◦ Banking: For customer information, accounts, loans, and banking transactions.
◦ Credit card transactions: For purchases on credit cards and generation of monthly statements.
◦ Finance: For storing information about holdings, sales, and purchases of financial instruments such as stocks
and bonds; also for storing real-time market data to enable online trading by customers and automated
trading by the firm.
• Universities: For student information, course registrations, and grades (in addition to standard enterprise
information such as human resources and accounting).
• Airlines: For reservations and schedule information. Airlines were among the first to use databases in a
geographically distributed manner.
• Telecommunication: For keeping records of calls made, generating monthly bills, maintaining balances on
prepaid calling cards, and storing information about the communication networks.
Data redundancy and inconsistency. Since different programmers create the files and application programs
over a long period, the various files are likely to have different structures and the programs may be written in
several programming languages. Moreover, the same information may be duplicated in several places (files). For
example, if a student has a double major (say, music and mathematics) the address and telephone number of
that student may appear in a file that consists of student records of students in the Music department and in a file
that consists of student records of students in the Mathematics department. This redundancy leads to higher
storage and access cost. In addition, it may lead to data inconsistency; that is, the various copies of the same
data may no longer agree. For example, a changed student address may be reflected in the Music department
records but not elsewhere in the system.
Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of all students
who live within a particular postal-code area. The clerk asks the data-processing department to generate such a
list. Because the designers of the original system did not anticipate this request, there is no application program
on hand to meet it. There is, however, an application program to generate the list of all students.
The university clerk has now two choices: either obtain the list of all students and extract the needed information
manually or ask a programmer to write the necessary application program. Both alternatives are obviously
unsatisfactory. Suppose that such a program is written, and that, several days later, the same clerk needs to trim
that list to include only those students who have taken at least 60 credit hours. As expected, a program to generate
such a list does not exist. Again, the clerk has the preceding two options, neither of which is satisfactory. The
point here is that conventional file-processing environments do not allow needed data to be retrieved in a
convenient and efficient manner. More responsive data-retrieval systems are required for general use.
Data isolation. Because data are scattered in various files, and files may be in different formats, writing new
application programs to retrieve the appropriate data is difficult.
Integrity problems. The data values stored in the database must satisfy certain types of consistency
constraints. Suppose the university maintains an account for each department, and records the balance amount
in each account. Suppose also that the university requires that the account balance of a department may never
fall below zero. Developers enforce these constraints in the system by adding appropriate code in the various
application programs. However, when new constraints are added, it is difficult to change the programs to enforce
them. The problem is compounded when constraints involve several data items from different files.
Atomicity problems. A computer system, like any other device, is subject to failure. In many applications, it is
crucial that, if a failure occurs, the data be restored to the consistent state that existed prior to the failure.
Consider a program to transfer $500 from the account balance of department A to the account balance of
department B. If a system failure occurs during the execution of the program, it is possible that the $500 was
removed from the balance of department A but was not credited to the balance of department B, resulting in an
inconsistent database state. Clearly, it is essential to database consistency that either both the credit and debit
occur, or that neither occur.
That is, the funds transfer must be atomic—it must happen in its entirety or not at all. It is difficult to ensure
atomicity in a conventional file-processing system.
Concurrent-access anomalies. For the sake of overall performance of the system and faster response, many
systems allow multiple users to update the data simultaneously. Indeed, today, the largest Internet retailers may
have millions of accesses per day to their data by shoppers. In such an environment, interaction of concurrent
updates is possible and may result in inconsistent data. Consider department A, with an account balance of
$10,000. If two department clerks debit the account balance (by say $500 and $100, respectively) of department
A at almost exactly the same time, the result of the concurrent executions may leave the budget in an incorrect
(or inconsistent) state. Suppose that the programs executing on behalf of each withdrawal read the old balance,
reduce that value by the amount being withdrawn, and write the result back. If the two programs run concurrently,
they may both read the value $10,000, and write back $9500 and $9900, respectively. Depending on which one
writes the value last, the account balance of department A may contain either $9500 or $9900, rather than the
correct value of $9400. To guard against this possibility, the system must maintain some form of supervision.
But supervision is difficult to provide because data may be accessed by many different application programs that
have not been coordinated previously.
As another example, suppose a registration program maintains a count of students registered for a course, in
order to enforce limits on the number of students registered. When a student registers, the program reads the
current count for the courses, verifies that the count is not already at the limit, adds one to the count, and stores
the count back in the database. Suppose two students register concurrently, with the count at (say) 39. The two
program executions may both read the value 39, and both would then write back 40, leading to an incorrect
increase of only 1, even though two students successfully registered for the course and the count should be 41.
Furthermore, suppose the course registration limit was 40; in the above case both students would be able to
register, leading to a violation of the limit of 40 students.
Security problems. Not every user of the database system should be able to access all the data. For example,
in a university, payroll personnel need to see only that part of the database that has financial information. They
do not need access to information about academic records. But, since application programs are added to the file-
processing system in an ad hoc manner, enforcing such security constraints is difficult.
These difficulties, among others, prompted the development of database systems. In what follows, we shall see
the concepts and algorithms that enable database systems to solve the problems with file-processing systems.
Advantages of DBMS:
Controlling of Redundancy: Data redundancy refers to the duplication of data (i.e storing same data multiple
times). In a database system, by having a centralized database and centralized control of data by the DBA the
unnecessary duplication of data is avoided. It also eliminates the extra time for processing the large volume of
data. It results in saving the storage space.
Improved Data Sharing : DBMS allows a user to share the data in any number of application
programs.
Data Integrity : Integrity means that the data in the database is accurate. Centralized control of
the data helps in permitting the administrator to define integrity constraints to the data in the
database. For example: in customer database we can enforce an integrity that it must accept the
customer only from Noida and Meerut city.
Security : Having complete authority over the operational data, enables the DBA in ensuring that
the only mean of access to the database is through proper channels. The DBA can define
authorization checks to be carried out whenever access to sensitive data is attempted.
Data Consistency: By eliminating data redundancy, we greatly reduce the opportunities for
inconsistency. For example: is a customer address is stored only once, we cannot have
disagreement on the stored values. Also updating data values is greatly simplified when each value
is stored in one place only. Finally, we avoid the wasted storage that results from redundant data
storage.
Efficient Data Access: In a database system, the data is managed by the DBMS and all access
to the data is through the DBMS providing a key to effective data processing
Enforcements of Standards: With the centralized of data, DBA can establish and enforce the data
standards which may include the naming conventions, data quality standards etc.
Data Independence: Ina database system, the database management system provides the
interface between the application programs and the data. When changes are made to the data
representation, the meta data obtained by the DBMS is changed but the DBMS is continues to
provide the data to application program in the previously used way. The DBMs handles the task of
transformation of data wherever necessary.
Reduced Application Development and Maintenance Time : DBMS supports many important
functions that are common to many applications, accessing data stored in the DBMS, which
facilitates the quick development of application.
Disadvantages of DBMS
1) It is bit complex. Since it supports multiple functionality to give the user the best, the underlying
software has become complex. The designers and developers should have thorough
knowledge about the software to get the most out of it.
2) Because of its complexity and functionality, it uses large amount of memory. It also needs large
memory to run efficiently.
3) DBMS system works on the centralized system, i.e.; all the users from all over the world access
this database. Hence any failure of the DBMS, will impact all the users.
4) DBMS is generalized software, i.e.; it is written work on the entire systems rather specific one.
Hence some of the application will run slow.
4. View of Data
A database system is a collection of interrelated data and a set of programs that allow users to
access and modify these data. A major purpose of a database system is to provide users with an
abstract view of the data. That is, the system hides certain details of how the data are stored and
maintained.
Data Abstraction
For the system to be usable, it must retrieve data efficiently. The need for efficiency has led
designers to use complex data structures to represent data in the database. Since many database-
system users are not computer trained, developers hide the complexity from users through several
levels of abstraction, to simplify users’ interactions with the system:
D
a
t
a
b
a
s
e
D
I
S
K
• Physical level (or Internal View / Schema): The lowest level of abstraction describes how the
data are actually stored. The physical level describes complex low-level data structures in detail.
• Logical level (or Conceptual View / Schema): The next-higher level of abstraction describes
what data are stored in the database, and what relationships exist among those data. The logical
level thus describes the entire database in terms of a small number of relatively simple structures.
Although implementation of the simple structures at the logical level may involve complex physical-
level structures, the user of the logical level does not need to be aware of this complexity. This is
referred to as physical data independence. Database administrators, who must decide what
information to keep in the database, use the logical level of abstraction.
• View level (or External View / Schema): The highest level of abstraction describes only part of
the entire database. Even though the logical level uses simpler structures, complexity remains
because of the variety of information stored in a large database. Many users of the database system
do not need all this information; instead, they need to access only a part of the database. The view
level of abstraction exists to simplify their interaction with the system. The system may provide
many views for the same database. Figure 1.2 shows the relationship among the three levels of
abstraction.
An analogy to the concept of data types in programming languages may clarify the distinction
among levels of abstraction. Many high-level programming languages support the notion of a
structured type. For example, we may describe a record as follows:
This code defines a new record type called instructor with four fields. Each field has a
name and a type associated with it. A university organization may have several such
record types, including
• department, with fields dept_name, building, and budget
• course, with fields course_id, title, dept_name, and credits
• student, with fields ID, name, dept_name, and tot_cred
At the physical level, an instructor, department, or student record can be described as a block of
consecutive storage locations. The compiler hides this level of detail from programmers. Similarly,
the database system hides many of the lowest-level storage details from database programmers.
Database administrators, on the other hand, may be aware of certain details of the physical
organization of the data.
At the logical level, each such record is described by a type definition, as in the previous code
segment, and the interrelationship of these record types is defined as well. Programmers using a
programming language work at this level of abstraction. Similarly, database administrators usually
work at this level of abstraction.
Finally, at the view level, computer users see a set of application programs that hide details of the
data types. At the view level, several views of the database are defined, and a database user sees
some or all of these views. In addition
to hiding details of the logical level of the database, the views also provide a security mechanism
to prevent users from accessing certain parts of the database. For example, clerks in the university
registrar office can see only that part of the database that has information about students; they
cannot access information about salaries of instructors.
Each variable has a particular value at a given instant. The values of the variables in a program at
a point in time correspond to an instance of a database schema. Database systems have several
schemas, partitioned according to the levels of abstraction. The physical schema describes the
database design at the physical level, while the logical schema describes the database design at
the logical level. A database may also have several schemas at the view level, sometimes called
subschemas, which describe different views of the database. Of these, the logical schema is by
far the most important, in terms of its effect on application programs, since programmers construct
applications by using the logical schema. The physical schema is hidden beneath the logical
schema, and can usually be changed easily without affecting application programs. Application
programs are said to exhibit physical data independence if they do not depend on the physical
schema, and thus need not be rewritten if the physical schema changes.
5. Data Models
Underlying the structure of a database is the data model: a collection of conceptual tools for
describing data, data relationships, data semantics, and consistency constraints. A data model
provides a way to describe the design of a database at the physical, logical, and view levels.
Entity-Relationship Model. The entity-relationship (E-R) data model uses a collection of basic
objects, called
entities, and relationships among these objects.
An entity is a “thing” or “object” in the real world that is distinguishable from other objects. The
entity- relationship model is widely used in database design.
Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has
become the dominant software-development methodology. This led to the development of an
object-oriented data model that can be seen as extending the E-R model with notions of
encapsulation, methods (functions), and object identity. The object-relational data model combines
features of the object-oriented data model and relational data model.
Semi-structured Data Model. The semi-structured data model permits the specification of data
where individual data items of the same type may have different sets of attributes. This is in contrast
to the data models mentioned earlier, where every data item of a particular type must have the
same set of attributes. The Extensible Markup Language (XML) is widely used to represent semi-
structured data.
Historically, the network data model and the hierarchical data model preceded the relational
data model. These models were tied closely to the underlying implementation, and complicated the
task of modeling data. As a result they are used little now, except in old database code that is still
in service in some places.
6. Database Languages
A database system provides a data-definition language to specify the database schema and a data-
manipulation language to express database queries and updates. In practice, the data- definition and data-
manipulation languages are not two separate languages; instead they simply form parts of a single database
language, such as the widely used SQL language.
Data-Manipulation Language
A data-manipulation language (DML) is a language that enables users to access or
manipulate data as organized by the appropriate data model. The types of access are:
• Retrieval of information stored in the database
• Insertion of new information into the database
• Deletion of information from the database
• Modification of information stored in the database
We specify the storage structure and access methods used by the database system by a set of
statements in a special type of DDL called a data storage and definition language. These
statements define the implementation details of the database schemas, which are usually hidden
from the users.
The data values stored in the database must satisfy certain consistency constraints.
For example, suppose the university requires that the account balance of a department must never
be negative. The DDL provides facilities to specify such constraints. The database system checks
these constraints every time the database is updated. In general, a constraint can be an arbitrary
predicate pertaining to the database. However, arbitrary predicates may be costly to test. Thus,
database systems implement integrity constraints that can be tested with minimal overhead.
• Domain Constraints. A domain of possible values must be associated with every attribute (for
example, integer types, character types, date/time types). Declaring an attribute to be of a particular
domain acts as a constraint on the values that it can take. Domain constraints are the most
elementary form of integrity constraint. They are tested easily by the system whenever a new data
item is entered into the database.
• Referential Integrity. There are cases where we wish to ensure that a value that appears in one
relation for a given set of attributes also appears in a certain set of attributes in another relation
(referential integrity). For example, the department listed for each course must be one that actually
exists. More precisely, the dept name value in a course record must appear in the dept name
attribute of some record of the department relation.
Database modifications can cause violations of referential integrity. When a referential-
integrity constraint is violated, the normal procedure is to reject the action that caused the
violation.
• Assertions. An assertion is any condition that the database must always satisfy. Domain
constraints and referential-integrity constraints are special forms of assertions. However, there are
many constraints that we cannot express by using only these special forms. For example, “Every
department must have at least five courses offered every semester” must be expressed as an
assertion. When an assertion is created, the system tests it for validity. If the assertion is valid, then
any future modification to the database is allowed only if it does not cause that assertion to be
violated.
• Authorization. We may want to differentiate among the users as far as the type of access they
are permitted on various data values in the database. These differentiations are expressed in terms
of authorization, the most common being: read authorization, which allows reading, but not
modification, of data; insert authorization, which allows insertion of new data, but not modification
of existing data; update authorization, which allows modification, but not deletion, of data; and
delete authorization, which allows deletion of data. We may assign the user all, none, or a
combination of these types of authorization.
The DDL, just like any other programming language, gets as input some instructions (statements)
and generates some output. The output of the DDL is placed in the data dictionary, which contains
metadata—that is, data about data. The data dictionary is considered to be a special type of table
that can only be accessed and updated by the database system itself (not a regular user). The
database system consults the data dictionary before reading or modifying actual data.
Data Dictionary
We can define a data dictionary as a DBMS component that stores the definition of data
characteristics and relationships. You may recall that such “data about data” were labeled
metadata. The DBMS data dictionary provides the DBMS with its self describing characteristic. In
effect, the data dictionary resembles and X-ray of the company’s entire data set, and is a crucial
element in the data administration function.
The two main types of data dictionary exist, integrated and stand alone. An integrated data
dictionary is included with the DBMS. For example, all relational DBMSs include a built in data
dictionary or system catalog that is frequently accessed and updated by the RDBMS. Other DBMSs
especially older types, do not have a built in data dictionary instead the DBA may use third party
stand alone data dictionary systems.
Data dictionaries can also be classified as active or passive. An active data dictionary is
automatically updated by the DBMS with every database access, thereby keeping its access
information up-to-date. A passive data dictionary is not updated automatically and usually requires
a batch process to be run. Data dictionary access information is normally used by the DBMS for
query optimization purpose.
The data dictionary’s main function is to store the description of all objects that interact with the
database. Integrated data dictionaries tend to limit their metadata to the data managed by the
DBMS. Stand alone data dictionary systems are more usually more flexible and allow the DBA to
describe and manage all the organization’s data, whether or not they are computerized. Whatever
the data dictionary’s format, its existence provides database designers and end users with a much
improved ability to communicate. In addition, the data dictionary is the tool that helps the DBA to
resolve data conflicts.
Although, there is no standard format for the information stored in the data dictionary several
features are common. For example, the data dictionary typically stores descriptions of all:
• Data elements that are define in all tables of all databases. Specifically the data dictionary
stores the name, data types, display formats, internal storage formats, and validation rules.
The data dictionary tells where an element is used, by whom it is used and so on.
• Tables define in all databases. For example, the data dictionary is likely to store the name of
the table creator, the date of creation access authorizations, the number of columns, and so
on.
• Indexes define for each database tables. For each index the DBMS stores at least the index
name the attributes used, the location, specific index characteristics and the creation date.
• Define databases: who created each database, the date of creation where the database is
located, who the DBA is and so on.
• End users and The Administrators of the data base
• Programs that access the database including screen formats, report formats application
formats, SQL queries and so on.
• Access authorization for all users of all databases.
• Relationships among data elements which elements are involved: whether the relationship are
mandatory or optional, the connectivity and cardinality and so on.
A primary goal of a database system is to retrieve information from and store new information in
the database. People who work with a database can be categorized as database users or database
administrators.
There are four different types of database-system users, differentiated by the way they expect to
interact with the system. Different types of user interfaces have been designed for the different
types of users.
Naive users are unsophisticated users who interact with the system by invoking one of the
application programs that have been written previously. For example, a bank teller who needs to
transfer $50 from account A to account B invokes a program called transfer. This program asks the
teller for the amount of money to be transferred, the account from which the money is to be
transferred, and the account to which the money is to be transferred.
As another example, consider a user who wishes to find her account balance over the World Wide
Web. Such a user may access a form, where she enters her account number. An application
program at the Web server then retrieves the account balance, using the given account number,
and passes this information back to the user. The typical user interface for naive users is a forms
interface, where the user can fill in appropriate fields of the form. Naive users may also simply read
reports generated from the database.
Transaction Management:
A transaction is a collection of operations that performs a single logical function in a database
application. Each transaction is a unit of both atomicity and consistency. Thus, we require that
transactions do not violate any database-consistency constraints. That is, if the database was
consistent when a transaction started, the database must be consistent when the transaction
successfully terminates. Transaction - manager ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes) and
transaction failures.
8. Data base System Structure (Database Architecture):
We are now in a position to provide a single picture (Figure 1.3) of the various components of a
database system and the connections among them.
The architecture of a database system is greatly influenced by the underlying computer system on
which the database system runs. Database systems can be centralized, or client-server, where one
server machine executes work on behalf of multiple client machines. Database systems can also
be designed to exploit parallel computer architectures. Distributed databases span multiple
geographically separated machines .
Figure 1.3: Database System Architecture
A database system is partitioned into modules that deal with each of the responsibilities of the
overall system. The functional components of a database system can be broadly divided into the
storage manager and the query processor components. The storage manager is important
because databases typically require a large amount of storage space. The query processor is
important because it helps the database system simplify and facilitate access to data.
It is the job of the database system to translate updates and queries written in a nonprocedural
language, at the logical level, into an efficient sequence of operations at the physical level.
Database applications are usually partitioned into two or three parts, as in Figure 1.4. In a two-tier
architecture, the application resides at the client machine, where it invokes database system
functionality at the server machine through query language statements. Application program
interface standards like ODBC and JDBC are used for interaction between the client and the server.
In contrast, in a three-tier architecture, the client machine acts as merely a front end and does not
contain any direct database calls. Instead, the client end communicates with an application server,
usually through a forms interface.
The application server in turn communicates with a database system to access data. The business
logic of the application, which says what actions to carry out under what conditions, is embedded
in the application server, instead of being distributed across multiple clients. Three-tier applications
are more appropriate for large applications, and for applications that run on the WorldWideWeb.
Figure 1.4: Two-tier and three-tier architectures.
Query Processor:
The query processor components include
· DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
· DML compiler, which translates DML statements in a query language into an evaluation
plan consisting of low-level instructions that the query evaluation engine understands.
A query can usually be translated into any of a number of alternative evaluation plans that all give
the same result. The DML compiler also performs query optimization, that is, it picks the lowest
cost evaluation plan from among the alternatives.
Query evaluation engine, which executes low-level instructions generated by the DML compiler.
Storage Manager:
A storage manager is a program module that provides the interface between the lowlevel data
stored in the database and the application programs and queries submitted to the system. The
storage manager is responsible for the interaction with the file manager. The raw data are stored
on the disk using the file system, which is usually provided by a conventional operating system.
The storage manager translates the various DML statements into low-level file-system commands.
Thus, the storage manager is responsible for storing, retrieving, and updating data in the database.
What is ER Modeling?
A graphical technique for understanding and organizing the data independent of the actual
database implementation
We need to be familiar with the following terms to go further.
Entity
Any thing that has an independent existence and about which we collect data. It is also known
as entity type. In ER modeling, notation for entity is given below.
Entity instance
Entity instance is a particular member of
the entity type. Example for entity
instance : A particular employee Regular
Entity
An entity which has its own key attribute is a
regular entity. Example for regular entity :
Employee.
Weak entity
An entity which depends on other entity for its existence and doesn't have any key attribute of its own
is a weak
ntity.
Example for a weak entity : In a parent/child relationship, a parent is considered as a strong
entity and the child is a weak entity.
In ER modeling, notation for weak entity is given below.
Attributes
Properties/characteristics which describe entities are
called attributes. In ER modeling, notation for attribute
is given below.
Domain of Attributes
The set of possible values that an attribute can take is called the domain of the attribute. For
example, the attribute day may take any value from the set {Monday, Tuesday ... Friday}. Hence
this set can be termed as the domain of the attribute day.
Key attribute
The attribute (or combination of attributes) which is unique for every entity instance is called key
attribute.
E.g the employee_id of an employee, pan_card_number of a person etc.If the key attribute
consists of two or more attributes in combination, it is called a composite key.
In ER modeling, notation for key attribute is given below.
Simple attribute
If an attribute cannot be divided into simpler components, it is a
simple attribute. Example for simple attribute : employee_id of an
employee.
Composite attribute
If an attribute can be split into components, it is called a composite attribute.
Example for composite attribute : Name of the employee which can be split into First_name,
Middle_name, and Last_name.
Single valued Attributes
If an attribute can take only a single value for each entity instance, it is a single valued
attribute. example for single valued attribute : age of a student. It can take only one value
for a particular student. Multi-valued Attributes
If an attribute can take more than one value for each entity instance, it is a multi-valued attribute. Multi-
valued
example for multi valued attribute : telephone number of an employee, a particular employee may
have multiple telephone numbers.
In ER modeling, notation for multi-valued attribute is given below.
Stored Attribute
An attribute which need to be stored permanently is a
stored attribute Example for stored attribute : name of
a student
Derived Attribute
An attribute which can be calculated or derived based on other attributes is a derived attribute.
Example for derived attribute : age of employee which can be calculated from date of birth
and current date. In ER modeling, notation for derived attribute is given below.
Relationships
Associations between entities are called relationships
Example : An employee works for an organization. Here "works for" is a relation between the
entities employee and organization.
In ER modeling, notation for relationship is given below.
However in ER Modeling, To connect a weak Entity with others, you should use a weak
relationship notation as given below
Degree of a Relationship
Degree of a relationship is the number of entity types involved. The n-ary relationship is the
general form for degree n. Special cases are unary, binary, and ternary ,where the degree is
1, 2, and 3, respectively.
Example for unary relationship : An employee ia a manager of
another employee Example for binary relationship : An employee
works-for department.
Example for ternary relationship : customer purchase item from a shop keeper
Cardinality of a Relationship
Relationship cardinalities specify how many of each entity type is allowed. Relationships can
have four possible connectivities as given below.
1. One to one (1:1) relationship
2. One to many (1:N) relationship
3. Many to one (M:1) relationship
4. Many to many (M:N) relationship
The minimum and maximum values of this connectivity is called the cardinality of the relationship
One employee is assigned with only one parking space and one parking space is
assigned to only one employee. Hence it is a 1:1 relationship and cardinality is One-To-
One (1:1)
In ER modeling, this can be mentioned using notations as given below
One organization can have many employees, but one employee works in only one organization.
Hence it is a 1:N relationship and cardinality is One-To-Many (1:N)
In ER modeling, this can be mentioned using notations as given below
Example for Cardinality – Many-to-One (M :1)
It is the reverse of the One to Many relationship. employee works in organization
One employee works in only one organization But one organization can have many employees.
Hence it is a M:1 relationship and cardinality is Many-to-One (M :1)
In ER modeling, this can be mentioned using notations as given below.
One student can enroll for many courses and one course can be enrolled by many students.
Hence it is a M:N relationship and cardinality is Many-to-Many (M:N)
In ER modeling, this can be mentioned using notations as given below
Relationship Participation
1. Total
In total participation, every entity instance will be connected through the relationship to another
instance of the other participating entity types
2. Partial
Example for relationship participation
Consider the relationship - Employee is head of the department.
Here all employees will not be the head of the department. Only one employee will be the head
of the department. In other words, only few instances of employee entity participate in the above
relationship. So employee entity's participation is partial in the said relationship.
However each department will be headed by some employee. So department entity's participation
is total in the said relationship.
Advantages and Disadvantages of ER Modeling (Merits and Demerits of
ER Modeling) Advantages
1. ER Modeling is simple and easily understandable. It is represented in business user’s
language and it can be understood by non-technical specialist.
2. Intuitive and helps in Physical Database creation.
3. Can be generalized and specialized based on needs.
4. Can help in database design.
5. Gives a higher level description of the system.
Disadvantages
1. Physical design derived from E-R Model may have some amount of ambiguities or inconsistency.
2. Sometime diagrams may lead to misinterpretations
Unit - II
Relational Model
The relational model is today the primary data model for commercial data processing applications. It attained its
primary position because of its simplicity, which eases the job of the programmer, compared to earlier data models
such as the network model or the hierarchical model. In this, we first study the fundamentals of the relational
model. A substantial theory exists for relational databases.
A relational database consists of a collection of tables, each of which is assigned a unique name. For example,
consider the instructor table of Figure: 1.5, which stores information about instructors. The table has four column
headers: ID, name, dept name, and salary. Each row of this table records information about an instructor,
consisting of the instructor’s ID, name, dept name, and salary. Similarly, the course table of Figure 1.6 stores
information about courses, consisting of a course id, title, dept name, and credits, for each course. Note that each
instructor is identified by the value of the column ID, while each course is identified by the value of the column
course id.
Figure 1.7 shows a third table, prereq, which stores the prerequisite courses for each course. The table has two
columns, course id and prereq id. Each row consists of a pair of course identifiers such that the second course is
a prerequisite for the first course.
Thus, a row in the prereq table indicates that two courses are related in the sense that one course is a prerequisite
for the other. As another example, we consider the table instructor, a row in the table can be thought of as
representing the relationship between a specified ID and the corresponding values for name,dept name, and salary
values.
In general, a row in a table represents a relationship among a set of values. Since a table is a collection of such
relationships, there is a close correspondence between the concept of table and the mathematical concept of
relation, from which the relational data model takes its name. In mathematical terminology, a tuple is simply a
sequence (or list) of values. A relationship between n values is represented mathematically by an n-tuple of values,
i.e., a tuple with n values, which corresponds to a row in a table.
The course relation (2.2)
Thus, in the relational model the term relation is used to refer to a table, while the term tuple is used to refer to
a row. Similarly, the term attribute refers to a column of a table.
Examining Figure 1.5, we can see that the relation instructor has four attributes:
We use the term relation instance to refer to a specific instance of a relation, i.e., containing a specific set of
rows. The instance of instructor shown in Figure 1.5 has 12 tuples, corresponding to 12 instructors.
In this topic, we shall be using a number of different relations to illustrate the various concepts underlying the
relational data model. These relations represent part of a university. They do not include all the data an actual
university database would contain, in order to simplify our presentation.
The order in which tuples appear in a relation is irrelevant, since a relation is a set of tuples. Thus, whether the
tuples of a relation are listed in sorted order, as in Figure 1.5, or are unsorted, as in Figure 1.8, does not matter;
the relations in the two figures are the same, since both contain the same set of tuples. For ease of exposition, we
will mostly show the relations sorted by their first attribute. For each attribute of a relation, there is a set of permitted
values, called the domain of that attribute. Thus, the domain of the salary attribute of the instructor relation is the
set of all possible salary values, while the domain of the name attribute is the set of all possible instructor names.
We require that, for all relations r, the domains of all attributes of r be atomic. A domain is atomic if elements of
the domain are considered to be indivisible units.
For example, suppose the table instructor had an attribute phone number, which can store a set of phone numbers
corresponding to the instructor. Then the domain of phone number would not be atomic, since an element of the
domain is a set of phone numbers, and it has subparts, namely the individual phone numbers
in the set.
The important issue is not what the domain itself is, but rather how we use domain elements in our database.
Suppose now that the phone number attribute stores a single phone number. Even then, if we split the value from
the phone number attribute into a country code, an area code and a local number, we would be treating it as a
nonatomic value. If we treat each phone number as a single indivisible unit, then the attribute phone number would
have an atomic domain.
The null value is a special value that signifies that the value is unknown or does not exist. For example, suppose
as before that we include the attribute phone number in the instructor relation. It may be that an instructor does
not have a phone number at all, or that the telephone number is unlisted. We would then have to use the null
value to signify that the value is unknown or does not exist. We shall see later that null values cause a number of
difficulties when we access or update the database, and thus should be eliminated if at all possible. We shall
assume null values are absent initially.
Database Schema
When we talk about a database, we must differentiate between the database schema, which is the logical
design of the database, and the database instance, which is a snapshot of the data in the database at a given
instant in time. The concept of a relation corresponds to the programming-language notion of a variable, while the
concept of a relation schema corresponds to the programming-language notion of type definition.
In general, a relation schema consists of a list of attributes and their corresponding domains. The concept of a relation instance
corresponds to the programming-language notion of a value of a variable. The value of a given variable may change with time
The department relation.(2-5)
similarly the contents of a relation instance may change with time as the relation is updated. In contrast, the schema
of a relation does not generally change. Although it is important to know the difference between a relation schema
and a relation instance, we often use the same name, such as instructor, to refer to both the schema and the
instance. Where required, we explicitly refer to the schema or to the instance, for example “the instructor schema,”
or “an instance of the instructor relation.” However, where it is clear whether we mean the schema or the instance,
we simply use the relation name.
Consider the department relation of Figure 1.9. The schema for that relation is
Note that the attribute dept name appears in both the instructor schema and the department schema. This
duplication is not a coincidence. Rather, using common attributes in relation schemas is one way of relating tuples
of distinct relations.
For example, suppose we wish to find the information about all the instructors who work in the Watson building.
We look first at the department relation to find the dept name of all the departments housed in Watson. Then, for
each such department, we look in the instructor relation to find the information about the instructor associated with
the corresponding dept name.
Let us continue with our university database example. Each course in a university may be offered multiple times,
across different semesters, or even within a semester.We need a relation to describe each individual
offering, or section, of the class. The schema is
section (course id, sec id, semester, year, building, room number, time slot id)
Figure 1.10 shows a sample instance of the section relation. We need a relation to describe the association
between instructors and the class sections that they teach. The relation schema to describe this association is
teaches (ID, course id, sec id, semester, year)
The section relation.(2-6)
Figure 1.11 shows a sample instance of the teaches relation. As you can imagine, there are many more relations
maintained in a real university database. In addition to those relations we have listed already, instructor,
department, course, section, prereq, and teaches,we use the following relations in this text:
Keys
We must have a way to specify how tuples within a given relation are distinguished. This is expressed in terms of
their attributes. That is, the values of the attribute values of a tuple must be such that they can uniquely identify
the tuple. In other words, no two tuples in a relation are allowed to have exactly the same value for all attributes.
A superkey is a set of one or more attributes that, taken collectively, allow us to identify uniquely a tuple in the
relation. For example, the ID attribute of the relation instructor is sufficient to distinguish one instructor tuple from
another. Thus, ID is a superkey. The name attribute of instructor, on the other hand, is not a superkey, because
several instructors might have the same name. Formally, let R denote the set of attributes in the schema of relation
r. If we say that a subset K of R is a superkey for r , we are restricting consideration to instances of relations r in
which no two distinct tuples have the same values on all attributes in K. That is, if t1 and t2 are in r and t1 = t2, then
t1.K = t2.K.
A superkey may contain extraneous attributes. For example, the combination of ID and name is a superkey for the
relation instructor. If K is a superkey, then so is any superset of K. We are often interested in superkeys for which
no proper subset is a superkey. Such minimal superkeys are called candidate keys.
It is possible that several distinct sets of attributes could serve as a candidate key. Suppose that a combination of
name and dept name is sufficient to distinguish among members of the instructor relation. Then, both {ID} and
{name, dept name} are candidate keys. Although the attributes ID and name together can distinguish instructor
tuples, their combination, {ID, name}, does not form a candidate key, since the attribute ID alone is a candidate
key.
We shall use the term primary key to denote a candidate key that is chosen by the database designer as the
principal means of identifying tuples within a relation. A key (whether primary, candidate, or super) is a property of
the entire relation, rather than of the individual tuples. Any two individual tuples in the relation are prohibited from
having the same value on the key attributes at the same time. The designation of a key represents a constraint in
the real-world enterprise being modeled.
Primary keys must be chosen with care. As we noted, the name of a person is obviously not sufficient, because
there may be many people with the same name. In the United States, the social-security number attribute of a
person would be a candidate key. Since non-U.S. residents usually do not have social-security numbers,
international enterprises must generate their own unique identifiers.
An alternative is to use some unique combination of other attributes as a key. The primary key should be chosen
such that its attribute values are never, or very rarely, changed. For instance, the address field of a person should
not be part of the primary key, since it is likely to change. Social-security numbers, on the other hand, are
guaranteed never to change. Unique identifiers generated by enterprises generally do not change, except if two
enterprises merge; in such a case the same identifier may have been issued by both enterprises, and a reallocation
of identifiers may be required to make sure they are unique.
It is customary to list the primary key attributes of a relation schema before the other attributes; for example, the
dept name attribute of department is listed first, since it is the primary key. Primary key attributes are also
underlined. A relation, say r1, may include among its attributes the primary key of another relation, say r2. This
attribute is called a foreign key from r1, referencing r2.
The relation r1 is also called the referencing relation of the foreign key dependency, and r2 is called the
referenced relation of the foreign key. For example, the attribute dept name in instructor is a foreign key from
instructor, referencing department, since dept name is the primary key of department. In any database instance,
given any tuple, say ta, from the instructor relation, there must be some tuple, say tb, in the department relation
such that the value of the dept name attribute of ta is the same as the value of the primary key, dept name, of tb.
Now consider the section and teaches relations. It would be reasonable to require that if a section exists for a
course, it must be taught by at least one instructor; however, it could possibly be taught by more than one instructor.
To enforce this constraint, we would require that if a particular (course id, sec id, semester, year) combination
appears in section, then the same combination must appear in teaches. However, this set of values does not form
a primary key for teaches, since more than one instructor may teach one such section. As a result, we cannot
declare a foreign key constraint from section to teaches (although we can define a foreign key constraint in the
other direction, from teaches to section).
The constraint from section to teaches is an example of a referential integrity constraint; a referential
integrity constraint requires that the values appearing in specified attributes of any tuple in the referencing
relation also appear in specified attributes of at least one tuple in the referenced relation.
Schema Diagrams
A database schema, along with primary key and foreign key dependencies, can be depicted by schema
diagrams. Figure 1.12 shows the schema diagram for our university organization. Each relation appears
as a box, with the relation name at the top in blue, and the attributes listed inside the box. Primary key
attributes are shown underlined. Foreign key dependencies appear as arrows from the foreign key
attributes of the referencing relation to the primary key of the referenced relation.
Referential integrity constraints other than foreign key constraints are not shown explicitly in schema
diagrams. We will study a different diagrammatic representation called the entity-relationship diagram.
Relational Algebra and Calculus
In defining relational algebra and calculus, the alternative of referring to fields by position
is more convenient than referring to fields by name: Queries often involve the computation of
intermediate results, which are themselves relation instances, and if we use field names to
refer to fields, the definition of query language constructs must specify the names of fields for
all intermediate relation instances. This can be tedious and is really a secondary issue because
we can refer to fields by position anyway. On the other hand, field names make queries more
readable.
Due to these considerations, we use the positional notation to formally define relational
algebra and calculus. We also introduce simple conventions that allow intermediate relations
to ‘inherit’ field names, for convenience.
The key fields are underlined, and the domain of each field is listed after the field
name. Thus sid is the key for Sailors, bid is the key for Boats, and all three fields
together form the key for Reserves. Fields in an instance of one of these relations will be
referred to by name, or positionally, using the order in which they are listed above.
In several examples illustrating the relational algebra operators, we will use the in-stances
S1 and S2 (of Sailors) and R1 (of Reserves) shown in Figures 4.1, 4.2, and 4.3, respectively,
RELATIONAL ALGEBRA
Relational algebra is one of the two formal query languages associated with the re-
lational model. Queries in algebra are composed using a collection of operators. A
fundamental property is that every operator in the algebra accepts (one or two) rela- tion
instances as arguments and returns a relation instance as the result. This property makes it
easy to compose operators to form a complex query—a relational algebra expression is
recursively defined to be a relation, a unary algebra operator applied to a single expression,
or a binary algebra operator applied to two expressions. We describe the basic operators
of the algebra (selection, projection, union, cross-product, and difference), as well as some
additional operators that can be defined in terms of the basic operators but arise frequently
enough to warrant special attention, in the following sections.Each relational query describes
a step-by-step procedure for computing the desired answer, based on the order in which
operators are applied in the query. The procedural nature of the algebra allows us to think of
an algebra expression as a recipe, or a plan, for evaluating a query, and relational systems in
fact use algebra expressions to represent query evaluation plans.
Relational algebra includes operators to select rows from a relation (σ) and to project
columns (π). These operations allow us to manipulate data in a single relation. Con- sider
the instance of the Sailors relation shown in Figure 4.2, denoted as S2. We can retrieve rows
corresponding to expert sailors by using the σ operator. The expression,
σrating>8(S2)
evaluates to the relation shown in Figure 4.4. The subscript rating>8 specifies the
selection criterion to be applied while retrieving tuples.
sname rating
yuppy 9
Lubber 8
Guppy 5
Rusty 10
sid sname rating age
28 Yuppy 9 35.0
58 Rusty 10 35.0
σrating>8(S2) πsname,rating(S2)
The selection operator σ specifies the tuples to retain through a selection condition. In general,
the selection condition is a boolean combination (i.e., an expression using the logical
connectives ∧ and ∨ ) of terms that have the form attribute op constant or attribute1 op
attribute2, where op is one of the comparison operators <, <=, =, =, >=, or >. The
reference to an attribute can be by position (of the form .i or i) or by name (of the form
.name or name). The schema of the result of a selection is the schema of the input relation
instance
The projection operator π allows us to extract columns from a relation; for example, we can
find out all sailor names and ratings by using π. The expression πsname,rating(S2)
Suppose that we wanted to find out only the ages of sailors. The expression
πage(S2)
a single tuple with age=35.0 appears in the result of the projection. This follows from
the definition of a relation as a set of tuples. In practice,
real systems often omit the expensive step of eliminating duplicate tuples, leading to
relations that are multisets. However, our discussion of relational algebra and calculus
assumes that duplicate elimination is always done so that relations are always sets of
tuples.
We can compute the names and ratings of highly rated sailors by combining two of the
preceding queries. The expression
πsname,rating(σrating>8(S2))
πage(S2) πsname,rating(σrating>8(S2))
Set Operations
The following standard operations on sets are also available in relational algebra: union (U),
intersection (∩), set-difference (−), and cross-product (×).
Union: R u S returns a relation instance containing all tuples that occur in either
relation instance R or relation instance S (or both). R and S must be union- compatible,
and the schema of the result is defined to be identical to the schema of R.
S1 ∪ S2
sid sname rating age sid sname rating age
31 Lubbe 8 55.5
58 Rusty 10 35.0 22 Dustin 7 45.0
S1 × R1
Renaming
We introduce a renaming operator ρ for this purpose. The expression ρ(R(F ), E) takes
an arbitrary relational algebra expression E and returns an instance of a (new) relation
called R. R contains the same tuples as the result of E, and has the same schema as E,
but some fields are renamed. The field names in relation R are the same as in E, except
for fields renamed in the renaming list F.
For example, the expression ρ(C(1 → sid1, 5 → sid2), S1 × R1) returns a relation
that contains the tuples shown in Figure 4.11 and has the following schema: C(sid1:
integer, sname: string, rating: integer, age: real, sid2: integer, bid: integer,day:
dates).
It is customary to include some additional operators in the algebra, but they can all be
defined in terms of the operators that we have defined thus far. (In fact, the renaming
operator is only needed for syntactic convenience, and even the ∩ operator is redundant; R
∩ S can be defined as R − (R − S).) We will consider these additional operators,and their
definition in terms of the basic operators, in the next two subsections.
Joins
The join operation is one of the most useful operations in relational algebra and is the
most commonly used way to combine information from two or more relations. Although
a join can be defined as a cross-product followed by selections and projections, joins arise
much more frequently in practice than plain cross-products.
joins have received a lot of attention, and there are several variants of the
join operation.
Condition Joins
The most general version of the join operation accepts a join condition c and a pair of
relation instances as arguments, and returns a relation instance. The join condition is
identical to a selection condition in form. The operation is defined as follows:
R ⊲⊳c S = σc(R × S)
attribute of a relation, say R, can be by position (of the form R.i) or by name (of the form
R.name).As an example, the result of S1 ⊲⊳S1.sid<R1.sid R1 is shown in Figure 4.12.
Because sid appears in both S1 and R1, the corresponding fields in the result of the cross-
product S1 × R1 (and therefore in the result of S1 ⊲⊳S1.sid<R1.sid R1) are unnamed.
Domains are inherited from the corresponding fields of S1 and R1.
S1 ⊲⊳S1.sid<R1.sid R1
Equijoin
A common special case of the join operation R ⊲⊳ S is when the join condition con- sists
solely of equalities (connected by ∧) of the form R.name1 = S.name2, that is, equalities
between two fields in R and S. In this case, obviously, there is some redun- dancy in
retaining both attributes in the result. For join conditions that contain only such equalities,
the join operation is refined by doing an additional projection in which S.name2 is dropped.
The join operation with this refinement is called equijoin.
The schema of the result of an equijoin contains the fields of R (with the same names and
domains as in R) followed by the fields of S that do not appear in the join conditions. If
this set of fields in the result relation includes two fields that inherit the same name from R
and S, they are unnamed in the result relation.
We illustrate S1 ⊲⊳R.sid=S.sid R1 in Figure 4.13. Notice that only one field called sid
appears in the result.
sid sname rating age bid day
22 Dustin 7 45.0 101 10/10/96
58 Rusty 10 35.0 103 11/12/96
S1 ⊲⊳R.sid=S.sid R1
Natural Join
A further special case of the join operation R ⊲⊳ S is an equijoin in which equalities are
specified on all fields having the same name in R and S. In this case, we can simply
omit the join condition; the default is that the join condition is a collection of equalities on
all common fields. We call this special case a natural join, and it has the nice property that
the result is guaranteed not to have two fields with the same name.
The equijoin expression S1 ⊲⊳R.sid=S.sid R1 is actually a natural join and can simply be
denoted as S1 ⊲⊳ R1, since the only common field is sid. If the two relations have no
attributes in common, S1 ⊲⊳ R1 is simply the cross-product.
Division
The division operator is useful for expressing certain kinds of queries, for example:
“Find the names of sailors who have reserved all boats.” Understanding how to use the
basic operators of the algebra to define division is a useful exercise. However,
the division operator does not have the same importance as the other operators—it is
not needed as often, and database systems do not try to exploit the semantics of division
by implementing it as a distinct operator (as, for example, is done with the join
operator).
We discuss division through an example. Consider two relation instances A and B in which
A has (exactly) two fields x and y and B has just one field y, with the same domain as in
A. We define the division operation A/B as the set of all x values (in the form of unary
tuples) such that for every y value in (a tuple of) B, there is a tuple
〈x,y〉in A.
Another way to understand division is as follows. For each x value in (the first column of)
A, consider the set of y values that appear in (the second field of) tuples of A with that x
value. If this set contains (all y values in) B, the x value is in the result of A/B.
An analogy with integer division may also help to understand division. For integers A and
B, A/B is the largest integer Q such that Q ∗ B ≤ A. For relation instances A and B,
A/B is the largest relation instance Q such that Q × B ⊆ A.
Division is illustrated in Figure 4.14. It helps to think of A as a relation listing the parts
supplied by suppliers, and of the B relations as listing parts. A/Bi computes suppliers
who supply all parts listed in relation instance Bi.
Expressing A/B in terms of the basic algebra operators is an interesting exercise, and the
reader should try to do this before reading further. The basic idea is to compute all x
values in A that are not disqualified. An x value is disqualified if by attaching a
y value from B, we obtain a tuple 〈 x,y 〉 that is not in A. We can compute disqualified tuples
πx((πx (A) × B) − A)
Thus we can define A/B as
πx(A) − πx((πx (A) × B) − A)
To understand the division operation in full generality, we have to consider the case
when both x and y are replaced by a set of attributes.
More Examples of Relational Algebra Queries
We illustrate queries using thei nstances S3 of Sailors, R2 of Reserves, and B1 of Boats,
shown in Figures 4.15,4.16,and4.17, respectively.
An Instance B1 of Boats
(Q1) Find the names of sailors who have reserved boat 103.
just one field, called sname, and three tuples 〈Dustin〉, 〈 Horatio〉,and
9
Lubber 〉 . (Observe that there are two sailors called Horatio, and only one of them has
We can break this query into smaller pieces using the renaming operator ρ:
ρ(T emp1, σbid=103Reserves)
ρ(T emp2, T emp1 ⊲⊳ Sailors) πsname(Temp2)
T emp1 denotes an intermediate relation that identifies reservations of boat 103. T emp2
is another intermediate relation, and it denotes sailors who have made a reservation in the
set T emp1. The instances of these relations when evaluating this query on the instances
R2 and S3 are illustrated in Figures 4.18 and 4.19. Finally, we extract the sname column
from T emp2.
πsname(σbid=103(Reserves⊲⊳Sailors))
The DBMS translates an SQL query into (an extended form of) relational algebra, and then
looks for other algebra expressions that will produce the same answers but are cheaper to
evaluate. If the user’s query is first translated into the expression
πsname(σbid=103(Reserves⊲⊳Sailors))
a good query optimizer will find the equivalent expression
πsname((σbid=103Reserves) ⊲⊳ Sailors)
(Q2) Find the names of sailors who have reserved a red boat.
(Q4) Find the names of sailors who have reserved at least one boat.
πsname(Sailors ⊲⊳ Reserves)
(Q5) Find the names of sailors who have reserved a red or a green boat.
ρ(T empboats, (σcolor=′red′ Boats) U (σcolor=′green′ Boats))
πsname(Tempboats ⊲⊳Reserves ⊲⊳Sailors)
We identify the set of all boats that are either red or green (Tempboats, which contains boats
with the bids 102, 103, and 104 on instances B1, R2, and S3). Then we join with Reserves to
identify sids of sailors who have reserved one of these boats; this gives us sids 22, 31, 64, and
74 over our example instances. Finally, we join (an intermediate relation containing this set
of sids) with Sailors to find the names of Sailors with these sids. This gives us the names
Dustin, Horatio, and Lubber on the instances B1, R2, and S3. Another equivalent definition
is the following:
However, this solution is incorrect—it instead tries to compute sailors who have re- served
a boat that is both red and green. (Since bid is a key for Boats, a boat can be only one
color; this query will always return an empty answer set.) The correct approach is to find
sailors who have reserved a red boat, then sailors who have reserved a green boat, and then
take the intersection of these two sets:
The two temporary relations compute the sids of sailors, and their intersection identifies
sailors who have reserved both red and green boats. On instances B1, R2, and S3, the
sids of sailors who have reserved a red boat are 22, 31, and 64. The sids of sailors who
have reserved a green boat are 22, 31, and 74. Thus, sailors 22 and 31 have reserved both
a red boat and a green boat; their names are Dustin and Lubber.
This formulation of Query Q6 can easily be adapted to find sailors who have reserved red
or green boats (Query Q5); just replace ∩ by ∪ :
This attempt is incorrect for a rather subtle reason. Two distinct sailors with the same
name, such as Horatio in our example instances, may have reserved red and
green boats, respectively. In this case, the name Horatio will (incorrectly) be included in
the answer even though no one individual called Horatio has reserved a red boat and a
green boat. The cause of this error is that sname is being used to identify sailors (while doing
the intersection) in this version of the query, but sname is not a key.
(Q7) Find the names of sailors who have reserved at least two boats.
πsname1σ(sid1=sid2) ∩ (bid1=bid2)Reservationpairs
First we compute tuples of the form 〈 sid,sname,bid 〉 , where sailor sid has made a
reservation for boat bid; this set of tuples is the temporary relation Reservations. Next
we find all pairs of Reservations tuples where the same sailor has made both reservations
and the boats involved are distinct. Here is the central idea: In order to show that a
sailor has reserved two boats, we must find two Reservations tuples involving the same
sailor but distinct boats. Over instances B1, R2, and S3, the sailors with sids 22, 31, and
64 have each reserved at least two boats. Finally, we project the names of such sailors to
obtain the answer, containing the names Dustin, Horatio, and Lubber.
Notice that we included sid in Reservations because it is the key field identifying sailors, and
we need it to check that two Reservations tuples involve the same sailor. As noted in the
previous example, we can’t use sname for this purpose.
(Q8) Find the sids of sailors with age over 20 who have not reserved a red boat .
(Q9) Find the names of sailors who have reserved all boats.
The use of the word all (or every) is a good indication that the division operation might
be applicable:
ρ(T empsids, (πsid,bidReserves)/(πbidBoats))
πsname(Tempsids ⊲⊳ Sailors)
The intermediate relation Tempsids is defined using division, and computes the set of sids of
sailors who have reserved every boat (over instances B1, R2, and S3, this is just sid 22).
Notice how we define the two relations that the division operator (/) is applied to—the first
relation has the schema (sid,bid) and the second has the schema (bid). Division then returns
all sids such that there is a tuple 〈 sid,bid 〉 in the first relation for each bid in the second.
Joining Tempsids with Sailors is necessary to associate names with the selected sids; for sailor
22, the name is Dustin.
(Q10) Find the names of sailors who have reserved all boats called Interlake.
ρ(T empsids, (πsid,bidReserves)/(πbid(σbname=′Interlake′ Boats))) πsname(Tempsids ⊲⊳
Sailors)
The only difference with respect to the previous query is that now we apply a selection to
Boats, to ensure that we compute only bids of boats named Interlake in defining the second
argument to the division operator. Over instances B1, R2, and S3, Tempsids evaluates to
sids 22 and 64, and the answer contains their names, Dustin and Horatio.
RELATIONAL CALCULUS
The variant of the calculus that we present in detail is called the tuple relational
calculus (TRC). Variables in TRC take on tuples as values. In another variant, called
the domain relational calculus (DRC), the variables range over field values. TRC has had
more of an influence on SQL, while DRC has strongly influenced QBE.
A tuple variable is a variable that takes on tuples of a particular relation schema as values.
That is, every value assigned to a given tuple variable has the same number and type of
fields. A tuple relational calculus query has the form { T | p(T) }, where T is a tuple variable
and p(T ) denotes a formula that describes T ; we will shortly define formulas and queries
rigorously. The result of this query is the set of all tuples t for which the formula p(T )
evaluates to true with T = t. The language for writing formulas p(T ) is thus at the heart of
TRC and is essentially a simple subset of first-order logic. As a simple example, consider the
following query.
(Q11) Find all sailors with a rating above 7.
{S I S E Sailors ^ S. rating > 7}
When this query is evaluated on an instance of the Sailors relation, the tuple variable S is
instantiated successively with each tuple, and the test S.rating>7 is applied. The answer
contains those instances of S that pass this test. On instance S3 of Sailors, the answer contains
Sailors tuples with sid 31, 32, 58, 71, and 74.
We now define these concepts formally, beginning with the notion of a formula. Let Rel
be a relation name, R and S be tuple variables, a an attribute of R, and b an attribute of
S. Let op denote an operator in the set {<,>,=,≤, ≥, =}. An atomic formula is one of
the following:
a. R E Rel
b. R.a op S.b
c. R.a op constant, or constant op R.a
A formula is recursively defined to be one of the following, where p and q are them-
selves formulas, and p(R) denotes a formula in which the variable R appears:
any atomic formula
¬ p, p V q, p ^ q, or p q
R(p(R)), where R is a tuple variable
R(p(R)), where R is a tuple variable
We observe that every variable in a TRC formula appears in a subformula that is atomic,
and every relation schema specifies a domain for each field; this observation ensures that
each variable in a TRC formula has a well-defined domain from which values for the
variable are drawn. That is, each variable has a well-defined type, in the programming language
sense. Informally, an atomic formula R ∈ Rel gives R the type of tuples in Rel, and
comparisons such as R.a op S.b and R.a op constant induce type restrictions on the field
R.a. If a variable R does not appear in an atomic formula of the form R E Rel (i.e., it
appears only in atomic formulas that are comparisons), we will follow the convention that
the type of R is a tuple whose fields include all (and only) fields of R that appear in the
formula.
We will not define types of variables formally, but the type of a variable should be clear in
most cases, and the important point to note is that comparisons of values having different
types should always fail. (In discussions of relational calculus, the simplifying assumption is
often made that there is a single domain of constants and that this is the domain associated
with each field of each relation.)
A TRC query is defined to be expression of the form {T | p(T)}, where T is the only free
variable in the formula p.
What does a TRC query mean? More precisely, what is the set of answer tuples for a given
TRC query? The answer to a TRC query {T | p(T)}, as we noted earlier, is the set of all
tuples t for which the formula p(T ) evaluates to true with variable T assigned the tuple
value t. To complete this definition, we must state which assignments of tuple values to the
free variables in a formula make the formula evaluate to true.
A query is evaluated on a given instance of the database. Let each free variable in a
formula F be bound to a tuple value. For the given assignment of tuples to variables,
with respect to the given database instance, F evaluates to (or simply ‘is’) true if one of
the following holds:
F is an atomic formula R Rel, and R is assigned a tuple in the instance of
relation Rel.
F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples
assigned to R and S have field values R.a and S.b that make the comparison true.
F is of the form ¬p, and p is not true; or of the form p ^ q, and both p and q are true;
or of the form p V q, and one of them is true, or of the form p q and q is true
whenever 4 p is true.
F is of the form R(p(R)), and there is some assignment of tuples to the free
variables in p(R), including the variable R,5 that makes the formula p(R) true.
F is of the form R(p(R)), and there is some assignment of tuples to the free
variables in p(R) that makes the formula p(R) true no matter what tuple is
assigned to R.
We now illustrate the calculus through several examples, using the instances B1 of Boats,
R2 of Reserves, and S3 of Sailors shown in Figures 4.15, 4.16, and 4.17. We will use
parentheses as needed to make our formulas unambiguous. Often, a formula p(R) includes a
condition RRel, and the meaning of the phrases some tuple R and for all tuples R is
intuitive. We will use the notation R Rel(p(R)) for R(R Rel p(R)). Similarly, we
use the notation R Rel(p(R)) for R(R Rel p(R)).
(Q12) Find the names and ages of sailors with a rating above 7 .
R2, and S3, the answer is the set of tuples 〈Lubber, 55.5〉, 〈Andy, 25.5〉, 〈Rusty, 35.0〉,
(Q13) Find the sailor name, boat id, and reservation date for each reservation
{P | R Reserves S Sailors
(R.sid = S.sid P.bid = R.bid P.day = R.day P.sname = S.sname)}
For each Reserves tuple, we look for a tuple in Sailors with the same sid. Given a pair of
such tuples, we construct an answer tuple P with fields sname, bid, and day bycopying the
corresponding fields from these two tuples. This query illustrates how we can combine values
from different relations in each answer tuple. The answer to this query on instances B1,
R2, and S3 is shown in Figure 4.20.
(Q1) Find the names of sailors who have reserved boat 103.
This query can be read as follows: “Retrieve all sailor tuples for which
there exists a tuple in Reserves, having the same value in the sid field, and
with bid = 103.” That is, for each sailor tuple, we look for a tuple in
Reserves that shows that this sailor has reserved boat 103. The answer tuple
P contains just one field, sname
(Q2) Find the names of sailors who have reserved a red boat.
{P | S Sailors R Reserves(R.sid = S.sid P.sname =
S.sname B Boats(B.bid = R.bid B.color =′red′))}
This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples
R in Reserves and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =′red′.”
Another way to write this query, which corresponds more closely to this reading, is as
follows:
(Q7) Find the names of sailors who have reserved at least two boats.
{P | S Sailors R1 Reserves R2 Reserves (S.sid = R1.sid
R1.sid = R2.sid R1.bid ≠ R2.bid P.sname = S.sname)}
Contrast this query with the algebra version and see how much simpler the calculus version
is. In part, this difference is due to the cumbersome renaming of fields in the algebra version,
but the calculus version really is simpler.
(Q9) Find the names of sailors who have reserved all boats.
{P | S Sailors B Boats
(R Reserves(S.sid = R.sid R.bid = B.bid P.sname = S.sname))}
This query was expressed using the division operator in relational algebra. Notice how easily
it is expressed in the calculus. The calculus query directly reflects how we might express the
query in English: “Find sailors S such that for all boats B there is a Reserves tuple showing
that sailor S has reserved boat B.”
(Q14) Find sailors who have reserved all red boats.
{S | S Sailors B Boats
(B.color =′red′ (R ∈ Reserves(S.sid = R.sid R.bid = B.bid)))}
This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor
must have reserved it. That is, for a candidate sailor, a boat being red must imply the
sailor having reserved it. Observe that since we can return an entire sailor tuple as the
answer instead of just the sailor’s name, we have avoided introducing a new free
variable (e.g., the variable P in the previous example) to hold the answer values. On
instances B1, R2, and S3, the answer contains the Sailors tuples with sids 22
and 31.We can write this query without using implication, by observing that an
expression of the form p q is logically equivalent to ¬p q:
A domain variable is a variable that ranges over the values in the domain of some attribute
(e.g., the variable can be assigned an integer if it appears in an attribute whose domain
is the set of integers). A DRC query has the form { 〈 x1, x2, . . . , xn 〉 | p( 〈 x1,x2,.. .,
〉 ) denotes a DRC formula whose only free variables are thevari-ables among the xi, 1 ≤
i ≤ n. The result of this query is the set of all tuples 〈 x1, x2,..
A DRC formula is defined in a manner that is very similar to the definition of a TRC formula.
The main difference is that the variables are now domain variables. Let op denote an
operator in the set {<, >, =, ≤, ≥, =} and let X and Y be domain variables. An atomic formula
in DRC is one of the following:
〈 x1,x2,...,xn 〉 ∈ Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i ≤ n is either a variable
or a constant.
X op Y
X op constant, or constant op X
A formula is recursively defined to be one of the following, where p and q are them- selves
formulas, and p(X) denotes a formula in which the variable X appears:
any atomic formula
¬ p, p V q, p q, or p q
X(p(X)), where X is a domain variable
X(p(X)), where X is a domain variable
Examples of DRC Queries
We now illustrate DRC through several examples. The reader is invited to compare these
with the TRC versions.
S.rating > 7, but we must specify the tuple 〈 I, N, T, A 〉 in the result, rather than just S.
(Q1) Find the names of sailors who have reserved boat 103 .
{〈N 〉 | I, T, A(〈I, N, T, A〉 Sailors
Ir, Br, D(〈Ir, Br, D〉 Reserves Ir = I Br = 103))}
(Q2) Find the names of sailors who have reserved a red boat.
(Q9) Find the names of sailors who have reserved all boats.
{〈N 〉 | I, T, A(〈I, N, T, A〉 Sailors
B, BN, C(¬(〈B, BN, C〉 Boats)
any atomic formula (〈Ir, Br, D〉 Reserves(I = Ir Br = B))))}
This query can be read as follows: “Find all values of N such that there is some tuple
〈 I, N,T,A 〉 in Sailors satisfying the following condition: for every 〈 B, BN,C 〉 , either this
is not a tuple in Boats or there is some tuple 〈 Ir, Br, D 〉 in Reserves that proves that
Sailor I has reserved boat B. ” The ∀ quantifier allows the domain variables B, BN, and
C to range over all values in their respective attribute domains, and the pattern ‘¬( 〈 B,
BN, C 〉 Boats)∨ ’ is necessary to restrict attention to those values that appear in tuples
of Boats.
UNIT-III
Basic SQL Query
SELECT (DISTINCT) target-list
FROM relation-list
WHERE qualification
• relation-list A list of relation names (possibly with a range-variable after each name).
• target-list A list of attributes of relations in relation-list
• qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of ) combined using AND, OR
and NOT.
• DISTINCT is an optional keyword indicating that the answer should not contain duplicates. Default is that
duplicates are not eliminated!
An SQL query defined in terms of the following conceptual evaluation strategy:
• Compute the cross-product of relation-list.
Example:
Create table branch
(branch_name char(15),branch_city char(30),assets integer)
Domain Types in SQL
• char(n). Fixed length character string, with user-specified length n.
• varchar(n). Variable length character strings, with user-specified maximum length n.
• int. Integer (a finite subset of the integers that is machine-dependent).
where A is the name of the attribute to be added to relation r and D is the domain of A.
– All tuples in the relation are assigned null as the value for the new attribute.
• The alter table command can also be used to drop attributes of a relation:
• Comparison results can be combined using the logical connectives and, or, and not.
The from Clause
• The from clause lists the relations involved in the query
SELECT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid AND R.bid=103
(sid) sname rating age (sid) bid day
22 dustin 7 45.0 22 101 10/10/9
22 dustin 7 45.0 58 103 11/12/9
31 lubber 8 55.5 22 101 10/10/9
31 lubber 8 55.5 58 103 11/12/9
58 rusty 10 35.0 22 101 10/10/9
58 rusty 10 35.0 58 103 11/12/9
OR
SELECT sname
FROM Sailors, Reserves
• What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding DISTINCT to this
variant of the query make a difference?
Expressions and Strings
SELECT S.age, age1=S.age-5, 2*S.age AS age2
FROM Sailors S
WHERE S.sname LIKE ‘B_%B’
• Illustrates use of arithmetic expressions and string pattern matching: Find triples (of ages of sailors and
two fields defined by expressions) for sailors whose names begin and end with B and contain at least three
characters.
• In relations with duplicates, SQL can define how many copies of tuples appear in the result.
• Multiset versions of some of the relational algebra operators – given multiset relations r1 and r2:
1. (r1): If there are c1 copies of tuple t1 in r1, and t1 satisfies selections ,, then there are c1 copies of t1 in
(r1).
2. A (r ): For each copy of tuple t1 in r1, there is a copy of tuple A (t1) in A (r1) where A (t1) denotes the
projection of the single tuple t1.
3. r1 x r2 : If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2, there are c1 x c2 copies of the
tuple t1. t2 in r1 x r2
• Example: Suppose multiset relations r1 (A, B) and r2 (C) are as follows:
Set Operations
• The set operations union, intersect, and except operate on relations and correspond to the relational
algebra operations
• Each of the above operations automatically eliminates duplicates; to retain all duplicates use the
corresponding multiset versions union all, intersect all and except all.
AND B.color=‘green’
• UNION: Can be used to compute the union of any two union-compatible sets of tuples (which are
themselves the result of SQL queries).
• If we replace OR by AND in the first version, what do we get?
Also available: EXCEPT (What do we get if we replace UNION by EXCEPT?)
SELECT S.sid
• Included in the SQL/92 standard, but some systems don’t support it.
• Contrast symmetry of the UNION and INTERSECT queries with how much the other versions differ.
Nested Queries
Find names of sailors who’ve reserved boat #103:
SELECT S.sname
FROM Sailors S
To understand semantics of nested queries, think of a nested loops evaluation: For each Sailors tuple, check the
qualification by computing the subquery.
Nested Queries with Correlation
Find names of sailors who’ve reserved boat #103:
SELECT S.sname
FROM Sailors S
WHERE EXISTS (SELECT *
FROM Reserves R
WHERE R.bid=103 AND S.sid=R.sid)
• EXISTS is another set comparison operator, like IN.
• If UNIQUE is used, and * is replaced by R.bid, finds sailors with at most one reservation for boat #103.
(UNIQUE checks for duplicate tuples; * denotes all attributes. Why do we have to replace * by R.bid?)
• Illustrates why, in general, subquery must be re-computed for each Sailors tuple.
Aggregate Functions
• These functions operate on the multiset of values of a column of a relation, and return a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
• Find the average account balance at the Perryridge branch.
select avg (balance)
from account
where branch_name = 'Perryridge
Find the number of tuples in the customer relation.
select count (*)
from customer
Find the number of depositors in the bank.
• Find the names of all branches where the average account balance is more than $1,200.
• select branch_name, avg (balance)
from account
group by branch_name
having avg (balance) > 1200
• Note: predicates in the having clause are applied after the
formation of groups whereas predicates in the where
clause are applied before forming groups
Nested Subqueries
• SQL provides a mechanism for the nesting of subqueries.
• A subquery is a select-from-where expression that is nested within another query.
• A common use of subqueries is to perform tests for set membership, set comparisons, and set cardinality.
In” Construct
Find all customers who have both an account and a loan at the bank.
“All” Construct
• Find the names of all branches that have greater assets than all branches located in Brooklyn.
• select branch_name
from branch
where assets > all
(select assets
from branch
where branch_city = 'Brooklyn')
“Exists” Construct
• Find all customers who have an account at all branches located in Brooklyn.
select R.customer_name
from account, depositor as R
where T.customer_name = R.customer_name and
R.account_number = account.account_number and
account.branch_name = 'Perryridge')
Example Query
• Find all customers who have at least two accounts at the Perryridge branch.
account.branch_name = 'Perryridge')
• Variable from outer level is known as a correlation variable
Modification of the Database – Deletion
update account
set balance = balance 1.06
where balance > 10000
_ The order is important
– Can be done better using the case statement (next slide)
Case Statement for Conditional Updates
• Same query as before: Increase all accounts with balances over $10,000 by 6%, all other accounts receive
5%.
update account
set balance = case
when balance <= 10000 then balance *1.05
else balance * 1.06
end
More on Set-Comparison Operators
• We’ve already seen IN, EXISTS and UNIQUE. Can also use NOT IN, NOT EXISTS and NOT UNIQUE.
• Also available: op ANY, op ALL, op IN
• Find sailors whose rating is greater than that of some sailor called Horatio:
SELECT *
FROM Sailors S
WHERE S.rating > ANY (SELECT S2.rating
FROM Sailors S2
WHERE S2.sname=‘Horatio
Find sid’s of sailors who’ve reserved both a red and a green boat:
SELECT S.sid
FROM Sailors S, Boats B, Reserves R
WHERE S.sid=R.sid AND R.bid=B.bid AND B.color=‘red’
AND S.sid IN (SELECT S2.sid
FROM Sailors S2, Boats B2, Reserves R2
WHERE S2.sid=R2.sid AND R2.bid=B2.bid
AND B2.color=‘green’)
• Similarly, EXCEPT queries re-written using NOT IN.
• To find names (not sid’s) of Sailors who’ve reserved both red and green boats, just replace S.sid by S.sname
in SELECT clause. (What about INTERSECT query?)
Division in SQL
COUNT (*)
COUNT ( [DISTINCT] A)
SUM ( [DISTINCT] A)
AVG ( [DISTINCT] A)
MAX (A)
MIN (A)
SELECT COUNT (*)
FROM Sailors S
SELECT AVG (S.age)
FROM Sailors S
WHERE S.rating=10
SELECT S.sname
FROM Sailors S
WHERE S.rating= (SELECT MAX(S2.rating)
FROM Sailors S2)
FROM Sailors S
WHERE S.rating = i
For i = 1, 2, ... , 10:
Queries With GROUP BY and HAVING
SELECT [DISTINCT] target-list
FROM relation-list
WHERE qualification
GROUP BY grouping-list
HAVING group-qualification
• The target-list contains (i) attribute names (ii) terms with aggregate operations (e.g., MIN (S.age)).
– The attribute list (i) must be a subset of grouping-list. Intuitively, each answer tuple corresponds to
a group, and these attributes must have a single value per group. (A group is a set of tuples that
have the same value for all attributes in grouping-list.)
Conceptual Evaluation
• The cross-product of relation-list is computed, tuples that fail qualification are discarded, `unnecessary’
fields are deleted, and the remaining tuples are partitioned into groups by the value of attributes in
grouping-list.
• The group-qualification is then applied to eliminate some groups. Expressions in group-qualification must
have a single value per group!
– In effect, an attribute in group-qualification that is not an argument of an aggregate op also
appears in grouping-list. (SQL does not exploit primary key semantics here!)
– Meaning of constructs must be defined carefully. (e.g., WHERE clause eliminates rows that don’t
evaluate to true.)
– New operators (in particular, outer joins) possible/needed.
• It is possible for tuples to have a null value, denoted by null, for some of their attributes
• null signifies an unknown value or that a value does not exist.
• The predicate is null can be used to check for null values.
– Example: Find all loan number which appear in the loan relation with null values for amount.
select loan_number
from loan
where amount is null
• The result of any arithmetic expression involving null is null
– Example: 5 + null returns null
• However, aggregate functions simply ignore nulls
More on next slide
Null Values and Three Valued Logic
• These additional operations are typically used as subquery expressions in the from clause
• Join condition – defines which tuples in the two relations match, and what attributes are present in the
result of the join.
Join type – defines how tuples in each relation that do not match any tuple in the other relation (based)
• Derived Relations
SQL allows a subquery expression to be used in the from clause
• Find the average account balance of those branches where the average account balance is greater than
$1200.
select branch_name, avg_balance
from (select branch_name, avg (balance)
from account
group by branch_name )
as branch_avg ( branch_name, avg_balance )
where avg_balance > 1200
Note that we do not need to use the having clause, since we compute the temporary (view) relation
branch_avg in the from clause, and the attributes of branch_avg can be used directly in the where clause.
Integrity Constraints (Review)
• An IC describes conditions that every legal instance of a relation must satisfy.
123-22-3666 Attishoo 48 8 40
231-31-5368 Smiley 22 8 30
131-24-3650 Smethurst 35 5 30
434-26-3751 Guldu 35 5 32
• Suppose that we have entity sets Parts, Suppliers, and Departments, as well as a relationship set Contracts
that involves all of them. We refer to the schema for Contracts as CQPSD. A contract with contract id
• C species that a supplier S will supply some quantity Q of a part P to a department D.
• We might have a policy that a department purchases at most one part from any given supplier.
• Thus, if there are several contracts between the same supplier and department,
• we know that the same part must be involved in all of them. This constraint is an FD, DS ! P.
Reasoning About FDs
• Given some FDs, we can usually infer additional FDs:
– ssn did, did lot implies ssn lot
• An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.
– = closure of F is the set of all FDs that are implied by F.
• Armstrong’s Axioms (X, Y, Z are sets of attributes):
– Reflexivity: If X Y, then Y X
– Augmentation: If X Y, then XZ YZ for any Z
– Transitivity: If X Y and Y Z, then X Z
• These are sound and complete inference rules for FDs!
• Couple of additional rules (that follow from AA):
– Union: If X Y and X Z, then X YZ
– Decomposition: If X YZ, then X Y and X Z
– JP C
– Dept purchases at most one part from a supplier: S
– D P
• JP C, C CSJDPQV imply JP CSJDPQV
• SD P implies SDJ JP
SDJ JP, JP CSJDPQV imply SDJ CSJDPQV
• Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)
• Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F. An efficient check:
– Compute attribute closure of X (denoted ) wrt F:
• Does F = {A B, B C, C D E } imply A E?
– i.e, is A E in the closure ? Equivalently, is E in ?
Closure of a Set of FDs
• The set of all FDs implied by a given set F of FDs is called the closure of F and is denoted as F+.
• An important question is how we can infer, or compute, the closure of a given set F of FDs.
• The following three rules, called Armstrong's Axioms, can be applied repeatedly to infer all FDs implied by a
set F of FDs.
• We use X, Y, and Z to denote sets of attributes over a relation schema R:
• which is the set of attributes A such that X → A can be inferred using the Armstrong Axioms.
• The algorithm for computing the attribute closure of a set X of attributes is
• closure = X;
• every relation in 3NF is also in 2NF, and every relation in 2NF is in 1NF.
• A relation
• is in first normal form if every field contains only atomic values, that is, not lists or sets.
• Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!
First Normal Form
1NF (First Normal Form)
• a relation R is in 1NF if and only if it has only single-valued attributes (atomic values)
• EMP_PROJ (SSN, PNO, HOURS, ENAME, PNAME, PLOCATION)
PLOCATION is not in 1NF (multi-valued attrib.)
• solution: decompose the relation
EMP_PROJ2 (SSN, PNO, HOURS, ENAME, PNAME)
LOC (PNO, PLOCATION)
Second Normal Form
2NF (Second Normal Form)
• a relation R in 2NF if and only if it is in 1NF and every nonkey column depends on a key not a subset
of a key
• all nonprime attributes of R must be fully functionally dependent on a whole key(s) of the relation,
not a part of the key
• no violation: single-attribute key or no nonprime attribute
• violation: part of a key nonkey
EMP_PROJ2 (SSN, PNO, HOURS, ENAME, PNAME)
SSN ENAME
PNO PNAME
• solution: decompose the relation
EMP_PROJ3 (SSN, PNO, HOURS)
EMP (SSN, ENAME)
• a relation R in 3NF if and only if it is in 2NF and every nonkey column does not depend on another
nonkey column
• all nonprime attributes of R must be non-transitively functionally dependent on a key of the
relation
• violation: nonkey nonkey
• Intuitively, decomposing R means we will store instances of the relation schemes produced by the
decomposition, instead of instances of R.
• E.g., Can decompose SNLRWH into SNLRH and RW.
Example Decomposition
– X Y X, or
– X Y Y
• In particular, the decomposition of R into UV and R - V is lossless-join if U V holds over R.
A B C
1 2 3
4 5 6
7 2 8
A B C
1 2 3
Dependency Preserving Decomposition 4 5 6
• Consider CSJDPQV, C is key, JP C and SD P.
7 2 8
1 2 8
– BCNF decomposition: CSJDQV and SDP 7 2 3
– Problem: Checking JP C requires a join!
• Dependency preserving decomposition (Intuitive):
– If R is decomposed into X, Y and Z, and we enforce the FDs that hold on X, on Y and on Z, then all
FDs that were given to hold on R must also hold. (Avoids Problem (3).)
• Projection of set of FDs F: If R is decomposed into X, ... projection of F onto X (denoted FX ) is the set of
FDs U V in F+ (closure of F ) such that U, V are in X.
• Decomposition of R into X and Y is dependency preserving if (FX union FY ) + = F +
– i.e., if we consider only dependencies in the closure F + that can be checked in X without
considering Y, and in Y without considering X, these imply all dependencies in F +.
• In general, several dependencies may cause violation of BCNF. The order in which we ``deal with’’ them
could lead to very different sets of relations!
• Similarly, decomposition of CSJDQV into SDP, JS and CJDQV is not dependency preserving (w.r.t. the FDs
JP C, SD P and J S).
– However, it is a lossless join decomposition.
– In this case, adding JPC to the collection of relations gives us a dependency preserving
decomposition.
• Refinement: Instead of the given set of FDs F, use a minimal cover for F.
Constraints on an Entity Set
• Consider the Hourly Emps relation again. The constraint that attribute ssn is a key can be expressed as an
FD:
• { ssn }-> { ssn, name, lot, rating, hourly wages, hours worked}
• For brevity, we will write this FD as S -> SNLRWH, using a single letter to denote each attribute
• The constraint that the hourly wages attribute is determined by the rating attribute is an FD: R -> W.
Constraints on a Relationship Set
• The previous example illustrated how FDs can help to rene the subjective decisions made during ER design,
• but one could argue that the best possible ER diagram would have led to the same nal set of relations.
• Our next example shows how FD information can lead to a set of relations that eliminates some
redundancy problems and is unlikely to be arrived at solely through ER design.
Identifying Attributes of Entities
• in particular, it shows that attributes can easily be associated with the `wrong' entity set during ER design.
• The ER diagram shows a relationship set called Works In that is similar to the Works In relationship set
• Using the key constraint, we can translate this ER diagram into two relations:
• Workers(ssn, name, lot, did, since)
• In addition, let there be an attribute C denoting the credit card to which the reservation is charged.
• Suppose that every sailor uses a unique credit card for reservations. This constraint is expressed by the FD
S -> C. This constraint indicates that in relation Reserves, we store the credit card number for a sailor as
often as we have reservations for that
• The meaning of a tuple is that teacher T can teach course C, and book B is a recommended text for the
course.
• There are no FDs; the key is CTB. However, the recommended texts for a course are independent of the
instructor.
There are three points to note here:
• The relation schema CTB is in BCNF; thus we would not consider decomposing it further if we looked only
at the FDs that hold over CTB.
• There is redundancy. The fact that Green can teach Physics101 is recorded once per recommended text for
the course. Similarly, the fact that Optics is a text for Physics101 is recorded once per potential teacher.
• The redundancy can be eliminated by decomposing CTB into CT and CB.
• Let R be a relation schema and let X and Y be subsets of the attributes of R. Intuitively,
• the multivalued dependency X !! Y is said to hold over R if, in every legal
• The redundancy in this example is due to the constraint that the texts for a course are independent of the
instructors, which cannot be epressed in terms of FDs.
• This constraint is an example of a multivalued dependency, or MVD. Ideally, we should model this situation
using two binary relationship sets, Instructors with attributes CT and Text with attributes CB.
• Because these are two essentially independent relationships, modeling them with a single ternary
relationship set with attributes CTB is inappropriate.
• Three of the additional rules involve only MVDs:
• MVD Complementation: If X →→Y, then X →→ R − XY
• MVD Augmentation: If X →→ Y and W > Z, then
WX →→ YZ.
• A join dependency is a further generalization of MVDs. A join dependency (JD) ∞{ R1,….. Rn } is said to hold
over a relation R if R1,…. Rn is a lossless-join decomposition of R.
• An MVD X ->-> Y over a relation R can be expressed as the join dependency ∞ { XY,X(R−Y)}
• As an example, in the CTB relation, the MVD C ->->T can be expressed as the join dependency ∞{ CT, CB}
Unlike FDs and MVDs, there is no set of sound and complete inference rules for JDs.
Fifth Normal Form
• A relation schema R is said to be in fth normal form (5NF) if for every JD ∞{ R1,…. Rn } that holds over R,
one of the following statements is true:
• Ri = R for some i, or
• The JD is implied by the set of those FDs over R in which the left side is a key for R.
• The following result, also due to Date and Fagin, identies conditions|again, detected using only FD
information|under which we can safely ignore JD information.
If a relation schema is in 3NF and each of its keys consists of a single attribute,it is also in 5NF.
Inclusion Dependencies
• MVDs and JDs can be used to guide database design, as we have seen, although they are less common than
FDs and harder to recognize and reason about.
• In contrast, inclusion dependencies are very intuitive and quite common. However, they typically have little
influence on database design
• The main point to bear in mind is that we should not split groups of attributes that participate in an
inclusion dependency.
Most inclusion dependencies in practice are key-based, that is, involve only keys.
UNIT-IV
Transaction Concept
A transaction is a unit of program execution that accesses and possibly updates various data items.
E.g. transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Two main issues to deal with:
– Failures of various kinds, such as hardware failures and system crashes
– Concurrent execution of multiple transactions
Example of Fund Transfer
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
• Atomicity requirement
• if the transaction fails after step 3 and before step 6, money will be “lost” leading to
an inconsistent database state Failure could be due to software or hardware
• the system should ensure that updates of a partially executed transaction are not
reflected in the database
Durability requirement — once the user has been notified that the transaction has completed (i.e.,
the transfer of the $50 has taken place), the updates to the database by the transaction must
persist even if there are software or hardware failures.
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
• Consistency requirement in above example:
– the sum of A and B is unchanged by the execution of the transaction
In general, consistency requirements include
Explicitly specified integrity constraints such as primary keys and foreign keys
Implicit integrity constraints
– e.g. sum of balances of all accounts, minus sum of loan amounts must equal
value of cash-in-hand
– A transaction must see a consistent database.
– During transaction execution the database may be temporarily inconsistent.
– When the transaction completes successfully the database must be consistent
Erroneous transaction logic can lead to inconsistency.
• Isolation requirement — if between steps 3 and 6, another transaction T2 is allowed to access the
partially updated database, it will see an inconsistent database (the sum A + B will be less than it
should be).
T1 T2
1. read(A)
2. A := A – 50
3. write(A)
read(A), read(B), print(A+B)
4. read(B)
5. B := B + 50
6. write(B
Isolation can be ensured trivially by running transactions serially
– that is, one after the other.
However, executing multiple transactions concurrently has significant benefits, as we will see
later.
ACID Properties
A transaction is a unit of program execution that accesses and possibly updates various data items.
To preserve the integrity of data the database system must ensure:
• Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
• Consistency. Execution of a transaction in isolation preserves the consistency of the database.
• Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of
other concurrently executing transactions. Intermediate transaction results must be hidden from other
concurrently executed transactions.
– That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution
before Ti started, or Tj started execution after Ti finished.
• Durability. After a transaction completes successfully, the changes it has made to the database persist, even
if there are system failures.
Transaction State
• Active – the initial state; the transaction stays in this state while it is executing
• Partially committed – after the final statement has been executed.
• Failed -- after the discovery that normal execution can no longer proceed.
• Aborted – after the transaction has been rolled back and the database restored to its state prior to the start
of the transaction. Two options after it has been aborted:
– restart the transaction
• can be done only if no internal logical error
– kill the transaction
• Committed – after successful completion.
– In case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow
copy can be deleted.
• The shadow-database scheme:
– Assumes that only one transaction is active at a time.
– Assumes disks do not fail
– Useful for text editors, but
• extremely inefficient for large databases (why?)
– Variant called shadow paging reduces copying of data, but is still not practical for
large databases
– Does not handle concurrent transactions.
Concurrent Executions
• Multiple transactions are allowed to run concurrently in the system. Advantages are:
– increased processor and disk utilization, leading to better transaction throughput
• E.g. one transaction can be using the CPU while another is reading from or writing to the
disk
– reduced average response time for transactions: short transactions need not wait behind long ones.
• Concurrency control schemes – mechanisms to achieve isolation
– that is, to control the interaction among the concurrent transactions in order to prevent them from
destroying the consistency of the database
Will study in Chapter 16, after studying notion of correctness of concurrent executions.
Schedules
• Schedule – a sequences of instructions that specify the chronological order in which instructions of
concurrent transactions are executed
– a schedule for a set of transactions must consist of all instructions of those transactions
– must preserve the order in which the instructions appear in each individual transaction.
• A transaction that successfully completes its execution will have a commit instructions as the last statement
• Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B.
• A serial schedule in which T1 is followed by T2 :
Schedule 2
Schedule 3
• Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it
is equivalent to Schedule 1.
Schedule 4
Serializability
• Basic Assumption – Each transaction preserves database consistency.
• Thus serial execution of a set of transactions preserves database consistency.
• A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different
forms of schedule equivalence give rise to the notions of:
1. conflict serializability
2. view serializability
• Simplified view of transactions
– We ignore operations other than read and write instructions
– We assume that transactions may perform arbitrary computations on data in local buffers in
between reads and writes.
– Our simplified schedules consist of only read and write instructions.
Conflicting Instructions
• Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some
item Q accessed by both li and lj, and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
• Intuitively, a conflict between li and lj forces a (logical) temporal order between them.
– If li and lj are consecutive in a schedule and they do not conflict, their results would remain
the same even if they had been interchanged in the schedule.
Conflict Serializability
• If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting
instructions, we say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule.
• Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1, by series of
swaps of non-conflicting instructions.
– Therefore Schedule 3 is conflict serializable.
• We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3,
T4 >, or the serial schedule < T4, T3 >.
View Serializability
• Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the
following three conditions are met, for each data item Q,
1. If in schedule S, transaction Ti reads the initial value of Q, then in schedule S’ also
transaction Ti must read the initial value of Q.
2. If in schedule S transaction Ti executes read(Q), and that value was produced by transaction
Tj (if any), then in schedule S’ also transaction Ti must read the value of Q that was
produced by the same write(Q) operation of transaction Tj .
3. The transaction (if any) that performs the final write(Q) operation in schedule S must also
perform the final write(Q) operation in schedule S’.
As can be seen, view equivalence is also based purely on reads and writes alone.
• A schedule S is view serializable if it is view equivalent to a serial schedule.
• Every conflict serializable schedule is also view serializable.
• Below is a schedule which is view-serializable but not conflict serializable.
Determining such equivalence requires analysis of operations other than read and write.
Recoverable Schedules
Need to address the effect of transaction failures on concurrently running transactions.
• Recoverable schedule — if a transaction Tj reads a data item previously written by a transaction Ti ,
then the commit operation of Ti appears before the commit operation of Tj.
The following schedule (Schedule 11) is not recoverable if T9 commits immediately after the read
• If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database
state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks
Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the
following schedule where none of the transactions has yet committed (so the schedule is recoverable)
• If T10 fails, T11 and T12 must also be rolled back.
• Different concurrency control protocols provide different tradeoffs between the amount of concurrency
they allow and the amount of overhead that they incur.
Tests for serializability help us understand why a concurrency control protocol is correct.
Weak Levels of Consistency
• Some applications are willing to live with weak levels of consistency, allowing schedules that are not
serializable
– E.g. a read-only transaction that wants to get an approximate total balance of all accounts
– E.g. database statistics computed for query optimization can be approximate (why?)
• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its
lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
• Such a situation is called a deadlock.
To handle a deadlock one of T3 or T4 must be rolled back
and its locks released.
• The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
• Starvation is also possible if concurrency control manager is badly designed. For example:
– A transaction may be waiting for an X-lock on an item, while a sequence of other
transactions request and are granted an S-lock on the same item.
– The same transaction is repeatedly rolled back due to deadlocks.
• Concurrency control manager can be designed to prevent starvation.
The Two-Phase Locking Protocol
• This is a protocol which ensures conflict-serializable schedules.
• Phase 1: Growing Phase
– transaction may obtain locks
– transaction may not release locks
• Phase 2: Shrinking Phase
– transaction may release locks
– transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions can be serialized in the order
of their lock points (i.e. the point where a transaction acquired its final lock).
• Two-phase locking does not ensure freedom from deadlocks
• Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol
called strict two-phase locking. Here a transaction must hold all its exclusive locks till it
commits/aborts.
• Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this
protocol transactions can be serialized in the order in which they commit.
• There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.
• However, in the absence of extra information (e.g., ordering of access to data), two-phase locking
is needed for conflict serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses
two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.
Lock Conversions
• Two-phase locking with lock conversions:
– First Phase:
– can acquire a lock-S on item
– can acquire a lock-X on item
– can convert a lock-S to a lock-X (upgrade)
– Second Phase:
– can release a lock-S
– can release a lock-X
– can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the various locking
instructions.
Automatic Acquisition of Locks
• A transaction Ti issues the standard read/write instruction, without explicit locking calls.
• The operation read(D) is processed as:
if Ti has a lock on D
then
read(D)
else begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
• write(D) is processed as:
if Ti has a lock-X on D
then
write(D)
else begin
if necessary wait until no other trans. has any lock on D,
if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
write(D)
end;
• All locks are released after commit or abort
Implementation of Locking
• A lock manager can be implemented as a separate process to which transactions send lock and
unlock requests
• The lock manager replies to a lock request by sending a lock grant messages (or a message asking
the transaction to roll back, in case of a deadlock)
• The requesting transaction waits until its request is answered
• The lock manager maintains a data-structure called a lock table to record granted locks and pending
requests
• The lock table is usually implemented as an in-memory hash table indexed on the name of the data
item being locked.
Lock Table
• Black rectangles indicate granted locks, white ones indicate waiting requests
• Lock table also records the type of lock granted or requested
• New request is added to the end of the queue of requests for the data item, and granted if it is
compatible with all earlier locks
• Unlock requests result in the request being deleted, and later requests are checked to see if they
can now be granted
• If transaction aborts, all waiting or granted requests of the transaction are deleted
lock manager may keep a list of locks held by each transaction, to implement this efficiently
Graph-Based Protocols
• Graph-based protocols are an alternative to two-phase locking
• Impose a partial ordering on the set D = {d1, d2 ,..., dh} of all data items.
– If di dj then any transaction accessing both di and dj must access di before accessing dj.
– Implies that the set D may now be viewed as a directed acyclic graph, called a database
graph.
• The tree-protocol is a simple kind of graph protocol.
Tree Protocol
• When recovering using a fuzzy checkpoint, start scan from the checkpoint record pointed to by
last_checkpoint
– Log records before last_checkpoint have their updates reflected in database on disk, and
need not be redone.
– Incomplete checkpoints, where system had crashed while performing checkpoint, are
handled safely
ARIES
• ARIES is a state of the art recovery method
– Incorporates numerous optimizations to reduce overheads during normal processing and to
speed up recovery
– The “advanced recovery algorithm” we studied earlier is modeled after ARIES, but greatly
simplified by removing optimizations
• Unlike the advanced recovery algorithm, ARIES
– Uses log sequence number (LSN) to identify log records
• Stores LSNs in pages to identify what updates have already been applied to a
database page
– Physiological redo
– Dirty page table to avoid unnecessary redos during recovery
– Fuzzy checkpointing that only records information about dirty pages, and does not require
dirty pages to be written out at checkpoint time
• More coming up on each of the above …
ARIES Optimizations
• Physiological redo
– Affected page is physically identified, action within page can be logical
• Used to reduce logging overheads
– e.g. when a record is deleted and all other records have to be moved to fill
hole
» Physiological redo can log just the record deletion
» Physical redo would require logging of old and new values for much of
the page
• Requires page to be output to disk atomically
– Easy to achieve with hardware RAID, also supported by some disk systems
– Incomplete page output can be detected by checksum techniques,
» But extra actions are required for recovery
» Treated as a media failure
ARIES Data Structures
• ARIES uses several data structures
– Log sequence number (LSN) identifies each log record
• Must be sequentially increasing
• Typically an offset from beginning of log file to allow fast access
– Easily extended to handle multiple log files
– Page LSN
– Log records of several different types
– Dirty page table
ARIES Data Structures: Page LSN
• Each page contains a PageLSN which is the LSN of the last log record whose effects are reflected on
the page
– To update a page:
• X-latch the page, and write the log record
• Update the page
• Record the LSN of the log record in PageLSN
• Unlock page
– To flush page to disk, must first S-latch page
• Thus page state on disk is operation consistent
– Required to support physiological redo
– PageLSN is used during recovery to prevent repeated redo
• Thus ensuring idempotence
ARIES Data Structures: Log Record
• Each log record contains LSN of previous log record of the same transaction
– Heap (random order) files: Suitable when typical access is a file scan retrieving all records.
– Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is
needed.
• Like sorted files, they speed up searches for a subset of records, based on values in
certain (“search key”) fields
Index Classification
• Primary vs. secondary: If search key contains primary key, then called primary index.
• Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries,
then called clustered index.
– Alternative 1 implies clustered; in practice, clustered also implies Alternative 1 (since sorted
files are rare).
– Cost of retrieving data records through index varies greatly based on whether index is
clustered or not!
Clustered vs. Unclustered Index
• Suppose that Alternative (2) is used for data entries, and that the data records are
stored in a Heap file.
– To build clustered index, first sort the Heap file (with some free space on each
page for future inserts).
– Overflow pages may be needed for inserts. (Thus, order of data recs is `close
to’, but not identical to, the sort order.)
Indexes
• An index on a file speeds up selections on the search key fields for the index.
– Any subset of the fields of a relation can be the search key for an index on the
relation.
– Search key is not the same as key (minimal set of fields that uniquely identify a
record in a relation).
• An index contains a collection of data entries, and supports efficient retrieval of all
data entries k* with a given key value k.
– Given data entry k*, we can find record with key k in at most one disk I/O.
(Details soon …)
B+ Tree Indexes
Leaf pages contain data entries, and are chained (prev & next)
Non-leaf pages have index entries; only used to direct searches
Example B+ Tree
• Choice of alternative for data entries is orthogonal to the indexing technique used to locate data
entries with a given key value k.
– Typically, index contains auxiliary information that directs searches to the desired data
entries
• Alternative 1:
– If this is used, index structure is a file organization for data records (instead of a Heap file or
sorted file).
– At most one index on a given collection of data records can use Alternative 1. (Otherwise,
data records are duplicated, leading to redundant storage and potential inconsistency.)
– If data records are very large, # of pages containing data entries is high. Implies size of
auxiliary information in the index is also large, typically.
• Alternatives 2 and 3:
– Data entries typically much smaller than data records. So, better than Alternative 1 with
large data records, especially if search keys are small. (Portion of index structure used to
direct search, which depends on size of data entries, is much smaller than with Alternative
1.)
– Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even
if search keys are of fixed length.
– Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus,
even I/O cost is only approximated.
• Heap file with unclustered B + tree index on search key <age, sal>
• Heap file with unclustered hash index on search key <age, sal>
Operations to Compare
• Equality search
• Range selection
• Insert a record
• Delete a record
• Heap Files:
• Sorted Files:
• Indexes:
• Scans:
• Range searches:
– We use tree indexes to restrict the set of data records fetched, but ignore hash indexes.
Cost of Operations
(a) Scan (b) Equality (c ) Range (d) Insert (e) Delete
– Which attributes are involved in selection/join conditions? How selective are these
conditions likely to be?
– Which attributes are involved in selection/join conditions? How selective are these
conditions likely to be?
The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.
Choice of Indexes
– Which relations should have indexes? What field(s) should be the search key? Should
we build several indexes?
– Clustered? Hash/tree?
• One approach: Consider the most important queries in turn. Consider the best plan using the
current indexes, and see if a better plan is possible with an additional index. If so, create it.
– Obviously, this implies that we must understand how a DBMS evaluates queries and
creates query evaluation plans!
• Before creating an index, must also consider the impact on updates in the workload!
– Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too.
Index Selection Guidelines
• Clustering is especially useful for range queries; can also help on equality queries if
there are many duplicates.
• Multi-attribute search keys should be considered when a WHERE clause contains several
conditions.
– Such indexes can sometimes enable index-only strategies for important queries.
– If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be
costly.
SELECT E.dno
FROM Emp E
WHERE E.age>40
FROM Emp E
WHERE E.age>10
GROUP BY E.dno
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps
Indexes with Composite Search Keys
– Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index:
– Lexicographic order, or
– Spatial order.
• To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better
than an index on age or an index on sal.
Summary
• Many alternative file organizations exist, each appropriate in some situation.
• If selection queries are frequent, sorting the file or building an index is important.
• Index is a collection of data entries plus a way to quickly find entries with given key values.
• Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.
– Choice orthogonal to indexing technique used to locate data entries with a given key
value.
• Can have several indexes on a given file of data records, each with a different search key.
• Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs.
sparse. Differences have important consequences for utility/performance.
Introduction
• Choice is orthogonal to the indexing technique used to locate data entries k*.
• Tree-structured indexing techniques support both range searches and equality searches.
• ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.
Range Searches
– If data is in sorted file, do binary search to find first such student, then scan to find others.
ISAM
• Index file may still be quite large. But we can apply the idea repeatedly!
index entry
Comments on ISAM
• File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages
allocated, then space for overflow pages.
• Index entries: <search key value, page id>; they `direct’ search for data entries, which are in
leaf pages.
• Search: Start at root; use key comparisons to go to leaf. Cost log F N ; F = # entries/index pg,
N = # leaf pgs
• Insert: Find leaf data entry belongs to, and put it there.
• Delete: Find and remove from leaf; if empty overflow page, de-allocate.
• Each node can hold 2 entries; no need for `next-leaf-page’ pointers. (Why?)
After Inserting 23*, 48*, 41*, 42* ...
• Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The
parameter d is called the order of the tree.
• Typical capacities:
– To split index node, redistribute entries evenly, but push up middle key. (Contrast with
leaf splits.)
• Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
• Note difference between copy-up and push-up; be sure you understand the reasons for this.
Example B+ Tree After Inserting 8*
In this example, we can avoid split by re-distributing entries; however, this is usually not
done in practice.
• Try to re-distribute, borrowing from sibling (adjacent node with same parent as
L).
Example Tree After (Inserting 8*, Then) Deleting 19* and 20*...
• Deleting 19* is easy.
• Deleting 20* is done with re-distribution. Notice how middle key is copied
• Must merge.
• Observe `toss’ of index entry (on right), and `pull down’ of index entry (below).
UNIT WISE QUESTION BANK
Unit I