Information and DBM Module-updated
Information and DBM Module-updated
By:
Mattu – Ethiopia
March 2025
Mattu University
College of Engineering and Technology
Department of Information Technology
By:
i
Module Name Database System and Information Management
ii
Table of Contents
1. Introduction ............................................................................................................................. 1
iii
1.8. Structured Query Language(SQL) ................................................................................ 53
Introduction ............................................................................................................................... 89
iv
1. Introduction
1.1. File based system
File-based systems were a primary attempt to computerize the manual filing system. File- based
system is a collection of application programs that perform services for the end-users. Each
program defines and manages its data. It is a method of handling and organizing data in which
information is stored in separate files, often organized in a hierarchical or flat structure. Each file
typically contains a specific type of data related to a particular application or aspect of the system.
In this approach, data is managed directly by the applications themselves, without the use of a
centralized system to oversee data storage and retrieval.
Separation and isolation of data: When data is isolated in separate files, it is more difficult
for us to access data that should be available. The application programmer is required to
synchronize the processing of two or more files to ensure the correct data is extracted.
Duplication of data: When employing the decentralized file-based approach, the
uncontrolled duplication of data is occurred. Uncontrolled duplication of data is undesirable
because:
o Duplication is wasteful
o Duplication can lead to loss of data integrity
Data dependence: Using file-based system, the physical structure and storage of the data files
and records are defined in the application program code. This characteristic is known as
program-data dependence. Making changes to an existing structure are rather difficult and will
lead to a modification of program. Such maintenance activities are time-consuming and subject
to error.
Incompatible file formats: The structures of the file are dependent on the application
programming language. However, file structure provided in one programming language such
as direct file, indexed-sequential file which is available in COBOL programming, may be
different from the structure generated by other programming language such as C. The direct
incompatibility makes them difficult to process jointly.
Fixed queries / proliferation of application programs: File-based systems are very
dependent upon the application programmer. Any required queries or reports have to be written
1
by the application programmer. Normally, a fixed format query or report can only be
entertained and no facility for ad-hoc queries if offered. File-based systems also give
tremendous pressure on data processing staff, with users ‘complaints on programs that are
inadequate or inefficient in meeting their demands. Documentation may be limited and
maintenance of the system is difficult. Provision for security, integrity and recovery capability
is very limited.
In order to overcome the limitations of the file-based approach, the concept of database and the
Database Management System (DMS) was emerged in 60s.
A database is a collection of interrelated data that helps in the efficient retrieval, insertion, and
deletion of data from the database and organizes the data in the form of tables, views, schemas,
reports, etc. For Example, a university database organizes the data about students, faculty, admin
staff, etc. which helps in the efficient retrieval, insertion, and deletion of data from it. It represents
some aspect of the real world, sometimes called the mini world or the universe of discourse (UoD).
In the context of websites and applications, when people refer to a “database” they are often talking
about a computer program that allows them to interact with their database. These programs, known
more formally as a database management system (DBMS), are often installed on a virtual private
server and accessed remotely. Redis, MariaDB, and PostgreSQL are a few examples of open-
source DBMSs.
2
Advantages of Database System
Control of data redundancy: The database approach attempts to eliminate the redundancy by
integrating the file. Although the database approach does not eliminate redundancy entirely, it
controls the amount of redundancy inherent in the database.
Data consistency: By eliminating or controlling redundancy, the database approach reduces
the risk of inconsistencies occurring. It ensures all copies of the idea are kept consistent.
More information from the same amount of data: With the integration of the operated data
in the database approach, it may be possible to derive additional information for the same data.
Sharing of data: Database belongs to the entire organization and can be shared by all
authorized users.
Improved data integrity: Database integrity provides the validity and consistency of stored
data. Integrity is usually expressed in terms of constraints, which are consistency rules that the
database is not permitted to violate.
Improved security: Database approach provides a protection of the data from the
unauthorized users. It may take the term of user names and passwords to identify user type and
their access right in the operation including retrieval, insertion, updating and deletion.
Enforcement of standards: The integration of the database enforces the necessary standards
including data formats, naming conventions, documentation standards, update procedures and
access rules.
Improved maintenance: Database approach provides a data independence. As a change of
data structure in the database will be affect the application program, it simplifies database
application maintenance.
i) Requirements Analysis
During this step, the database designers have to interview the customers (database users) to
understand the proposed system and obtain and document the data and functional requirements.
The result of this step is a document that includes the detailed requirements provided by the users.
3
Organization experts (managers, accountants, users) are necessarily involved in the creation of the
database.
Generally,
Objective: Understand the business needs and define what the database should achieve.
Key Activities:
The conceptual data model provides a shared, formal representation of what is being
communicated between clients and developers during database development it is focused on the
data in a database, irrespective of the eventual use of that data in user processes or implementation
of the data in specific computer environments. Therefore, a conceptual data model is concerned
with the meaning and structure of data, but not with the details affecting how they are implemented.
The conceptual data model then is a formal representation of what data a database should contain
and the constraints the data must satisfy. This should be expressed in terms that are independent
of how the model may be implemented.
4
iii) Logical Design
Database design starts with a conceptual data model and produces a specification of a logical
schema; this will determine the specific type of database system (network, relational, object-
oriented) that is required. The relational representation is still independent of any specific DBMS;
it is another conceptual data model. We can use a relational representation of the conceptual data
model as input to the logical design process. The output of this stage is a detailed relational
specification, the logical schema, of all the tables and constraints needed to satisfy the description
of the data in the conceptual data model. It is during this design activity that choices are made as
to which tables are most appropriate for representing the data in a database. These choices must
take into account various design criteria including, for example, flexibility for change, control of
duplication and how best to represent the constraints. It is the tables defined by the logical schema
that determine what data are stored and how they may be manipulated in the database.
Key Activities:
Outcome: Logical schema independent of the specific database management system (DBMS).
iv) Physical Design
Objective: Optimize the database for a specific DBMS and hardware environment.
Key Activities:
5
1.2.2. Roles in database design environment
Database design involves multiple roles, each contributing to the effective development,
implementation, and maintenance of a database system. Here are the primary roles in a database
design environment:
o Set up database management systems (DBMS) such as Oracle, SQL Server, MySQL, or
PostgreSQL.
o Regularly monitor database performance and optimize queries, indexes, and configurations to
maintain efficiency.
3. Data Security:
o Implement access controls, encryption, and other measures to protect sensitive data from
unauthorized access or breaches.
o Plan for disaster recovery to minimize data loss in case of hardware failure or cyberattacks.
5. Database Maintenance:
6
o Perform routine maintenance tasks such as updating software, reorganizing indexes, and
ensuring the database is running smoothly.
6. User Support:
7. Capacity Planning:
o Forecast future database storage needs and plan for upgrades or expansions.
o Ensure databases comply with regulatory standards and industry best practices.
Technical Skills:
Tools:
Soft Skills:
o Communication skills to work with cross-functional teams and explain complex concepts.
Importance of a DBA
7
A DBA is essential for:
1. Data Availability: Ensures databases are accessible and responsive to meet business
needs.
5. Strategic Planning: Supports organizational growth by planning for future data needs.
Database Designer
A Database Designer is a professional responsible for designing the structure and architecture of
a database to meet the specific needs of an organization or application. They play a crucial role in
the development lifecycle of a database, ensuring that it is efficient, scalable, and capable of storing
and retrieving data effectively.
1. Requirement Analysis:
o Create the conceptual, logical, and physical designs of the database. This includes defining
tables, relationships, constraints, and indexes.
3. Normalization:
o Organize data to reduce redundancy and improve data integrity while maintaining efficient
access.
4. Data Modeling:
5. Collaboration:
8
o Work closely with database administrators, application developers, and business analysts to
ensure the database supports application requirements and business objectives.
6. Optimization:
o Design databases with performance and scalability in mind, considering factors like indexing,
query optimization, and data access patterns.
7. Documentation:
Provide clear documentation for the database design, including data dictionaries, schemas, and
entity-relationship diagrams
Tools:
o Data modeling tools like IBM InfoSphere, SAP PowerDesigner, or Toad Data Modeler.
Soft Skills:
o Data Integrity: Ensures that the database consistently represents accurate and reliable data.
9
o Maintainability: Simplifies troubleshooting, updates, and modifications.
Purpose: Represents how individual users or user groups view the data.
Features:
o Each user has a customized view tailored to their specific requirements, often called the
external schema.
o Hides irrelevant details and presents only the data of interest to the user.
o Ensures data security and privacy by restricting access to specific subsets of the database.
Examples: A financial analyst sees sales reports, while a customer support representative sees
customer profiles.
Purpose: Represents the entire database logically as a unified view, independent of physical
storage details.
Features:
o Provides a global view of the data, including relationships, constraints, and rules.
o Defines the conceptual schema, which abstracts the data structure and relationships in
terms of entities, attributes, and constraints.
o Ensures logical independence changes at the internal level do not affect this level.
10
3. Internal Level (Physical View)
Features:
o Defines the internal schema, specifying storage structures, file layouts, indexes, and access
paths.
2. Data Independence:
Logical independence: The ability to change the conceptual schema without affecting
external schemas.
Physical independence: The ability to change the internal schema without affecting the
conceptual schema.
3. Flexibility and Scalability: Supports multiple user views and accommodates changes at
different levels without disrupting the overall system.
4. Security and Privacy: External schemas restrict user access to specific parts of the
database.
Since the 1970s, most DBMSs have been designed around the relational model. The most
fundamental elements in the relational model are relations, which users and modern relational
11
DBMSs (RDBMSs or relational databases) recognize as tables. A relation is a set of tuples, or
rows in a table, with each tuple sharing a set of attributes, or columns:
You can think of each tuple as a unique instance of whatever type of people, objects, events, or
associations the table holds. These instances might be things like employees at a company, sales
from an online business, or test results in a medical lab. For example, in a table that holds employee
records of teachers at a school, the tuples might have attributes like name, subjects, start_date, and
so on.
In the relational model, each table contains at least one column that can be used to uniquely
identify each row, called a primary key. Building on the example of a table storing employee
records of teachers at a school, the database administrator could create a primary key column
named employee_ID whose values automatically increment. This would allow the DBMS to keep
track of each record and return them on an ad hoc basis. In turn, it would mean that the records
have no defined logical order, and users have the ability to return their data in whatever order or
through whatever filters they wish.
If you have two tables that you’d like to associate with one another, one way you can do so is with
a foreign key. A foreign key is essentially a copy of one table’s (the “parent” table) primary key
inserted into a column in another table (the “child”). The following example highlights the
relationship between two tables, one used to record information about employees at a company
and another used to track the company’s sales. In this example, the primary key of the
EMPLOYEES table is used as the foreign key of the TRAINER table:
12
Employee
TRAINER
In the above tables, for table “EMPLOYEES” primary key is Employyee_ID and Foreign Key for
table “TRAINER” is TrainerPerson_ID.
The relational model’s structural elements help to keep data stored in an organized way, but storing
data is only useful if you can retrieve it. To retrieve information from an RDBMS, you can issue
a query, or a structured request for a set of information. Most relational databases use a language
called Structured Query Language known as SQL and informally pronounced like “sequel” to
manage and query data. SQL allows you to filter and manipulate query results with a variety of
clauses, predicates, and expressions, giving you fine control over what data will appear in the result
set.
13
There are many open-source RDBMSs available today, including the following:
MySQL
MariaDB
PostgreSQL
SQLite
Nowadays, most applications still use the relation model to store and organize data. However, the
relation model cannot meet the needs of every application. For example, it can be difficult to scale
relational databases horizontally, and though they’re ideal for storing structured data, they are less
useful for storing unstructured data. These and other limitations of the relational model have led
to the development of alternatives. Collectively, these database models are often referred to as
non-relational databases. Because these alternative models typically don’t implement SQL for
defining or querying data, they are also sometimes referred to as NoSQL databases. This also
means that many NoSQL databases implement a unique syntax to insert and retrieve data.
It can be helpful to think of “NoSQL” and “non-relational” as broad umbrella terms, as there are
many database models that are labeled as NoSQL, with significant differences between them. The
remainder of this section highlights a few of the more commonly used non-relational database
models:
Key-Value Databases
Key-value databases, also known as key-value stores, work by storing and managing associative
arrays. An associative array, also known as a dictionary or hash table, consists of a collection of
key-value pairs in which a key serves as a unique identifier to retrieve an associated value. Values
can be anything from simple objects, like integers or strings, to more complex objects, like JSON
structures. Redis: is an example of a popular, open-source key-value store.
Document-Oriented Databases
Document-oriented databases, or document stores, are NoSQL databases that store data in the form
of documents. Document stores are a type of key-value store: each document has a unique
identifier its key and the document itself serves as the value. The difference between these two
models is that, in a key-value database, the data is treated as opaque and the database doesn’t know
14
or care about the data held within it; it’s up to the application to understand what data is stored. In
a document store, however, each document contains some kind of metadata that provides a degree
of structure to the data. Document stores often come with an API or query language that allows
users to retrieve documents based on the metadata they contain. They also allow for complex data
structures, as you can nest documents within other documents. MongoDB is a widely used
document database. The documents you store in a MongoDB database are written in BSON, which
is a binary form of JSON.
Columnar Databases
Columnar databases, sometimes called column-oriented databases, are database systems that store
data in columns. This may seem similar to traditional relational databases, but rather than grouping
columns together into tables, each column is stored in a separate file or region in the system’s
storage. The data stored in a columnar database appears in record order, meaning that the first
entry in one column is related to the first entry in other columns. This design allows queries to
only read the columns they need, rather than having to read every row in a table and discard
unneeded data after it’s been stored in memory. Apache Cassandra is a widely used open-source
column store.
A DBMS has appropriate languages and interfaces to express database queries and updates. It can
be used to read, store and update the data in the database.
15
i) Data Definition Language(DDL)
Data definition language is used to define and manage all attributes and properties of a database,
including row layouts, column definitions, key columns, file locations, and storage strategy. DDL
statements are used to build and modify the structure of tables and other objects such as views,
triggers, stored procedures, and so on. For each object, there are usually CREATE, ALTER, and
DROP statements (such as, CREATE TABLE, ALTER TABLE, and DROP TABLE).
In DDL statements, object_name can be a table, view, trigger, stored procedure, and so on.
It is used for accessing and manipulating data in a database and handles user requests.
16
It is used to retrieve the stored or saved data and the execution is transactional. It also has
rollback parameters. (But in Oracle database, the execution of data control language does not have
the feature of rolling.
There are the following operations which have the authorization of Revoke: CONNECT, INSERT,
USAGE, EXECUTE, DELETE, UPDATE and SELECT.
TCL is used to run the changes made by the DML statement and it can be grouped into a logical
transaction.
A database model refers to the structure or framework used to define, organize, and store data in
a database. It determines how data is stored, how relationships between data are managed, and
how data can be accessed or modified. There are several types of database models, each offering
different ways of organizing and managing data.
A. Hierarchical Model
This is one of the oldest models in a data model which was developed by IBM, in the 1950s. In a
hierarchical model, data are viewed as a collection of tables, or we can say segments that form a
hierarchical relation. In this, the data is organized into a tree-like structure where each record
consists of one parent record and many children. Even if the segments are connected as a chain-
like structure by logical associations, then the instant structure can be a fan structure with multiple
branches. We call the illogical associations as directional associations.
17
In the hierarchical model, segments pointed to by the logical association are called the child
segment and the other segment is called the parent segment. If there is a segment without a parent
is then that will be called the root and the segment which has no children are called the leaves.
The main disadvantage of the hierarchical model is that it can have one-to-one and one-to-many
relationships between the nodes.
Hierarchical models are generally used as semantic models in practice as many real-world
occurrences of events are hierarchical in nature like biological structures, political, or social
structures.
Hierarchical models are also commonly used as physical models because of the inherent
hierarchical structure of the disk storage system like tracks, cylinders, etc. There are various
examples such as Information Management System (IMS) by IBM, NOMAD by NCSS, etc.
B. Network Model
Some data were more naturally modeled with more than one parent per child. So, the network
model permitted the modeling of many-to-many relationships in data. In 1971, the Conference on
Data Systems Languages (CODASYL) formally defined the network model. The basic data
modeling construct in the network model is the set construct. A set consists of an owner record
18
type, a set name, and a member record type. A member record type can have that role in more than
one set, hence the multipart concept is supported. An owner record type can also be a member or
owner in another set.
The data model is a simple network, and link and intersection record types (called junction records
by IDMS) may exist, as well as sets between them. Thus, the complete network of relationships is
represented by several pairwise sets; in each set some (one) record type is owner (at the tail of the
network arrow) and one or more record types are members (at the head of the relationship arrow).
Usually, a set defines a 1: M relationship, although 1:1 is permitted. The CODASYL network
model is based on mathematical set theory.
19
Here is an example of how these two concepts might be combined in an ER data model: Mr. Ebisa
(entity) teaches (relationship) the Database Systems course (entity).
An entity can be a real-world object, either animate or inanimate, that can be easily identifiable.
For example, in a school database, students, teachers, classes, and courses offered can be
considered as entities. All these entities have some attributes or properties that give them their
identity. An entity set is a collection of similar types of entities. An entity set may contain entities
with attribute sharing similar values. For example, a Students set may contain all the students of a
school; likewise a Teachers set may contain all the teachers of a school from all faculties. Entity
sets need not be disjoint.
A real-world thing either living or non-living that is easily recognizable and no recognizable. It is
anything in the enterprise that is to be represented in our database. It may be a physical thing or
simply a fact about the enterprise or an event that happens in the real world.
An entity can be place, person, object, event or a concept, which stores data in the database. The
characteristics of entities are must have an attribute, and a unique key. Every entity is made up of
some 'attributes' which represent that entity.
Examples of entities:
Entities can be classified based on their strength. An entity is considered to be weak if its tables
are existence dependent.
An entity is considered strong if it can exist apart from all of its related entities.
20
o A table without a foreign key or a table that contains a foreign key that can contain nulls
is a strong entity
An entity set is a collection of entities of an entity type at a particular point of time. In an entity
relationship diagram (ERD), an entity type is represented by a name in a box.
Types of Entities
i. Independent entities
Independent entities, also referred to as kernels, are the backbone of the database. They are what
other tables are based on. Kernels have the following characteristics:
Dependent entities, also referred to as derived entities, depend on other tables for their meaning.
These entities have the following characteristics:
21
Create a new simple primary key
Characteristic entities
Characteristic entities provide more information about another table. These entities have the
following characteristics:
Each entity is described by a set of attributes (e.g., Employee = (Name, Address, Birthdate (Age),
Salary). Each attribute has a name, and is associated with an entity and a domain of legal values.
However, the information about attribute domain is not presented on the ERD.
22
Types of Attributes
There are a few types of attributes you need to be familiar with. Some of these are to be left as is,
but some need to be adjusted to facilitate representation in the relational model. This first section
will discuss the types of attributes. Later on we will discuss fixing the attributes to fit correctly
into the relational model.
Simple attributes
Simple attributes are those drawn from the atomic value domains; they are also called single-
valued attributes. In the COMPANY database, an example of this would be: Name = {John ;Age=
{23}.
Composite attributes
Composite attributes are those that consist of a hierarchy of attributes. Address may consist of
Number, Street and Suburb. So this would be written as → Address = {59 + ‘Meek Street’ +
‘Kingsford’}.
Multivalued attributes
Multivalued attributes are attributes that have a set of values for each entity. An example of a
multivalued attribute from the COMPANY database, are the degrees of an employee: BSc, MIT,
PhD.
23
Figure: Example of a multivalued attribute
Derived attributes
Derived attributes are attributes that contain values calculated from other attributes. Age can be
derived from the attribute Birthdate. In this situation, Birthdate is called a stored attribute, which
is physically saved to the database.
Keys
An important constraint on an entity is the key. The key is an attribute or a group of attributes
whose values can be used to uniquely identify an individual entity in an entity set.
24
Types of Keys
Candidate key
A candidate key is a simple or composite key that is unique and minimal. It is unique because no
two rows in a table may have the same value at any time. It is minimal because every column is
necessary in order to attain uniqueness. From our COMPANY database example, if the entity is
Employee (EID, First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID),
possible candidate keys are:
o EID, SIN
o First Name and Last Name - assuming there is no one else in the company with the same
name
o Last Name and DepartmentID - assuming two people with the same last name don’t work
in the same department
Composite key
A composite key is composed of two or more attributes, but it must be minimal. Using the example
from the candidate key section, possible composite keys are:
o First Name and Last Name - assuming there is no one else in the company with the same name
o Last Name and Department ID - assuming two people with the same last name don’t work in
the same department.
Primary key
The primary key is a candidate key that is selected by the database designer to be used as an
identifying mechanism for the whole entity set. It must uniquely identify tuples in a table and not
be null. The primary key is indicated in the ER model by underlining the attribute.
o A candidate key is selected by the designer to uniquely identify tuples in a table. It must not
be null.
o A key is chosen by the database designer to be used as an identifying mechanism for the whole
entity set.
This is referred to as the primary key. This key is indicated by underlining the attribute in the ER
model.
Employee(EID, First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID).
25
Secondary key: is an attribute used strictly for retrieval purposes (can be composite), for
example: Phone and Last Name.
Alternate keys: are all candidate keys not chosen as the primary key.
Foreign key:- is an attribute in a table that references the primary key in another table OR it
can be null. Both foreign and primary keys must be of the same data type.
In the COMPANY database example below, DepartmentID is the foreign key: Employee(EID,
First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID).
Nulls
A null is a special symbol, independent of data type, which means either unknown or inapplicable. It does
not mean zero or blank. Features of null include:
o No data entry
o Not permitted in the primary key
o Should be avoided in other attributes
o Can represent
An unknown attribute value
A known, but missing, attribute value
A “not applicable” condition
o Can create problems when functions such as COUNT, AVERAGE and SUM are used
o Can create logical problems when relational tables are linked.
NOTE: The result of a comparison operation is null when either argument is null. The result of an
arithmetic operation is null when either argument is null (except functions that ignore nulls).
Use the Salary table (Salary_tbl) below to follow an example of how null can be used.
26
Table: Salary_tbl
To begin, find all employees (EmpNo) in Sales (under the jobName column) whose salary plus
commission are greater than 30,000.
SELECT EmpNo FROM Salary_tbl WHERE jobName = Sales AND (commission > 30000 OR
salary > 30000 OR (commission + salary) > 30,000 => E10 and E12 and E13.
Relationships
Relationships are the glue that holds the tables together. They are used to connect related
information between tables. Relationship strength is based on how the primary key of a related
entity is defined. A weak, or non-identifying, relationship exists if the primary key of the related
entity does not contain a primary key component of the parent entity. Company database examples
include:
Customer(CustID, CustName)
Order(OrderID, CustID, Date)
A strong, or identifying, relationship exists when the primary key of the related entity contains the
primary key component of the parent entity. Examples include:
27
Class (CrsCode, Section, ClassTime…)
Types of relationship
When only one instance of an entity is associated with the relationship, then it is known as one to
one relationship. For example, A female can marry to one male, and a male can marry to one
female.
When only one instance of the entity on the left, and more than one instance of an entity on the
right associates with the relationship then this is known as a one-to-many relationship. For
example, Scientist can invent many inventions, but the invention is done by the only specific
scientist.
When more than one instance of the entity on the left, and only one instance of an entity on the
right associates with the relationship then it is known as a many-to-one relationship. For example,
Student enrolls for only one course, but a course can have many students.
28
iv) Many-to-many relationship
When more than one instance of the entity on the left, and more than one instance of an entity on
the right associates with the relationship then it is known as a many-to-many relationship. For
example, Employee can assign by many projects and project can have many employees.
Prerequisite
Generalization, Specialization and Aggregation in ER model are used for data abstraction in which
abstraction mechanism is used to hide details of a set of objects.
Generalization
Generalization is the process of extracting common properties from a set of entities and create a
generalized entity from it. It is a bottom-up approach in which two or more entities can be
generalized to a higher level entity if they have some attributes in common. For Example,
STUDENT and FACULTY can be generalized to a higher level entity called PERSON as shown
29
below. In this case, common attributes like P_NAME, P_ADD become part of higher entity
(PERSON) and specialized attributes like S_FEE become part of specialized entity (STUDENT).
Specialization
In specialization, an entity is divided into sub-entities based on their characteristics. It is a top-
down approach where higher level entity is specialized into two or more lower level entities. For
Example, EMPLOYEE entity in an Employee management system can be specialized into
DEVELOPER, TESTER etc. as shown below. In this case, common attributes like E_NAME,
E_SAL etc. become part of higher entity (EMPLOYEE) and specialized attributes like T_TYPE
become part of specialized entity (TESTER).
30
Aggregation
An ER diagram is not capable of representing relationship between an entity and a relationship
which may be required in some scenarios. In those cases, a relationship with its corresponding
entities is aggregated into a higher level entity. For Example, Employee working for a project may
require some machinery. So, REQUIRE relationship is needed between relationship
WORKS_FOR and entity MACHINERY. Using aggregation, WORKS_FOR relationship with its
entities EMPLOYEE and PROJECT is aggregated into single entity and relationship REQUIRE
is created between aggregated entity and MACHINERY.
D. Relational Model
Relational Database Model represents the database as a collection of relations. A relation is
nothing but a table of values. Every row in the table represents a collection of related data values.
These rows in the table denote a real-world entity or relationship. The table name and column
names are helpful to interpret the meaning of values in each row. The data are represented as a set
of relations. In the relational model, data are stored as tables. However, the physical storage of the
data is independent of the way the data are logically organized.
31
Relational Model Concepts in DBMS
Attribute: - Each column in a Table. Attributes are the properties which define a relation.
e.g., Student_Rollno, Name, etc.
Tables: - In the Relational model the, relations are saved in the table format. It is stored along
with its entities. A table has two properties rows and columns. Rows represent records and
columns represent attributes.
Tuple: It is nothing but a single row of a table, which contains a single record.
Relation Schema: A relation schema represents the name of the relation with its attributes.
Degree: The total number of attributes which in the relation is called the degree of the relation.
Cardinality: Total number of rows present in the Table.
Column: - The column represents the set of values for a specific attribute.
Relation instance: - Relation instance is a finite set of tuples in the RDBMS system.
Relation instances never have duplicate tuples.
Relation key: Every row has one, two or multiple attributes, which is called relation key.
Attribute domain – Every attribute has some pre-defined value and scope which is known
as attribute domain.
Properties of a Table
A table has a name that is distinct from all other tables in the database.
There are no duplicate rows; each row is distinct.
32
Entries in columns are atomic. The table does not contain repeating groups or multivalued
attributes.
Entries from columns are from the same domain based on their data type including:
o Number (numeric, integer, float, small int…)
o Character (string)
o Date
o Logical (true or false)
Operations combining different data types are disallowed.
Each attribute has a distinct name.
The sequence of columns is insignificant.
The sequence of rows is insignificant.
Relational Integrity Constraints
Relational Integrity constraints in DBMS are referred to conditions which must be present for a
valid relation. These Relational constraints in DBMS are derived from the rules in the mini-world
that the database represents.
There are many types of Integrity Constraints in DBMS. Constraints on the Relational database
management system is mostly divided into three main categories are:
Domain Constraints
Key Constraints
Referential Integrity Constraints
Domain Constraints
Domain constraints can be violated if an attribute value is not appearing in the corresponding
domain or it is not of the appropriate data type. Domain constraints specify that within each tuple,
and the value of each attribute must be unique. This is specified as data types which include
standard data type’s integers, real numbers, characters, Booleans, variable length strings, etc.
Key Constraints
An attribute that can uniquely identify a tuple in a relation is called the key of the table. The value
of the attribute for different tuples in the relation has to be unique.
33
Example
In the given table, CustomerID is a key attribute of Customer Table. It is most likely to have a
single key for one customer, CustomerID =101 is only for the CustomerName =” Google”.
Referential Integrity constraints in DBMS are based on the concept of Foreign Keys. A foreign
key is an important attribute of a relation which should be referred to in other relationships.
Referential integrity constraint state happens where relation refers to a key attribute of a different
or same relation. However, that key element must exist in the table.
Example:
34
Tuple for CustomerID =1 is referenced twice in the relation Billing. So we know
CustomerName=Google has billing amount $300.
The insert operation gives values of the attribute for a new tuple which should be inserted into a
relation.
Update Operation
You can see that in the below-given relation table CustomerName= „Apple‟ is updated from Inactive to
Active.
35
Delete Operation
To specify deletion, a condition on the attributes of the relation selects the tuple to be deleted.
In the above-given example, CustomerName= “Apple” is deleted from the table. The Delete
operation could violate referential integrity if the tuple which is deleted is referenced by foreign
keys from other tuples in the same database.
36
Data independence: The Structure of Relational database can be changed without having
to change any application.
Scalable: Regarding a number of records, or rows, and the number of fields, a database
should be enlarged to enhance its usability.
Disadvantages of Relational Model
Few relational databases have limits on field lengths which can’t be exceeded.
Relational databases can sometimes become complex as the amount of data grows, and the
relations between pieces of data become more complicated.
Complex relational database systems may lead to isolated databases where the information
cannot be shared from one system to another.
DatabaseSchema
A database schema is the skeleton structure that represents the logical view of the entire database.
It defines how the data is organized and how the relations among them are associated. It formulates
all the constraints that are to be applied on the data. A database schema defines its entities and the
relationship among them. It contains a descriptive detail of the database, which can be depicted by
means of schema diagrams. It’s the database designers who design the schema to help
programmers understand the database and make it useful.
37
A database schema can be divided broadly into two categories:
Physical Database Schema: This schema pertains to the actual storage of data and its form
of storage like files, indices, etc. It defines how the data will be stored in a secondary storage.
Logical Database Schema: This schema defines all the logical constraints that need to be
applied on the data stored. It defines tables, views, and integrity constraints.
Database Instance
It is important that we distinguish these two terms individually. Database schema is the skeleton
of database. It is designed when the database doesn't exist at all. Once the database is operational,
it is very difficult to make any changes to it. A database schema does not contain any data or
information.
A database instance is a state of operational database with data at any given time. It contains a
snapshot of the database. Database instances tend to change with time. A DBMS ensures that its
every instance (state) is in a valid state, by diligently following all the validations, constraints, and
conditions that the database designers have imposed.
38
In addition to distributed processing and inexpensive hardware, NoSQL databases differ
significantly on their approach to maintaining data integrity and consistency. A more relaxed
approach to data consistency helps NoSQL databases improve the performance of data storage.
Because RDBMS highly value data integrity, they use the ACID theorem for data consistency
which was presented in the early 1980’s by Jim Grey. ACID is an acronym for Atomicity,
Consistency, Isolation and Durability and supports the concept of a transaction.
Atomicity: ensures that all tasks with a transaction are performed completely (all or
nothing).
Consistency: ensures that a transaction must leave the database in a consistent state at the
end of the transaction.
Isolation: ensures that transactions are isolated and do not interfere with other transactions.
Durability: ensures that once the transaction is complete, the data will persist permanently,
even in the event of a system failure.
In contrast, NoSQL databases use the CAP theorem (Consistency, Availability and Partition
Tolerance) for data consistency. The CAP Theorem states that of the three possible combinations
of CAP, only two are available at a given point in time. In another words, NoSQL databases can
have partition tolerance (in a distributed environment) and consistency or partition tolerance and
availability, but not all three factors at the same time. CAP theorem has evolved into what is now
called BASE (Basically Available, Soft state and Eventual consistency).
Basically Available: means that data will be available, however the response from the
database can be a failure to retrieve the data or the data retrieved may be in an inconsistent
state.
Soft state: means that the data can change over time as the database seeks consistency.
Eventual consistency: means that sooner or later, the database will eventually become
consistent.
Database design is the process of creating a detailed data model of a database. It involves defining
the structure, format, and organization of the data stored in the database system. The goal is to
39
optimize data management, ensuring data is stored efficiently, and can be retrieved and updated
quickly and accurately. It involves defining data structures, relationships, and rules for storing,
accessing, and manipulating data. Effective database design reduces redundancy, ensures data
integrity, and optimizes performance.
Database design encompasses several stages that collectively ensure the database meets the needs
of its users and applications. These stages are:
i) Requirements Analysis
The first step in database design is to gather and analyze requirements. This involves
understanding what data needs to be stored, how it will be used, and the expected performance
requirements. For example, consider a simple requirement: a library database system needs to
store information about books, authors, and borrowers.
Consider a simple requirement: a library database system needs to store information about
books, authors, and borrowers. Key activities in this stage include:
Interviews and Surveys: Engaging with users, managers, and stakeholders to collect
detailed information about their data needs.
Requirement Specification: Writing detailed requirements documents that outline all data
needs, including data types, relationships, constraints, and security requirements.
Once requirements are gathered, the next step is to create a conceptual design. This involves
creating an abstract model of the database. Using the library example, the conceptual design
might include entities like 'Book', 'Author', and 'Borrower', along with their relationships (e.g.,
each book is written by an author).
40
Using the library example, the conceptual design might include:
Attributes
Relationships
Tables:
41
Defining a set of the table structures, data types for fields, and constraints on these tables such
as primary key, foreign key, unique key, not null and domain definitions to check if data are
out of the range.
Identifying the specific storage structures and access methods to retrieve data efficiently. For
example, adding a secondary index to a relation.
Designing security features for the database system including account creation, privilege
granting/revocation, access protection, and security level assignment.
Logical Design Compared with Physical Design
Generally, Physical design is where you translate the expected schemas into actual database
structures. At this time, you have to map:
Entities to Tables
Attributes to Columns
42
Example: Physical database design for a staff in a department are as follow:
Table: Staff
Table: Department
43
Indexes
Functional Dependency (FD) determines the relation of one attribute to another attribute in a
database management system (DBMS) system. Functional dependency helps you to maintain the
quality of data in the database. A functional dependency is denoted by an arrow →. The functional
dependency of X on Y is represented by X → Y. Functional Dependency plays a vital role to find
the difference between good and bad database design.
Example
In this example, if we know the value of Employee number, we can obtain Employee Name, city,
salary, etc. By this, we can say that the city, Employee Name, and salary are functionally depended
on Employee number.
44
Key terms
Axiom Axioms is a set of inference rules used to infer all the functional dependencies
on a relational database
Decomposition It is a rule that suggests if you have a table that appears to contain two entities
which are determined by the same primary key then you should consider
breaking them up into two different tables.
Union It suggests that if two tables are separate, and the PK is the same, you should
consider putting them. together
Below given are the three most important rules for Functional Dependency:
Multivalued dependency occurs in the situation where there are multiple independent multivalued
attributes in a single table. A multivalued dependency is a complete constraint between two sets
of attributes in a relation. It requires that certain tuples be present in a relation.
Example
45
In this example, maf_year and color are independent of each other but dependent on car_model.
In this example, these two columns are said to be multivalued dependent on car_model.
The Trivial dependency is a set of attributes which are called a trivial if the set of attributes are
included in that attribute.
For example:
46
iii. Non trivial functional dependency in DBMS
Functional dependency which also known as a nontrivial dependency occurs when A->B holds
true where B is not a subset of A. In a relationship, if attribute B is not a subset of attribute A, then
it is considered as a nontrivial dependency.
Example:
(Company} -> {CEO} (if we know the Company, we knows the CEO name) But CEO is not a
subset of Company, and hence it's non-trivial functional dependency.
A transitive is a type of functional dependency which happens when t is indirectly formed by two
functional dependencies.
Example:
o {Company} -> {CEO} (if we know the company, we know its CEO's name)
o {CEO} -> {Age} If we know the CEO, we know the Age
o {Company} -> {Age} should hold, that makes sense because if we know the company
name, we can know his age.
47
Note: You need to remember that transitive dependency can only occur in a relation of three or
more attributes.
The functional dependency is a relationship that exists between two attributes. It typically exists
between the primary key and non-key attribute within a table.
X→Y
The left side of FD is known as a determinant, the right side of the production is known as a
dependent.
For example:
Assume we have an employee table with attributes: Emp_Id, Emp_Name, and Emp_Address. Here
Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because if we
know the Emp_Id, we can tell that employee name associated with it.
Emp_Id → Emp_Name
Example:
48
ii. Non-trivial functional dependency
A → B has a non-trivial functional dependency if B is not a subset of A.
o When an intersection B is NULL, then A → B is called as complete non-trivial.
Example:
o ID → Name,
o Name → DOB
1.7.1. Normalization
EMPLOYEE table:
49
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 7272826385 UP
14 John 9064738238 UP
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In a
school, a teacher can teach more than one subject.
TEACHER table
50
In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which is
a proper subset of a candidate key. That's why it violates the rule for 2NF.
To convert the given table into 2NF, we decompose it into two tables:
A relation is in third normal form if it holds at least one of the following conditions for every non-
trivial function dependency X → Y.
X is a super key.
Y is a prime attribute, i.e., each element of Y is part of some candidate key.
51
Example: EMPLOYEE_DETAIL table:
Super key in the table above: {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME,
EMP_ZIP}....so on
Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime. Here,
EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID. The
nonprime attributes (EMP_STATE, EMP_CITY) transitively dependent on super key (EMP_ID).
It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP>
table, with EMP_ZIP as a Primary key.
52
1.8. Structured Query Language(SQL)
SQL is a standard database language used to access and manipulate data in databases. SQL stands
for Structured Query Language. It was developed by IBM Computer Scientists in the 1970s.
By executing queries SQL can create, update, delete, and retrieve data in databases like
MySQL, Oracle, PostgreSQL, etc. Overall, SQL is a query language that communicates with
databases.
SQL Example
We'll create a table to store employee information and populate it with sample data like
Employee_Id, Name, Age, Department, and Salary.
53
If you want to retrieves data from the employees table where the salary is greater than 55000.00
then we will use SELECT Statement.
6 Anjali 32 HR 58000
7 Vikram 29 IT 56000
Explanation:
This SELECT statement retrieves records from the employees table where the Salary column is
greater than 55000.00. The * symbol indicates that we want to retrieve all columns for the matching
records. The WHERE clause filters the results based on the specified condition.
SQL Basics
It is a fundamental skill for anyone who wants to interact with databases. This standard Query
Language all users to create, manage, and retrieve data from relational databases. In this SQL
tutorial PDF, we have listed all the basics of SQL.
A database is a repository that organizes data in structured formats through tables, views, stored
procedures, and other components.
54
Syntax: CREATE DATABASE database_name;
Use IT
This command is typically used in scenarios where a database is no longer required or needs
to be removed for reorganization. Before running the command, ensure that we have
appropriate permissions to drop the database, as only database administrators typically have
the necessary rights.
To demonstrate the functionality of the SQL DROP DATABASE command, Let's look at some
examples of the DROP DATABASE statement in SQL
First, let's create a database on which we will run the query. This will allow us to perform
operations and test the SQL DROP DATABASE command.
55
To confirm that the Staff database was successfully created, use the following command to list all
databases:
Now that the database is confirmed to exist, let’s use the DROP DATABASE command, to delete
the database ‘Staff’.
To avoid any error while running the DROP DATABASE command use the IF EXISTS clause,
which will delete the database only if it exists in the system.
To rename a database in SQL, you’ll need to use the ALTER DATABASE command. The syntax
varies slightly between different SQL platforms, but the core functionality remains the same. The
ALTER DATABASE statement allows you to modify the properties of a database, including its
name. However, it’s important to note that while SQL Server uses the ALTER DATABASE
statement with a MODIFY NAME clause, MySQL employs a different approach using
the RENAME DATABASE statement.
56
RENAME DATABASE [current_database_name] TO [new_database_name];
In this example, we will use the ALTER command with MODIFY NAME clause to rename
the database.
Database Availability: Please be aware that renaming a database may temporarily render
it inaccessible while the process is underway. It is advisable to schedule this task during
off-peak hours, particularly in a production environment.
Dependencies: Ensure that any applications, scripts, or users relying on the database name
are updated to reflect the new designation.
Permissions: Confirm that you possess the requisite permissions to rename the database,
which typically necessitates administrative or root-level access.
Additionally, the SELECT statement in SQL is used to query and retrieve data from the tables
within the selected database. In this article, we will learn about SQL Select Database by
understanding various examples in detail and so on.
57
USE DATABASE Statement
The USE DATABASE statement is not a standard SQL command, but rather a variation of
the USE command used in some SQL database management systems (DBMS) to select a
specific database for the current session.
This command sets the default database for subsequent queries in that session.
Let’s take a look at how to select a database in SQL, using MySQL as an example. Suppose you
have a database called company_db that contains employee information.
o Create a Database: To begin, you can create a new database if it doesn’t already exist
Select the Database: To set Technology as the active database, use the USE command
USE Technology;
How to Query Data from the Selected Database
The SELECT statement in SQL is used to query and retrieve data from the tables within the
selected database. Here are some key ways to use the SELECT statement effectively.
58
Table: employees
4 David 28 HR 45000
6 Frank 50 HR 70000
7 Grace 29 IT 48000
8 Hannah 38 IT 53000
Output
ID Name
1 Alice
2 Bob
3 Charlie
4 David
5 Eve
6 Frank
7 Grace
8 Hannah
Explanation: Retrieves only the name and age columns for all rows.
59
2. Filtering Results with WHERE
Output
Name Age
Bob 40
Charlie 35
Eve 45
Frank 50
Hannah 38
Name Age
Frank 50
Eve 45
Bob 40
Hannah 38
Charlie 35
Alice 30
Grace 29
David 28
Explanation: Retrieves the top 3 highest-paid employees, ordered by salary in descending order.
60
4. Aggregating Data with GROUP BY and Aggregation Functions
SELECT department, AVG(salary) AS average_salary FROM employees GROUP BY
department;
Output
Department Average_Salary
Sales 52500
Marketing 62500
HR 57500
IT 50500
Tables in SQL
The cornerstone of any SQL database is the table. Basically, these structure functions is very
similar to spreadsheets, which store data in very organized grid format. In this section, you will
learn how to Create, Drop, Delete, and more related to Table.
The SQL CREATE TABLE statement is a foundational command used to define and structure
a new table in a database. By specifying the columns, data types, and constraints such
as PRIMARY KEY, NOT NULL, and CHECK, helps you design the database schema.
In this article, we’ll learn the syntax, best practices, and practical examples of using the CREATE
TABLE statement in SQL. We’ll also cover how to create tables from existing data and
troubleshoot common errors.
To create a new table in the database, use the SQL CREATE TABLE statement. A table’s
structure, including column names, data types, and constraints like NOT NULL, PRIMARY KEY,
and CHECK, are defined when it is created in SQL.
The CREATE TABLE command is a crucial tool for database administration because of these
limitations, which aid in ensuring data integrity.
61
Syntax: To create a table in SQL, use this CREATE TABLE syntax:
Datatype (size): Defines the data type and size of each column.
Let us create a table to store data of Customers, so the table name is Customer, Columns are
Name, Country, age, phone, and so on.
CREATE TABLE Customer (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(50),
LastName VARCHAR (50),
Country VARCHAR (50),
Age INT CHECK (Age >= 0 AND Age <= 99),
Phone int(10)
);
Output
Customer
Empty
To add data to the table, we use INSERT INTO command, the syntax is as shown below:
Syntax: INSERT INTO table_name (column1, column2, …) VALUES (value1, value2, …);
Example Query
62
INSERT INTO Customer (CustomerID, CustomerName, LastName, Country, Age, Phone)
We can also use CREATE TABLE to create a copy of an existing table. In the new table, it gets
the exact column definition all columns or specific columns can be selected.
If an existing table was used to create a new table, by default the new table would be populated
with the existing values from the old table.
Syntax:
CREATE TABLE new_table_name AS SELECT column1, column2,…
FROM existing_table_name
WHERE ….;
Query:
CREATE TABLE SubTable AS
SELECT CustomerID, CustomerName
FROM customer;
63
Output
Sub Table
CustomerID CustomerName
1 Shubham
2 Aman
3 Naveen
4 Aditya
5 Nishant
The DROP TABLE command in SQL is a powerful and essential tool used to permanently delete
a table from a database, along with all of its data, structure, and associated constraints such as
indexes, triggers, and permissions. When executed, this command removes the table and all its
contents, making it unrecoverable unless backed up.
The DROP TABLE statement in SQL is used to delete a table and all of its data from the database
permanently. This operation cannot be undone, and once the table is dropped all data in that table
is lost.
Once executed, this operation removes the table definition and all of its rows, so the table can no
longer be accessed or used. This action is irreversible which means that once a table is dropped,
it cannot be recovered unless there is a backup.
Syntax:
Example:
Step1: First, we will create a database and table on which the SQL queries will be run.
64
CREATE DATABASE NewCafe;
USE NewCafe;
CREATE TABLE categories (
CategoryID INT NOT NULL PRIMARY KEY,
CategoryName NVARCHAR(50) NOT NULL,
ItemDescription NVARCHAR(50) NOT NULL
);
INSERT INTO categories (CategoryID, CategoryName, ItemDescription)
VALUES
(1, 'Beverages', 'Soft Drinks'),
(2, 'Condiments', 'Sweet and Savory Sauces'),
(3, 'Confections', 'Sweet Breads');
SELECT * FROM categories;
Output
Results Messages
1 1 Beverages SoftDrink
Step2: Now, let’s use the DROP TABLE statement to delete the categories table permanently.
Query:
DROP TABLE categories;
SQL DELETE Statement
The SQL DELETE statement is one of the most commonly used commands in SQL
(Structured Query Language). It allows you to remove one or more rows from the table
depending on the situation. Unlike the DROP statement, which removes the entire table,
the DELETE statement removes data (rows) from the table retaining only the table structure,
constraints, and schema.
65
Syntax:
In SQL, making structural changes to a database is often necessary. Whether it’s renaming a
table or a column, adding new columns, or modifying data types, the SQL ALTER TABLE
command plays a critical role. This command provides flexibility to manage and adjust
database schemas without affecting the existing data.
The structure of an existing table can be changed by users using the SQL ALTER
TABLE command. If we need to rename a table, add a new column, or change the name of
an existing column in SQL, the ALTER command is crucial for
making schema changes without affecting the data that is already there. This is essential for
tasks like:
Renaming a table.
Syntax:
RENAME TO new_table_name;
Syntax:
Syntax:
66
Syntax:
Commands Descriptions
INSERT INTO SELECT Inserts data from one table into another based on a
query.
TRUNCATE TABLE Removes all records from a table, but keeps the table
structure intact.
ALTER TABLE ADD Adds a constraint to a table after it has been created.
CONSTRAINT
67
GRANT Gives specific privileges to database users.
68
Module Name Database System and Information Management
69
2. Introduction to Advanced Database Management System
Advanced Database Management involves the study of complex database systems, focusing on the
more intricate and sophisticated features of databases that are typically not covered in basic
database courses. It builds on the fundamental concepts of relational databases, extending them to
handle larger and more intricate data management challenges. This field is essential for managing
vast amounts of data in various industries, including finance, healthcare, e-commerce, and more.
2. Transaction Management
ACID Properties: Ensuring that database transactions are reliable, following the properties
of Atomicity, Consistency, Isolation, and Durability.
Indexes: Advanced indexing strategies, such as B-trees, hash indexing, and bitmap indexes,
are used to speed up search operations in large databases.
70
Query Optimization: Developing algorithms to improve the performance of SQL queries by
reducing their computational cost. This includes techniques like query rewriting, join
reordering, and parallel execution.
Data Warehouses: Building large-scale systems for storing and querying vast amounts of
data, often used in business intelligence and analytics.
Big Data Management: Technologies such as Hadoop, NoSQL databases (e.g., MongoDB,
Cassandra), and distributed computing frameworks that handle large-scale, unstructured, or
semi-structured data.
5.NoSQL Databases
Use Cases: How NoSQL databases support applications requiring high scalability, real-time
data, and horizontal scaling.
6. Cloud Databases
Cloud-Based Database Systems: Managing databases hosted on cloud platforms like AWS,
Google Cloud, or Microsoft Azure. This includes scaling databases dynamically based on
demand.
Database as a Service (DBaaS): Offering managed database services with automatic scaling,
backup, and maintenance.
Encryption: Ensuring data is secure by applying encryption techniques both in transit and
at rest.
Access Control: Defining roles and permissions to restrict access to sensitive data.
71
8. Distributed and Parallel Databases
Distributed Systems: Dealing with data that is distributed across multiple machines or
locations, ensuring consistency and fault tolerance (CAP theorem).
Predictive Analytics: Using machine learning algorithms to predict trends based on data
stored in databases.
Automated Database Tuning: AI techniques that can optimize queries, index creation, and
database performance automatically.
Query processing in a database system refers to the series of steps that are performed to translate
and execute a query. The goal of query processing is to efficiently retrieve the required data while
ensuring minimal resource consumption.
i. Parsing:
The SQL query is parsed to check its syntax and translate it into an internal representation.
The query is converted into a parse tree, which is then analyzed for correctness and
transformed into a query tree or relational algebra expression.
Syntax errors are detected in this step, such as missing clauses or incorrect keywords.
ii. Translation:
The query is transformed into an intermediate representation, often in the form of relational
algebra, which can be used to generate the actual execution plan.
For example, SQL queries are converted to relational algebra expressions like select, project,
and join operations.
72
iii. Optimization:
The main goal here is to improve the efficiency of the query execution. The database engine
explores different ways of executing the query to determine the most efficient plan.
The query optimizer evaluates multiple execution plans and chooses the one that minimizes
time or resource consumption (e.g., CPU, memory, I/O).
iv. Execution:
Once the optimal query execution plan is chosen, the database engine executes the query using
the selected strategy.
The result is retrieved from the database and returned to the user.
Example Query:
1. Parsing:
The database parses the SQL query to ensure its syntax is correct.
During this step, the SQL engine checks for errors like invalid column names or missing
FROM clauses.
2. Translation:
The parsed query is converted into relational algebra. For example, the JOIN operation
between the customers and orders table is translated into a relational algebra operation like
σ_order_date > '2024-01-01' (π_customer_name, order_date (customers ⨝ orders)).
73
The selection (σ) operation filters the rows based on the order_date, and projection (π) limits
the output columns to customer_name and order_date.
3. Optimization:
During optimization, the database may apply various techniques to improve query
performance, such as reordering operations or using indexes.
4. Execution:
After the query is optimized, the database engine executes it by performing the join operation,
filtering the data using the condition order_date > '2024-12-05', and returning the results.
I. Heuristic Optimization:
This optimization method involves applying a set of rules (heuristics) to the query in order to
rearrange its operations to minimize execution cost.
o Selectivity Pushdown: Moving selection operations as early as possible in the query plan
to reduce the number of rows involved.
o Join Commutativity: Changing the order of joins, as certain join orders can be more
efficient.
74
II. Cost-Based Optimization
Types of Query Optimization
Heuristic Optimization:
o This optimization method involves applying a set of rules (heuristics) to the query in order to
rearrange its operations to minimize execution cost.
Selectivity Pushdown: Moving selection operations as early as possible in the query plan
to reduce the number of rows involved.
Join Commutativity: Changing the order of joins, as certain join orders can be more
efficient.
Cost-Based Optimization:
o This involves evaluating different execution plans based on their cost, which is usually
a combination of factors like:
o A cost model is used to calculate the cost for each potential plan. The query optimizer then
chooses the plan with the lowest cost.
o These are used in both heuristic and cost-based optimization. Transformational rules allow
a query to be restructured into an equivalent query that may be more efficient to execute.
75
Subquery flattening: Rewriting subqueries into joins or other more efficient
expressions.
a. Joins are typically the most expensive operations in a query, so much of optimization is
aimed at choosing the most efficient join algorithm.
ii. Merge join: Efficient when the input tables are already sorted.
Join Ordering
The order in which joins are performed can significantly impact the query performance.
Optimizers use a search space to find the most efficient join order, but this can be
computationally expensive.
Dynamic programming and greedy algorithms are often used to explore different join
orders and choose the most efficient one.
Physical query optimization focuses on selecting the best physical implementation for each
relational operation in the query plan. It involves:
Index Selection: Choosing whether to use indexes for searching or sorting data, based on
their availability and cost.
Access Path Selection: The physical method used to access data (e.g., table scan, index
scan).
Join Strategy Selection: Choosing the best algorithm for joins (nested-loop, hash join, etc.).
Sorting: Deciding whether to use external sorting, in-memory sorting, or other methods.
76
Advanced Optimization Techniques
I. Materialized Views:
These are precomputed results of a query stored in the database. The optimizer can choose to
use a materialized view instead of recalculating a complex query repeatedly.
Sometimes queries can be rewritten to improve performance. For example, converting a query
that requires a full scan to one that can use an index or eliminating redundant operations.
In distributed databases, optimization must account for the cost of network communication.
It must decide which part of the query to execute on which node and how to handle data
transfer.
Query Optimization Examples
Optimization Example 1: Using Indexes
Without Index: If there is no index on product_id, the database will have to scan all the rows
in the products table to find the matching product. This can be inefficient for large tables.
With Index: If there is an index on product_id, the database can quickly locate the product
using the index, reducing the need to scan the entire table. This drastically improves
performance.
The optimizer may choose to use an Index Scan instead of a Full Table Scan, which is a
more efficient way to retrieve the data.
77
2.2. Transaction Management Procedure
Transaction management is a systematic process used to ensure the proper handling of database
transactions, maintaining data consistency, integrity, and security throughout their lifecycle.
Below is a standard procedure for managing transactions effectively.
Transaction Initiation
Start the Transaction: A transaction begins when an application or user requests to perform
an operation that will modify the database. This is typically done by sending a BEGIN
TRANSACTION command (in SQL).
Establish Transaction Context: Ensure the necessary context (like user identification and
session details) is established for auditing and tracking.
Transaction Execution
Data Manipulation: During the execution phase, the transaction performs operations such
as inserts, updates, or deletes on the database. These operations are not yet permanent.
o Atomicity: Ensure that operations are grouped in a way that they all succeed or fail
together. If any operation fails, the entire transaction must be rolled back.
o Consistency: Ensure that the database maintains its consistency rules, such as
constraints, triggers, and referential integrity.
o Isolation: Ensure that the changes made by the transaction are not visible to other
transactions until the transaction is committed. Techniques like locking, multiversion
concurrency control (MVCC), or isolation levels (e.g., Serializable, Read Committed) may
be used.
o Durability: Once committed, the transaction’s effects must be permanent, and even if the
system crashes afterward, the data must remain intact.
Concurrency Control
o Locking Mechanisms: Use locks (e.g., shared or exclusive locks) to prevent conflicts
between concurrent transactions that may lead to inconsistencies or conflicts.
78
o Deadlock Detection and Resolution: Implement techniques for detecting deadlocks
(where two or more transactions are waiting for each other to release resources) and take
corrective actions, such as rolling back one of the transactions.
o Optimistic Concurrency: In some cases, allow transactions to execute without locks and
only validate at commit time, rolling back if conflicts are detected.
Commit or Rollback
Commit Transaction: Once the transaction has completed successfully and all operations are
ready to be finalized, issue a COMMIT statement to make all changes permanent.
Rollback Transaction: If an error occurs or the transaction cannot be completed (e.g., due to
a violation of constraints, business logic failure, or system failure), issue a ROLLBACK
statement to undo all changes made during the transaction.
Error Handling
Transaction Failure: If any part of the transaction encounters an error (e.g., a constraint
violation, a system crash, or an integrity violation), the transaction should be rolled back to
maintain consistency.
Automatic Recovery: Implement recovery mechanisms to restore the system to a consistent
state after a failure. This may involve transaction logs, backups, or journaling.
Logging: Maintain detailed logs of all transaction activities. These logs can be used for
recovery and auditing purposes.
Logging and Auditing
Transaction Log: Keep a detailed record of all changes made during the transaction,
including the start and end time, data modified, and the user initiating the transaction.
Audit Trails: Ensure that a secure audit trail is kept, recording any changes made to critical
data, especially for sensitive operations like financial transactions
Active: When the transaction is in progress and has not yet been committed or rolled back.
Partially Committed: When the transaction has executed, but has not yet been fully
committed.
79
Committed: When the transaction has been successfully completed and the changes are
permanent.
Aborted: When the transaction has failed and any changes have been undone.
Read Uncommitted: Transactions can see uncommitted changes from other transactions,
which can lead to dirty reads.
Read Committed: A transaction can only read committed data, avoiding dirty reads but still
susceptible to non-repeatable reads.
Repeatable Read: Prevents dirty and non-repeatable reads by locking the data being read,
ensuring the same result if read multiple times.
Serializable: The highest isolation level, ensuring that transactions are executed as if they were
processed serially (one after the other), eliminating all concurrency issues.
Performance Optimization
Minimize Transaction Duration: Keep transactions short and avoid holding locks for
extended periods, as long transactions can reduce system throughput and cause contention.
Batch Processing: For large data modifications, consider using batch processing to reduce
the load on the system.
Use Indexing: Ensure proper indexing of frequently accessed columns to speed up data
retrieval during transactions.
Avoid Blocking: Use techniques like deadlock detection and transaction queuing to prevent
transactions from blocking each other unnecessarily.
Data Integrity: Ensure that all transactions respect the integrity constraints of the database,
such as foreign keys, uniqueness, and check constraints.
80
Confidentiality: Sensitive data should be encrypted, and access should be restricted to
prevent unauthorized access during the transaction.
Transaction Logs: Keep a detailed log of all transactions that include both committed and
aborted transactions to allow for recovery in case of failure.
Concurrent Executions
Multiple transactions are allowed to run concurrently in the system. The main advantages of
Concurrency executions are:
E.g. one transaction can be using the CPU while another is reading from or writing to the disk
o Reduced average response time for transactions: short transactions need not wait behind long
ones.
o Concurrency control schemes: mechanisms to achieve isolation
That is, to control the interaction among the concurrent transactions in order to prevent them from
destroying the consistency of the database.
81
Schedule 1
o Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B.
o An example of a serial schedule in which T1 is followed by T2 :
Schedule 2
Schedule 3
o Let T1 and T2 be the transactions defined previously. The following schedule is not a serial
schedule, but it is equivalent to Schedule 1.
82
Note : In schedules 1, 2 and 3, the sum A + B is preserved
Serializability
Basic Assumption –Each transaction preserves database consistency. Thus, serial execution of a
set of transactions preserves database consistency. A (possibly concurrent) schedule is serializable
if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the
notions of:
o Conflict serializability
o View serializability
83
IV. li = write(Q), lj = write(Q). They conflict
o Intuitively, a conflict between li and lj forces a (logical) temporal order between them. If li
and lj are consecutive in a schedule and they do not conflict, their results would remain the
same even if they had been interchanged in the schedule.
Conflict Serializability
84
We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3
, T4 >, or the serial schedule < T4 , T3 >.
Database security refers to the measures, policies, and practices employed to protect database
systems against unauthorized access, misuse, and malicious attacks. It ensures the confidentiality,
integrity, and availability of the data stored within the database, making it a critical aspect of any
organization's overall cybersecurity strategy.
Confidentiality: Ensuring that only authorized users can access sensitive data.
Availability: Guaranteeing that authorized users have reliable access to data when needed.
Authentication: is a critical aspect of database security that ensures that only authorized users or
systems can access a database and its associated resources. It is the process of verifying the identity
85
of users or applications attempting to interact with the database. Proper authentication mechanisms
prevent unauthorized access and help protect sensitive data.
Authorization: in database security refers to the process of granting or denying access to specific
resources within a database system based on user roles and permissions. It determines what actions
a user can perform on the database, such as reading, writing, modifying, or deleting data.
Role-Based Access Control (RBAC): Users are assigned roles with specific privileges. For
example, an admin has full access, while a regular user may only have read access.
Discretionary Access Control (DAC): The owner of the data determines who has access.
Mandatory Access Control (MAC): Access control policies are enforced by the system, not
the user.
Attribute-Based Access Control (ABAC): Uses attributes (e.g., user roles, location) to define
access permissions.
B. Data Encryption
Encryption at Rest: Ensures that data stored in the database (e.g., files, backups) is encrypted,
making it unreadable without the proper decryption key. Techniques include:
o Transparent Data Encryption (TDE): Encrypts the entire database file system.
86
Encryption in Transit: Protects data as it moves between the database and applications, using
protocols like SSL/TLS to prevent interception and tampering.
Database Activity Monitoring (DAM): Tracks and analyzes all database activity, such as
who is accessing the data, what actions they are performing, and when.
Audit Logs: Maintain a detailed record of database interactions for security reviews and
regulatory compliance. Logs should be immutable and regularly reviewed.
Regular Backups: Periodically back up the database to prevent data loss. Backups can be full
(entire database), incremental (only changes since last backup), or differential (changes since
last full backup).
Offsite Backups: Store backups in a separate physical location (or the cloud) to protect against
local disasters like fires, floods, or theft.
Restore Testing: Periodically test backup restoration processes to ensure data recovery is
possible in case of a disaster.
Regular Patch Management: Ensure that the database management system (DBMS) is
regularly updated to fix vulnerabilities. This includes applying patches for security fixes, bug
fixes, and performance improvements.
F. Network Security
87
Firewall Protection: Use network firewalls to block unauthorized access to the database from
external sources.
Virtual Private Network (VPN): Encrypts the connection between remote users and the
database, ensuring secure access over untrusted networks.
IP Whitelisting: Only allow database access from predefined IP addresses to reduce the attack
surface.
G. Database Hardening
Strong Password Policies: Enforce strong, complex passwords for database users, and use
password vaults for secure storage.
Minimum Privilege Principle: Ensure users have only the minimum privileges necessary to
perform their job functions.
Account Lockout: Lock accounts after a certain number of failed login attempts to protect
against brute-force attacks.
Checksums: Use checksums to verify that data hasn’t been tampered with or corrupted.
Data Validation: Ensure that data entered into the database is consistent, complete, and
adheres to defined rules (e.g., field lengths, data types).
Referential Integrity: Use foreign keys and constraints to ensure data consistency across
tables, preventing orphaned records and other integrity violations.
I. Data Masking
Static Data Masking: Creates a copy of the database with sensitive data obscured for testing
or development purposes.
88
Dynamic Data Masking: Masks sensitive data on the fly when it is queried, ensuring that
sensitive data is not exposed to unauthorized users.
Tokenization: Replaces sensitive data with a token that can be used for processing but cannot
be reverse-engineered to reveal the original data.
Database security testing is the process of evaluating the security of a database to identify
vulnerabilities, weaknesses, or misconfigurations that could lead to unauthorized access, data
breaches, or other security incidents. This type of testing is essential for protecting sensitive
information stored within databases and ensuring compliance with regulatory requirements.
Penetration Testing: Regularly test the database for vulnerabilities and attempt to exploit
weaknesses to identify and address potential security gaps.
Introduction
Database failures can disrupt operations, result in data loss, and impact business continuity. To
address these challenges, understanding the causes of database failures and implementing effective
recovery methods is essential.
89
Example: A customer initiates a bank transaction to transfer $1,000 from their Savings Account
to their Checking Account.
2. Before it credits the amount to the Checking Account, the database server crashes due to a
sudden power outage.
As a result:
This failure leaves the database in an inconsistent state. To address this, the database recovery
system would use a transaction log to roll back the incomplete transaction, ensuring no partial
updates are retained and maintaining consistency.
System failures in database systems refer to unexpected disruptions or malfunctions that affect the
normal operations of a database. These failures can result in data loss, inconsistencies, or
downtime, potentially impacting the integrity and availability of the stored information.
Understanding system failures is critical for designing robust database systems that ensure data
reliability and recoverability. Result from hardware or software issues causing the database server
to crash.
Scenario:
In 2000, Microsoft SQL Server experienced a major system failure in a large-scale, high-traffic e-
commerce application during a critical shopping period. The failure was caused by a bug in the
SQL Server’s transaction log system, which led to a deadlock situation. The bug prevented the
system from writing transaction logs properly, thus leading to database corruption.
90
Transaction Log Corruption: SQL Server uses transaction logs to ensure ACID
properties (Atomicity, Consistency, Isolation, Durability). In this case, the transaction log
got corrupted during the execution of a transaction. Because the logs couldn't be written
properly, the system was unable to recover from the failure.
Crash and Data Inconsistency: As a result of the corrupted logs, several transactions
couldn't be rolled back or completed, leading to inconsistent data within the database.
Users trying to perform operations on the affected system encountered errors.
Downtime: The failure led to several hours of downtime for the affected systems, which
was detrimental to the businesses relying on SQL Server for handling customer
transactions.
Consequences:
Data Loss and Inconsistency: Some data was lost, and other parts of the database became
inconsistent. Even though data could be partially recovered from backups, the system had
issues with incomplete transactions, leading to partial orders and inconsistent customer
records.
Business Impact: Many online transactions could not be processed during the peak
shopping period, resulting in revenue loss, customer dissatisfaction, and reputational
damage.
Recovery Efforts: The failure required a lengthy disaster recovery process, with
Microsoft providing patches and hotfixes to prevent the issue from reoccurring.
Example
Imagine a company that uses a relational database management system (RDBMS) like MySQL,
PostgreSQL, or Oracle. The database stores critical data for the company, such as customer
91
information, orders, and financial transactions. This data is stored on physical disk drives in the
form of files that make up the database.
Media Failure
If the disk on which the database files are stored fails (e.g., due to hardware malfunction, bad
sectors, or power surges), the database system will be unable to access the files that contain the
critical data. This constitutes a media failure.
Impact
o Data Loss: The data stored on the failed disk may be lost, especially if there are no backups
or redundant storage mechanisms (such as RAID) in place.
o System Downtime: The database will be unavailable until the media failure is resolved,
causing downtime for applications and services relying on the database.
o Corruption: In some cases, if the failure happens during a write operation, the database could
become corrupted, making it impossible to restore or read the data.
Human errors in database systems can significantly impact data integrity, security, and
performance. These errors may stem from user mistakes, misconfigurations, or lack of training.
Here are some common types of human errors in database systems:
Example
A common human error in a database management system (DBMS) is incorrect data entry. For
example, a user accidentally enters "1234" as a phone number instead of "234-567-8901." This
mistake could lead to data inconsistency and cause issues in data retrieval or reporting.
v) Natural Disasters
In the context of a database management system (DBMS), natural disasters can refer to events
like earthquakes, floods, fires, or other calamities that can physically damage the hardware storing
the database, causing data loss or corruption. For instance:
Floods can damage servers and storage devices, leading to potential loss of critical data.
Earthquakes can cause server racks or data centers to collapse, resulting in system downtime
and the inability to access databases.
92
Fires can destroy physical infrastructure, including databases, if not properly backed up or
stored off-site.
Malicious attacks can lead to database failures in several ways, causing both direct and indirect
impacts. Here are some common malicious attack vectors that can compromise database integrity
and result in failure:
SQL Injection
How it Leads to Failure: Attackers inject malicious SQL queries into input fields or
application parameters. If these queries are executed by the database, the attacker could
retrieve, modify, or delete sensitive data. In some cases, attackers might even drop entire
tables or databases, causing the system to fail.
Example: An attacker enters '; DROP TABLE users;-- into a form field, which could delete
the users table if not properly sanitized.
Example: A DDoS attack may flood a database server with thousands of requests per second,
eventually causing it to crash due to high CPU usage or memory depletion.
93
Brute Force Attacks
How it leads to Failure: Attackers may use brute force techniques to guess database login
credentials. Once successful, they gain unauthorized access, which can be used to cause
damage, including data corruption or deletion.
Example: An attacker continuously tries different password combinations until they gain
access to the database, where they delete or modify key records.
Key Impacts of Malicious Attacks on Databases:
Loss of Data: Attackers can delete, corrupt, or exfiltrate sensitive information, leading to
significant data loss.
Data Integrity Issues: Altered or corrupted data may lead to inconsistent or unreliable
information, causing system failures.
Business Disruption: Extended downtime due to an attack can halt business operations,
resulting in financial losses.
A full database backup involves creating a complete copy of all data, schema, and transaction logs
in the database. This ensures that if data loss or corruption occurs, the database can be fully restored
to its state at the time of the backup.
Backup Process:
i. Full Backup: A snapshot of the entire database is taken, including data, indexes, and system
files.
ii. Store Backup: The backup file is stored securely for future recovery.
94
Restore Process:
i. Full Restore: During recovery, the full backup is applied to restore the entire database to its
last backup point.
ii. Point-in-Time Restore (Optional): If combined with transaction log backups, the database can
be restored to a specific point in time.
Transaction Log Recovery
Transaction log recovery is a process in which the database uses transaction logs to restore data
after a failure, ensuring data consistency and integrity.
Process:
ii. Transaction Logs: These logs track all changes (inserts, updates, deletes) made to the
database, allowing for precise recovery.
iii. Restore Full Backup: The most recent full backup is restored as the base for recovery.
iv. Apply Transaction Logs: After restoring the full backup, the transaction logs are applied
in sequence to roll forward all committed transactions and undo any incomplete or
uncommitted transactions.
Transaction log recovery allows databases to be restored to a consistent state, either to the last
backup or to a specific point in time, minimizing data loss and ensuring system reliability.
Both techniques used in database failure recovery to ensure data availability, redundancy, and fault
tolerance in the event of hardware failures.
o RAID 1 (Mirroring): Data is copied to two or more disks. If one fails, the other keeps the
data available.
o RAID 5/6 (Parity): Distributes data with parity across multiple disks. Can recover from
one or two disk failures.
o RAID 10 (1+0): Combines RAID 1 and RAID 0 for redundancy and performance.
Mirroring:
o Involves duplicating database data to a secondary location. If the primary fails, the system
switches to the mirrored copy, ensuring minimal downtime.
95
Choosing the Right Recovery Method
The choice of recovery method depends on:
Criticality of Data: Mission-critical systems benefit from transaction log recovery or high-
availability solutions.
Database Size: Differential or incremental backups are suitable for large databases.
Cost Considerations: RAID and clustering incur higher costs but offer near-instant
recovery.
Data is stored across multiple sites or nodes, which may reside in different physical
locations.
Transparency:
Location Transparency: Users can access data without needing to know where it is located.
Fragmentation Transparency: Users don’t need to know how data is partitioned across
nodes.
96
Autonomy:
Each node can function autonomously and may have its database management system
(DBMS).
Nodes can manage their operations while still being part of the larger distributed system.
Scalability:
DDBS can handle increasing amounts of data and users by adding more nodes.
Fault Tolerance:
o Distributed systems are designed to continue functioning even if some nodes fail, ensuring
high availability and reliability.
i) Homogeneous Databases
Homogeneous databases are characterized by all participating nodes sharing the same database
management system (DBMS) and schema structure. These databases are designed to offer a
unified and consistent view of data across all nodes.
Key Characteristics:
i. Uniform Software: All nodes use the same DBMS software. This ensures compatibility and
avoids issues arising from differences in database functionalities.
ii. Consistent Schema: The database schema is identical across all nodes. This means that data
representation, table structures, and relationships are uniform.
iii. Ease of Management: Due to the homogeneity, managing and maintaining the system is
easier compared to heterogeneous systems, as the tools and configurations are standardized.
97
v. Simplified Communication: Since all nodes use the same software, they can communicate
directly without the need for additional middleware or translation mechanisms.
Example: Consider a distributed e-commerce system with multiple nodes handling customer
orders, inventory management, and shipping. Employing a homogeneous database approach, all
nodes share the same DBMS (e.g., MySQL) and adhere to a consistent schema. When an order
is placed on one node, the system automatically synchronizes the order details and inventory
updates across all nodes, enabling real-time visibility and consistency.
It is a distributed database system where the databases are stored across multiple locations and
managed by different database management systems (DBMS), which may have diverse structures,
data models, query languages, and access protocols. The main purpose of such a system is to
provide a unified interface for accessing and managing the distributed, heterogeneous data.
Key Characteristics of Heterogeneous Distributed Database Systems
Diverse Data Models: Combines relational, NoSQL, hierarchical, or object-oriented
databases with varying schemas.
Different DBMS: Integrates databases managed by different systems (e.g., Oracle,
MySQL, MongoDB).
Autonomy: Each database operates independently with its own control.
Geographic Distribution: Data is physically distributed across multiple locations.
Query Translation: Converts queries to match the specific syntax of each database.
Middleware Coordination: Uses middleware to enable communication and data
integration between diverse systems.
98
Homogeneous vs Heterogeneous Distributed Database System
99
Consistency More straightforward consistency Complex consistency models, as
Models models due to the same DBMS. different systems may follow
different consistency paradigms.
Data Distribution Data distribution is easier to Data distribution is more
manage as all nodes follow the challenging due to varying data
same principles. models and systems.
Replication Simplified replication process as Replication is more complicated
the same DBMS is used across all due to different DBMSs with
sites. varying replication strategies.
Performance Generally better performance due Performance may vary depending
to uniformity in architecture. on how different systems are
optimized or interact.
Example Distributed MySQL, Distributed A system with a combination of
PostgreSQL. MySQL, MongoDB, and Oracle
databases at different sites.
Query processing in distributed databases refers to the process of executing a query across multiple
interconnected databases, which can be spread across different locations, machines, or even
organizations. The goal is to ensure that the query is executed efficiently, data is fetched from the
appropriate sites, and results are returned in a timely manner.
Here are the key steps and concepts involved in query processing for distributed databases:
1. Query Decomposition
Parsing: The query is parsed to understand the structure, such as selecting tables, columns,
and conditions.
Translation: The high-level query (e.g., SQL) is translated into a more abstract internal
query representation, such as a relational algebra expression.
Decomposition: The query is decomposed into subqueries based on data distribution.
These subqueries can be run on different databases in the system. The challenge is to
determine how to break down the query efficiently based on data location and availability.
100
2. Query Optimization
Centralized vs. Distributed Optimization: In distributed databases, query optimization
can be challenging because the optimizer must consider the distribution of data across
nodes and the cost of data transfer between them.
Join Ordering: The order in which joins are executed can significantly affect performance.
The optimizer must choose the most efficient join strategy (e.g., nested loops, hash join)
considering the location of data.
Minimizing Data Transfer: Transferring large amounts of data over the network can be
expensive. The optimizer may choose to filter or aggregate data early in the execution to
reduce the amount of data that needs to be sent.
3. Query Execution
Data Localization: The system must decide where to execute each part of the query. Some
operations may be performed locally on each node, while others might require
communication between nodes (e.g., if the data is distributed).
Parallel Execution: Distributed databases often execute queries in parallel across different
sites. This requires managing concurrency and ensuring that results from multiple sites are
combined correctly.
Distributed Joins: If the data for a join resides on multiple nodes, the database must decide
how to perform the join. Techniques such as broadcast joins or partitioned joins can be
used to reduce data movement.
4. Data Fragmentation and Replication
Fragmentation: Data can be fragmented horizontally (dividing rows across different sites)
or vertically (dividing columns). Query execution must take into account where the
fragments are stored.
Replication: Data may be replicated across multiple nodes for fault tolerance and
performance reasons. The query processor must decide whether to access a replica or the
original copy based on factors such as load and availability.
5. Transaction Management and Consistency
Distributed Transactions: Queries in distributed databases are often part of larger
transactions. Ensuring ACID (Atomicity, Consistency, Isolation, and Durability) properties
101
in a distributed setting requires careful management of resources and coordination between
nodes.
Two-Phase Commit (2PC): This is a protocol used to ensure that all participating sites
either commit or abort a transaction in a consistent manner.
6. Cost Estimation and Selection
Cost Models: A cost model is used to estimate the resources required to execute a query,
including computation, data transfer, and disk I/O. The optimizer uses these estimates to
choose the best execution plan.
Factors in Cost Estimation: This includes network latency, data transfer costs, disk I/O
costs, CPU usage, and memory usage.
7. Result Aggregation and Presentation
Once the query is executed on all relevant nodes, the results are aggregated and presented
to the user. In distributed systems, this step might involve combining data from different
sites and resolving conflicts if replication is involved.
Challenges in Distributed Query Processing:
Data Distribution: Deciding where data should reside and how it can be accessed
efficiently is a key challenge.
Network Latency: Minimizing the cost of data transfer between nodes in the system.
Concurrency Control: Ensuring correct and consistent results when multiple users or
processes are accessing distributed data simultaneously.
Fault Tolerance: Ensuring the query processing is resilient to network or node failures,
which requires mechanisms for replication and recovery.
Tools and Techniques:
Distributed Query Optimizers: These specialized optimizers take into account the
distributed nature of the database to improve query performance.
Materialized Views: Storing precomputed results of queries to avoid repeated execution.
Data Sharding: Dividing the database into smaller, more manageable parts, each of which
can be stored on different nodes.
102
Appendix A
Tools used for Information and Database Management Module
- PostgreSQL - MongoDB
- SQLite - HBase
- MySQL Workbench
103
- DBeaver
- Redis
104
Appendix B
Here are some common terms in the fundamentals of database systems:
1. Database: A collection of organized data that can be easily accessed, managed, and
updated.
2. Database Management System (DBMS): Software that allows users to create, manage,
and interact with databases. It provides an interface for users and programs to access data.
3. Table: A collection of related data organized into rows and columns. Each table represents
a specific type of entity in the database.
4. Row (Record/tuple): A single, data item or entry in a table, representing a unique instance
of an entity.
5. Column (Field/Attribute): A single characteristic or property of an entity, represented as
a vertical division in a table.
6. Primary Key: A column or a combination of columns in a table that uniquely identifies
each record in that table.
7. Foreign Key: A column or a set of columns in one table that refers to the primary key in
another table, establishing a relationship between the two tables.
8. Entity: A real-world object or concept that can have data stored about it in the database
(e.g., an employee, a product).
9. Schema: The structure that defines the organization of data in a database, including tables,
views, and relationships.
10. Normalization: The process of organizing data in a database to reduce redundancy and
improve data integrity by dividing large tables into smaller, more manageable ones.
11. Denormalization: The process of combining tables to reduce the complexity of a database,
often for performance reasons.
12. SQL (Structured Query Language): A standardized language used to query and manage
databases, including commands for selecting, inserting, updating, and deleting data.
13. Query: A request for data or information from the database, typically written in SQL.
14. Index: A data structure that improves the speed of data retrieval operations on a database
table by providing quick access to rows based on the values of one or more columns.
105
15. Transaction: A unit of work that is executed as a single, indivisible operation, ensuring
database consistency. It typically includes operations like INSERT, UPDATE, and
DELETE.
16. ACID Properties: The set of properties that guarantee database transactions are processed
reliably:
o Atomicity: The transaction is all-or-nothing.
o Consistency: The database moves from one valid state to another.
o Isolation: Transactions do not interfere with each other.
o Durability: Once a transaction is committed, it is permanent.
17. Relational Model: A type of database model that organizes data into tables (relations),
using rows and columns.
18. Normalization Forms (1NF, 2NF, 3NF): Specific guidelines for organizing database
tables to minimize redundancy and dependency.
19. Data Integrity: The accuracy and consistency of data within a database.
20. Backup: The process of making a copy of the database to ensure data can be restored in
case of failure or loss.
21. View: A virtual table created by querying one or more tables. It doesn't store data itself but
displays data from underlying tables.
22. Stored Procedure: A precompiled collection of one or more SQL statements that can be
executed as a single unit to perform a task.
23. Trigger: A set of actions automatically performed in the database in response to specific
events (e.g., data insertion, deletion, or update).
24. Data Warehouse: A large collection of data used for analysis and reporting, often
structured differently from operational databases to support decision-making processes.
25. NoSQL: A class of databases designed to handle large volumes of unstructured data,
providing more flexibility than traditional relational databases.
26. Distributed Database: A database that is distributed across multiple physical locations,
either on different machines or in different geographical locations. Data may be replicated
or partitioned.
27. Sharding: A method of horizontal partitioning where a database is divided into smaller,
more manageable pieces (shards) that are distributed across multiple servers.
106
28. Replication: The process of copying data from one database server to another to improve
data availability, fault tolerance, and load balancing.
29. Concurrency Control: A mechanism to ensure that database transactions are executed in
such a way that the database remains in a consistent state, even when multiple transactions
occur simultaneously. Common methods include locking and optimistic concurrency
control.
30. Locking: A mechanism to control access to data in a multi-user environment. Locks can be
placed on data to prevent other transactions from modifying it until the lock is released.
o Exclusive Lock: Prevents other transactions from reading or writing to the locked data.
o Shared Lock: Allows other transactions to read the data but prevents writing.
31. Deadlock: A situation where two or more transactions are blocked because each is waiting
for the other to release a lock. Deadlock detection and resolution techniques are used to
handle this problem.
32. Eventual Consistency: A model of consistency in distributed databases where updates to
data are propagated through the system over time, but immediate consistency is not
guaranteed. It is often used in NoSQL databases.
33. Database Partitioning: The practice of dividing a database into smaller, more manageable
pieces to improve performance and scalability. This can be done in different ways:
o Horizontal Partitioning: Dividing a table into rows, with each partition containing
a subset of the rows.
o Vertical Partitioning: Dividing a table into columns, where each partition contains
a subset of the columns.
34. CAP Theorem: A principle that states that a distributed database can provide only two out
of three guarantees at a time:
107
o Partition Tolerance: The system continues to function even if network partitions
occur.
35. OLTP (Online Transaction Processing): A type of database management that supports
transaction-oriented applications, such as banking or e-commerce systems, which require
fast query processing, high concurrency, and consistency.
36. OLAP (Online Analytical Processing): A database design optimized for query processing
and analytics, focusing on data retrieval for decision support rather than transactional
operations. OLAP systems typically use multidimensional data models (e.g., star or
snowflake schema).
37. Data Mining: The process of discovering patterns, relationships, or knowledge from large
datasets using statistical, machine learning, or computational techniques.
38. Big Data: Large, complex datasets that require special handling and storage techniques
due to their volume, variety, and velocity. Big data tools and databases (e.g., Hadoop,
Apache Spark) handle high-volume, high-velocity data.
o Column-Family Stores: Store data in columns rather than rows (e.g., Cassandra, HBase).
o Graph Databases: Use graph structures with nodes and edges to represent and query data
(e.g., Neo4j).
40. Cloud Databases: Databases that are hosted on cloud platforms (e.g., AWS, Azure, and
Google Cloud), offering scalability, high availability, and flexibility in terms of
infrastructure management.
108
41. Data Lake: A storage repository that can hold vast amounts of raw, unstructured, or semi-
structured data in its native format. Unlike traditional databases, data lakes are designed to
store large-scale data without the need for preprocessing.
42. Data Federation: A data integration technique that allows access to multiple databases
and data sources as if they were a single database, without physically consolidating them.
43. Data Warehousing: The practice of collecting, storing, and managing large amounts of
data from different sources for reporting and analysis. Data warehouses typically store
historical data and support complex queries.
44. ETL (Extract, Transform, Load): A process used in data warehousing to extract data
from different sources, transform it into a usable format, and load it into a data warehouse.
45. Materialized Views: A database object that stores the result of a query for faster access.
Unlike regular views, which are computed on demand, materialized views are precomputed
and stored on disk.
46. Query Optimization: The process of improving the efficiency of SQL queries to reduce
execution time and resource consumption. This can involve indexing, rewriting queries,
and adjusting the database configuration.
47. Hybrid Databases: A database that supports both transactional and analytical workloads
(e.g., HTAP - Hybrid Transactional/Analytical Processing), enabling real-time analytics
on transactional data.
48. In-Memory Databases: A database that stores data primarily in the system’s main
memory (RAM) rather than on traditional disk storage, which provides much faster data
access and processing speeds (e.g., Redis, SAP HANA).
o BASE: (Basically Available, Soft state, eventually consistent) focuses on availability and
flexibility in distributed systems, often used in NoSQL databases.
109
50. Blockchain Databases: A decentralized, distributed ledger system used to securely store
data in a tamper-evident manner. Block chain databases are often used for cryptocurrencies
and other applications requiring transparent and immutable records.
110
Appendix C
Model Questions
A. MongoDB C. Redis
B. MySQL D. Cassandra
A. SQL C. Python
B. HTML D. JavaScript
4. Which of the following is the key used to uniquely identify a record in a table?
A. DROP C. REMOVE
B. DELETE D. CLEAR
111
A. Reducing data redundancy C. Combining data from multiple tables
A. Tables C. Documents
B. Objects D. Nodes
A. Descending C. Alphabetical
B. Ascending D. Random
112
15. What is a schema in database terms?
A. MODIFY C. CHANGE
B. UPDATE D. ALTER
A. WHERE C. SELECT
B. HAVING D. ORDER BY
19. Which of the following ensures that no two rows in a table have identical values?
113
A. Transactions are executed in a complete manner
25. Which database technology is primarily designed for handling real-time data?
A. It reduces redundancy
A. A rule that defines how changes to data are propagated across nodes
B. A type of query optimization algorithm
114
C. A security mechanism for encryption
D. A method for archiving data
A. Single-point failure
D. Ensuring data consistency across nodes
B. Centralized data storage
115
C) There is no locking mechanism
D) The database is in read-only mode
36. What is the function of a cost-based query optimizer?
A) It measures the runtime of the query.
B) It uses predefined rules to optimize queries.
C) It estimates the cost of different query plans and chooses the lowest-cost plan.
D) It rewrites queries into equivalent forms for execution.
37. A schedule is conflict-serializable if:
A) There are no conflicting operations between transactions.
B) The precedence graph has no cycles.
C) All transactions are executed in a predefined order.
D) It avoids cascading aborts.
38. Which security mechanism can prevent a brute force attack on a database?
A) Limiting the number of login attempts
B) Disabling indexing
C) Avoiding SQL joins
D) Increasing server capacity
39. Which of the following is NOT a feature of homogeneous distributed databases?
A) Easier integration
B) Uniform query language
C) Seamless communication between nodes
D) Supports multiple database platforms
40. Which of the following challenges is common in heterogeneous distributed databases?
A) Query translation between different query languages
B) Data integrity enforcement in a single format
C) Uniform access to all data
D) Centralized control of data
41. Which of the following is a key difference between homogeneous and heterogeneous
distributed databases?
A) Heterogeneous systems support multiple DBMSs, while homogeneous systems use only one.
B) Homogeneous systems require manual synchronization of data.
116
C) Heterogeneous systems cannot handle distributed queries.
D) Homogeneous systems always operate without failures.
42. An example of a heterogeneous distributed database is:
A) A collection of MySQL databases replicated across multiple sites
B) An integration of Oracle, SQL Server, and MongoDB databases
C) A system of multiple SQLite instances sharing the same schema
D) A single PostgreSQL database split into shards
43. A trigger in a database is:
A) A condition that enforces constraints on data integrity
B) A stored procedure that automatically executes in response to specific database events
C) A special type of index for query optimization
D) A tool for encrypting sensitive data
44. What is the primary limitation of triggers?
A) They can only be executed once.
B) They cannot interact with other tables.
C) They may lead to performance overhead if overused.
D) They cannot handle DELETE operations.
45. A database view is:
A) A physical copy of a table
B) A virtual table based on a SELECT query
C) An index on a table
D) A trigger for data validation
46. What is the difference between WHERE and HAVING in SQL?
A) WHERE filters rows before grouping, and HAVING filters rows after grouping
D) There is no difference
A) To group rows that have the same values into summary rows
117
B) To filter rows after grouping
C) To join multiple tables together
D) To sort the result set
49. Which of the following statements will return the second highest salary from an Employee
table?
118
Answer Key
1) B 18) A 35) B
2) B 19) B 36) C
3) A 20) B 37) B
4) B 21) A 38) A
5) B 22) A 39) D
6) A 23) D 40) A
7) A 24) B 41) A
8) A 25) A 42) B
9) A 26) B 43) B
10) A 27) A 44) C
11) A 28) A 45) B
12) B 29) D 46) A
13) B 30) C 47) A
14) A 31) D 48) B
15) A 32) C 49) A
16) B 33) B 50) A
17) A 34) B
119