DBMS-Unit 5
DBMS-Unit 5
DATA FRAGMENTATION
Data Fragmentation: Breaking up the database into logical units, called fragments, which
may be assigned for storage at the various sites.
A relation can be fragmented in two ways:
■ Horizontal Fragmentation
■ Vertical Fragmentation
Horizontal fragmentation:
It is a horizontal subset of a relation which contain those of tuples which satisfy
selection conditions. Consider the Employee relation with selection condition (DNO = 5).
All tuples satisfy this condition will create a subset which will be a horizontal fragment of
Employee relation. A selection condition may be composed of several conditions connected
by AND or OR.
Derived horizontal fragmentation: It is the partitioning of a primary relation to
other secondary relations which are related with Foreign keys.
Each horizontal fragment on a relation can be specified by a sCi (R) operation in the relational
algebra.
Complete horizontal fragmentation-> A set of horizontal fragments whose
conditions C1, C2, …, Cn include all the tuples in R- that is, every tuple in R satisfies (C1
OR C2 OR … OR Cn).
Disjoint complete horizontal fragmentation🡪 No tuple in R satisfies (Ci AND Cj)
where i ≠ j.
To reconstruct R from horizontal fragments a UNION is applied.
Vertical fragmentation:
It is a subset of a relation which is created by a subset of columns. Thus a vertical
fragment of a relation will contain values of selected columns. There is no selection
condition used in vertical fragmentation. Consider the Employee relation. A vertical
fragment of can be created by keeping the values of Name, Bdate, Sex, and Address.
Because there is no condition for creating a vertical fragment. A vertical fragment on a
relation can be specified by a ΠLi(R) operation in the relational algebra.
Complete vertical fragmentation🡪A set of vertical fragments whose projection lists
L1, L2, …, Ln include all the attributes in R but share only the primary key of R. In this case
the projection lists satisfy the following two conditions:
■ L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)
■ Li ∩ Lj = PK(R) for any i j, where ATTRS (R) is the set of attributes of
R and PK(R) is the primary key of R.
To reconstruct R from complete vertical fragments a OUTER UNION is applied.
Mixed (Hybrid) fragmentation:
It is a combination of Vertical fragmentation and Horizontal fragmentation. This is
achieved by SELECT-PROJECT operations which is represented by ΠLi(sCi (R)).If C = True
(Select all tuples) and L ≠ ATTRS(R), we get a vertical fragment, and if C ≠ True and L ≠
ATTRS(R), we get a mixed fragment. If C = True and L = ATTRS(R), then R can be
considered a fragment.
Fragmentation schema
A definition of a set of fragments (horizontal or vertical or horizontal and vertical) that
includes all attributes and tuples in the database that satisfies the condition that the whole
database can be reconstructed from the fragments by applying some sequence of UNION (or
OUTER JOIN) and UNION operations.
Allocation schema
It describes the distribution of fragments to sites of distributed databases. It can be
fully or partially replicated or can be partitioned.
DATA REPLICATION AND ALLOCATION
Data Replication
Database is replicated to all sites. In full replication the entire database is replicated
and in partial replication some selected part is replicated to some of the sites. Data replication
is achieved through a replication schema.
Data Distribution (Data Allocation)
This is relevant only in the case of partial replication or partition. The selected portion
of the database is distributed to the database sites.
TYPES OF DISTRIBUTED DATABASE SYSTEMS
1. Homogeneous
All sites of the database system have identical setup, i.e., same database system
software(oracle).But the underlying operating system may be different. The underlying
operating systems can be a mixture of Linux, Window, Unix, etc.
2. Heterogeneous
2 Types of Heterogeneous distributed system
i)Federated:
Each site may run different database system but the data access is managed through a
single conceptual schema.This implies that the degree of local autonomy is minimum. Each
site must adhere to a centralized access policy. There may be a global schema.
ii)Multidatabase:
There is no one conceptual global schema. For data access a schema is constructed
dynamically as needed by the application software.
Fnam Minit Lname SSN Bdate Address Sex Salary Superssn Dno
e
The result of this query will have 10,000 tuples, assuming that every employee is related to a
department.
■ Suppose each result tuple is 40 bytes long. The query is submitted at site 3
and the result is sent to this site.
■ Problem: Employee and Department relations are not present at site 3.
Strategies:
1. Transfer Employee and Department to site 3.
-Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.
-Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 +
1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site1, execute the join at site 1, and send the result to site 3
-Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
-Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.
1.Primary site technique: A single site is designated as a primary site which serves as a
coordinator for transaction management.
Transaction management: Concurrency control and commit are managed by this site.
■ In two phase locking, this site manages locking and releasing data items. If all
transactions follow two-phase policy at all sites, then serializability is
guaranteed.
Advantages:
● An extension to the centralized two phase locking so implementation and
management is simple.
● Data items are locked only at one site but they can be accessed at any site.
Disadvantages:
● All transaction management activities go to primary site which is likely to
overload the site.
● If the primary site fails, the entire system is inaccessible.
2.Primary Copy Technique:
In this approach, instead of a site, a data item partition is designated as primary copy. To
lock a data item just the primary copy of the data item is locked.
Advantages:
Since primary copies are distributed at various sites, a single site is not overloaded
with locking and unlocking requests.
Disadvantages:
Identification of a primary copy is complex. A distributed directory must be
maintained, possibly at all sites.
Recovery from a coordinator failure:
In both approaches a coordinator site or copy may become unavailable. This will require
the selection of a new coordinator.
3. Primary site approach with no backup site:
Aborts and restarts all active transactions at all sites. Elects a new coordinator and
initiates transaction processing.
4. Primary site approach with backup site:
Suspends all active transactions, designates the backup site as the primary site and
identifies a new back up site. Primary site receives all transaction management information
to resume processing.
5.Primary and backup sites fail or no backup site:
Use election process to select a new coordinator site.
Concurrency control based on voting:
■ There is no primary copy of coordinator.
■ Send lock request to sites that have data item.
■ If majority of sites grant lock then the requesting transaction gets the data
item.
■ Locking information (grant or denied) is sent to all these sites.
■ To avoid unacceptably long wait, a time-out period is defined. If the
requesting transaction does not get any vote information then the transaction is
aborted.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about
NoSQL with a diagram in this NoSQL database tutorial:
In this NoSQL tutorial for beginners, you will learn NoSQL basics like:
● What is NoSQL?
● Why NoSQL?
● Brief History of NoSQL Databases
● Features of NoSQL
● Types of NoSQL Databases
● Query Mechanism tools for NoSQL
● What is the CAP Theorem?
● Eventual Consistency
● Advantages of NoSQL
● Disadvantages of NoSQL
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the
load increases. This method is known as “scaling out.”
NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.
● 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
● 2000- Graph database Neo4j is launched
● 2004- Google BigTable is launched
● 2005- CouchDB is launched
● 2007- The research paper on Amazon Dynamo is released
● 2008- Facebooks open sources the Cassandra project
● 2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
Schema-free
Simple API
● Offers easy to use interfaces for storage and querying data provided
● APIs allow low-level data manipulation & selection methods
● Text-based protocols mostly used with HTTP REST with JSON
● Mostly used no standard based NoSQL query language
● Web-enabled databases running as internet-facing services
Distributed
Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like “Website” associated with a value like
“Guru99”.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored
contiguously.
Column based NoSQL database
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as
the data is readily available in a column.
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.
In this diagram on your left you can see we have rows and columns, and in the right, we have
a document database which has a similar structure to JSON. Now for the relational database,
you have to know what columns you have and so on. However, for a document database, you
have data store like JSON object. You do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics
& e-commerce applications. It should not use for complex transactions which require
multiple operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Document store Database offers more difficult queries as they understand the value in a
key-value pair. For example, CouchDB allows defining views with MapReduce
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.
Availability:
The database should always be available and responsive. It should not have any downtime.
Partition Tolerance:
Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the
database is unavailable, other parts are always unaffected.
Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get
high availability and scalability. Thus, changes made to any data item on one machine has to
be propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.
● Basically, available means DB is available all the time as per CAP theorem
● Soft state means even without an input; the system state may change
● Eventual consistency means that the system will become consistent over time
Advantages of NoSQL
Disadvantages of NoSQL
● No standardization rules
● Limited query capabilities
● RDBMS databases and tools are comparatively mature
● It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
● When the volume of data increases it is difficult to maintain unique values as keys
become difficult
● Doesn’t work as well with relational data
● The learning curve is stiff for new developers
● Open source options so not so popular for enterprises.
■ Types of Security
■ Legal and ethical issues
■ Policy issues
■ System-related issues
■ The need to identify multiple security levels
■ Threats to databases
■ Loss of integrity
■ Loss of availability
■ Loss of confidentiality
■ To protect databases against these types of threats four kinds of countermeasures can
be implemented:
■ Access control
■ Inference control
■ Flow control
■ Encryption
■ The database administrator (DBA) is the central authority for managing a database
system.
■ The DBA’s responsibilities include
■ granting privileges to users who need to use the system
■ classifying users and data in accordance with the policy of the
organization
■ The DBA is responsible for the overall security of the database system.
■ The DBA has a DBA account in the DBMS
■ Sometimes these are called a system or superuser account
■ These accounts provide powerful capabilities such as:
■ 1. Account creation
■ 2. Privilege granting
■ 3. Privilege revocation
■ 4. Security level assignment
■ Action 1 is access control, whereas 2 and 3 are discretionarym and 4 is used to
control mandatory authorization
■ The database system must also keep track of all operations on the database that are
applied by a certain user throughout each login session.
■ To keep a record of all updates applied to the database and of the particular
user who applied each update, we can modify system log, which includes an
entry for each operation applied to the database that may be required for
recovery from a transaction failure or system crash.
■ If any tampering with the database is suspected, a database audit is performed
■ A database audit consists of reviewing the log to examine all accesses and
operations applied to the database during a certain time period.
■ A database log that is used mainly for security purposes is sometimes called an audit
trail.
The same applies to limiting B to retrieving only certain tuples of R; a view V’ can be created
by defining the view by means of a query that selects only those tuples from R that A wants
to allow B to access
Revoking Privileges
■ In some cases it is desirable to grant a privilege to a user temporarily. For example,
■ The owner of a relation may want to grant the SELECT privilege to a user for
a specific task and then revoke that privilege once the task is completed.
■ Hence, a mechanism for revoking privileges is needed. In SQL, a REVOKE
command is included for the purpose of canceling privileges.
Example 1:
■ Suppose that the DBA creates four accounts A1, A2, A3, A4
and wants only A1 to be able to create base relations. Then the DBA must
issue the following GRANT command in SQL
GRANT CREATETAB TO A1;
In SQL2 the same effect can be accomplished by having the DBA issue a
CREATE SCHEMA command as follows:
CREATE SCHAMA EXAMPLE AUTHORIZATION A1;
Example 2:
■ User account A1 can create tables under the schema called EXAMPLE.
■ Suppose that A1 creates the two base relations EMPLOYEE and DEPARTMENT
■ A1 is then owner of these two relations and hence all the relation privileges on
each of them.
■ Suppose that A1 wants to grant A2 the privilege to insert and delete tuples in both of
these relations, but A1 does not want A2 to be able to propagate these privileges to
additional accounts:
GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;