0% found this document useful (0 votes)
2K views27 pages

DBMS-Unit 5

Distributed databases allow data to be stored across multiple sites connected by a network. This improves availability, scalability, and performance. Data can be fragmented horizontally by rows or vertically by columns and replicated across sites. Queries in distributed databases must be optimized to minimize data transfer costs between sites by pushing processing closer to the data when possible.

Uploaded by

ABR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views27 pages

DBMS-Unit 5

Distributed databases allow data to be stored across multiple sites connected by a network. This improves availability, scalability, and performance. Data can be fragmented horizontally by rows or vertically by columns and replicated across sites. Queries in distributed databases must be optimized to minimize data transfer costs between sites by pushing processing closer to the data when possible.

Uploaded by

ABR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT V:

Distributed Databases: Architecture, Data Storage, Transaction Processing, Query


processing and optimization – NOSQL Databases: Introduction – CAP Theorem –
Document Based systems – Key value Stores – Column Based Systems – Graph
Databases. Database Security: Security issues – Access control based on privileges –
Role Based access control – SQL Injection – Statistical Database security – Flow control
– Encryption and Public Key infrastructures – Challenges

INTRODUCTION TO DISTRIBUTED DATABASES


DISTRIBUTED DATABASE CONCEPTS
A distributed database (DDB) processes Unit of execution (a transaction) in a
distributed manner.
Defn:A distributed database (DDB) is a collection of multiple logically interrelated databases
distributed over a computer network, and a distributed database management system
(DDBMS) as a software system that manages a distributed database while making the
distribution transparent to the user.
Parallel Versus Distributed Technology
i)Shared memory (tightly coupled) architecture: Multiple processors share secondary (disk)
storage and also share primary memory.
ii) Shared disk (loosely coupled) architecture: Multiple processors share secondary (disk)
storage but each has their own primary memory.

Fig: distributed database architecture

Advantages of Distributed Databases

1. Management of distributed data with different levels of transparency:

• Distribution or network transparency: It may be divided into location transparency and


naming transparency. Location transparency refers to the fact that the command used to
perform a task is independent of the location of data and the location of the system where the
command was issued. Naming transparency implies that once a name is specified, the named
objects can be accessed unambiguously without additional specification.
• Replication transparency: Copies of data may be stored at multiple sites for better
availability, performance, and reliability. Replication transparency makes the user unaware of
the existence of copies.
• Fragmentation transparency: Two types of fragmentation are possible. Horizontal
fragmentation distributes a relation into sets of tuples (rows). Vertical fragmentation
distributes a relation into sub relations where each sub relation is defined by a subset of the
columns of the original relation. Fragmentation transparency makes the user unaware of the
existence of fragments.
Example: The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented
horizontally and stored with possible replication as shown below.

2. Increased reliability and availability:


Reliability is broadly defined as the probability that a system is running (not down) at a
certain time point, whereas availability(replicating data and software at more than one site) is
the probability that the system is continuously available during a time interval. A distributed
database system has multiple nodes (computers) and if one fails then others are available to
do the job.
3. Improved performance:
A distributed DBMS fragments the database by keeping the data closer to where it is needed
most. Data localization reduces the contention for CPU and I/O services and simultaneously
reduces access delays involved in wide area networks.
4. Easier expansion: In a distributed environment, expansion of the system in terms of
adding more data, increasing database sizes, or adding more processors is much easier.

Additional Functions of Distributed Databases


• Keeping track of data: The ability to keep track of the data distribution, fragmentation,
and replication by expanding the DDBMS catalog.
• Distributed query processing: The ability to access remote sites and transmit queries and
data among the various sites via a communication network.
• Distributed transaction management: The ability to devise execution strategies for
queries and transactions that access data from more than one site and to synchronize the
access to distributed data and maintain integrity of the overall database.
• Replicated data management: The ability to decide which copy of a replicated data item
to access and to maintain the consistency of copies of a replicated data item.
• Distributed database recovery: The ability to recover from individual site crashes and
from new types of failures such as the failure of a communication links.
• Security: Distributed transactions must be executed with the proper management of the
security of the data and the authorization/access privileges of users.
• Distributed directory (catalog) management: A directory contains information
(metadata) about data in the database. The directory may be global for the entire DDB, or
local for each site. The placement and distribution of the directory are design and policy
issues.

DATA FRAGMENTATION
Data Fragmentation: Breaking up the database into logical units, called fragments, which
may be assigned for storage at the various sites.
A relation can be fragmented in two ways:
■ Horizontal Fragmentation
■ Vertical Fragmentation
Horizontal fragmentation:
It is a horizontal subset of a relation which contain those of tuples which satisfy
selection conditions. Consider the Employee relation with selection condition (DNO = 5).
All tuples satisfy this condition will create a subset which will be a horizontal fragment of
Employee relation. A selection condition may be composed of several conditions connected
by AND or OR.
Derived horizontal fragmentation: It is the partitioning of a primary relation to
other secondary relations which are related with Foreign keys.
Each horizontal fragment on a relation can be specified by a sCi (R) operation in the relational
algebra.
Complete horizontal fragmentation-> A set of horizontal fragments whose
conditions C1, C2, …, Cn include all the tuples in R- that is, every tuple in R satisfies (C1
OR C2 OR … OR Cn).
Disjoint complete horizontal fragmentation🡪 No tuple in R satisfies (Ci AND Cj)
where i ≠ j.
To reconstruct R from horizontal fragments a UNION is applied.

Vertical fragmentation:
It is a subset of a relation which is created by a subset of columns. Thus a vertical
fragment of a relation will contain values of selected columns. There is no selection
condition used in vertical fragmentation. Consider the Employee relation. A vertical
fragment of can be created by keeping the values of Name, Bdate, Sex, and Address.
Because there is no condition for creating a vertical fragment. A vertical fragment on a
relation can be specified by a ΠLi(R) operation in the relational algebra.
Complete vertical fragmentation🡪A set of vertical fragments whose projection lists
L1, L2, …, Ln include all the attributes in R but share only the primary key of R. In this case
the projection lists satisfy the following two conditions:
■ L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)
■ Li ∩ Lj = PK(R) for any i j, where ATTRS (R) is the set of attributes of
R and PK(R) is the primary key of R.
To reconstruct R from complete vertical fragments a OUTER UNION is applied.
Mixed (Hybrid) fragmentation:
It is a combination of Vertical fragmentation and Horizontal fragmentation. This is
achieved by SELECT-PROJECT operations which is represented by ΠLi(sCi (R)).If C = True
(Select all tuples) and L ≠ ATTRS(R), we get a vertical fragment, and if C ≠ True and L ≠
ATTRS(R), we get a mixed fragment. If C = True and L = ATTRS(R), then R can be
considered a fragment.
Fragmentation schema
A definition of a set of fragments (horizontal or vertical or horizontal and vertical) that
includes all attributes and tuples in the database that satisfies the condition that the whole
database can be reconstructed from the fragments by applying some sequence of UNION (or
OUTER JOIN) and UNION operations.

Allocation schema
It describes the distribution of fragments to sites of distributed databases. It can be
fully or partially replicated or can be partitioned.
DATA REPLICATION AND ALLOCATION
Data Replication
Database is replicated to all sites. In full replication the entire database is replicated
and in partial replication some selected part is replicated to some of the sites. Data replication
is achieved through a replication schema.
Data Distribution (Data Allocation)
This is relevant only in the case of partial replication or partition. The selected portion
of the database is distributed to the database sites.
TYPES OF DISTRIBUTED DATABASE SYSTEMS
1. Homogeneous
All sites of the database system have identical setup, i.e., same database system
software(oracle).But the underlying operating system may be different. The underlying
operating systems can be a mixture of Linux, Window, Unix, etc.

2. Heterogeneous
2 Types of Heterogeneous distributed system
i)Federated:
Each site may run different database system but the data access is managed through a
single conceptual schema.This implies that the degree of local autonomy is minimum. Each
site must adhere to a centralized access policy. There may be a global schema.
ii)Multidatabase:
There is no one conceptual global schema. For data access a schema is constructed
dynamically as needed by the application software.

Federated Database Management Systems Issues:


i)Differences in data models:
Relational, Objected oriented, hierarchical, network, etc.
ii)Differences in constraints:
Each site may have their own data accessing and processing constraints.
iii)Differences in query language:
Some site may use SQL, some may use SQL-89, some may use SQL-92, and so on.

QUERY PROCESSING IN DISTRIBUTED DATABASES

Data Transfer Costs of Distributed Query Processing:


Cost of transferring data (files and results) over the network.
■ This cost is usually high so some optimization is necessary.
■ Example relations: Employee at site 1 and Department at Site 2
■ Employee at site 1. 10,000 rows. Row size = 100 bytes. Table
size = 106 bytes.
■ Department at Site 2. 100 rows. Row size = 35 bytes. Table
size = 3,500 bytes.
■ Q: For each employee, retrieve employee name and department name
Where the employee works.
Q: PFname,Lname,Dname (Employee Dno = Dnumber Department)

Fnam Minit Lname SSN Bdate Address Sex Salary Superssn Dno
e

Dname Dnumber Mgrssn Mgrstartdate

The result of this query will have 10,000 tuples, assuming that every employee is related to a
department.
■ Suppose each result tuple is 40 bytes long. The query is submitted at site 3
and the result is sent to this site.
■ Problem: Employee and Department relations are not present at site 3.
Strategies:
1. Transfer Employee and Department to site 3.
-Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.
-Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 +
1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site1, execute the join at site 1, and send the result to site 3
-Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
-Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.

OVERVIEW OF CONCURRENCY CONTROL AND RECOVERY IN


DISTRIBUTED DATABASES:

Distributed Databases encounter a number of concurrency control and recovery


problems which are not present in centralized databases. Some of them are listed below.
i)Dealing with multiple copies of data items
The concurrency control must maintain global consistency. Likewise the recovery
mechanism must recover all copies and maintain consistency after recovery.
ii)Failure of individual sites
Database availability must not be affected due to the failure of one or two sites and
the recovery scheme must recover them before they are available for use.
iii)Communication link failure
This failure may create network partition which would affect database availability
even though all database sites may be running.
iv)Distributed commit
A transaction may be fragmented and they may be executed by a number of sites.
This require a two or three-phase commit approach for transaction commit.
v)Distributed deadlock
Since transactions are processed at multiple sites, two or more sites may get involved
in deadlock. This must be resolved in a distributed manner.
Distributed Concurrency Control Based on a Distinguished Copy of a Data Item:

1.Primary site technique: A single site is designated as a primary site which serves as a
coordinator for transaction management.
Transaction management: Concurrency control and commit are managed by this site.
■ In two phase locking, this site manages locking and releasing data items. If all
transactions follow two-phase policy at all sites, then serializability is
guaranteed.
Advantages:
● An extension to the centralized two phase locking so implementation and
management is simple.
● Data items are locked only at one site but they can be accessed at any site.
Disadvantages:
● All transaction management activities go to primary site which is likely to
overload the site.
● If the primary site fails, the entire system is inaccessible.
2.Primary Copy Technique:
In this approach, instead of a site, a data item partition is designated as primary copy. To
lock a data item just the primary copy of the data item is locked.
Advantages:
Since primary copies are distributed at various sites, a single site is not overloaded
with locking and unlocking requests.
Disadvantages:
Identification of a primary copy is complex. A distributed directory must be
maintained, possibly at all sites.
Recovery from a coordinator failure:
In both approaches a coordinator site or copy may become unavailable. This will require
the selection of a new coordinator.
3. Primary site approach with no backup site:
Aborts and restarts all active transactions at all sites. Elects a new coordinator and
initiates transaction processing.
4. Primary site approach with backup site:
Suspends all active transactions, designates the backup site as the primary site and
identifies a new back up site. Primary site receives all transaction management information
to resume processing.
5.Primary and backup sites fail or no backup site:
Use election process to select a new coordinator site.
Concurrency control based on voting:
■ There is no primary copy of coordinator.
■ Send lock request to sites that have data item.
■ If majority of sites grant lock then the requesting transaction gets the data
item.
■ Locking information (grant or denied) is sent to all these sites.
■ To avoid unacceptably long wait, a time-out period is defined. If the
requesting transaction does not get any vote information then the transaction is
aborted.

NoSQL Tutorial: What is, Types of NoSQL Databases & Example


What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require a fixed
schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is
for distributed data stores with humongous data storage needs. NoSQL is used for Big data
and real-time web apps. For example, companies like Twitter, Facebook and Google collect
terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be
“NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about
NoSQL with a diagram in this NoSQL database tutorial:
In this NoSQL tutorial for beginners, you will learn NoSQL basics like:

● What is NoSQL?
● Why NoSQL?
● Brief History of NoSQL Databases
● Features of NoSQL
● Types of NoSQL Databases
● Query Mechanism tools for NoSQL
● What is the CAP Theorem?
● Eventual Consistency
● Advantages of NoSQL
● Disadvantages of NoSQL

Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.

To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.

The alternative for this issue is to distribute database load on multiple hosts whenever the
load increases. This method is known as “scaling out.”
NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.

Brief History of NoSQL Databases

● 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
● 2000- Graph database Neo4j is launched
● 2004- Google BigTable is launched
● 2005- CouchDB is launched
● 2007- The research paper on Amazon Dynamo is released
● 2008- Facebooks open sources the Cassandra project
● 2009- The term NoSQL was reintroduced

Features of NoSQL
Non-relational

● NoSQL databases never follow the relational model


● Never provide tables with flat fixed-column records
● Work with self-contained aggregates or BLOBs
● Doesn’t require object-relational mapping and data normalization
● No complex features like query languages, query planners,referential integrity joins,
ACID

Schema-free

● NoSQL databases are either schema-free or have relaxed schemas


● Do not require any sort of definition of the schema of the data
● Offers heterogeneous structures of data in the same domain
NoSQL is Schema-Free

Simple API

● Offers easy to use interfaces for storage and querying data provided
● APIs allow low-level data manipulation & selection methods
● Text-based protocols mostly used with HTTP REST with JSON
● Mostly used no standard based NoSQL query language
● Web-enabled databases running as internet-facing services

Distributed

● Multiple NoSQL databases can be executed in a distributed fashion


● Offers auto-scaling and fail-over capabilities
● Often ACID concept can be sacrificed for scalability and throughput
● Mostly no synchronous replication between distributed nodes Asynchronous
Multi-Master Replication, peer-to-peer, HDFS Replication
● Only providing eventual consistency
● Shared Nothing Architecture. This enables less coordination and higher distribution.
NoSQL is Shared Nothing.

Types of NoSQL Databases


NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based and Document-oriented. Every category has its unique attributes and limitations.
None of the above-specified database is better to solve all the problems. Users should select
the database based on their product needs.
Types of NoSQL Databases:

● Key-value Pair Based


● Column-oriented Graph
● Graphs based
● Document-oriented

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.

Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.

For example, a key-value pair may contain a key like “Website” associated with a value like
“Guru99”.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.

Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.

Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored
contiguously.
Column based NoSQL database

They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as
the data is readily available in a column.

Column-based NoSQL databases are widely used to manage data warehouses, business


intelligence, CRM, Library card catalogs,

HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.

Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.

Relational Vs. Document

In this diagram on your left you can see we have rows and columns, and in the right, we have
a document database which has a similar structure to JSON. Now for the relational database,
you have to know what columns you have and so on. However, for a document database, you
have data store like JSON object. You do not require to define which make it flexible.

The document type is mostly used for CMS systems, blogging platforms, real-time analytics
& e-commerce applications. It should not use for complex transactions which require
multiple operations or queries against varying aggregate structures.

Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.

Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.

Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.

Graph base database mostly used for social networks, logistics, spatial data.

Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

Query Mechanism tools for NoSQL


The most common data retrieval mechanism is the REST-based retrieval of a value based on
its key/ID with GET resource

Document store Database offers more difficult queries as they understand the value in a
key-value pair. For example, CouchDB allows defining views with MapReduce

What is the CAP Theorem?


CAP theorem is also called brewer’s theorem. It states that is impossible for a distributed data
store to offer more than two out of three guarantees

1. Consistency
2. Availability
3. Partition Tolerance
Consistency:

The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.

Availability:

The database should always be available and responsive. It should not have any downtime.

Partition Tolerance:

Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the
database is unavailable, other parts are always unaffected.

Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get
high availability and scalability. Thus, changes made to any data item on one machine has to
be propagated to other replicas.

Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.

BASE: Basically Available, Soft state, Eventual consistency

● Basically, available means DB is available all the time as per CAP theorem
● Soft state means even without an input; the system state may change
● Eventual consistency means that the system will become consistent over time
Advantages of NoSQL

● Can be used as Primary or Analytic Data Source


● Big Data Capability
● No Single Point of Failure
● Easy Replication
● No Need for Separate Caching Layer
● It provides fast performance and horizontal scalability.
● Can handle structured, semi-structured, and unstructured data with equal effect
● Object-oriented programming which is easy to use and flexible
● NoSQL databases don’t need a dedicated high-performance server
● Support Key Developer Languages and Platforms
● Simple to implement than using RDBMS
● It can serve as the primary data source for online applications.
● Handles big data which manages data velocity, variety, volume, and complexity
● Excels at distributed database and multi-data center operations
● Eliminates the need for a specific caching layer to store data
● Offers a flexible schema design which can easily be altered without downtime or
service disruption

Disadvantages of NoSQL

● No standardization rules
● Limited query capabilities
● RDBMS databases and tools are comparatively mature
● It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
● When the volume of data increases it is difficult to maintain unique values as keys
become difficult
● Doesn’t work as well with relational data
● The learning curve is stiff for new developers
● Open source options so not so popular for enterprises.

Introduction to Database Security Issues

■ Types of Security
■ Legal and ethical issues
■ Policy issues
■ System-related issues
■ The need to identify multiple security levels 

■ Threats to databases
■ Loss of integrity
■ Loss of availability
■ Loss of confidentiality
■ To protect databases against these types of threats four kinds of countermeasures can
be implemented:
■ Access control
■ Inference control
■ Flow control
■ Encryption

A DBMS typically includes a database security and authorization subsystem that is


responsible for ensuring the security portions of a database against unauthorized access.

■ Two types of database security mechanisms:


■ Discretionary security mechanisms
■ Mandatory security mechanisms
■ The security mechanism of a DBMS must include provisions for restricting access to
the database as a whole. This function is called access control and is handled by
creating user accounts and passwords to control login process by the DBMS.
■ The security problem associated with databases is that of controlling the access to a
statistical database, which is used to provide statistical information or summaries of
values based on various criteria. The countermeasures to statistical database
security problem is called inference control measures.
■ Another security is that of flow control, which prevents information from flowing in
such a way that it reaches unauthorized users. Channels that are pathways for
information to flow implicitly in ways that violate the security policy of an
organization are called covert channels.
■ A final security issue is data encryption, which is used to protect sensitive data (such
as credit card numbers) that is being transmitted via some type communication
network. The data is encoded using some encoding algorithm. An unauthorized user
who access encoded data will have difficulty deciphering it, but authorized users are
given decoding or decrypting algorithms (or keys) to decipher data.

1.2 Database Security and the DBA

■ The database administrator (DBA) is the central authority for managing a database
system.
■ The DBA’s responsibilities include
■ granting privileges to users who need to use the system
■ classifying users and data in accordance with the policy of the
organization
■ The DBA is responsible for the overall security of the database system.
■ The DBA has a DBA account in the DBMS
■ Sometimes these are called a system or superuser account
■ These accounts provide powerful capabilities such as:
■ 1. Account creation
■ 2. Privilege granting
■ 3. Privilege revocation
■ 4. Security level assignment
■ Action 1 is access control, whereas 2 and 3 are discretionarym and 4 is used to
control mandatory authorization

1.3 Access Protection, User Accounts, and Database Audits


■ Whenever a person or group of person s need to access a database system, the
individual or group must first apply for a user account.
■ The DBA will then create a new account id and password for the user if
he/she deems there is a legitimate need to access the database
■ The user must log in to the DBMS by entering account id and password whenever
database access is needed.

■ The database system must also keep track of all operations on the database that are
applied by a certain user throughout each login session.
■ To keep a record of all updates applied to the database and of the particular
user who applied each update, we can modify system log, which includes an
entry for each operation applied to the database that may be required for
recovery from a transaction failure or system crash.
■ If any tampering with the database is suspected, a database audit is performed
■ A database audit consists of reviewing the log to examine all accesses and
operations applied to the database during a certain time period.
■ A database log that is used mainly for security purposes is sometimes called an audit
trail.

2. Discretionary Access Control Based on Granting and Revoking Privileges

■ The typical method of enforcing discretionary access control in a database system is


based on the granting and revoking privileges.

Types of Discretionary Privileges

■ The account level:


■ At this level, the DBA specifies the particular privileges that each account
holds independently of the relations in the database.
■ The relation level (or table level):
■ At this level, the DBA can control the privilege to access each individual
relation or view in the database.
■ The privileges at the account level apply to the capabilities provided to the account
itself and can include
■ the CREATE SCHEMA or CREATE TABLE privilege, to create a schema
or base relation;
■ the CREATE VIEW privilege;
■ the ALTER privilege, to apply schema changes such adding or removing
attributes from relations;
■ the DROP privilege, to delete relations or views;
■ the MODIFY privilege, to insert, delete, or update tuples;
■ and the SELECT privilege, to retrieve information from the database by using
a SELECT query.
■ The second level of privileges applies to the relation level
■ This includes base relations and virtual (view) relations.
■ The granting and revoking of privileges generally follow an authorization model for
discretionary privileges known as the access matrix model where
■ The rows of a matrix M represents subjects (users, accounts, programs)
■ The columns represent objects (relations, records, columns, views,
operations).
■ Each position M(i,j) in the matrix represents the types of privileges (read,
write, update) that subject i holds on object j.
■ To control the granting and revoking of relation privileges, each relation R in a
database is assigned and owner account, which is typically the account that was used
when the relation was created in the first place.
■ The owner of a relation is given all privileges on that relation.
■ In SQL2, the DBA can assign and owner to a whole schema by creating the
schema and associating the appropriate authorization identifier with that
schema, using the CREATE SCHEMA command.
■ The owner account holder can pass privileges on any of the owned relation to
other users by granting privileges to their accounts.
■ In SQL the following types of privileges can be granted on each individual relation R:
■ SELECT (retrieval or read) privilege on R:
■ Gives the account retrieval privilege.
■ In SQL this gives the account the privilege to use the SELECT
statement to retrieve tuples from R.
■ MODIFY privileges on R:
■ This gives the account the capability to modify tuples of R.
■ In SQL this privilege is further divided into UPDATE, DELETE, and
INSERT privileges to apply the corresponding SQL command to R.
■ In addition, both the INSERT and UPDATE privileges can specify
that only certain attributes can be updated by the account.
■ In SQL the following types of privileges can be granted on each individual relation R
(contd.):
■ REFERENCES privilege on R:
■ This gives the account the capability to reference relation R when
specifying integrity constraints.
■ The privilege can also be restricted to specific attributes of R.
■ Notice that to create a view, the account must have SELECT privilege on all relations
involved in the view definition.

Specifying Privileges Using Views

■ The mechanism of views is an important discretionary authorization mechanism in its


own right. For example,
■ If the owner A of a relation R wants another account B to be able to retrieve
only some fields of R, then A can create a view V of R that includes only
those attributes and then grant SELECT on V to B.

The same applies to limiting B to retrieving only certain tuples of R; a view V’ can be created
by defining the view by means of a query that selects only those tuples from R that A wants
to allow B to access

Revoking Privileges
■ In some cases it is desirable to grant a privilege to a user temporarily. For example,
■ The owner of a relation may want to grant the SELECT privilege to a user for
a specific task and then revoke that privilege once the task is completed.
■ Hence, a mechanism for revoking privileges is needed. In SQL, a REVOKE
command is included for the purpose of canceling privileges.

Propagation of Privileges using the GRANT OPTION

■ Whenever the owner A of a relation R grants a privilege on R to another account B,


privilege can be given to B with or without the GRANT OPTION.
■ If the GRANT OPTION is given, this means that B can also grant that privilege on R
to other accounts.
■ Suppose that B is given the GRANT OPTION by A and that B then grants the
privilege on R to a third account C, also with GRANT OPTION. In this way,
privileges on R can propagate to other accounts without the knowledge of the
owner of R.
■ If the owner account A now revokes the privilege granted to B, all the
privileges that B propagated based on that privilege should automatically be
revoked by the system.

Example 1:

■ Suppose that the DBA creates four accounts A1, A2, A3, A4
and wants only A1 to be able to create base relations. Then the DBA must
issue the following GRANT command in SQL
GRANT CREATETAB TO A1;
In SQL2 the same effect can be accomplished by having the DBA issue a
CREATE SCHEMA command as follows:
CREATE SCHAMA EXAMPLE AUTHORIZATION A1;
Example 2:
■ User account A1 can create tables under the schema called EXAMPLE.
■ Suppose that A1 creates the two base relations EMPLOYEE and DEPARTMENT
■ A1 is then owner of these two relations and hence all the relation privileges on
each of them.
■ Suppose that A1 wants to grant A2 the privilege to insert and delete tuples in both of
these relations, but A1 does not want A2 to be able to propagate these privileges to
additional accounts:
GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;

Mandatory Access Control and Role-Based Access Control for Multilevel


Security
■ The discretionary access control techniques of granting and revoking privileges on
relations has traditionally been the main security mechanism for relational database
systems.
■ This is an all-or-nothing method:
■ A user either has or does not have a certain privilege.
■ In many applications, and additional security policy is needed that classifies data and
users based on security classes.
■ This approach as mandatory access control, would typically be combined with
the discretionary access control mechanisms.
■ Typical security classes are top secret (TS), secret (S), confidential (C), and
unclassified (U), where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U
■ The commonly used model for multilevel security, known as the Bell-LaPadula
model, classifies each subject (user, account, program) and object (relation, tuple,
column, view, operation) into one of the security classifications, T, S, C, or U:
■ Clearance (classification) of a subject S as class(S) and to the classification
of an object O as class(O).
■ Two restrictions are enforced on data access based on the subject/object
classifications:
■ Simple security property: A subject S is not allowed read access to an object
O unless class(S) ≥ class(O).
■ A subject S is not allowed to write an object O unless class(S) ≤ class(O). This
known as the star property (or * property).
■ To incorporate multilevel security notions into the relational database model, it is
common to consider attribute values and tuples as data objects.
■ Hence, each attribute A is associated with a classification attribute C in the schema,
and each attribute value in a tuple is associated with a corresponding security
classification.
■ In addition, in some models, a tuple classification attribute TC is added to the
relation attributes to provide a classification for each tuple as a whole.
■ Hence, a multilevel relation schema R with n attributes would be represented as
■ R(A1,C1,A2,C2, …, An,Cn,TC)
where each Ci represents the classification attribute associated with attribute Ai.
■ The value of the TC attribute in each tuple t – which is the highest of all attribute
classification values within t – provides a general classification for the tuple itself,
whereas each Ci provides a finer security classification for each attribute value within
the tuple.
■ The apparent key of a multilevel relation is the set of attributes that would
have formed the primary key in a regular(single-level) relation.
■ A multilevel relation will appear to contain different data to subjects (users) with
different clearance levels.
■ In some cases, it is possible to store a single tuple in the relation at a higher
classification level and produce the corresponding tuples at a lower-level
classification through a process known as filtering.
■ In other cases, it is necessary to store two or more tuples at different
classification levels with the same value for the apparent key.
■ This leads to the concept of polyinstantiation where several tuples can have the same
apparent key value but have different attribute values for users at different
classification levels.
■ In general, the entity integrity rule for multilevel relations states that all attributes
that are members of the apparent key must not be null and must have the same
security classification within each individual tuple.
■ In addition, all other attribute values in the tuple must have a security classification
greater than or equal to that of the apparent key.
■ This constraint ensures that a user can see the key if the user is permitted to
see any part of the tuple at all.
■ Other integrity rules, called null integrity and interinstance integrity, informally
ensure that if a tuple value at some security level can be filtered (derived) from a
higher-classified tuple, then it is sufficient to store the higher-classified tuple in the
multilevel relation.
Comparing Discretionary Access Control and Mandatory Access Control
■ Discretionary Access Control (DAC) policies are characterized by a high degree of
flexibility, which makes them suitable for a large variety of application domains.
■ The main drawback of DAC models is their vulnerability to malicious attacks,
such as Trojan horses embedded in application programs.
■ By contrast, mandatory policies ensure a high degree of protection in a way, they
prevent any illegal flow of information.
■ Mandatory policies have the drawback of being too rigid and they are only applicable
in limited environments.
■ In many practical situations, discretionary policies are preferred because they offer a
better trade-off between security and applicability.

Role-Based Access Control


■ Role-based access control (RBAC) emerged rapidly in the 1990s as a proven
technology for managing and enforcing security in large-scale enterprisewide
systems.
■ Its basic notion is that permissions are associated with roles, and users are assigned to
appropriate roles.
■ Roles can be created using the CREATE ROLE and DESTROY ROLE commands.
■ The GRANT and REVOKE commands discussed under DAC can then be
used to assign and revoke privileges from roles.
■ RBAC appears to be a viable alternative to traditional discretionary and mandatory
access controls; it ensures that only authorized users are given access to certain data
or resources.
■ Many DBMSs have allowed the concept of roles, where privileges can be assigned to
roles.
■ Role hierarchy in RBAC is a natural way of organizing roles to reflect the
organization’s lines of authority and responsibility.
■ Another important consideration in RBAC systems is the possible temporal
constraints that may exist on roles, such as time and duration of role activations, and
timed triggering of a role by an activation of another role.
■ Using an RBAC model is highly desirable goal for addressing the key security
requirements of Web-based applications.
■ In contrast, discretionary access control (DAC) and mandatory access control (MAC)
models lack capabilities needed to support the security requirements emerging
enterprises and Web-based applications.

Introduction to Statistical Database Security


■ Statistical databases are used mainly to produce statistics on various populations.
■ The database may contain confidential data on individuals, which should be protected
from user access.
■ Users are permitted to retrieve statistical information on the populations, such as
averages, sums, counts, maximums, minimums, and standard deviations.
■ A population is a set of tuples of a relation (table) that satisfy some selection
condition.
■ Statistical queries involve applying statistical functions to a population of tuples.
■ For example, we may want to retrieve the number of individuals in a population or
the average income in the population.
■ However, statistical users are not allowed to retrieve individual data, such as the
income of a specific person.
■ Statistical database security techniques must prohibit the retrieval of individual data.
■ This can be achieved by prohibiting queries that retrieve attribute values and by
allowing only queries that involve statistical aggregate functions such as COUNT,
SUM, MIN, MAX, AVERAGE, and STANDARD DEVIATION.
■ Such queries are sometimes called statistical queries.
■ It is DBMS’s responsibility to ensure confidentiality of information about individuals,
while still providing useful statistical summaries of data about those individuals to
users. Provision of privacy protection of users in a statistical database is paramount.
■ In some cases it is possible to infer the values of individual tuples from a sequence
statistical queries.
■ This is particularly true when the conditions result in a population consisting of a
small number of tuples.

Introduction to Flow Control


■ Flow control regulates the distribution or flow of information among accessible
objects.
■ A flow between object X and object Y occurs when a program reads values from X
and writes values into Y.
■ Flow controls check that information contained in some objects does not flow
explicitly or implicitly into less protected objects.
■ A flow policy specifies the channels along which information is allowed to move.
■ The simplest flow policy specifies just two classes of information:
■ confidential (C) and nonconfidential (N)
and allows all flows except those from class C to class N.
Covert Channels
■ A covert channel allows a transfer of information that violates the security or the
policy.
■ A covert channel allows information to pass from a higher classification level to a
lower classification level through improper means.
■ Covert channels can be classified into two broad categories:
■ Storage channels do not require any temporal synchronization, in that
information is conveyed by accessing system information or what is otherwise
inaccessible to the user.
■ Timing channel allow the information to be conveyed by the timing of events
or processes.
■ Some security experts believe that one way to avoid covert channels is for
programmers to not actually gain access to sensitive data that a program is supposed
to process after the program has been put into operation.

Encryption and Public Key Infrastructures

■ Encryption is a means of maintaining secure data in an insecure environment.


■ Encryption consists of applying an encryption algorithm to data using some
prespecified encryption key.
■ The resulting data has to be decrypted using a decryption key to recover the original
data.
■ The Data Encryption Standard (DES) is a system developed by the U.S.
government for use by the general public.
■ It has been widely accepted as a cryptographic standard both in the United States and
abroad.
■ DES can provide end-to-end encryption on the channel between the sender A and
receiver B.
■ DES algorithm is a careful and complex combination of two of the fundamental
building blocks of encryption:
■ substitution and permutation (transposition).
■ The DES algorithm derives its strength from repeated application of these two
techniques for a total of 16 cycles.
■ Plaintext (the original form of the message) is encrypted as blocks of 64 bits.
■ After questioning the adequacy of DES, the National Institute of Standards (NIST)
introduced the Advanced Encryption Standards (AES).
■ This algorithm has a block size of 128 bits and thus takes longer time to crack.

Public Key Encryption


■ In 1976 Diffie and Hellman proposed a new kind of cryptosystem, which they called
public key encryption.
■ Public key algorithms are based on mathematical functions rather than operations on
bit patterns.
■ They also involve the use of two separate keys
■ in contrast to conventional encryption, which uses only one key.
■ The use of two keys can have profound consequences in the areas of
confidentiality, key distribution, and authentication.
■ The two keys used for public key encryption are referred to as the public key and the
private key.
■ the private key is kept secret, but it is referred to as private key rather than a
secret key (the word used in conventional encryption to avoid confusion with
conventional encryption).
■ A public key encryption scheme, or infrastructure, has six ingredients:
■ Plaintext: This is the data or readable message that is fed into the algorithm as
input.
■ Encryption algorithm: The encryption algorithm performs various
transformations on the plaintext.
■ Public and private keys: These are pair of keys that have been selected so
that if one is used for encryption, the other is used for decryption.
■ The exec transformations performed by the encryption algorithm
depend on the public or private key that is provided as input.
■ A public key encryption scheme, or infrastructure, has six ingredients (contd.):
■ Ciphertext:
■ This is the scrambled message produced as output. It depends on the
plaintext and the key.
■ For a given message, two different keys will produce two different
ciphertexts.
■ Decryption algorithm:
■ This algorithm accepts the ciphertext and the matching key and
produces the original plaintext.
■ Public key is made for public and private key is known only by owner.
■ A general-purpose public key cryptographic algorithm relies on
■ one key for encryption and
■ a different but related key for decryption.
■ The essential steps are as follows:
■ Each user generates a pair of keys to be used for the encryption and
decryption of messages.
■ Each user places one of the two keys in a public register or other accessible
file. This is the public key. The companion key is kept private (private key).
■ If a sender wishes to send a private message to a receiver, the sender encrypts
the message using the receiver’s public key.
■ When the receiver receives the message, he or she decrypts it using the
receiver’s private key.
■ No other recipient can decrypt the message because only the receiver
knows his or her private key.
■ The RSA Public Key Encryption algorithm, one of the first public key schemes was
introduced in 1978 by Ron Rivest (R), Adi Shamir (S), and Len Adleman (A) at MIT
and is named after them.
■ The RSA encryption algorithm incorporates results from number theory,
such as the difficulty of determining the large prime factors of a large number.
■ The RSA algorithm also operates with modular arithmetic – mod n, where n is the
product of two large prime numbers.

You might also like