CMP 214

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

CMP 214: Data management I (3 Units)

Course Contents
Information storage and retrieval, information management applications, information capture
and representation, analysis and indexing, search, retrieval, information privacy, integrity,
security, scalability, efficiency and effectiveness. Introduction to database systems,
components of database system, DBMS functions, database architecture and data independence
use of database query language.
Objectives:
1. Demonstrate good knowledge of basic database concepts, including the structure and
operation of the relational data model.
2. Understand and successfully apply logical database design principles, including E-R
diagrams and database normalization.
3. Understand Information storage and retrieval.
4. Assess the quality and ease of use of data modelling and diagramming tools.
5. Understand Information capture and representation
6. Design and implement a small database project.

Learning outcome:
At the end of this course, students should be able to:
1. Define Data Management and its Important
2. Know the Basic database Terminologies
3. Know Database management system
4. Know DBMS architecture
5. Understand Information storage and retrieval
6. Know DBMS Languages
7. Know DBMS data model
8. Know ERM model
9. Understand design theory for relational database
10. Understand Relational Algebra
11. Understand Information capture and representation
12. Understand file organization method
13. Understand file characteristics & file processing activities
14. Understand Concept of buffer & File Security Techniques
15. Design and implement a small database project using Microsoft Access/MySQL.

References/Further Reading
1. Silberschatz, Abraham; Henry F. Korth; & S. Sudarshan (2011). Database system
concepts— 6th ed. The McGraw-Hill Companies, Inc., New York.

Course Writer/Developer: Dr. F.E. AYO, Department of Mathematical Sciences, Computer


science unit, Olabisi Onabanjo University, Ago-Iwoye.
E-mail: ayo.femi@oouagoiwoye.edu.ng
INTRODUCTION
Data management plays a significant role in an organization’s ability to generate revenue,
control costs and mitigate risks. Successfully being able to share, store, protect and retrieve the
ever-increasing amount of data can be the competitive advantage needed to grow in today’s
business environment.
Management of data generally focuses on the defining of the data element, how it is structured,
stored and moved. Management of information is more concerned with the security, accuracy,
completeness and timeliness of multiple pieces of data. These are all concerns that accountants
are trained to assess and help manage for an organization.

What is Data Management and Why is It Important?


The definition provided by the Data Management Association (DAMA) is:
“Data management is the development, execution and supervision of plans, policies, programs
and practices that control, protect, deliver and enhance the value of data and information
assets”.
Managing customer data results in improved customer relationships, which ultimately drives
revenues. While expanded data storage requirements have increased equipment investments;
there also are many other hidden costs associated with data management. Some of these costs
include power consumption, cooling requirements, installation, cabling, backup management,
and data recovery. Inherent within all of these costs is the need for more time and space leading
to increases in payroll and occupancy expenses.
Lastly, but just as important, data management plays a key role in helping an organization
mitigate risks. For example, establishing a formal data retention policy can help decrease
storage costs and reduce litigation risks.

Basic Terminologies
Data
Data is raw representation of unprocessed facts, figures, concepts or instruction. It can exist in
any form, usable or not. Data are facts presented without relation to other things. E.g. it is
raining
Information
Information is processed data that has been given meaning by way of relational connection.
This "meaning" can be useful, but does not have to be. In computer parlance, a relational
database makes information from the data stored within it. Information embodies the
understanding of some sort. E.g. the temperature dropped to 15 degrees and then it started
raining.
Database
Database is the collection of organized and related files. For instance, the collection of staff-
file, student-file & equipment-file of an institution is referred to as the Institution’s Database.
This collection of data organized for storage in a computer memory and designed for easy
access by authorized users. The data may be in the form of text, numbers, or encoded graphics.
A Database is a shared, integrated computer structure that is repository to:
 End-user data, that is, raw facts of interest to the end user
 Metadata, or data about data describes of the data characteristics and the set of
relationships that link the data found within the db.
A database is a collection of data, typically describing the activities of one or more related
organizations. For example, a university database might contain information about the
following:
 Entities such as students, faculty, courses, and classrooms.
 Relationships between entities, such as students' enrolment in courses, faculty teaching
courses, and the use of rooms for courses.
Proper storage of data in a database will enhance efficient
- Data Management
- Data processing
- Data retrieval
Database System
Refers to an organization of components that define and regulate the collection, storage,
management from general management point of view, the DB system is composed of
- Hardware
- Software
- People
 System administrators: database systems operations
 DB administrators: manage the DBMS and ensure the DB is functioning
properly
 DB designers
 System analysts and programmers design and implement the application
programs
 End user
- Procedures
- Data

Database management system (DBMS)


A database management system (DBMS) is a collection of programs that manages the database
structure and controls access to the data stored in the database. In a sense, a database resembles
a very well-organized electronic filing cabinet in which powerful software (the DBMS) helps
manage the cabinet’s contents.
Advantages of a DBMS
Data independence: This is the technique that allow data to be changed without affecting the
applications that process it. We can change the way the database is physically stored and
accessed without having to make corresponding changes to the way the database is perceived
by the user. Changing the way the database is physically stored and accessed is almost always
to improve performance; and the fact that we can make such changes without having to change
the way the database looks to the user means that existing application programs, queries, and
the like can all still work after the change. Application programs should be as independent as
possible from details of data representation and storage. The DBMS can provide an abstract
view of the data to insulate application code from such details.
Efficient data access: A DBMS deploys sophisticated techniques to store and retrieve data
efficiently.
Data integrity control: the DBMS can enforce integrity constraints on the data. For example,
before inserting salary information for an employee, the DBMS can check that the department
budget is not exceeded. Also, updating the status for supplier S1 to 200 will rejected, if status
values are supposed never to exceed 100.
Security Control: the DBMS can enforce access controls that govern what data is visible to
different classes of users. Users are only allowed to perform an operation he or she is allowed
to carry out on data.
Concurrent access and crash recovery: A DBMS schedules concurrent accesses to the data
in such a manner that users can think of the data as being accessed by only one user at a time.
Further, the DBMS protects users from the effects of system failures.
Reduced application development time: Clearly, the DBMS supports many important
functions that are common to many applications accessing data stored in the DBMS. This, in
conjunction with the high-level interface to the data, facilitates quick development of
applications. Such applications are also likely to be more robust than applications developed
from scratch because many important tasks are handled by the DBMS instead of being
implemented by the application.
Data Dictionary Management: Data Dictionary is where the DBMS stores definitions of the
data elements and their relationships (metadata). The DBMS uses this function to look up the
required data component structures and relationships.
Data Storage Management: This particular function is used for the storage of data and any
related data entry forms or screen definitions, report definitions, data validation rules,
procedural code, and structures that can handle video and picture formats. Users do not need
to know how data is stored or manipulated. Also involved with this structure is a term called
performance tuning that relates to a database’s efficiency in relation to storage and access
speed.
Data Transformation and Presentation: This function exists to transform any data entered
into required data structures. By using the data transformation and presentation function the
DBMS can determine the difference between logical and physical data formats.
Multiuser Access Control: Data integrity and data consistency are the basis of this function.
Multiuser access control is a very useful tool in a DBMS, it enables multiple users to access
the database simultaneously without affecting the integrity of the database.
Backup and Recovery Management: Backup and recovery is brought to mind whenever there
is potential outside threats to a database. For example if there is a power outage, recovery
management is how long it takes to recover the database after the outage. Backup management
refers to the data safety and integrity; for example backing up all your mp3 files on a disk.
Data Integrity Management: The DBMS enforces these rules to reduce things such as data
redundancy, which is when data is stored in more than one place unnecessarily, and
maximizing data consistency, making sure database is returning correct/same answer each time
for same question asked.
Database Access Languages and Application Programming Interfaces: A query language
is a nonprocedural language. An example of this is SQL (structured query language). SQL is
the most common query language supported by the majority of DBMS vendors. The use of this
language makes it easy for user to specify what they want done without the headache of
explaining how to specifically do it.
Database Communication Interfaces: This refers to how a DBMS can accept different end
user requests through different network environments. An example of this can be easily related
to the internet. A DBMS can provide access to the database using the Internet through Web
Browsers (Mozilla Firefox, Internet Explorer, and Netscape).
Transaction Management: This refers to how a DBMS must supply a method that will
guarantee that all the updates in a given transaction are made or not made. All transactions
must follow what is called the ACID properties.

DBMS ARCHITECTURE
The DBMS provides users with an abstract view of the data in it i.e. the system hides certain
details of how the data is stored and maintained from users. A DBMS can be viewed as divided
into levels of abstraction. A common architecture generally used is the ANSI/SPARC
(American National Standards Institute - Standards Planning and Requirements Committee)
model. The ANSI/SPARC model abstracts the DBMS into a 3-tier architecture as follows:
- External level
- Conceptual level
- Internal level
DBMS architecture

External level: The external level is the user’s view of the database and closest to the users. It
presents only the relevant part of the DBMS to the user. E.g. A bank database stores a lot more
information but an account holder is only interested in his/her account details such as the
current account balance, transaction history etc. An external schema describes each external
view. The external schema consists of the definition of the logical records and the relationships
in the external view. In the external level, the different views may have different
representations of the same data.
Conceptual level: At this level of database abstraction, all the database entities and
relationships among them are included. Conceptual level provides the community view of the
database and describes what data is stored in the database and the relationships among the data.
In other words, the conceptual view represents the entire database of an organization. It is a
complete view of the data requirements of the organization that is independent of any storage
consideration. The conceptual schema defines conceptual view. It is also called the logical
schema. There is only one conceptual schema per database. The figure shows the conceptual
view record of a data base.
Internal level or physical level: The lowest level of abstraction is the internal level. It is the
one closest to physical storage device. This level is also termed as physical level, because it
describes how data are actually stored on the storage medium such as hard disk, magnetic tape
etc. This level indicates how the data will be stored in the database and describe the data
structures, file structures and access methods to be used by the database. The internal schema
defines the internal level. The internal schema contains the definition of the stored record, the
methods of representing the data fields and accessed methods used. The figure shows the
internal view record of a database.

Database Architecture
Database applications are usually partitioned into two or three parts as in Figure (a) below.
Two-tier architecture
In a two-tier architecture, the application resides at the client machine, where it invokes
database system functionality at the server machine through query language statements.
Application program interface standards like ODBC and JDBC are used for interaction
between the client and the server.
Three-tier architecture
In contrast, in a three-tier architecture, the client machine acts as merely a front end and does
not contain any direct database calls. Instead, the client end communicates with an application
server, usually through a forms interface. The application server in turn communicates with a
database system to access data. The business logic of the application, which says what actions
to carry out under what conditions, is embedded in the application server, instead of being
distributed across multiple clients. Three-tier applications are more appropriate for large
applications, and for applications that run on the World Wide Web.
Data independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult
to modify or update a set of metadata once it is stored in the database. But as a DBMS expands,
it needs to change over time to satisfy the requirements of the users. If the entire data is
dependent, it would become a tedious and highly complex job.

Metadata itself follows a layered architecture, so that when we change data at one layer, it does
not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data residing
on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data
independence is the power to change the physical data without impacting the schema or logical
data.
For example, in case we want to change or upgrade the storage system itself — suppose we
want to replace hard-disks with SSD — it should not have any impact on the logical data or
schemas.
INFORMATION STORAGE AND RETRIEVAL
An information storage and retrieval system is a network with a built-in user interface that
facilitates the creation, searching, and modification of stored data. Files on a computer can be
created, moved, modified, grown, shrunk and deleted. In most cases, computer programs that
are executed on the computer handle these operations, but the user of a computer can also
manipulate files if necessary. For instance, Microsoft Word files are normally created and
modified by the Microsoft Word program in response to user commands, but the user can also
move, rename, or delete these files directly by using a file manager program such as Windows
Explorer.
DBMS Languages
The workings of a DBMS is controlled by four different languages, they are
Data Definition Language (DDL): Used by the DBA and database designers to specify the
conceptual schema of a database. In many DBMSs, the DDL is also used to define internal and
external schemas (views). In some DBMSs, separate storage definition language (SDL) and
view definition language (VDL) are used to define internal and external schemas. SDL is
typically realized via DBMS commands provided to the DBA and database designers. Some
examples include:
- CREATE - to create objects in the database
- ALTER - alters the structure of the database
- DROP - delete objects from the database
- TRUNCATE - remove all records from a table, including all spaces allocated for the
records are removed
- COMMENT - add comments to the data dictionary
- RENAME - rename an object
CREATE
Creates new databases, tables, and views from RDBMS.
For example:
Create database tutorialspoint;
Create table article;
Create view for_students;
DROP
Drops commands, views, tables, and databases from RDBMS.
For example:
Drop object_type object_name;
Drop database tutorialspoint;
Drop table article;
Drop view for_students;

ALTER
Modifies database schema.
Alter object_type object_name parameters;

For example:
Alter table article add subject varchar;

This command adds an attribute in the relation article with the name subject of string type.
Data Manipulation Language (DML): these statements managing data within schema
objects. They specify database retrievals and updates. DML commands (data sublanguage) can
be embedded in a general-purpose programming language (host language), such as COBOL,
C, C++, or Java.
- A library of functions can also be provided to access the DBMS from a programming
language
- Alternatively, stand-alone DML commands can be applied directly (called a query
language).
Some examples in SQL include:
- SELECT - Retrieve data from the a database
- INSERT - Insert data into a table
- UPDATE - Updates existing data within a table
- DELETE - deletes all records from a table, the space for the records remain
- MERGE - UPSERT operation (insert or update)
- CALL - Call a PL/SQL or Java subprogram
- EXPLAIN PLAN - explain access path to data
- LOCK TABLE - control concurrency
SELECT/FROM/WHERE
 SELECT
This is one of the fundamental query command of SQL. It is similar to the projection operation
of relational algebra. It selects the attributes based on the condition described by WHERE
clause.
 FROM
This clause takes a relation name as an argument from which attributes are to be
selected/projected. In case more than one relation names are given, this clause corresponds to
Cartesian product.
 WHERE
This clause defines predicate or conditions, which must match in order to qualify the attributes
to be projected.
For example:
Select author_name
From book_author
Where age > 50;

This command will yield the names of authors from the relation book_author whose
age is greater than 50.
INSERT INTO/VALUES
This command is used for inserting values into the rows of a table (relation).
Syntax:
INSERT INTO table (column1 [, column2, column3 ... ]) VALUES (value1 [,
value2, value3 ... ])

Or
INSERT INTO table VALUES (value1, [value2, ... ])

For example:
INSERT INTO tutorialspoint (Author, Subject) VALUES ("anonymous",
"computers");
UPDATE/SET/WHERE
This command is used for updating or modifying the values of columns in a table (relation).
Syntax:
UPDATE table_name SET column_name = value [, column_name = value ...]
[WHERE condition]

For example:
UPDATE tutorialspoint SET Author="webmaster" WHERE Author="anonymous";

DELETE/FROM/WHERE
This command is used for removing one or more rows from a table (relation).
Syntax:
DELETE FROM table_name [WHERE condition];

For example:
DELETE FROM tutorialspoint
WHERE Author="unknown";

Data Control Language (DCL): used for granting and revoking user access on a database
- To grant access to user – GRANT
- To revoke access from user – REVOKE
Transaction Control (TCL): Statements are used to manage the changes made by DML
statements. It allows statements to be grouped together into logical transactions.
Some examples include:
- COMMIT - save work done
- SAVEPOINT - identify a point in a transaction to which you can later roll back
- ROLLBACK - restore database to original since the last COMMIT
- SET TRANSACTION - Change transaction options like isolation level and what
rollback segment to use in practical data definition language, data manipulation
language and data control languages are not separate language; rather they are the parts
of a single database language such as SQL.
Example
Write the SQL code that will create the table structure for a table named EMP_1. This table is
a subset of the EMPLOYEE table. The basic EMP_1 table structure is summarized in the
following table. EMP_NUM is the primary key and JOB_CODE is the FK to JOB.

CREATE TABLE EMP_1(


EMP_NUM CHAR(6) NOT NULL,
EMP_LNAME VARCHAR(15),
EMP_FNAME VARCHAR(15),
EMP_INITIAL CHAR(1),
EMP_HIREDATE DATE,
JOB_CODE CHAR(3),
PRIMARY KEY (EMP_NUM),
FOREIGN KEY(JOB_CODE) REFERENCES JOB (JOB_CODE)
);
Having created the table structure in (a), write the SQL code to enter the first two rows for the
table EMP_1 below:
INSERT INTO EMP_1
(EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_HIREDATE,
JOB_CODE)
VALUES
("101", "News", "John", "G", "08-Nov-00", "502"),
("102", "Senior", "David", "H", "12-Jul-89", "500");

Assuming the data shown in the EMP_1 table have been entered, write the SQL code that will
list all attributes for a job code of 502.
SELECT * FROM EMP_1
WHERE JOB_CODE = ‘502’;
Write the SQL code that will save the changes made to the EMP_1 table.
COMMIT WORK;
NB: WORK is optional.

DBMS DATA MODEL


A data model is a notation for describing data or information. The description generally consists
of three parts:
i. Structure of the data: The data structures used to implement data in the computer are
sometimes referred to, in discussions of database systems, as a physical data model.
ii. Operations on the data: In database data models, there is usually a limited set of
operations that can be performed but DBA can describe database operations at a very
high level, yet have the database management system implement the operations
efficiently.
iii. Constraints on the data. Database data models usually have a way to describe
limitations on what the data can be. These constraints can range from the simple (e.g.,
“a day of the week is an integer between 1 and 7” or “a movie has at most one title”) to
some very complex limitations.

STRUCTURE OF THE DATA


Traditionally, there are four DBMS. These four data models also represent the historical
developments of the DBMS:
Hierarchical Database Model
This is the oldest DBMS data model. In this model, information is organized as a collection of
inverted trees of records. The record at the root of a tree has zero or more child records; the
child records, in turn, serve as parent records for their immediate descendants. This parent-
child relationship recursively continues down the tree. The records consist of fields, where each
field may contain simple data values (e.g. integer, real, text)., or a pointer to a record. The
pointer graph is not allowed to contain cycles. Some combinations of fields may form the key
for a record relative to its parent. Only a few hierarchical DBMSs support null values or
variable-length fields.

Example of Hierarchical data model

Applications can navigate a hierarchical database by starting at a root and successively navigate
downward from parent to children until the desired record is found.
Searching down a hierarchical tree is very fast since the storage layer for hierarchical databases
use contiguous storage for hierarchical structures. All other types of queries require sequential
search techniques. A DDL for hierarchical data model must allow the definition of record types,
fields types, pointers, and parent-child relationships. And the DML must support direct
navigation using the parent-child relationships and through pointers.
Limitations
- Hierarchical model only permits one to many relationship. The concept of Logical
relationship is often used to circumvent this limitation. Logical relationship
superimpose another set of connection between data items separate from the physical
tree structure. This of course increases its complexity
- Often a natural hierarchy does not exist and it is awkward to impose a parent-child
relationship. Pointers partially compensate for this weakness, but it is still difficult to
specify suitable hierarchical schemas for large models and this means Programs have
to navigate very close to the physical data structure level, implying that the hierarchical
data model offers only very limited data independence.
- Lack of ad hoc query capability placed burden on programmers to generate code for
reports.
Network model:
It represents complex data relationships more effectively than the hierarchical model. The
major improvement is that the one-to-many limitation was removed; the models still views data
in a hierarchical one-to-many structure but now record may have more than one parent.
Network data models represent data in a symmetric manner, unlike the hierarchical data model
(distinction between a parent and a child). Information is organized as a collection of graphs
of record that are related with pointers. More flexible than a hierarchical data model and still
permits efficient navigation.

Example of network data model

The records consist of lists of fields (fixed or variable length with maximum length), where
each field contains a simple value (fixed or variable size).
The network data model also introduces the notion of indexes of fields and records, sets of
pointers, and physical placement of records.
A DDL for network data models must allow the definition of record types, fields types, pointers
and indexes. And the DML must allow navigation through the graphs through the pointers and
indexes.
Programs also navigates closely to the physical storage structures, implying that the network
data model only supports limited data independence, and are therefore difficult to maintain as
the data models evolve over time.
Concepts introduced under the network model include:
- Schema: conceptual design of the entire database usually managed by the dba
- Sub-schema: virtual view of portions of the database visible to application
programmers
- Data management language: enables definition of and access to the schema and sub-
schema. It consist of DDL to construct the schema and DML to develop programs
- Data Definition Language.
Limitations
- Cumbersome
- Lack of ad hoc query capability placed burden on programmers to generate code for
reports
- Structural change in the database could produce havoc in all application programs

The relational database Model


Developed by E.F. Codd (IBM) in 1970, the relational data model has a mathematical
foundation in relational algebra. The model is based on first-order predicate logic and defines
a table as an n-ary relation. Data is organized in relations (two-dimensional tables). Each
relation contains a set of tuples (records). Each tuple contains a number of fields. A field may
contain a simple value (fixed or variable size) from some domain (e.g. integer, real, text, etc.).
Advantages of relational model
- Built-in multilevel integrity: Data integrity is built into the model at the field level to
ensure the accuracy of the data; at the table level to ensure that records are not
duplicated and to detect missing primary key values; at the relationship level to ensure
that the relationship between a pair of tables is valid; and at the business level to ensure
that the data is accurate in terms of the business itself. (Integrity is discussed in detail
as the design process unfolds).
- Logical and physical data independence from database applications: Neither
changes a user makes to the logical design of the database, nor changes a database
software vendor makes to the physical implementation of the database, will adversely
affect the applications built upon it.
- Guaranteed data consistency and accuracy: Data is consistent and accurate due to
the various levels of integrity you can impose within the database. (This will become
quite clear as you work through the design process.)
- Easy data retrieval: At the user’s command, data can be retrieved either from a
particular table or from any number of related tables within the database. This enables
a user to view information in an almost unlimited number of ways.
One commonly perceived disadvantage of the relational database was that software programs
based on it ran very slowly.

SOME DEFINITIONS
RELATION
A relation, as defined by E. F. Codd, is a set of tuples (d1, d2, ..., dn), where each element dj
is a member of Dj, a data domain, for each j=1, 2, ..., n. A data domain is simply a data type. It
specifies a data abstraction: the possible values for the data and the operations available on the
data. For example, a String can have zero or more characters in it, and has operations for
comparing strings, concatenating string, and creating strings. A relation is a truth predicate. It
defines what attributes are involved in the predicate and what the meaning of the predicate is.
In relational data model, relations are represented in the table format. This format stores the
relation among entities. A table has rows and columns, where rows represent records and
columns represent the attributes. E.g.

TUPLE
A single row of a table, which contains a single record for that relation is called a tuple. A tuple
has attribute values which match the required attributes in the relation. The ordering of attribute
values is immaterial. Every tuple in the body of a given relation is required to conform to the
heading (attribute) of that relation, i.e. it contains exactly one value, of the applicable type, for
each attribute.
ATTRIBUTE
The columns of a relation are named by attributes. Attributes appear at the tops of the columns.
Usually, an attribute describes the meaning of entries in the column below. For instance, the
column with attribute length holds the length, in minutes, of each movie.
ATTRIBUTE DOMAIN
Every attribute has some predefined value scope, known as attribute domain.
ATTRIBUTE VALUE/INSTANCE
An attribute value is the value for an attribute in a particular tuple. An attribute value must
come from the domain that the attribute specifies. Most relational DBMS allows NULL
attribute values. Each attribute value in a relational model must be atomic i.e. it must be of
some elementary type such as integer or string. It is not permitted for a value to be a record
structure, set, list, array, or any other type that reasonably can have its values broken into
smaller components.
SCHEMAS
The name of a relation and the set of attributes for a relation is called the schema for that
relation. The schema is depicted by the relation name followed by a parenthesized list of its
attributes. Thus, the schema for relation Movies above is
Movies (title, year, length, genre)
In the relational model, a database consists of one or more relations. The set of schemas for the
relations of a database is called a relational database schema, or just a database schema.
Data Types
All attributes must have a data type. The following are the primitive data types that are
supported by SQL (Structured Query Language) systems.
i. Character strings of fixed or varying length. The type CHAR(n) denotes a fixed-
length string of up to n characters. VARCHAR(n) also denotes a string of up to n
characters. The difference is implementation-dependent; typically CHAR implies that
short strings are padded to make n characters, while VARCHAR implies that an
endmarker or string-length is used. Normally, a string is padded by trailing blanks if it
becomes the value of a component that is a fixed-length string of greater length. For
example, the string ’foo’ if it became the value of a component for an attribute of type
CHAR(5), would assume the value ’foo ’ (with two blanks following the second o).
ii. Bit strings of fixed or varying length. These strings are analogous to fixed and
varying-length character strings, but their values are strings of bits rather than
characters. The type BIT (n) denotes bit strings of length n, while BIT VARYING (n)
denotes bit strings of length up to n.
iii. BOOLEAN. The type BOOLEAN denotes an attribute whose value is logical. The
possible values of such an attribute are TRUE, FALSE.
iv. INT or INTEGER. The type INT or INTEGER (these names are synonyms) denotes
typical integer values. The type SHORTINT also denotes integers, but the number of
bits permitted may be less, depending on the implementation (as with the types int and
short int in C).
v. FLOAT or REAL. Floating-point numbers can be represented in a variety of ways.
We may use the type FLOAT or REAL (these are synonyms) for typical floating point
numbers. A higher precision can be obtained with the type DOUBLE PRECISION. We
can also specify real numbers with a fixed decimal point. For example, DECIMAL(n,d)
allows values that consist of n decimal digits, with the decimal point assumed to be d
positions from the right. Thus, 0123.45 is a possible value of type DECIMAL(6,2).
NUMERIC is almost a synonym for DECIMAL, although there are possible
implementation-dependent differences.
vi. DATE and TIME. Dates and times can be represented by the data types DATE and
TIME, respectively. These values are essentially character strings of a special form. We
may, in fact, coerce dates and times to string types, and we may do the reverse if the
string “makes sense” as a date or time.

A relational database Schema is depicted by stating both the attributes and their datatype:
Movies (
title CHAR(1OO),
year INT,
length INT,
genre CHAR(10),
studioName CHAR(30),
producer INT
)
Relation instance: A finite set of tuples in the relational database system represents relation
instance. Relation instances do not have duplicate tuples.
{
<Person SSN# = "123-45-6789" Name = "Art Larsson" City = "San Francisco">,
<Person SSN# = "231-45-6789" Name = "Lino Buchanan" City = "Philadelphia">,
<Person SSN# = "321-45-6789" Name = "Diego Jablonski" City = "Chicago">
}
It is more common and concise to show a relation value as a table. All ordering within the table
is artificial and meaningless.
DESIGN THEORY FOR RELATIONAL DATABASE
A common problem with schema design involve trying to combine too much into one relation
thus leading to redundancy. Thus, improvements to relational schemas pay close attention to
eliminating redundancy. The theory of “dependences” is a well-developed theory for relational
databases providing guidelines on how to develop good schema and eliminate flaws if any. The
first concept we need to consider is Functional Dependency (FD).
FUNCTIONAL DEPENDENCY: the term functional dependence can be defined most easily
this way:
Definition:
Let A and B be subsets of the attribute of a relation R. Then the functional dependency (FD)
A→B
holds in R if and only if, whenever two tuples of R have the same value for A, they also have
the same value for B. A and B are the determinant and the dependent, respectively, and the FD
overall can be read as “A functionally determines B” or “B is functionally dependent on A,” or
more simply just as
A→B
If A and B are composite, then we have
A1, A2, …, An → B1, B2, …, Bm
This is also equivalent to
A1, A2, …, An → B1, A1, A2, …, An → B2, ..., A1, A2, …, An → Bm
The attribute(s) B is functionally dependent on attributes(s)A, if A determines B. e.g.
STU_PHONE is functionally dependent on STU_NUM.

STU_NUM is not functionally dependent on STU_PHONE because the STU_PHONE value


2267 is associated with two STU_NUM values: 324274 and 324291. (This could happen when
roommates share a single land line phone number).
The functional dependence definition can be generalized to cover the case in which the
determining attribute values occur more than once in a table.
Functional dependence can then be defined this way:
Attribute B is functionally dependent on A if all of the rows in the table that agree in value for
attribute A also agree in value for attribute B.

RELATION KEYS: The key’s role is based on a concept known as determination. i.e. the
statement “A determines B” indicates that if you know the value of attribute A, you can look
up (determine) the value of attribute B. E.g.: an invoice number identifies all of the invoice
attributes such as invoice date and the customer name.
If we know STU_NUM in a STUDENT table we can look up (determine) student’s last name,
grade point average, phone number, etc.

Table name: Student

The shorthand notation for “A determines B” is


A → B.
If A determines B, C, and D, we write
A → B, C, D.
For the student example we can write:
STU_NUM → STU_LNAME, STU_FNAME, STU_INIT, STU_DOB, STU_TRANSFER

In contrast, STU_NUM is not determined by STU_LNAME because it is quite possible for


several students to have the last name Smith.
Proper understanding of the principle of determination is vital to the understanding of a central
relational database concept known as functional dependence (FD).
Definitions
Key Attribute(s): We say a set of one or more attributes {A1, A2, ..., An} is a key for a relation
R if:
i. Those attributes functionally determine all other attributes of the relation. That is, it is
impossible for two distinct tuples of R to agree on all of A1, A2, ..., An (uniqueness).
ii. No proper subset of {A1, A2, ..., An} functionally determines all other attributes of R;
i.e., a key must be minimal.
When a key consists of a single attribute A, we often say that A (rather than {A}) is a key. An
attribute that is part of a key is called key attribute.
Consider the Relation Movies below:

Attributes {title, year, starName} form a key for the relation Movies because it meets the two
conditions:
Condition 1:
Do they functionally determine all the other attributes? Yes
Condition 2:
Do any proper subset of {title, year, starName} functionally determines all other attributes?
{title, year} do not determine starName thus {title, year} is not a key.
{year, starName} is not a key because we could have a star in two movies in the same year;
therefore{Year, starName} → title is not an FD.
{title, starName} is not a key, because two movies with the same title, made in different years,
can have a star in common.
Therefore, no proper subset of {title, year, starName} functionally determines all other
attributes.
Super Key (shortened: super set of keys): An attribute or a combination of attributes that is
used to identify the records uniquely is known as Super Key. It is to be noted that some
superkeys are not (minimal) keys. Note that every superkey satisfies the first condition of a
key: it functionally determines all other attributes of the relation. However, a superkey need
not satisfy the second condition: minimality. A table can have many Super Keys. E.g. of Super
Key
 ID
 ID, Name
 ID, Address
 ID, Department_ID
 ID, Salary
 Name, Address
Candidate Key: It can be defined as minimal Super Key or irreducible Super Key. In other
words an attribute or a combination of attribute that identifies the record uniquely but none of
its proper subsets can identify the records uniquely. E.g. of Candidate Key
 Code
 Name, Address
Primary Key: A Candidate Key that is used by the database designer for unique identification
of each row in a table is known as Primary Key. A Primary Key can consist of one or more
attributes of a table. E.g. of Primary Key - Database designer can use one of the Candidate Key
as a Primary Key.
In this case we have “Code” and “Name, Address” as Candidate Key, The designer may prefer
“Code” as the Primary Key as the other key is the combination of more than one attribute.
Null values should never be part of a primary key, they should also be avoided to the greatest
extent possible in other attributes too. A null is no value at all. It does not mean a zero or a
space. There are rare cases in which nulls cannot be reasonably avoided when you are working
with non-key attributes. For example, one of an EMPLOYEE table’s attributes is likely to be
the EMP_INITIAL. However, some employees do not have a middle initial. Therefore, some
of the EMP_INITIAL values may be null. Null can also exist because of the nature of the
relationship between two entities. Conventionally, the existence of nulls in a table is often an
indication of poor database design. Nulls, if used improperly, can create problems because they
have many different meanings. For example, a null can represent:
 An unknown attribute value.
 A known, but missing, attribute value.
 A “not applicable” condition.
Foreign Key: A foreign key is an attribute or combination of attributes in one base table that
points to the candidate key (generally it is the primary key) of another table. The purpose of
the foreign key is to ensure referential integrity of the data i.e. only values that are supposed to
appear in the database are permitted. E.g. Consider two table
Employee (EmployeeID, EmployeeName, DOB, DOJ, SSN, DeptID, MgrID) and
DeptTbl (Dept_ID, Dept_Name, Manager_ID, Location_ID)

Dept_ID is the primary key in Table DeptTbl, the DeptID attribute of table Employee
(dependent or child table) can be defined as the Foreign Key as it can reference to the Dept_ID
attribute of the table DeptTbl (the referenced or parent table), a Foreign Key value must match
an existing value in the parent table or be NULL.
Composite Key: If we use multiple attributes to create a Primary Key then that Primary Key
is called Composite Key (also called a Compound Key or Concatenated Key).
Full functional dependency (FFD): If the attribute (B) is functionally dependent on a
composite key (A) but not on any subset of that composite key, the attribute (B) is fully
functionally dependent on (A).
Alternate Key: Alternate Key can be any of the Candidate Keys except for the Primary Key.
Secondary Key: The attributes that are not even the Super Key but can be still used for
identification of records (not unique) are known as Secondary Key. E.g. of Secondary Key can
be Name, Address, Salary, Department_ID etc. as they can identify the records but they might
not be unique.

An Example of relational db with primary key and foreign key


Exercise
Suppose R is a relation with attributes A1, A2, ..., An. As a function of n, tell how many
superkeys R has, if:
i. The only key is A1.
ii. The only keys are A1 and A2
iii. The only keys are {A1, A2} and {A3, A4}
iv. The only keys are {A1, A2} and {A1, A3}

Rules About Functional Dependencies


These rules guide us on how we can infer a functional dependency from other given FD’s.
Transitive rule
E.g., given that a relation R (A, B, C) satisfies the FD’s
A —> B and B —> C,
then we can deduce that R also satisfies the FD
A —> C.
Proof:
Consider two tuples of R that agree on A.
Let the tuples agreeing on attribute A be (a, b1, c1) and (a, b2, c2).
Since R satisfies A → B, and these tuples agree on A, they must also agree on B. That is, b1 =
b2.
The tuples are now (a, b, c1) and (a, b, c2), where b is both b1 and b2.
Similarly, since R satisfies B → C, and the tuples agree on B, they agree also on C. Thus, c1=
c2; i.e., the tuples do agree on C.
We have proved that any two tuples of R that agree on A also agree on C, and that is the FD
A → C.
This rule is called the transitive rule.
The Splitting/Combining Rule
The splitting/ combining rule is stated as follows:
Suppose we have two tuples that agree in A1, A2, ..., An. As a single FD, we would assert
“then the tuples must agree in all of B1, B2, ..., Bm.” As individual FD’s, we assert “then the
tuples agree in B1, and they agree in B2, and, ..., and they agree in Bm.”
Recall that the FD:
A1, A2, …, An → B1, B2, …, Bm
is equivalent to the set of FD’s:
A1, A2, …, An → B1, A1, A2, …, An → B2, ..., A1, A2, …, An → Bm
In other words, we may split attributes on the right side so that only one attribute appears on
the right of each FD. Likewise, we can replace a collection of FD’s having a common left side
by a single FD with the same left side and all the right sides combined into one set of attributes.
In either event, the new set of FD’s is equivalent to the old. The equivalence noted above can
be used in two ways.
 We can replace an FD
A1, A2, …, An → B1, B2, …, Bm by a set of FD’s
A1, A2, …, An → Bi for i = 1, 2, ..., m
We call this transformation the splitting rule.
 We can replace a set of FD’s
A1, A2, …, An → Bi for i = 1, 2, ..., m by the single FD
A1, A2, …, An → B1, B2, …, Bm.
We call this transformation the combining rule.
E.g. the set of FD’s:
title year → length
title year → genre
title year → studioName
is equivalent to the single FD:
title year → length, genre, studioName
Trivial-dependency rule.
Trivial Functional Dependencies: If a functional dependency (FD) α → β holds in Relation
R, then the term trivial is attached to the dependency if it is satisfied by all possible r(R)
i.e. α → β is trivial if β ⊆ α or β ∪ α = R
where β is a subset of α, then it is called a trivial FD.
e.g.
title, year → title
title → title
are both trivial FD
There is an intermediate situation in which some, but not all, of the attributes on the right side
of an FD are also on the left. This FD is not trivial.
Non-trivial: If an FD X → Y holds, where Y is not a subset of X, then it is called a non-trivial
FD.
Completely non-trivial: If an FD X → Y holds, where x intersect Y = Φ, it is said to be a
completely non-trivial FD.

Computing the Closure of Attributes


Given a set a = {A1, A2, ..., A n} of attributes of R and a set of functional dependencies FD,
we need a way to find all of the attributes of R that are functionally determined by a. This set
of attributes is called the closure of a under F and is denoted a+.
Finding a+ is useful because:
 if a+ = R, then a is a superkey for R
 With closure we can find all FD’s easily
- To check if X → A
- Compute X+
- Check if A ∈ X
 if we find a+ for all a ⊆ R, we've computed F+ (except that we'd need to use
decomposition to get all of it).
Formal definition of closure:
Suppose a ={A1, A2, ..., An} is a set of attributes and S is a set of FD’s. The closure of a under
the FD’s in S is the set of attributes B such that every relation that satisfies all the FD’s in set
S also satisfies A1, A2, …, An → B. That is, A1, A2, …, An → B follows from the FD’s of S.
We denote the closure of a set of attributes A1, A2, …, An by
{A1, A2, ..., An}+.
Note that A1, A2, ..., An are always in {A1, A2, …, An}+ because the FD A1, A2, …, An →
Ai is trivial when i is one of 1,2,... , n.

An algorithm for computing a+:


result := a
repeat
temp := result
for each functional dependency 𝛽 → 𝛾 in F do
if 𝛽 ⊆ result then
result := result ⋃ 𝛾
until temp = result
Example:
Consider a relation with attributes A, B, C, D, E, and F. Suppose that this relation has the FD’s
AB → C, BC → AD, D → E, and CF → B.
What is the closure of {A, B}?
Solution
First, split BC → AD into BC → A and BC → D.
Result = {A, B}.
For AB → C
AB ⊆ Result, so we have
Result = Result ⋃ C i.e. Result = {A, B, C}.
For BC → A and BC → D
BC ⊆ Result, so we have
Result = Result ⋃ A and D i.e., Result = {A, B, C, D}
For D → E
D ⊆ Result, so we have
Result = Result ⋃ E i.e. Result = {A, B, C, D, E}
For CF → B
C and F not a subset of Result, so no more changes to Result are possible.
Thus, {A, B}+ = {A, B, C, D, E}
By computing the closure of any set of attributes, we can test whether any given FD
A1, A2, …, An → B follows from a set of FD’s S.
First compute {A1, A2, …, An}+ using the set of FD’s S. If B is in {A1, A2, …, An}+, then
A1, A2, …, An → B does follow from S, and if B is not in {A1, A2, …, An}+, then this FD
does not follow from S.
More generally, A1, A2, …, An → B1, B2, …, Bm follows from set of FD’s S if and only if
all of B1, B2, ..., Bm are in {A1, A2, …, An}+.
Example:
Consider the relation and FD’s in the example above, suppose we wish to test whether AB →
D follows from these FD’s. We compute {A, B}+, which is {A, B, C, D, E}. Since D is a
member of the closure, we conclude that AB → D does follow.
On the other hand, consider the FD
D → A. To test whether this FD follows from the given FD’s, first compute {D}+.
{D}+ = {D, E}. Since A is not a member of {D, E}, we conclude that D → A does not follow.

Armstrong's Axioms
If F is a set of functional dependencies then the closure of F, denoted as F+, is the set of all
functional dependencies logically implied by F. Armstrong's Axioms are a set of rules, that
when applied repeatedly, generates a closure of functional dependencies.

 Reflexivity / reflexive rule: If {B1, B2, ..., Bm} ⊆{A1, A2, ..., An}, then A1, A2, …,
An → B1, B2, …, Bm. These are what we have called trivial FD’s.
 Augmentation rule: If A1A2 … An → B1B2 … Bm, then A1A2 … AnC1C2 … Ck
→ B1B2, … BmC1C2 … Ck for any set of attributes C1, C2, ..., Ck
Since some of the C ’s may also be A’s or B’s or both, we should eliminate from the
left side duplicate attributes and do the same for the right side.
 Transitivity rule: If A1, A2, …, An → B1, B2, …, Bm and B1, B2, …, Bm → C1,
C2,…, Ck hold in relation R, then A1, A2, …, An → C1, C2, …, Ck also holds in R.
 Additional rules:
- Union: If X → Y and X → Z, then X → Y Z
- Pseudotransitivity: If X → Y and W Y → Z, then W X → Z
- Composition: If X → Y and Z → W, then XZ → Y W
Transitive dependence: an attribute Y is said to be transitively dependent on attribute X if Y
is functionally dependent on another attribute Z which is functionally dependent on X.
Exercise
Consider a relation with schema R (A, B, C, D) and FD’s AB → C, D → D and D → A.
i. What are all the nontrivial FD’s that follow from the given FD’s? You should restrict
yourself to FD’s with single attributes on the right side.
ii. What are all the keys of R?
iii. What are all the superkeys for R that are not keys?

THE RELATIONAL ALGEBRA


The relational algebra is a procedural query language. It consists of a set of operations that take
one or two relations as input and produce a new relation as their result. The fundamental
operations in the relational algebra are
i. Select
ii. Project
iii. Union
iv. set difference
v. Cartesian product, and
vi. Rename.
In addition to the fundamental operations, there are several other operations—namely,
 set intersection
 natural join, and
 assignment.
We shall define these operations in terms of the fundamental operations.
Fundamental Operations
The select, project, and rename operations are called unary operations, because they operate on
one relation. The other three operations operate on pairs of relations and are, therefore, called
binary operations.
The Select Operation
The select operation selects tuples that satisfy a given predicate. We use the lowercase Greek
letter sigma (𝜎) to denote selection. The predicate appears as a subscript to 𝜎. The argument
relation is in parentheses after the 𝜎.

The instructor relation.


Thus, to select those tuples of the instructor relation where the instructor is in the “Physics”
department, we write:
𝜎𝑑𝑒𝑝𝑡_𝑛𝑎𝑚𝑒=“Physics” (instructor)

Result of 𝜎𝑑𝑒𝑝𝑡_𝑛𝑎𝑚𝑒=“Physics” (instructor)

We can find all instructors with salary greater than $90,000 by writing:
𝜎salary>90000 (instructor)

Furthermore, we can combine several predicates into a larger predicate by using the
connectives and (∧), or (∨), and not (¬).
Thus, to find the instructors in Physics with a salary greater than $90,000, we write
𝜎dept name = “Physics” ∧ salary>90000 (instructor)

To find all departments whose name is the same as their building name, we can write:
𝜎dept name = building (department)

The Project Operation


Suppose we want to list all instructors’ ID, name, and salary, but do not care about the dept
name. The project operation allows us to produce this relation. The project operation is a unary
operation that returns its argument relation, with certain attributes left out. Since a relation is a
set, any duplicate rows are eliminated. Projection is denoted by the uppercase Greek letter pi
(∏). We list those attributes that we wish to appear in the result as a subscript to ∏. The
argument relation follows in parentheses. We write the query to produce such a list as:
∏ID,name,salary (instructor)

Result of ∏ID,name,salary (instructor).


Composition of Relational Operations
The fact that the result of a relational operation is itself a relation is important. Consider the
more complicated query “Find the name of all instructors in the Physics department.” We write
∏name (𝜎dept_name ="Physics" (department))

The Union Operation


Consider a query to find the set of all courses taught in the Fall 2009 semester, the Spring 2010
semester, or both. The information is contained in the section relation below.

The section relation.

To find the set of all courses taught in the Fall 2009 semester, we write:
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section))

To find the set of all courses taught in the Spring 2010 semester, we write:
∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))

To answer the query, we need the union of these two sets; that is, we need all section IDs that
appear in either or both of the two relations. We find these data by the binary operation union,
denoted, as in set theory, by ∪. So the expression needed is:
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) ⋃
∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))

The result relation for this query appears in Figure below


Courses offered in either Fall 2009, Spring 2010 or both semesters.
Observe that, in our example, we took the union of two sets, both of which consisted of course
id values. In general, we must ensure that unions are taken between compatible relations. For
example, it would not make sense to take the union of the instructor relation and the student
relation. Although both relations have four attributes, they differ on the salary and tot cred
domains. The union of these two attributes would not make sense in most situations. Therefore,
for a union operation r ∪ s to be valid, we require that two conditions hold:
1. The relations r and s must be of the same arity. That is, they must have the same number
of attributes.
2. The domains of the ith attribute of r and the ith attribute of s must be the same, for all
i.
Note that r and s can be either database relations or temporary relations that are the result of
relational-algebra expressions.

The Set-Difference Operation


The set-difference operation, denoted by −, allows us to find tuples that are in one relation but
are not in another. The expression r − s produces a relation containing those tuples in r but not
in s.
We can find all the courses taught in the Fall 2009 semester but not in Spring 2010 semester
by writing:
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) −
∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))

The result relation for this query appears in Figure below

Courses offered in the Fall 2009 semester but not in Spring 2010 semester.
As with the union operation, we must ensure that set differences are taken between compatible
relations. Therefore, for a set-difference operation r − s to be valid, we require that the relations
r and s be of the same arity, and that the domains of the ith attribute of r and the ith attribute of
s be the same, for all i.

The Cartesian-Product Operation


The Cartesian-product operation, denoted by a cross (×), allows us to combine information
from any two relations. We write the Cartesian product of relations r1 and r2 as r1 × r2.
Recall that a relation is by definition a subset of a Cartesian product of a set of domains. From
that definition, we should already have an intuition about the definition of the Cartesian-
product operation. However, since the same attribute name may appear in both r1 and r2, we
need to devise a naming schema to distinguish between these attributes. We do so here by
attaching to an attribute the name of the relation from which the attribute originally came. For
example, the relation schema for r = instructor × teaches is:
(instructor.ID, instructor.name, instructor.dept_name, instructor.salary teaches.ID,
teaches.course_id, teaches.sec_id, teaches.semester, teaches.year)
With this schema, we can distinguish instructor.ID from teaches.ID. For those attributes that
appear in only one of the two schemas, we shall usually drop the relation-name prefix. This
simplification does not lead to any ambiguity. We can then write the relation schema for r as:
(instructor.ID, name, dept name, salary teaches.ID, course id, sec id, semester, year)

This naming convention requires that the relations that are the arguments of the Cartesian-
product operation have distinct names. This requirement causes problems in some cases, such
as when the Cartesian product of a relation with itself is desired. A similar problem arises if
we use the result of a relational-algebra expression in a Cartesian product, since we shall need
a name for the relation so that we can refer to the relation’s attributes.

The teaches relation.


The Rename Operation
Unlike relations in the database, the results of relational-algebra expressions do not have a
name that we can use to refer to them. It is useful to be able to give them names; the rename
operator, denoted by the lowercase Greek letter rho (𝜌), lets us do this. Given a relational-
algebra expression E, the expression
𝜌𝑥 (E)
The Set-Intersection
Operation The first additional relational-algebra operation that we shall define is set
intersection (∩). Suppose that we wish to find the set of all courses taught in both the Fall 2009
and the Spring 2010 semesters. Using set intersection, we can write
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section))
⋂ ∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))

The result relation for this query appears in Figure below

Courses offered in both the Fall 2009 and Spring 2010 semesters.

The Natural-Join Operation


The natural join is a binary operation that allows us to combine certain selections and a
Cartesian product into one operation. It is denoted by the join symbol ⋈. The natural-join
operation forms a Cartesian product of its two arguments, performs a selection forcing equality
on those attributes that appear in both relation schemas, and finally removes duplicate
attributes. Returning to the example of the relations instructor and teaches, computing
instructor natural join teaches considers only those pairs of tuples where both the tuple from
instructor and the tuple from teaches have the same value on the common attribute ID.
For example: Find the names of all instructors together with the course id of all courses they
taught. We express this query by using the natural join as follows:
∏name,course_id (instructor ⋈ teaches)
Since the schemas for instructor and teaches have the attribute ID in common, the natural-join
operation considers only pairs of tuples that have the same value on ID. It combines each such
pair of tuples into a single tuple on the union of the two schemas; that is, (ID, name, dept name,
salary, course id). After performing the projection, we obtain the relation in Figure below
Result of ∏name,course_id (instructor ⋈ teaches).

We are now ready for a formal definition of the natural join. Consider two relations r(R) and
s(S). The natural join of r and s, denoted by 𝑟 ⋈ 𝑠, is a relation on schema R ∪ S formally
defined as follows:

𝑟 ⋈ 𝑠 = ∏𝑅 ⋃ 𝑆 (𝜎𝑟.𝐴1 =𝑠.𝐴1 ∧ 𝑟.𝐴2 =𝑠.𝐴2 ∧…∧ 𝑟.𝐴𝑛=𝑠.𝐴𝑛 (𝑟 × 𝑠))

where R ∩ S = {A1, A2,..., An}


Please note that if r(R) and s(S) are relations without any attributes in common, that is, R ∩ S
= ∅, then 𝑟 ⋈ 𝑠 = r × s
RELATIONSHIPS IN RELATIONAL DATABASE
Relationships are classified as: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:N
or M:M). In developing a good database designs, we must focus on the following points:
 The 1:1 relationship should be rare in any relational database design.
 The 1:M relationship is the relational modeling ideal. Therefore, this relationship type
should be the norm in any relational database design.
 M:N relationships cannot be implemented as such in the relational model.
The 1:M Relationship

The 1:M relationship between PAINTER and PAINTING


The implemented 1:M relationship between PAINTER and PAINTING
The one-to-many (1:M) relationship is easily implemented in the relational model by putting
the primary key of the 1 side in the table of the many side as a foreign key.

The 1:M relationship between COURSE and CLASS

The implemented 1:M relationship between COURSE and CLASS


The 1:1 relationship:
As the 1:1 label implies, in this relationship, one entity can be related to only one other entity,
and vice versa. For example, one department chair— a professor—can chair only one
department, and one department can have only one department chair. The entities
PROFESSOR and DEPARTMENT thus exhibit a 1:1 relationship.

The 1:M relationship between PROFRSSOR and DEPARTMENT

If we the examine the PROFESSOR and DEPARTMENT tables, we note some important
features:
 Each professor is a College employee; thus, the professor identification is through the
EMP_NUM. (However, note that not all employees are professors—there’s another
optional relationship).
 The 1:1 PROFESSOR chairs DEPARTMENT relationship is implemented by having
the EMP_NUM as foreign key in the DEPARTMENT table. Note that the 1:1
relationship is treated as a special case of the 1:M relationship in which the “many” side
is restricted to a single occurrence. In this case, DEPARTMENT contains the
EMP_NUM as a foreign key to indicate that it is the department that has a chair.
 Also, note that the PROFESSOR table contains the DEPT_CODE foreign key to
implement the 1:M DEPARTMENT employs PROFESSOR relationship. This is a
good example of how two entities can participate in two (or even more) relationships
simultaneously. The preceding “PROFESSOR chairs DEPARTMENT” example
illustrates a proper 1:1 relationship. In fact, the use of a 1:1 relationship ensures that
two entity sets are not placed in the same table when they should not be.
The M:N Relationship:
A many-to-many (M:N) relationship is not supported directly in the relational environment.
However, M:N relationships can be implemented by creating a new entity in 1:M relationships
with the original entities. To explore the many-to-many (M:N) relationship, consider a rather
typical college environment in which each STUDENT can take many CLASSes, and each
CLASS can contain many STUDENTs. The ER model for this M:N relationship is below:
Table name: PROFESSOR
Primary key: EMP_NUM
Foreign key: DEPT_CODE

The ERM’s M:N relationship between STUDENT and CLASS


Note the features of the ERM above:
 Each CLASS can have many STUDENTs, and each STUDENT can take many
CLASSes.
 There can be many rows in the CLASS table for any given row in the STUDENT table,
and there can be many rows in the STUDENT table for any given row in the CLASS
table.
To examine the M:N relationship more closely, imagine a small college with two students, each
of whom takes three classes. The table below shows the enrolment data for the two students.

Given the data relationship and the sample data in the table above, it can be wrongly assumed
that M:N relationship can be implemented by simply adding a foreign key in the many side of
the relationship that points to the primary key of the related table. This not correct.
 The tables will create many redundancies. For example, note that the STU_NUM values
occur many times in the STUDENT table. In a real-world situation, additional student
attributes such as address, classification, major, and home phone would also be
contained in the STUDENT table, and each of those attribute values would be repeated
in each of the records shown here. Similarly, the CLASS table contains many
duplications: each student taking the class generates a CLASS record. The problem
would be even worse if the CLASS table included such attributes as credit hours and
course description.
 Given the structure and contents of the two tables, the relational operations become
very complex and are likely to lead to system efficiency errors and output errors.
The problems inherent in the many-to-many (M:N) relationship can easily be avoided by
creating a composite entity (also referred to as a bridge entity or an associative entity).
Because such a table is used to link the tables that were originally related in an M:N
relationship, the composite entity structure includes—as foreign keys—at least the primary
keys of the tables that are to be linked. The database designer can then define the composite
table’s primary key either by: using the combination of those foreign keys or create a new
primary key. In the example above, we can create the composite ENROLL table CLASS and
STUDENT. In this example, the ENROLL table’s primary key is the combination of its foreign
keys CLASS_CODE and STU_NUM. But the designer could have decided to create a single-
attribute new primary key such as ENROLL_LINE, using a different line value to identify each
ENROLL table row uniquely. (Microsoft Access users might use the Autonumber data type to
generate such line values automatically).

Because the ENROLL table links two tables, STUDENT and CLASS, it is also called a linking
table. In other words, a linking table is the implementation of a composite entity.
The ENROLL table yields the required M:N to 1:M conversion. Observe that the composite
entity represented by the ENROLL table must contain at least the primary keys of the CLASS
and STUDENT tables (CLASS_CODE and STU_NUM, respectively) for which it serves as a
connector. Also note that the STUDENT and CLASS tables now contain only one row per
entity. The ENROLL table contains multiple occurrences of the foreign key values, but those
controlled redundancies are incapable of producing anomalies as long as referential integrity is
enforced.
Additional attributes may be assigned as needed. In this case, ENROLL_GRADE is selected
to satisfy a reporting requirement. Also note that the ENROLL table’s primary key consists of
the two attributes CLASS_CODE and STU_NUM because both the class code and the student
number are needed to define a particular student’s grade. Naturally, the conversion is reflected
in the ERM, too. The revised relationship is shown below:
Changing the M:N relationship to two 1:M relationships

Note that the composite entity named ENROLL represents the linking table between
STUDENT and CLASS. We can increase the amount of available information even as we
control the database’s redundancies. Below is the expanded ERM, including the 1:M
relationship between COURSE and CLASS. Note that the model is able to handle multiple
sections of a CLASS while controlling redundancies by making sure that all of the COURSE
data common to each CLASS are kept in the COURSE table.
Expanded entity relationship model

The relationship diagram that corresponds to the ERM shown above is as below:
CODD’S RELATIONAL DATABASE RULES
In 1985, Dr. E. F. Codd published a list of 12 rules to define a relational database system. The
reason Dr. Codd published the list was his concern that many vendors were marketing products
as “relational” even though those products did not meet minimum relational standards. Dr.
Codd’s list, serves as a frame of reference for what a truly relational database should be. Note
that even the dominant database vendors do not fully support all 12 rules.
Dr. Codd’s 12 Relational Database Rules
RULE RULE NAME DESCRIPTION
1. Information All information in a relational database must be logically
represented as column values in rows within tables.
2. Guaranteed Access Every value in a table is guaranteed to be accessible through
a combination of table name, primary key value, and column
name.
3. Systematic Treatment Nulls must be represented and treated in a systematic way,
of Nulls independent of data type.
4. Dynamic Online The metadata must be stored and managed as ordinary data,
Catalogue Based on that is, in tables within the database. Such data must be
the relational model available to authorized users using the standard database
relational language
5. Comprehensive Data The relational dataset may support may languages.
Sublanguage However, it must support one well-defined, declarative
language with support for data definition, view definition,
data manipulation.
6. View Updating Any view that is theoretically updatable must be updatable
through the system.
7. High-Level insert, The database must support set-level inserts, updates, and
Update, and Delete deletes.
8. Physical Data Application programs and ad hoc facilities are logically
Independence unaffected when physical access methods or storage
structures are changed.
9. Logical Data Application programs and ad hoc facilities are logically
Independence unaffected when changes are made to the table structures
that preserve the original table values (changing order of
column or inserting columns).
10. Integrity All relational integrity constraints must be definable in the
Independence relational language and stored in the system catalog, not at
the application level.
11. Distribution The end users and application programs are unaware and
Independence unaffected by the data location (distributed vs. local
databases).
12. Nonsubversion If the system supports low-level access to the data, there
must ot e a way to bypass the integrity rules of the database.
Rule Zero All preceding rules are based on the notion that in order for
a dataset to be considered relational, it must use its relational
facilities exclusively to manage the database.
THE ENTITY RELATIONSHIP MODEL (ERM)
Peter Chen first introduced the ER data model in 1976; it was the graphical representation of
entities and their relationships in a database structure that quickly became popular because it
complemented the relational data model concepts.
The relational data model and ERM combined to provide the foundation for tightly structured
database design. ER models are normally represented in an entity relationship diagram (ERD),
which uses graphical representations to model database components. The ERD represents the
conceptual database as viewed by the end user. ERDs depict the database’s main components:
entities, attributes, and relationships. Because an entity represents a real-world object, the
words entity and object are often used interchangeably. The notations used with ERDs are the
original Chen notation and the newer Crow’s Foot and UML notations. Some conceptual
database modeling concepts can be expressed only using the Chen notation. Because of its
implementation emphasis, the Crow’s Foot notation can represent only what could be
implemented. In summary:
 The Chen notation favors conceptual modeling.
 The Crow’s Foot notation favors a more implementation-oriented approach.
 The UML notation can be used for both conceptual and implementation modeling.
Components of the ER model:
Entity: An entity is anything about which data are to be collected and stored. An entity is
represented in the ERD by a rectangle, also known as an entity box. The name of the entity, a
noun, is written in the center of the rectangle. The entity name is generally written in capital
letters and is written in the singular form: PAINTER rather than PAINTERS, and EMPLOYEE
rather than EMPLOYEES. Usually, when applying the ERD to the relational model, an entity
is mapped to a relational table. Each row in the relational table is known as an entity instance
or entity occurrence in the ER model. Each entity is described by a set of attributes that
describes particular characteristics of the entity. For example, the entity EMPLOYEE will have
attributes such as a Social Security number, a last name, and a first name. A collection of like
entities is known as an entity set. The word entity in the ERM corresponds to a table—not to a
row—in the relational environment. The ERM refers to a table row as an entity instance or
entity occurrence.
Attributes: Attributes are characteristics of entities. For example, the STUDENT entity
includes, among many others, the attributes STU_LNAME, STU_FNAME, and
STU_INITIAL. In the original Chen notation, attributes are represented by ovals and are
connected to the entity rectangle with a line. Each oval contains the name of the attribute it
represents. In the Crow’s Foot notation, the attributes are written in the attribute box below the
entity rectangle. Because the Chen representation is rather space-consuming, software vendors
have adopted the Crow’s Foot attribute display.
Required and Optional Attributes: A required attribute is an attribute that must have a value;
in other words, it cannot be left empty. As shown below there are two boldfaced attributes in
the Crow’s Foot notation. This indicates that a data entry will be required. In this example,
STU_LNAME and STU_FNAME require data entries because of the assumption that all
students have a last name and a first name. But students might not have a middle name, and
perhaps they do not (yet) have a phone number and an e-mail address. Therefore, those
attributes are not presented in boldface in the entity box. An optional attribute is an attribute
that does not require a value; therefore, it can be left empty.
Attributes of the STUDENT entity: Chen and crow’s foot

Attribute domains: Attributes have a domain. A domain is the set of possible values for a
given attribute. For example, the domain for the grade point average (GPA) attribute is written
(0,4) because the lowest possible GPA value is 0 and the highest possible value is 4. The
domain for the gender attribute consists of only two possibilities: M or F (or some other
equivalent code). The domain for a company’s date of hire attribute consists of all dates that
fit in a range (for example, company startup date to current date). Attributes may share a
domain. For instance, a student address and a professor address share the same domain of all
possible addresses. In fact, the data dictionary may let a newly declared attribute inherit the
characteristics of an existing attribute if the same attribute name is used. For example, the
PROFESSOR and STUDENT entities may each have an attribute named ADDRESS and could
therefore share a domain.
Identifiers (Primary Keys): The ERM uses identifiers, that is, one or more attributes that
uniquely identify each entity instance. In the relational model, such identifiers are mapped to
primary keys (PKs) in tables. Identifiers are underlined in the ERD. Key attributes are also
underlined in a frequently used table structure shorthand notation using the format:
Table Name (Key_Attribute 1, Attribute 2, Attribute 3, . . . Attribute K)
Composite Identifiers: Ideally, an entity identifier is composed of only a single attribute.
However, it is possible to use a composite identifier, that is, a primary key composed of more
than one attribute. E.g. CLASS entity of CRS_CODE and CLASS_SECTION instead of using
CLASS_CODE. Either approach uniquely identifies each entity instance.
Composite and Simple Attributes: Attributes are classified as simple or composite. A
composite attribute, not to be confused with a composite key, is an attribute that can be further
subdivided to yield additional attributes. For example, the attribute ADDRESS can be
subdivided into street, city, state, and zip code. Similarly, the attribute PHONE_NUMBER can
be subdivided into area code and exchange number. A simple attribute is an attribute that cannot
be subdivided. For example, age, sex and marital status would be classified as simple attributes.
To facilitate detailed queries, it is wise to change composite attributes into a series of simple
attributes.
Single-Valued Attributes: A single-valued attribute is an attribute that can have only a single
value. For example, a person can have only one Social Security number, and a manufactured
part can have only one serial number. Keep in mind that a single-valued attribute is not
necessarily a simple attribute. For instance, a part’s serial number, such as SE-08-02-189935,
is single-valued, but it is a composite attribute because it can be subdivided into the region in
which the part was produced (SE), the plant within that region (08), the shift within the plant
(02), and the part number (189935).
Multivalued Attributes: Multivalued attributes are attributes that can have many values. For
instance, a person may have several college degrees, and a household may have several
different phones, each with its own number. Similarly, a car’s color may be subdivided into
many colors (that is, colors for the roof, body, and trim). In the Chen ERM, the multivalued
attributes are shown by a double line connecting the attribute to the entity. The Crow’s Foot
notation does not identify multivalued attributes.

A multivalued attribute in an entity

The ERD above contains all of the components introduced thus far. Note that CAR_VIN is the
primary key, and CAR_COLOR is a multivalued attribute of the CAR entity.

Implementing Multivalued Attributes


Although the conceptual model can handle M:N relationships and multivalued attributes, it is
poor practice to implement them in the RDBMS. In the relational table, each column/row
intersection represents a single data value. The designer must decide on one of two possible
courses of action to handle multivalued attributes.
1. Split the multivalued attribute to create several new attributes.
For example, the CAR entity’s attribute CAR_COLOR can be split to create the new attributes
CAR_TOPCOLOR, CAR_BODYCOLOR, and CAR_TRIMCOLOR, which are then assigned
to the CAR entity. Although this solution seems to work, its adoption can lead to major
structural problems in the table. For example, if additional color components—such as a logo
color—are added for some cars, the table structure must be modified to accommodate the new
color section. In that case, cars that do not have such color sections generate nulls for the
nonexisting components, or their color entries for those sections are entered as N/A to indicate
“not applicable.” Also consider the employee entity containing employee degrees and
certifications. If some employees have 10 degrees and certifications while most have fewer or
none, the number of degree/certification attributes would number 10, and most of those
attribute values would be null for most of the employees.) In short, while solution 1 is
practicable, it is not an acceptable solution.
Splitting the multivalued attribute into new attributes
2. Create a new entity composed of the original multivalued attribute’s components.
This new entity allows the designer to define color for different sections of the car. (See Table
below).
Components of the Multivalued Attribute
SECTION COLOR
Top White
Body Blue
Trim Gold
Interior Blue

Another benefit we can derive from this approach is that we are now able to assign as many
colors as necessary without having to change the table structure.

A new entity set composed of a multivalued attribute’s components

Note that the ERM shown in Figure above reflects the components listed in previous table.
This is the preferred way to deal with multivalued attributes. Creating a new entity in a 1:M
relationship with the original entity yields several benefits: it’s a more flexible, expandable
solution, and it is compatible with the relational model.
Derived Attributes: A derived attribute is an attribute whose value is calculated (derived)
from other attributes. The derived attribute need not be physically stored within the database;
instead, it can be derived by using an algorithm. For example, an employee’s age, EMP_AGE,
may be found by computing the integer value of the difference between the current date and
the EMP_DOB. In Microsoft Access, we use: INT((DATE() – EMP_DOB)/365).
In Microsoft SQL Server, we use
SELECT DATEDIFF(“YEAR”, EMP_DOB, GETDATE()),
where DATEDIFF is a function that computes the difference between dates. The first parameter
indicates the measurement, in this case, years. In Oracle, we use SYSDATE instead of DATE().
A derived attribute is indicated in the Chen notation by a dashed line connecting the attribute
and the entity. The Crow’s Foot notation does not have a method for distinguishing the derived
attribute from other attributes.

Depiction of a derived attribute

Derived attributes are sometimes referred to as computed attributes. A derived attribute


computation can be as simple as adding two attribute values located on the same row, or it can
be the result of aggregating the sum of values located on many table rows (from the same table
or from a different table). The decision to store derived attributes in database tables depends
on the processing requirements and the constraints placed on a particular application. The
designer should be able to balance the design in accordance with such constraints. Table below
shows the advantages and disadvantages of storing (or not storing) derived attributes in the
database.
Advantages and disadvantages of storing (or not storing) derived attributes in the database.

Relationships: Relationships describe associations among data. Most relationships describe


associations between two entities. The three types of relationships among data include:
 one-to-many (1:M)
 many-to-many (M:N)
 and one-to-one (1:1).
The ER model uses the term connectivity to label the relationship types. The name of the
relationship is usually an active or passive verb. For example, a PAINTER paints many
PAINTINGs; an EMPLOYEE learns many SKILLs; an EMPLOYEE manages a STORE.
Illustrated below are the different types of relationships using two ER notations: the original
Chen notation and the more current Crow’s Foot notation.

The left side of the ER diagram shows the Chen notation, based on Peter Chen’s landmark
paper. In this notation, the connectivities are written next to each entity box. Relationships are
represented by a diamond connected to the related entities through a relationship line. The
relationship name is written inside the diamond. The right side illustrates the Crow’s Foot
notation. The name “Crow’s Foot” is derived from the three-pronged symbol used to represent
the “many” side of the relationship. In the basic Crow’s Foot ERD represented above, the
connectivities are represented by symbols. For example, the “1” is represented by a short line
segment, and the “M” is represented by the three-pronged “crow’s foot.” The relationship name
is written above the relationship line. In Figure above, entities and relationships are shown in
a horizontal format, but they may also be oriented vertically. The entity location and the order
in which the entities are presented are immaterial; just remember to read a 1:M relationship
from the “1” side to the “M” side.
Connectivity and Cardinality
As stated above, the term connectivity is used to describe the relationship classification.
Cardinality expresses the minimum and maximum number of entity occurrences associated
with one occurrence of the related entity. In the ERD, cardinality is indicated by placing the
appropriate numbers beside the entities, using the format (x,y). The first value represents the
minimum number of associated entities, while the second value represents the maximum
number of associated entities. Many database designers who use Crow’s Foot modeling
notation do not depict the specific cardinalities on the ER diagram itself because the specific
limits described by the cardinalities cannot be implemented directly through the database
design. Correspondingly, some Crow’s Foot ER modeling tools do not print the numeric
cardinality range in the diagram; instead, you can add it as text if you want to have it shown.

Connectivity and Cardinality


Knowing the minimum and maximum number of entity occurrences is very useful at the
application software level. A college might want to ensure that a class is not taught unless it
has at least 10 students enrolled. Similarly, if the classroom can hold only 30 students, the
application software should use that cardinality to limit enrollment in the class. However, keep
in mind that the DBMS cannot handle the implementation of the cardinalities at the table
level—that capability is provided by the application software or by triggers.
Existence Dependence: An entity is said to be existence-dependent if it can exist in the
database only when it is associated with another related entity occurrence. In implementation
terms, an entity is existence-dependent if it has a mandatory foreign key—that is, a foreign key
attribute that cannot be null. For example, if an employee wants to claim one or more
dependents for tax-withholding purposes, the relationship “EMPLOYEE claims
DEPENDENT” would be appropriate. In that case, the DEPENDENT entity is clearly
existencedependent on the EMPLOYEE entity because it is impossible for the dependent to
exist apart from the EMPLOYEE in the database. If an entity can exist apart from all of its
related entities (it is existence-independent), then that entity is referred to as a strong entity or
regular entity.
Relationship Strength: The concept of relationship strength is based on how the primary key
of a related entity is defined. To implement a relationship, the primary key of one entity appears
as a foreign key in the related entity. To implement a relationship, the primary key of one entity
appears as a foreign key in the related entity. For example, the 1:M relationship between
VENDOR and PRODUCT is implemented by using the VEND_CODE primary key in
VENDOR as a foreign key in PRODUCT. There are times when the foreign key also is a
primary key component in the related entity. Relationship strength decisions affect primary key
arrangement in database design.
Weak (Non-identifying) Relationships: A weak relationship, also known as a non-identifying
relationship, exists if the PK of the related entity does not contain a PK component of the parent
entity. By default, relationships are established by having the PK of the parent entity appear as
an FK on the related entity. For example, suppose that the COURSE and CLASS entities are
defined as:
COURSE(CRS_Code, Dept_Code, CRS_Description, CRS_Credit)
CLASS(Class_Code, CRS_Code, Class_Section, Class_Time, Room_Code, Prof_Num)

In this case, a weak relationship exists between COURSE and CLASS because the
CLASS_CODE is the CLASS entity’s PK, while the CRS_CODE in CLASS is only an FK. In
this example, the CLASS PK did not inherit the PK component from the COURSE entity.

Crow’s Foot notation depicts a weak relationship

Strong (Identifying) Relationships: A strong relationship, also known as an identifying


relationship, exists when the PK of the related entity contains a PK component of the parent
entity. For example, the definitions of the COURSE and CLASS entities
COURSE(CRS_CODE,DEPT_CODE,CRS_DESCRIPTION,CRS_CREDIT)
CLASS(CRS_CODE,CLASS_SECTION,CLASS_TIME,ROOM_CODE,PROF_NUM)
indicate that a strong relationship exists between COURSE and CLASS, because the CLASS
entity’s composite PK is composed of CRS_CODE + CLASS_SECTION. (Note that the
CRS_CODE in CLASS is also the FK to the COURSE entity.) The Crow’s Foot notation
depicts the strong (identifying) relationship with a solid line between the entities. Whether the
relationship between COURSE and CLASS is strong or weak depends on how the CLASS
entity’s primary key is defined. Keep in mind that the order in which the tables are created and
loaded is very important. For example, in the “COURSE generates CLASS” relationship, the
COURSE table must be created before the CLASS table. After all, it would not be acceptable
to have the CLASS table’s foreign key reference a COURSE table that does not yet exist.

Crow’s Foot notation depicts a strong relationship


Weak Entities: a weak entity is one that meets two conditions:
 The entity is existence-dependent; that is, it cannot exist without the entity with which
it has a relationship.
 The entity has a primary key that is partially or totally derived from the parent entity in
the relationship.
For example, a company insurance policy insures an employee and his/her dependents. For
the purpose of describing an insurance policy, an EMPLOYEE might or might not have a
DEPENDENT, but the DEPENDENT must be associated with an EMPLOYEE. Moreover,
the DEPENDENT cannot exist without the EMPLOYEE; that is, a person cannot get
insurance coverage as a dependent unless s(he) happens to be a dependent of an employee.
DEPENDENT is the weak entity in the relationship “EMPLOYEE has DEPENDENT.”

Note that the Chen notation above identifies the weak entity by using a double-walled entity
rectangle. The Crow’s Foot notation generated by Visio Professional uses the relationship
line and the PK/FK designation to indicate whether the related entity is weak. A strong
(identifying) relationship indicates that the related entity is weak. Such a relationship means
that both conditions for the weak entity definition have been met—the related entity is
existence-dependent, and the PK of the related entity contains a PK component of the parent
entity. Remember that the weak entity inherits part of its primary key from its strong
counterpart. For example, at least part of the DEPENDENT entity’s key shown in Figure
above was inherited from the EMPLOYEE entity:
EMPLOYEE (EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_DOB,
EMP_HIREDATE)
DEPENDENT (EMP_NUM, DEP_NUM, DEP_FNAME, DEP_DOB)

Crowfoot symbols

Relationship Degree: A relationship degree indicates the number of entities or participants


associated with a relationship. A unary relationship exists when an association is
maintained within a single entity. A binary relationship exists when two entities are
associated. A ternary relationship exists when three entities are associated. Although higher
degrees exist, they are rare and are not specifically named. (For example, an association of
four entities is described simply as a four-degree relationship.)

FILE SYSTEM
In computing, a file system (often also written as file system) is a method for storing and
organizing computer files and the data they contain to make it easy to find and access them.
File systems may use a data storage device such as a hard disk or CD-ROM and involve
maintaining the physical location of the files. More formally, a file system is a special-purpose
database for the storage, hierarchical organization, manipulation, navigation, access, and
retrieval of data.
The most familiar file systems make use of an underlying data storage device that offers access
to an array of fixed-size blocks, sometimes called sectors, generally a power of 2 in size (512
bytes or 1, 2, or 4 KiB are most common). The file system software is responsible for
organizing these sectors into files and directories, and keeping track of which sectors belong to
which file and which are not being used. Most file systems address data in fixed-sized units
called "clusters" or "blocks" which contain a certain number of disk sectors (usually 1-
64). This is the smallest logical amount of disk space that can be allocated to hold a file.
File names
Whether the file system has an underlying storage device or not, file systems typically have
directories which associate file names with files, usually by connecting the file name to an
index in a file allocation table of some sort. Directory structures may be flat, or allow
hierarchies where directories may contain subdirectories. In some file systems, file names are
structured, with special syntax for filename extensions and version numbers. In others, file
names are simple strings, and per-file metadata is stored elsewhere.
File organization method
There are a large number of ways data can be organized on disk or tape. The main methods of
file organization used for files are:
• Serial
• Sequential
• Indexed Sequential
• Random (or Direct)

Serial Organization
Serial files are stored in chronological order, that is, as each record is received it is stored in
the next available storage position. In general it is only used on a serial medium such as
magnetic tape. This type of file organization means that the records are in no particular order
and therefore to retrieve a single record, the whole file needs to be read from the beginning to
end. Serial organization is usually the method used for creating Transaction files (unsorted),
Work and Dump files.
Sequential Organization
Sequential files are serial files whose records are sorted and stored in an ascending or
descending on a particular key field. The physical order of the records on the disk is not
necessarily sequential, as most manufacturers support an organization where certain records
(inserted after the file has been set up) are held in a logical sequence but are physically placed
into an overflow area. They are no longer physically contiguous with the preceding and
following logical records, but they can be retrieved in sequence.
Indexed Sequential Organization
Indexed Sequential file organization is logically the same as sequential organization, but an
index is built indicating the block containing the record with a given value for the Key field.
This method combines the advantages of a sequential file with the possibility of direct access
using the Primary Key (the primary Key is the field that is used to control the sequence of the
records). These days’ manufacturers providing Indexed Sequential Software allow for the
building of indexes using fields other than the primary Key. These additional fields on which
indexes are built are called Secondary Keys.
There are three major types of indexes used:
 Basic Index: This provides a location for each record (key) that exists in the system.
 Implicit Index: This type of index gives a location of all possible records (keys)
whether they exist or not.
 Limit Index: This index groups the records (keys) and only provides the location of
the highest key in the group. Generally they form a hierarchical index. Data records are
blocked before being written to disk. An index may consist of the highest key in each
block, (or on each track).
In the above example, data records are shown as being 3 to a block. The index, then, holds the
key of the highest record in each block. (An essential element of the index, which has been
omitted from the diagram for simplicity, is the physical address of the block of data records).
Should we wish to access record 5, whose key is A0038, we can quickly determine from the
index that the record is held in block 2, since this key is greater than the highest key in block
1, A0025, and less than the highest key in block 2, A0053. By way of the index we can go
directly to the record we wish to retrieve, hence the term "direct access".
Random (or Direct)
A randomly organized file contains records arranged physically without regard to the sequence
of the primary key. Records are loaded to disk by establishing a direct relationship between the
Key of the record and its address on the file, normally by use of a formula (or algorithm) that
converts the primary Key to a physical disk address. This relationship is also used for retrieval.
The use of a formula (or algorithm) is known as 'Key Transformation' and there are several
techniques that can be used:
 Division Taking Quotient
 Division Taking Remainder
 Truncation
 Folding
 Squaring
 Radix Conversion
These methods are often mixed to produce a unique address (or location) for each record (key).
The production of the same address for two different records is known as a synonym.
Random files show a distinct advantage where:
 Hit Rate is low
 Data cannot be batched or sorted
 Fast response is required.
File access methods
Algorithm used for the storage and/or retrieval of records to/from a data file is referred to as
file access method. This term is used interchangeably (and loosely) for both the access method
and file organization. Access methods are of two kinds; those that determine the structural
characteristics of the file on which it is used (file organization) and those that do not, as in
secondary indexing. In the case of the first one the same algorithm is used for the retrieval of
records as was used for its original physical placement/organization. File access method
include the following;
i. Serial access method: This is method of accessing records in an unsorted file. The term
refers to the ability of accessing each record in a file by a sequential searching process,
i.e., from the first record to the last. One must finish before the next begins, i.e., one
after another. This access method is associated with time consuming.
ii. Sequential access method: A method of access to a data file where each record is
sequenced in some order. A file is said to be sequentially accessed if the sequence of
transaction presented to it matches a sequence in which records are organized.
iii. Indexed sequential access method: This is an access method for a sequentially
organized file whose records are indexed with their corresponding address. This access
method supports both sequential access and indexed access and the two types of access
that can be comfortably done in this method are sequential access and random access.
Note that this method is only possible with disk files.
a. Sequential access: The access of the index sequential file makes minimal use
of the index. The records can be accessed serially as they are ordered one after
another and from the first to the last record in the file.
b. Random access: The random access of the index sequential makes full use of
the index to access the records using the record address associated with the
index. The records are accessed without any defined sequence of sorting. The
method is used when the transaction is to be processed immediately without
waiting for sorting and it’s widely used in commercial data processing because
it can be allow both sequential and random methods of access.
iv. Random access method: Random access is a type of memory access in which storage
location can be accessed in any order. This means that a file is said to be randomly
accessed if the sequence of transactions submit to it does not match any sequence in
which records may be organized.
Summary of file access methods
The information stored in a file can be accessed in a variety of methods:
 Sequential: in order, one record after another.
 Direct (random): in any order, skipping the previous records.
 Indexed or Keyed: in any order, but with particular value(s).
Other access methods, such as indexed, can be built on top of the above basic techniques.
Indexed sequential access method (ISAM) is built on random and sequential access.

File characteristics
Data file(s) should have at least one of the following behaviors or qualities:
 Hit rate: This is the frequency with which active records in the file are accessed. File
records that are accessed more often are said to have high hit rate, while those records
which are rarely accessed are said to have low hit rate.
 Volatility: The ease at which records are either added or deleted from file is described
as file volatility. This calls for careful handling of data file to avoid sudden loss of its
contents (records) due to its volatility nature.
 Growth: This is referred to the increase in size of the file. When a file is created, it
grows as new records are added to the file. The file grows as its contents (records)
increase.
 Size: The file size is expressed in terms of characters in the field of the record and the
number of records. When a file is created, the size is determined by the amount of data
stored in the file.
 Seek Time: Seek time is one of the three delays associated with reading or writing data
on a computer’s drive and somewhat similar for CD or DVD drives. The others are
rotational delay and transfer time and their sum is the access time. In order to read or
write data in a particular place on the disk, the read/write head of the disk needs to be
physically moved to the correct place. This process is known as seeking and the time it
takes for the head to move to the right place is the seek time. Seek time for a given disk
varies depending on how far the head destination is from its origin at the time of each
read/write instruction.
 Rotational Delay: The rotational delay is the time required for the address area of the
disk to rotate into a position where it is accessible by the read/write head.
 Transfer Time: Transfer Time is the number of bits that are transferred per unit of time.
The unit therefore is in bit per second (bps).
CONCEPT OF BUFFER
Buffer is a temporary memory for data, normally used to accommodate the difference in the
rate at which two devices can handle data during transfer. The buffer may be built into a
peripherals device such as printer or disk drive or may be part of the system’s main memory.
Buffering technique is used to compensate for the slow and possible erratic rate at which a
peripheral device produces or consume data. If the device communicates directly with the
program, the program is constraint to run in synchronism with the device to operate
independently.
Consider a program sending output to a slow device, a memory area (the buffer) is set aside for
communication. The program places data in the buffer at its own rate. Although the device may
be slow, the program does not have to stop unless the buffer fills up, at the same time the device
runs at full speed unless the buffer is empty. This buffering technique is applicable to input
device.
By providing a data buffer in the device or control unit, devices that would ordinarily require
the channel for a long period can be made to perform independently of the channel. For
instance, a card reader may physically require about 60 ms to read a card and transfer the data
to memory through the channel.
However, a buffered card reader always reads one card before it is needed and save the 80 bytes
in a buffer in the card reader or control unit. When the channel requests that a card be read, the
contents of the 80-bytes buffered are transferred to memory at high speed (e.g., 100 µs) and
the channel is released. The device then proceeds to read the next card and buffer it
independently. Card readers, card punches and printers are frequently buffered.
Functions of Buffer
Buffer synchronizes the speed of data transfer among the systems/devices that are under the
control of the CPU. It makes individual devices (that is, input and output devices) to perform
their functions independently. This is because, the rate at which the CPU performs/sends or
receive it data is very high compare to the rate at which the I/O devices receive or send data.
Buffer is equally used to accommodate the differences in the rate at which two devices can
handle data during data communication/transfer.
Information capture and representation
File Concepts
In this section, we shall deal with the concepts of file & their relationship.
Many operating system consider files as a collection of bytes. At a higher level, where the
content of the file is being considered, these binary digits may represent integer values or text
characters, or anything else. It is up to the program using the file to understand the meaning
and internal layout of information in the file and present it to a user as more meaningful
information (like text, images, sounds).
At any instant in time, a file might have a size, normally expressed in bytes, that indicates how
much storage is associated with the file. In most modern operating systems the size can be any
whole number up to a system limit. However, the general definition of a file does not require
that its instant size has any real meaning, unless the data within the file happens to correspond
to data within a pool of persistent storage.
Information in a computer file can consist of smaller packets of information, often called
"records" or "lines" that are individually different but share some trait in common. For example,
a payroll file might contain information concerning all the employees in a company and their
payroll details; each record in the payroll file concerns just one employee, and all the records
have the common trait of being related to payroll. This is very similar to placing all payroll
information into a specific filing cabinet in an office that does not have a computer. A text file
may contain lines of text, corresponding to printed lines on a piece of paper. Alternatively, a
file may contain an arbitrary binary image (a BLOB) or it may contain an executable.
The way information is grouped into a file is entirely up to the person designing the file. This
has led to a plethora of more or less standardized file structures for all imaginable purposes,
from the simplest to the most complex. Most computer files are used by computer programs.
These programs create, modify and delete files for their own use on an as-needed basis. The
programmers who create the programs decide what files are needed, how they are to be used
and their names.
In some cases, computer programs manipulate files that are made visible to the computer user.
For example, in a word-processing program, the user manipulates document files that she
names herself. The content of the document file is arranged in a way that the word-processing
program understands, but the user chooses the name and location of the file, and she provides
the bulk of the information (such as words and text) that will be stored in the file.
Many applications pack all their data files into a single file, using internal markers to discern
the different types of information contained within.
Files on a computer can be created, moved, modified, grown, shrunk and deleted. In most cases,
computer programs that are executed on the computer handle these operations, but the user of
a computer can also manipulate files if necessary. For instance, Microsoft Word files are
normally created and modified by the Microsoft Word program in response to user commands,
but the user can also move, rename, or delete these files directly by using a file manager progrm
such as Windows Explorer.
The following are different data representation and their relationships:
i. Bit: A Bit is simply defined as binary digits, that’s 0’s &1’s. The bit is the smallest unit
of storage.
A bit is a binary digit, taking a value of either 0 or 1. Binary digits are a basic unit of
information storage and communication in digital computing and digital information
theory. Information theory also often uses the natural digit, called either a nit or a nat.
Quantum computing also uses qubits, a single piece of information with a probability of
being true. The bit is also a unit of measurement, the information capacity of one binary
digit. It has the symbol bit, or b (see discussion below).
Binary digit
Claude E. Shannon first used the word bit in his 1948 paper A Mathematical Theory of
Communication. He attributed its origin to John W. Tukey, who had written a Bell Labs
memo on 9 January 1947 in which he contracted "binary digit" to simply "bit".
Interestingly, Vannevar Bush had written in 1936 of "bits of information" that could be
stored on the punch cards used in the mechanical computers of that time.
A bit of storage can be either on (1) or off (0). A single bit is a one or a zero, a true or a
false, a "flag" which is "on" or "off", or in general, the quantity of information required to
distinguish two mutually exclusive equally probable states from each other. Gregory
Bateson defined a bit as "a difference that makes a difference".
Representation
 Transmission
Bits can be implemented in many forms depending on context. For example, in digital
circuitry in most computing devices as well as flash memories, a bit is an electrical pulse
generated by the internal clock in the control unit or data register. For devices using positive
logic, a logical 1 (true value) is represented by up to 5 volts, while a logical 0 (false value)
is represented by 0 volt.
 Storage
Bits are manipulated in the volatile memory of a computer, and can further be kept in a
persistent manner on a magnetic storage device such as magnetic tape or disc, as well as on
optical discs.
 Unit
It is important to differentiate between the use of "bit" in referring to a discrete storage unit
and the use of "bit" in referring to a statistical unit of information. The bit, as a discrete
storage unit, can by definition store only 0 or 1. A statistical bit is the amount of information
that, on average, can be stored in a discrete bit. It is thus the amount of information carried
by a choice between two equally likely outcomes. One bit corresponds to about 0.693 nats
(ln(2)), or 0.301 hartleys (log10(2)).
ii. Nibble: Nibble is ½ a byte. Generally, it is 4 bits, e.g 0110, 1110.
iii. A Byte: This is the collection of 2 nibbles or 8 bits. E.g, 010101010. A byte fixed
number of bits that can be treated as a unit by the computer. It represents a character. It
can also be described as the smallest unit of the computer memory. A byte is a collection
of 8 bits, originally differing in size depending on the context but now almost always
eight bits. Eight-bit bytes, also known as octets, can represent 256 values (28 values,
0–255). A four-bit quantity is known as a nibble, and can represent 16 values (24 values,
0–15). A rarely used term, crumb, can refer to a two-bit quantity, and can represent 4
values (2² values, 0–3).
iv. Word: is a term for a slightly larger group of bits, but it has no standard size. It
represents the size of one register in a Computer-CPU.
v. Character: A character is the smallest unit of information. It includes letters, digits and
special characters such as; +, -, = *, %, &, etc.
vi. Field: A field is made up of characters, or simply defined as a combination of
characters. It’s an attribute that can be assigned values. For example, First name, Age,
Gender, Date of birth, etc.
vii. Record: Record is the collection of organizes and related fields.
viii. File is the collection of organized and related records.

File Processing
File is a body of stored data or information in an electronic format. Almost all information
stored on computers is in the form of files. Files reside on mass storage devices such as hard
drives, CD-ROMs, magnetic tape, and floppy disks. When the central processing unit (CPU)
of a computer needs data from a file, or needs to write data to a file, it temporarily stores the
file in its main memory, or Random Access Memory (RAM), while it works on the data.
Information in computers is usually classified into two different types of files: data files (those
containing data) and program files (those that manipulate data). Within each of these
categories, many different types of files exist that store various kinds of information.
Different computer operating systems have unique rules for the naming of files. Windows 95
(Win95) and disk operating systems (DOS), for instance, make use of an extension attached to
the end of each filename in order to indicate the type of file. Extensions begin with a period (.),
and then have one or more letters. An example of a file extension used in Win95 and DOS is
.bak, which indicates that the file is a backup file.
When saving a file, a user can give it any name within the rules of the operating system. In
addition, the name must be unique. Two files in the same directory may not have the same
name, but some operating systems allow the same name for one file to be placed in more than
one location. These additional names are called aliases.
Directory files contain information used to organize other files into a hierarchical structure. In
the Macintosh operating system, directory files are called folders. The topmost directory in any
file system is the root directory. A directory contained within another directory is called a
subdirectory. Directories containing one or more subdirectories are called parent directories.
Directory files contain programs or commands that are executable by the computer.
Executable files have a .exe suffix at the end of their names and are often called EXE
(pronounced EX-ee) files. Text files contain characters represented by their ASCII (American
Standard Code for Information Interchange) codes. These files are often called ASCII
(pronounced ASK-ee) files. Files that contain words, sentences, and bodies of paragraphs are
frequently referred to as text files. The diagram below shows the root directory, sub director
and file.

File Processing Activities


The file processing operations deal with the various activities which are performed on the file.
These operations are briefly described as shown below;
 File creation: The process of bringing file into existence is called file creation.
 Searching: Searching is locating data in a file by reference to a special field of each
record/data called the key. The key is a unique filed used to identify certain record in a
file. If a record is to be inserted into a file, it must be given a unique key value.
 Retrieving/reading: This involves reading an existing data from a form of storage or
input medium.
 Writing: Writing is the act of recording data onto some form of storage.
 Deleting: This means removing a record or item of data from a storage medium such as
disk/tape.
 File updating: This is an act of changing values in one or more records of a file without
changing the organization of the file. That is making the file modern by adding most
recent data to the file.
 Sorting: Sorting means rearranging data in either ascending or descending order. It
involves the arrangement of grouped data elements into a predetermined sequence to
facilitate file processing.
 Calculating: The arithmetic or logical manipulation of data in a file is referred to as
calculation.
 File querying/interrogating: This is retrieving specific data from a file according to the
set of retrieval criteria.
 File merging: Combining multiple sets of data files or records to produce only one set,
usually in an ordered sequence is referred to as file merging.
 Reporting: Reporting is a file processing operation that deals with the production
(printing) of report from the file in a specified format.
 File display: The contents of a data file can be displayed either on the computer screen
as soft copy or printed on the paper as hard copy.
 File storage: When a file is created, it is stored in the appropriate storage medium such
as disk, flash disk, tape, drum, etc.

Concept of File Security


Computer Security, techniques developed to safeguard information and information systems
stored on computers. Potential threats include the destruction of computer hardware and
software and the loss, modification, theft, unauthorized use, observation, or disclosure of
computer data.
Computers and the information they contain are often considered confidential systems because
their use is typically restricted to a limited number of users. This confidentiality can be
compromised in a variety of ways. For example, computers and computer data can be harmed
by people who spread computer viruses and worms. A computer virus is a set of computer
program instructions that attaches itself to programs in other computers. The viruses are often
parts of documents that are transmitted as attachments to e-mail messages. A worm is similar
to a virus but is a self-contained program that transports itself from one computer to another
through networks. Thousands of viruses and worms exist and can quickly contaminate millions
of computers.
People who intentionally create viruses are computer experts often known as hackers. Hackers
also violate confidentiality by observing computer monitor screens and by impersonating
authorized users of computers in order to gain access to the users’ computers. They invade
computer databases to steal the identities of other people by obtaining private, identifying
information about them. Hackers also engage in software piracy and deface Web sites on the
Internet. For example, they may insert malicious or unwanted messages on a Web site, or alter
graphics on the site. They gain access to Web sites by impersonating Web site managers.
Malicious hackers are increasingly developing powerful software crime tools such as automatic
computer virus generators, Internet eavesdropping sniffers, password guessers, vulnerability
testers, and computer service saturators. For example, an Internet eavesdropping sniffer
intercepts Internet messages sent to other computers. A password guesser tries millions of
combinations of characters in an effort to guess a computer’s password. Vulnerability testers
look for software weaknesses. These crime tools are also valuable security tools used for testing
the security of computers and networks.
An increasingly common hacker tool that has gained widespread public attention is the
computer service saturator, used in denial-of-service attacks, which can shut down a selected
or targeted computer on the Internet by bombarding the computer with more requests than it
can handle. This tool first searches for vulnerable computers on the Internet where it can install
its own software program. Once installed, the compromised computers act like “zombies”
sending usage requests to the target computer. If thousands of computers become infected with
the software, then all would be sending usage requests to the target computer, overwhelming
its ability to handle the requests for service.
A variety of simple techniques can help prevent computer crimes, such as protecting computer
screens from observation, keeping printed information and computers in locked facilities,
backing up copies of data files and software, and clearing desktops of sensitive information
and materials. Increasingly, however, more sophisticated methods are needed to prevent
computer crimes. These include using encryption techniques, establishing software usage
permissions, mandating passwords, and installing firewalls and intrusion detection systems. In
addition, controls within application systems and disaster recovery plans are also necessary.#
 Information privacy or confidentiality- keeping information secret from all but those
who are authorized to see it.
 Data integrity- ensuring information has not been altered by unauthorized or
unknown means.
 Entity authentication or identification- corroboration of the identity of an entity (e.g.,
a person, a computer terminal, a credit card, etc.).
 Message authentication- corroborating the source of information; also known as data
origin authentication.
 Information security- Information security include confidentiality, data integrity,
entity authentication, and data origin authentication.
 Scalability- information scalability is the ability of a computer application to continue
to function well when it there is a change in size or volume of information in order to
meet a user need.
 Efficiency- Information must be accurate and correct at all times.
 Effectiveness- Information must be useful for its purpose.

File Security Techniques


Backup
Storing backup copies of software and data or files and having backup computer and
communication capabilities are important basic safeguards because the data can then be
restored if it was altered or destroyed by a computer crime or accident. Computer data should
be backed up frequently and should be stored nearby in secure locations in case of damage at
the primary site. Transporting sensitive data to storage locations should also be done securely.
Encryption
Another technique to protect confidential information is encryption. Computer users can
scramble information to prevent unauthorized users from accessing it. Authorized users can
unscramble the information when needed by using a secret code called a key. Without the key
the scrambled information would be impossible or very difficult to unscramble. A more
complex form of encryption uses two keys, called the public key and the private key, and
asystem of double encryption. Each participant possesses a secret, private key and a public key
that is known to potential recipients. Both keys are used to encrypt, and matching keys are used
to decrypt the message. However, the advantage over the single key method lies with the
private keys, which are never shared and so cannot be intercepted. The public key verifies that
the sender is the one who transmitted it. The keys are modified periodically, further hampering
unauthorized unscrambling and making the encrypted information more difficult to decipher.
Approved Users
Another technique to help prevent abuse and misuse of computer data is to limit the use of
computers and data files to approved persons. Security software can verify the identity of
computer users and limit their privileges to use, view, and alter files. The software also securely
records their actions to establish accountability. Military organizations give access rights to
classified, confidential, secret, or top-secret information according to the corresponding
security clearance level of the user. Other types of organizations also classify information and
specify different degrees of protection.
Passwords
Passwords are confidential sequences of characters that allow approved persons to make use
of specified computers, software, or information. To be effective, passwords must be difficult
to guess and should not be found in dictionaries. Effective passwords contain a variety of
characters and symbols that are not part of the alphabet. To thwart imposters, computer systems
usually limit the number of attempts and restrict the time it takes to enter the correct password.
A more secure method is to require possession and use of tamper-resistant plastic cards with
microprocessor chips, known as “smart cards,” which contain a stored password that
automatically changes after each use. When a user logs on, the computer reads the card's
password, as well as another password entered by the user, and matches these two respectively
to an identical card password generated by the computer and the user's password stored in the
computer in encrypted form. Use of passwords and 'smart cards' is beginning to be reinforced
by biometrics, identification methods that use unique personal characteristics, such as
fingerprints, retinal patterns, facial characteristics, or voice recordings.
Firewalls
Computers connected to communication networks, such as the Internet, are particularly
vulnerable to electronic attack because so many people have access to them. These computers
can be protected by using firewall computers or software placed between the networked
computers and the network. The firewall examines, filters, and reports on all information
passing through the network to ensure its appropriateness. These functions help prevent
saturation of input capabilities that otherwise might deny usage to legitimate users, and they
ensure that information received from an outside source is expected and does not contain
computer viruses.
Intrusion Detection Systems
Security software called intrusion detection systems may be used in computers to detect
unusual and suspicious activity and, in some cases, stop a variety of harmful actions by
authorized or unauthorized persons. Abuse and misuse of sensitive system and application
programs and data such as password, inventory, financial, engineering, and personnel files can
be detected by these systems.
Application Safeguards
The most serious threats to the integrity and authenticity of computer information come from
those who have been entrusted with usage privileges and yet commit computer fraud. For
example, authorized persons may secretly transfer money in financial networks, alter credit
histories, sabotage information, or commit bill payment or payroll fraud. Modifying, removing,
or misrepresenting existing data threatens the integrity and authenticity of computer
information. For example, omitting sections of a bad credit history so that only the good credit
history remains violates the integrity of the document. Entering false data to complete a
fraudulent transfer or withdrawal of money violates the authenticity of banking information.
These crimes can be prevented by using a variety of techniques. One such technique is
checksumming. Checksumming sums the numerically coded word contents of a file before and
after it is used. If the sums are different, then the file has been altered. Other techniques include
authenticating the sources of messages, confirming transactions with those who initiate them,
segregating and limiting job assignments to make it necessary for more than one person to be
involved in committing a crime, and limiting the amount of money that can be transferred
through a computer.
Disaster Recovery Plans
Organizations and businesses that rely on computers need to institute disaster recovery plans
that are periodically tested and upgraded. This is because computers and storage components
such as diskettes or hard disks are easy to damage. A computer's memory can be erased or
flooding, fire, or other forms of destruction can damage the computer’s hardware. Computers,
computer data, and components should be installed in safe and locked facilities.
File dumping vs file archiving
File Dumping
File dumping is a file processing activity whereby large numbers of users’ files are stored on
magnetic disk. This is one of the periodic records of the state of disks that are made on some
form of offline storage device. This protects against failure either in hardware or software that
can lead to corruption of stored files. In the event of a system error that causes file/information
to be lost, the most recently copied version of the file can be reinstated from the dump.
On a large multiuser system, the total volume of stored file means that it may not be practicable
to dump all the file on every occasion. In these cases an incremental dump can be taken,
containing only those files that are marked as having been altered since the last dump. This
reduces the total amount of file to be copied during the dump, allowing dumping to be made
frequently.
File Archiving
File Archiving is the collection of historically valuable records, ranging from papers and
documents to photographs, films, videotapes, and sound recordings. It is a repository for
information that the user wishes to retain, but without requiring immediate access. There are
three activities that must be distinguished in archiving; viz
 The routine checking of back-up copies initiated by the system manager to protect users
and system manager against corruption of stored information/file.
 The autonomous transferring of file from higher-performance to a lower performance
storage system, initiated by the operating system to achieve economies in the total cost
to the system manager of the information storage.
 The voluntary transferring of a file between normal file storage and archive storage
initiated by the user to achieve economies in the total costs to the user of information
storage.

You might also like