CMP 214
CMP 214
CMP 214
Course Contents
Information storage and retrieval, information management applications, information capture
and representation, analysis and indexing, search, retrieval, information privacy, integrity,
security, scalability, efficiency and effectiveness. Introduction to database systems,
components of database system, DBMS functions, database architecture and data independence
use of database query language.
Objectives:
1. Demonstrate good knowledge of basic database concepts, including the structure and
operation of the relational data model.
2. Understand and successfully apply logical database design principles, including E-R
diagrams and database normalization.
3. Understand Information storage and retrieval.
4. Assess the quality and ease of use of data modelling and diagramming tools.
5. Understand Information capture and representation
6. Design and implement a small database project.
Learning outcome:
At the end of this course, students should be able to:
1. Define Data Management and its Important
2. Know the Basic database Terminologies
3. Know Database management system
4. Know DBMS architecture
5. Understand Information storage and retrieval
6. Know DBMS Languages
7. Know DBMS data model
8. Know ERM model
9. Understand design theory for relational database
10. Understand Relational Algebra
11. Understand Information capture and representation
12. Understand file organization method
13. Understand file characteristics & file processing activities
14. Understand Concept of buffer & File Security Techniques
15. Design and implement a small database project using Microsoft Access/MySQL.
References/Further Reading
1. Silberschatz, Abraham; Henry F. Korth; & S. Sudarshan (2011). Database system
concepts— 6th ed. The McGraw-Hill Companies, Inc., New York.
Basic Terminologies
Data
Data is raw representation of unprocessed facts, figures, concepts or instruction. It can exist in
any form, usable or not. Data are facts presented without relation to other things. E.g. it is
raining
Information
Information is processed data that has been given meaning by way of relational connection.
This "meaning" can be useful, but does not have to be. In computer parlance, a relational
database makes information from the data stored within it. Information embodies the
understanding of some sort. E.g. the temperature dropped to 15 degrees and then it started
raining.
Database
Database is the collection of organized and related files. For instance, the collection of staff-
file, student-file & equipment-file of an institution is referred to as the Institution’s Database.
This collection of data organized for storage in a computer memory and designed for easy
access by authorized users. The data may be in the form of text, numbers, or encoded graphics.
A Database is a shared, integrated computer structure that is repository to:
End-user data, that is, raw facts of interest to the end user
Metadata, or data about data describes of the data characteristics and the set of
relationships that link the data found within the db.
A database is a collection of data, typically describing the activities of one or more related
organizations. For example, a university database might contain information about the
following:
Entities such as students, faculty, courses, and classrooms.
Relationships between entities, such as students' enrolment in courses, faculty teaching
courses, and the use of rooms for courses.
Proper storage of data in a database will enhance efficient
- Data Management
- Data processing
- Data retrieval
Database System
Refers to an organization of components that define and regulate the collection, storage,
management from general management point of view, the DB system is composed of
- Hardware
- Software
- People
System administrators: database systems operations
DB administrators: manage the DBMS and ensure the DB is functioning
properly
DB designers
System analysts and programmers design and implement the application
programs
End user
- Procedures
- Data
DBMS ARCHITECTURE
The DBMS provides users with an abstract view of the data in it i.e. the system hides certain
details of how the data is stored and maintained from users. A DBMS can be viewed as divided
into levels of abstraction. A common architecture generally used is the ANSI/SPARC
(American National Standards Institute - Standards Planning and Requirements Committee)
model. The ANSI/SPARC model abstracts the DBMS into a 3-tier architecture as follows:
- External level
- Conceptual level
- Internal level
DBMS architecture
External level: The external level is the user’s view of the database and closest to the users. It
presents only the relevant part of the DBMS to the user. E.g. A bank database stores a lot more
information but an account holder is only interested in his/her account details such as the
current account balance, transaction history etc. An external schema describes each external
view. The external schema consists of the definition of the logical records and the relationships
in the external view. In the external level, the different views may have different
representations of the same data.
Conceptual level: At this level of database abstraction, all the database entities and
relationships among them are included. Conceptual level provides the community view of the
database and describes what data is stored in the database and the relationships among the data.
In other words, the conceptual view represents the entire database of an organization. It is a
complete view of the data requirements of the organization that is independent of any storage
consideration. The conceptual schema defines conceptual view. It is also called the logical
schema. There is only one conceptual schema per database. The figure shows the conceptual
view record of a data base.
Internal level or physical level: The lowest level of abstraction is the internal level. It is the
one closest to physical storage device. This level is also termed as physical level, because it
describes how data are actually stored on the storage medium such as hard disk, magnetic tape
etc. This level indicates how the data will be stored in the database and describe the data
structures, file structures and access methods to be used by the database. The internal schema
defines the internal level. The internal schema contains the definition of the stored record, the
methods of representing the data fields and accessed methods used. The figure shows the
internal view record of a database.
Database Architecture
Database applications are usually partitioned into two or three parts as in Figure (a) below.
Two-tier architecture
In a two-tier architecture, the application resides at the client machine, where it invokes
database system functionality at the server machine through query language statements.
Application program interface standards like ODBC and JDBC are used for interaction
between the client and the server.
Three-tier architecture
In contrast, in a three-tier architecture, the client machine acts as merely a front end and does
not contain any direct database calls. Instead, the client end communicates with an application
server, usually through a forms interface. The application server in turn communicates with a
database system to access data. The business logic of the application, which says what actions
to carry out under what conditions, is embedded in the application server, instead of being
distributed across multiple clients. Three-tier applications are more appropriate for large
applications, and for applications that run on the World Wide Web.
Data independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult
to modify or update a set of metadata once it is stored in the database. But as a DBMS expands,
it needs to change over time to satisfy the requirements of the users. If the entire data is
dependent, it would become a tedious and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one layer, it does
not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data residing
on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data
independence is the power to change the physical data without impacting the schema or logical
data.
For example, in case we want to change or upgrade the storage system itself — suppose we
want to replace hard-disks with SSD — it should not have any impact on the logical data or
schemas.
INFORMATION STORAGE AND RETRIEVAL
An information storage and retrieval system is a network with a built-in user interface that
facilitates the creation, searching, and modification of stored data. Files on a computer can be
created, moved, modified, grown, shrunk and deleted. In most cases, computer programs that
are executed on the computer handle these operations, but the user of a computer can also
manipulate files if necessary. For instance, Microsoft Word files are normally created and
modified by the Microsoft Word program in response to user commands, but the user can also
move, rename, or delete these files directly by using a file manager program such as Windows
Explorer.
DBMS Languages
The workings of a DBMS is controlled by four different languages, they are
Data Definition Language (DDL): Used by the DBA and database designers to specify the
conceptual schema of a database. In many DBMSs, the DDL is also used to define internal and
external schemas (views). In some DBMSs, separate storage definition language (SDL) and
view definition language (VDL) are used to define internal and external schemas. SDL is
typically realized via DBMS commands provided to the DBA and database designers. Some
examples include:
- CREATE - to create objects in the database
- ALTER - alters the structure of the database
- DROP - delete objects from the database
- TRUNCATE - remove all records from a table, including all spaces allocated for the
records are removed
- COMMENT - add comments to the data dictionary
- RENAME - rename an object
CREATE
Creates new databases, tables, and views from RDBMS.
For example:
Create database tutorialspoint;
Create table article;
Create view for_students;
DROP
Drops commands, views, tables, and databases from RDBMS.
For example:
Drop object_type object_name;
Drop database tutorialspoint;
Drop table article;
Drop view for_students;
ALTER
Modifies database schema.
Alter object_type object_name parameters;
For example:
Alter table article add subject varchar;
This command adds an attribute in the relation article with the name subject of string type.
Data Manipulation Language (DML): these statements managing data within schema
objects. They specify database retrievals and updates. DML commands (data sublanguage) can
be embedded in a general-purpose programming language (host language), such as COBOL,
C, C++, or Java.
- A library of functions can also be provided to access the DBMS from a programming
language
- Alternatively, stand-alone DML commands can be applied directly (called a query
language).
Some examples in SQL include:
- SELECT - Retrieve data from the a database
- INSERT - Insert data into a table
- UPDATE - Updates existing data within a table
- DELETE - deletes all records from a table, the space for the records remain
- MERGE - UPSERT operation (insert or update)
- CALL - Call a PL/SQL or Java subprogram
- EXPLAIN PLAN - explain access path to data
- LOCK TABLE - control concurrency
SELECT/FROM/WHERE
SELECT
This is one of the fundamental query command of SQL. It is similar to the projection operation
of relational algebra. It selects the attributes based on the condition described by WHERE
clause.
FROM
This clause takes a relation name as an argument from which attributes are to be
selected/projected. In case more than one relation names are given, this clause corresponds to
Cartesian product.
WHERE
This clause defines predicate or conditions, which must match in order to qualify the attributes
to be projected.
For example:
Select author_name
From book_author
Where age > 50;
This command will yield the names of authors from the relation book_author whose
age is greater than 50.
INSERT INTO/VALUES
This command is used for inserting values into the rows of a table (relation).
Syntax:
INSERT INTO table (column1 [, column2, column3 ... ]) VALUES (value1 [,
value2, value3 ... ])
Or
INSERT INTO table VALUES (value1, [value2, ... ])
For example:
INSERT INTO tutorialspoint (Author, Subject) VALUES ("anonymous",
"computers");
UPDATE/SET/WHERE
This command is used for updating or modifying the values of columns in a table (relation).
Syntax:
UPDATE table_name SET column_name = value [, column_name = value ...]
[WHERE condition]
For example:
UPDATE tutorialspoint SET Author="webmaster" WHERE Author="anonymous";
DELETE/FROM/WHERE
This command is used for removing one or more rows from a table (relation).
Syntax:
DELETE FROM table_name [WHERE condition];
For example:
DELETE FROM tutorialspoint
WHERE Author="unknown";
Data Control Language (DCL): used for granting and revoking user access on a database
- To grant access to user – GRANT
- To revoke access from user – REVOKE
Transaction Control (TCL): Statements are used to manage the changes made by DML
statements. It allows statements to be grouped together into logical transactions.
Some examples include:
- COMMIT - save work done
- SAVEPOINT - identify a point in a transaction to which you can later roll back
- ROLLBACK - restore database to original since the last COMMIT
- SET TRANSACTION - Change transaction options like isolation level and what
rollback segment to use in practical data definition language, data manipulation
language and data control languages are not separate language; rather they are the parts
of a single database language such as SQL.
Example
Write the SQL code that will create the table structure for a table named EMP_1. This table is
a subset of the EMPLOYEE table. The basic EMP_1 table structure is summarized in the
following table. EMP_NUM is the primary key and JOB_CODE is the FK to JOB.
Assuming the data shown in the EMP_1 table have been entered, write the SQL code that will
list all attributes for a job code of 502.
SELECT * FROM EMP_1
WHERE JOB_CODE = ‘502’;
Write the SQL code that will save the changes made to the EMP_1 table.
COMMIT WORK;
NB: WORK is optional.
Applications can navigate a hierarchical database by starting at a root and successively navigate
downward from parent to children until the desired record is found.
Searching down a hierarchical tree is very fast since the storage layer for hierarchical databases
use contiguous storage for hierarchical structures. All other types of queries require sequential
search techniques. A DDL for hierarchical data model must allow the definition of record types,
fields types, pointers, and parent-child relationships. And the DML must support direct
navigation using the parent-child relationships and through pointers.
Limitations
- Hierarchical model only permits one to many relationship. The concept of Logical
relationship is often used to circumvent this limitation. Logical relationship
superimpose another set of connection between data items separate from the physical
tree structure. This of course increases its complexity
- Often a natural hierarchy does not exist and it is awkward to impose a parent-child
relationship. Pointers partially compensate for this weakness, but it is still difficult to
specify suitable hierarchical schemas for large models and this means Programs have
to navigate very close to the physical data structure level, implying that the hierarchical
data model offers only very limited data independence.
- Lack of ad hoc query capability placed burden on programmers to generate code for
reports.
Network model:
It represents complex data relationships more effectively than the hierarchical model. The
major improvement is that the one-to-many limitation was removed; the models still views data
in a hierarchical one-to-many structure but now record may have more than one parent.
Network data models represent data in a symmetric manner, unlike the hierarchical data model
(distinction between a parent and a child). Information is organized as a collection of graphs
of record that are related with pointers. More flexible than a hierarchical data model and still
permits efficient navigation.
The records consist of lists of fields (fixed or variable length with maximum length), where
each field contains a simple value (fixed or variable size).
The network data model also introduces the notion of indexes of fields and records, sets of
pointers, and physical placement of records.
A DDL for network data models must allow the definition of record types, fields types, pointers
and indexes. And the DML must allow navigation through the graphs through the pointers and
indexes.
Programs also navigates closely to the physical storage structures, implying that the network
data model only supports limited data independence, and are therefore difficult to maintain as
the data models evolve over time.
Concepts introduced under the network model include:
- Schema: conceptual design of the entire database usually managed by the dba
- Sub-schema: virtual view of portions of the database visible to application
programmers
- Data management language: enables definition of and access to the schema and sub-
schema. It consist of DDL to construct the schema and DML to develop programs
- Data Definition Language.
Limitations
- Cumbersome
- Lack of ad hoc query capability placed burden on programmers to generate code for
reports
- Structural change in the database could produce havoc in all application programs
SOME DEFINITIONS
RELATION
A relation, as defined by E. F. Codd, is a set of tuples (d1, d2, ..., dn), where each element dj
is a member of Dj, a data domain, for each j=1, 2, ..., n. A data domain is simply a data type. It
specifies a data abstraction: the possible values for the data and the operations available on the
data. For example, a String can have zero or more characters in it, and has operations for
comparing strings, concatenating string, and creating strings. A relation is a truth predicate. It
defines what attributes are involved in the predicate and what the meaning of the predicate is.
In relational data model, relations are represented in the table format. This format stores the
relation among entities. A table has rows and columns, where rows represent records and
columns represent the attributes. E.g.
TUPLE
A single row of a table, which contains a single record for that relation is called a tuple. A tuple
has attribute values which match the required attributes in the relation. The ordering of attribute
values is immaterial. Every tuple in the body of a given relation is required to conform to the
heading (attribute) of that relation, i.e. it contains exactly one value, of the applicable type, for
each attribute.
ATTRIBUTE
The columns of a relation are named by attributes. Attributes appear at the tops of the columns.
Usually, an attribute describes the meaning of entries in the column below. For instance, the
column with attribute length holds the length, in minutes, of each movie.
ATTRIBUTE DOMAIN
Every attribute has some predefined value scope, known as attribute domain.
ATTRIBUTE VALUE/INSTANCE
An attribute value is the value for an attribute in a particular tuple. An attribute value must
come from the domain that the attribute specifies. Most relational DBMS allows NULL
attribute values. Each attribute value in a relational model must be atomic i.e. it must be of
some elementary type such as integer or string. It is not permitted for a value to be a record
structure, set, list, array, or any other type that reasonably can have its values broken into
smaller components.
SCHEMAS
The name of a relation and the set of attributes for a relation is called the schema for that
relation. The schema is depicted by the relation name followed by a parenthesized list of its
attributes. Thus, the schema for relation Movies above is
Movies (title, year, length, genre)
In the relational model, a database consists of one or more relations. The set of schemas for the
relations of a database is called a relational database schema, or just a database schema.
Data Types
All attributes must have a data type. The following are the primitive data types that are
supported by SQL (Structured Query Language) systems.
i. Character strings of fixed or varying length. The type CHAR(n) denotes a fixed-
length string of up to n characters. VARCHAR(n) also denotes a string of up to n
characters. The difference is implementation-dependent; typically CHAR implies that
short strings are padded to make n characters, while VARCHAR implies that an
endmarker or string-length is used. Normally, a string is padded by trailing blanks if it
becomes the value of a component that is a fixed-length string of greater length. For
example, the string ’foo’ if it became the value of a component for an attribute of type
CHAR(5), would assume the value ’foo ’ (with two blanks following the second o).
ii. Bit strings of fixed or varying length. These strings are analogous to fixed and
varying-length character strings, but their values are strings of bits rather than
characters. The type BIT (n) denotes bit strings of length n, while BIT VARYING (n)
denotes bit strings of length up to n.
iii. BOOLEAN. The type BOOLEAN denotes an attribute whose value is logical. The
possible values of such an attribute are TRUE, FALSE.
iv. INT or INTEGER. The type INT or INTEGER (these names are synonyms) denotes
typical integer values. The type SHORTINT also denotes integers, but the number of
bits permitted may be less, depending on the implementation (as with the types int and
short int in C).
v. FLOAT or REAL. Floating-point numbers can be represented in a variety of ways.
We may use the type FLOAT or REAL (these are synonyms) for typical floating point
numbers. A higher precision can be obtained with the type DOUBLE PRECISION. We
can also specify real numbers with a fixed decimal point. For example, DECIMAL(n,d)
allows values that consist of n decimal digits, with the decimal point assumed to be d
positions from the right. Thus, 0123.45 is a possible value of type DECIMAL(6,2).
NUMERIC is almost a synonym for DECIMAL, although there are possible
implementation-dependent differences.
vi. DATE and TIME. Dates and times can be represented by the data types DATE and
TIME, respectively. These values are essentially character strings of a special form. We
may, in fact, coerce dates and times to string types, and we may do the reverse if the
string “makes sense” as a date or time.
A relational database Schema is depicted by stating both the attributes and their datatype:
Movies (
title CHAR(1OO),
year INT,
length INT,
genre CHAR(10),
studioName CHAR(30),
producer INT
)
Relation instance: A finite set of tuples in the relational database system represents relation
instance. Relation instances do not have duplicate tuples.
{
<Person SSN# = "123-45-6789" Name = "Art Larsson" City = "San Francisco">,
<Person SSN# = "231-45-6789" Name = "Lino Buchanan" City = "Philadelphia">,
<Person SSN# = "321-45-6789" Name = "Diego Jablonski" City = "Chicago">
}
It is more common and concise to show a relation value as a table. All ordering within the table
is artificial and meaningless.
DESIGN THEORY FOR RELATIONAL DATABASE
A common problem with schema design involve trying to combine too much into one relation
thus leading to redundancy. Thus, improvements to relational schemas pay close attention to
eliminating redundancy. The theory of “dependences” is a well-developed theory for relational
databases providing guidelines on how to develop good schema and eliminate flaws if any. The
first concept we need to consider is Functional Dependency (FD).
FUNCTIONAL DEPENDENCY: the term functional dependence can be defined most easily
this way:
Definition:
Let A and B be subsets of the attribute of a relation R. Then the functional dependency (FD)
A→B
holds in R if and only if, whenever two tuples of R have the same value for A, they also have
the same value for B. A and B are the determinant and the dependent, respectively, and the FD
overall can be read as “A functionally determines B” or “B is functionally dependent on A,” or
more simply just as
A→B
If A and B are composite, then we have
A1, A2, …, An → B1, B2, …, Bm
This is also equivalent to
A1, A2, …, An → B1, A1, A2, …, An → B2, ..., A1, A2, …, An → Bm
The attribute(s) B is functionally dependent on attributes(s)A, if A determines B. e.g.
STU_PHONE is functionally dependent on STU_NUM.
RELATION KEYS: The key’s role is based on a concept known as determination. i.e. the
statement “A determines B” indicates that if you know the value of attribute A, you can look
up (determine) the value of attribute B. E.g.: an invoice number identifies all of the invoice
attributes such as invoice date and the customer name.
If we know STU_NUM in a STUDENT table we can look up (determine) student’s last name,
grade point average, phone number, etc.
Attributes {title, year, starName} form a key for the relation Movies because it meets the two
conditions:
Condition 1:
Do they functionally determine all the other attributes? Yes
Condition 2:
Do any proper subset of {title, year, starName} functionally determines all other attributes?
{title, year} do not determine starName thus {title, year} is not a key.
{year, starName} is not a key because we could have a star in two movies in the same year;
therefore{Year, starName} → title is not an FD.
{title, starName} is not a key, because two movies with the same title, made in different years,
can have a star in common.
Therefore, no proper subset of {title, year, starName} functionally determines all other
attributes.
Super Key (shortened: super set of keys): An attribute or a combination of attributes that is
used to identify the records uniquely is known as Super Key. It is to be noted that some
superkeys are not (minimal) keys. Note that every superkey satisfies the first condition of a
key: it functionally determines all other attributes of the relation. However, a superkey need
not satisfy the second condition: minimality. A table can have many Super Keys. E.g. of Super
Key
ID
ID, Name
ID, Address
ID, Department_ID
ID, Salary
Name, Address
Candidate Key: It can be defined as minimal Super Key or irreducible Super Key. In other
words an attribute or a combination of attribute that identifies the record uniquely but none of
its proper subsets can identify the records uniquely. E.g. of Candidate Key
Code
Name, Address
Primary Key: A Candidate Key that is used by the database designer for unique identification
of each row in a table is known as Primary Key. A Primary Key can consist of one or more
attributes of a table. E.g. of Primary Key - Database designer can use one of the Candidate Key
as a Primary Key.
In this case we have “Code” and “Name, Address” as Candidate Key, The designer may prefer
“Code” as the Primary Key as the other key is the combination of more than one attribute.
Null values should never be part of a primary key, they should also be avoided to the greatest
extent possible in other attributes too. A null is no value at all. It does not mean a zero or a
space. There are rare cases in which nulls cannot be reasonably avoided when you are working
with non-key attributes. For example, one of an EMPLOYEE table’s attributes is likely to be
the EMP_INITIAL. However, some employees do not have a middle initial. Therefore, some
of the EMP_INITIAL values may be null. Null can also exist because of the nature of the
relationship between two entities. Conventionally, the existence of nulls in a table is often an
indication of poor database design. Nulls, if used improperly, can create problems because they
have many different meanings. For example, a null can represent:
An unknown attribute value.
A known, but missing, attribute value.
A “not applicable” condition.
Foreign Key: A foreign key is an attribute or combination of attributes in one base table that
points to the candidate key (generally it is the primary key) of another table. The purpose of
the foreign key is to ensure referential integrity of the data i.e. only values that are supposed to
appear in the database are permitted. E.g. Consider two table
Employee (EmployeeID, EmployeeName, DOB, DOJ, SSN, DeptID, MgrID) and
DeptTbl (Dept_ID, Dept_Name, Manager_ID, Location_ID)
Dept_ID is the primary key in Table DeptTbl, the DeptID attribute of table Employee
(dependent or child table) can be defined as the Foreign Key as it can reference to the Dept_ID
attribute of the table DeptTbl (the referenced or parent table), a Foreign Key value must match
an existing value in the parent table or be NULL.
Composite Key: If we use multiple attributes to create a Primary Key then that Primary Key
is called Composite Key (also called a Compound Key or Concatenated Key).
Full functional dependency (FFD): If the attribute (B) is functionally dependent on a
composite key (A) but not on any subset of that composite key, the attribute (B) is fully
functionally dependent on (A).
Alternate Key: Alternate Key can be any of the Candidate Keys except for the Primary Key.
Secondary Key: The attributes that are not even the Super Key but can be still used for
identification of records (not unique) are known as Secondary Key. E.g. of Secondary Key can
be Name, Address, Salary, Department_ID etc. as they can identify the records but they might
not be unique.
Armstrong's Axioms
If F is a set of functional dependencies then the closure of F, denoted as F+, is the set of all
functional dependencies logically implied by F. Armstrong's Axioms are a set of rules, that
when applied repeatedly, generates a closure of functional dependencies.
Reflexivity / reflexive rule: If {B1, B2, ..., Bm} ⊆{A1, A2, ..., An}, then A1, A2, …,
An → B1, B2, …, Bm. These are what we have called trivial FD’s.
Augmentation rule: If A1A2 … An → B1B2 … Bm, then A1A2 … AnC1C2 … Ck
→ B1B2, … BmC1C2 … Ck for any set of attributes C1, C2, ..., Ck
Since some of the C ’s may also be A’s or B’s or both, we should eliminate from the
left side duplicate attributes and do the same for the right side.
Transitivity rule: If A1, A2, …, An → B1, B2, …, Bm and B1, B2, …, Bm → C1,
C2,…, Ck hold in relation R, then A1, A2, …, An → C1, C2, …, Ck also holds in R.
Additional rules:
- Union: If X → Y and X → Z, then X → Y Z
- Pseudotransitivity: If X → Y and W Y → Z, then W X → Z
- Composition: If X → Y and Z → W, then XZ → Y W
Transitive dependence: an attribute Y is said to be transitively dependent on attribute X if Y
is functionally dependent on another attribute Z which is functionally dependent on X.
Exercise
Consider a relation with schema R (A, B, C, D) and FD’s AB → C, D → D and D → A.
i. What are all the nontrivial FD’s that follow from the given FD’s? You should restrict
yourself to FD’s with single attributes on the right side.
ii. What are all the keys of R?
iii. What are all the superkeys for R that are not keys?
We can find all instructors with salary greater than $90,000 by writing:
𝜎salary>90000 (instructor)
Furthermore, we can combine several predicates into a larger predicate by using the
connectives and (∧), or (∨), and not (¬).
Thus, to find the instructors in Physics with a salary greater than $90,000, we write
𝜎dept name = “Physics” ∧ salary>90000 (instructor)
To find all departments whose name is the same as their building name, we can write:
𝜎dept name = building (department)
To find the set of all courses taught in the Fall 2009 semester, we write:
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section))
To find the set of all courses taught in the Spring 2010 semester, we write:
∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))
To answer the query, we need the union of these two sets; that is, we need all section IDs that
appear in either or both of the two relations. We find these data by the binary operation union,
denoted, as in set theory, by ∪. So the expression needed is:
∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) ⋃
∏course_id (𝜎semester = “Spring” ∧ year=2010 (section))
Courses offered in the Fall 2009 semester but not in Spring 2010 semester.
As with the union operation, we must ensure that set differences are taken between compatible
relations. Therefore, for a set-difference operation r − s to be valid, we require that the relations
r and s be of the same arity, and that the domains of the ith attribute of r and the ith attribute of
s be the same, for all i.
This naming convention requires that the relations that are the arguments of the Cartesian-
product operation have distinct names. This requirement causes problems in some cases, such
as when the Cartesian product of a relation with itself is desired. A similar problem arises if
we use the result of a relational-algebra expression in a Cartesian product, since we shall need
a name for the relation so that we can refer to the relation’s attributes.
Courses offered in both the Fall 2009 and Spring 2010 semesters.
We are now ready for a formal definition of the natural join. Consider two relations r(R) and
s(S). The natural join of r and s, denoted by 𝑟 ⋈ 𝑠, is a relation on schema R ∪ S formally
defined as follows:
If we the examine the PROFESSOR and DEPARTMENT tables, we note some important
features:
Each professor is a College employee; thus, the professor identification is through the
EMP_NUM. (However, note that not all employees are professors—there’s another
optional relationship).
The 1:1 PROFESSOR chairs DEPARTMENT relationship is implemented by having
the EMP_NUM as foreign key in the DEPARTMENT table. Note that the 1:1
relationship is treated as a special case of the 1:M relationship in which the “many” side
is restricted to a single occurrence. In this case, DEPARTMENT contains the
EMP_NUM as a foreign key to indicate that it is the department that has a chair.
Also, note that the PROFESSOR table contains the DEPT_CODE foreign key to
implement the 1:M DEPARTMENT employs PROFESSOR relationship. This is a
good example of how two entities can participate in two (or even more) relationships
simultaneously. The preceding “PROFESSOR chairs DEPARTMENT” example
illustrates a proper 1:1 relationship. In fact, the use of a 1:1 relationship ensures that
two entity sets are not placed in the same table when they should not be.
The M:N Relationship:
A many-to-many (M:N) relationship is not supported directly in the relational environment.
However, M:N relationships can be implemented by creating a new entity in 1:M relationships
with the original entities. To explore the many-to-many (M:N) relationship, consider a rather
typical college environment in which each STUDENT can take many CLASSes, and each
CLASS can contain many STUDENTs. The ER model for this M:N relationship is below:
Table name: PROFESSOR
Primary key: EMP_NUM
Foreign key: DEPT_CODE
Given the data relationship and the sample data in the table above, it can be wrongly assumed
that M:N relationship can be implemented by simply adding a foreign key in the many side of
the relationship that points to the primary key of the related table. This not correct.
The tables will create many redundancies. For example, note that the STU_NUM values
occur many times in the STUDENT table. In a real-world situation, additional student
attributes such as address, classification, major, and home phone would also be
contained in the STUDENT table, and each of those attribute values would be repeated
in each of the records shown here. Similarly, the CLASS table contains many
duplications: each student taking the class generates a CLASS record. The problem
would be even worse if the CLASS table included such attributes as credit hours and
course description.
Given the structure and contents of the two tables, the relational operations become
very complex and are likely to lead to system efficiency errors and output errors.
The problems inherent in the many-to-many (M:N) relationship can easily be avoided by
creating a composite entity (also referred to as a bridge entity or an associative entity).
Because such a table is used to link the tables that were originally related in an M:N
relationship, the composite entity structure includes—as foreign keys—at least the primary
keys of the tables that are to be linked. The database designer can then define the composite
table’s primary key either by: using the combination of those foreign keys or create a new
primary key. In the example above, we can create the composite ENROLL table CLASS and
STUDENT. In this example, the ENROLL table’s primary key is the combination of its foreign
keys CLASS_CODE and STU_NUM. But the designer could have decided to create a single-
attribute new primary key such as ENROLL_LINE, using a different line value to identify each
ENROLL table row uniquely. (Microsoft Access users might use the Autonumber data type to
generate such line values automatically).
Because the ENROLL table links two tables, STUDENT and CLASS, it is also called a linking
table. In other words, a linking table is the implementation of a composite entity.
The ENROLL table yields the required M:N to 1:M conversion. Observe that the composite
entity represented by the ENROLL table must contain at least the primary keys of the CLASS
and STUDENT tables (CLASS_CODE and STU_NUM, respectively) for which it serves as a
connector. Also note that the STUDENT and CLASS tables now contain only one row per
entity. The ENROLL table contains multiple occurrences of the foreign key values, but those
controlled redundancies are incapable of producing anomalies as long as referential integrity is
enforced.
Additional attributes may be assigned as needed. In this case, ENROLL_GRADE is selected
to satisfy a reporting requirement. Also note that the ENROLL table’s primary key consists of
the two attributes CLASS_CODE and STU_NUM because both the class code and the student
number are needed to define a particular student’s grade. Naturally, the conversion is reflected
in the ERM, too. The revised relationship is shown below:
Changing the M:N relationship to two 1:M relationships
Note that the composite entity named ENROLL represents the linking table between
STUDENT and CLASS. We can increase the amount of available information even as we
control the database’s redundancies. Below is the expanded ERM, including the 1:M
relationship between COURSE and CLASS. Note that the model is able to handle multiple
sections of a CLASS while controlling redundancies by making sure that all of the COURSE
data common to each CLASS are kept in the COURSE table.
Expanded entity relationship model
The relationship diagram that corresponds to the ERM shown above is as below:
CODD’S RELATIONAL DATABASE RULES
In 1985, Dr. E. F. Codd published a list of 12 rules to define a relational database system. The
reason Dr. Codd published the list was his concern that many vendors were marketing products
as “relational” even though those products did not meet minimum relational standards. Dr.
Codd’s list, serves as a frame of reference for what a truly relational database should be. Note
that even the dominant database vendors do not fully support all 12 rules.
Dr. Codd’s 12 Relational Database Rules
RULE RULE NAME DESCRIPTION
1. Information All information in a relational database must be logically
represented as column values in rows within tables.
2. Guaranteed Access Every value in a table is guaranteed to be accessible through
a combination of table name, primary key value, and column
name.
3. Systematic Treatment Nulls must be represented and treated in a systematic way,
of Nulls independent of data type.
4. Dynamic Online The metadata must be stored and managed as ordinary data,
Catalogue Based on that is, in tables within the database. Such data must be
the relational model available to authorized users using the standard database
relational language
5. Comprehensive Data The relational dataset may support may languages.
Sublanguage However, it must support one well-defined, declarative
language with support for data definition, view definition,
data manipulation.
6. View Updating Any view that is theoretically updatable must be updatable
through the system.
7. High-Level insert, The database must support set-level inserts, updates, and
Update, and Delete deletes.
8. Physical Data Application programs and ad hoc facilities are logically
Independence unaffected when physical access methods or storage
structures are changed.
9. Logical Data Application programs and ad hoc facilities are logically
Independence unaffected when changes are made to the table structures
that preserve the original table values (changing order of
column or inserting columns).
10. Integrity All relational integrity constraints must be definable in the
Independence relational language and stored in the system catalog, not at
the application level.
11. Distribution The end users and application programs are unaware and
Independence unaffected by the data location (distributed vs. local
databases).
12. Nonsubversion If the system supports low-level access to the data, there
must ot e a way to bypass the integrity rules of the database.
Rule Zero All preceding rules are based on the notion that in order for
a dataset to be considered relational, it must use its relational
facilities exclusively to manage the database.
THE ENTITY RELATIONSHIP MODEL (ERM)
Peter Chen first introduced the ER data model in 1976; it was the graphical representation of
entities and their relationships in a database structure that quickly became popular because it
complemented the relational data model concepts.
The relational data model and ERM combined to provide the foundation for tightly structured
database design. ER models are normally represented in an entity relationship diagram (ERD),
which uses graphical representations to model database components. The ERD represents the
conceptual database as viewed by the end user. ERDs depict the database’s main components:
entities, attributes, and relationships. Because an entity represents a real-world object, the
words entity and object are often used interchangeably. The notations used with ERDs are the
original Chen notation and the newer Crow’s Foot and UML notations. Some conceptual
database modeling concepts can be expressed only using the Chen notation. Because of its
implementation emphasis, the Crow’s Foot notation can represent only what could be
implemented. In summary:
The Chen notation favors conceptual modeling.
The Crow’s Foot notation favors a more implementation-oriented approach.
The UML notation can be used for both conceptual and implementation modeling.
Components of the ER model:
Entity: An entity is anything about which data are to be collected and stored. An entity is
represented in the ERD by a rectangle, also known as an entity box. The name of the entity, a
noun, is written in the center of the rectangle. The entity name is generally written in capital
letters and is written in the singular form: PAINTER rather than PAINTERS, and EMPLOYEE
rather than EMPLOYEES. Usually, when applying the ERD to the relational model, an entity
is mapped to a relational table. Each row in the relational table is known as an entity instance
or entity occurrence in the ER model. Each entity is described by a set of attributes that
describes particular characteristics of the entity. For example, the entity EMPLOYEE will have
attributes such as a Social Security number, a last name, and a first name. A collection of like
entities is known as an entity set. The word entity in the ERM corresponds to a table—not to a
row—in the relational environment. The ERM refers to a table row as an entity instance or
entity occurrence.
Attributes: Attributes are characteristics of entities. For example, the STUDENT entity
includes, among many others, the attributes STU_LNAME, STU_FNAME, and
STU_INITIAL. In the original Chen notation, attributes are represented by ovals and are
connected to the entity rectangle with a line. Each oval contains the name of the attribute it
represents. In the Crow’s Foot notation, the attributes are written in the attribute box below the
entity rectangle. Because the Chen representation is rather space-consuming, software vendors
have adopted the Crow’s Foot attribute display.
Required and Optional Attributes: A required attribute is an attribute that must have a value;
in other words, it cannot be left empty. As shown below there are two boldfaced attributes in
the Crow’s Foot notation. This indicates that a data entry will be required. In this example,
STU_LNAME and STU_FNAME require data entries because of the assumption that all
students have a last name and a first name. But students might not have a middle name, and
perhaps they do not (yet) have a phone number and an e-mail address. Therefore, those
attributes are not presented in boldface in the entity box. An optional attribute is an attribute
that does not require a value; therefore, it can be left empty.
Attributes of the STUDENT entity: Chen and crow’s foot
Attribute domains: Attributes have a domain. A domain is the set of possible values for a
given attribute. For example, the domain for the grade point average (GPA) attribute is written
(0,4) because the lowest possible GPA value is 0 and the highest possible value is 4. The
domain for the gender attribute consists of only two possibilities: M or F (or some other
equivalent code). The domain for a company’s date of hire attribute consists of all dates that
fit in a range (for example, company startup date to current date). Attributes may share a
domain. For instance, a student address and a professor address share the same domain of all
possible addresses. In fact, the data dictionary may let a newly declared attribute inherit the
characteristics of an existing attribute if the same attribute name is used. For example, the
PROFESSOR and STUDENT entities may each have an attribute named ADDRESS and could
therefore share a domain.
Identifiers (Primary Keys): The ERM uses identifiers, that is, one or more attributes that
uniquely identify each entity instance. In the relational model, such identifiers are mapped to
primary keys (PKs) in tables. Identifiers are underlined in the ERD. Key attributes are also
underlined in a frequently used table structure shorthand notation using the format:
Table Name (Key_Attribute 1, Attribute 2, Attribute 3, . . . Attribute K)
Composite Identifiers: Ideally, an entity identifier is composed of only a single attribute.
However, it is possible to use a composite identifier, that is, a primary key composed of more
than one attribute. E.g. CLASS entity of CRS_CODE and CLASS_SECTION instead of using
CLASS_CODE. Either approach uniquely identifies each entity instance.
Composite and Simple Attributes: Attributes are classified as simple or composite. A
composite attribute, not to be confused with a composite key, is an attribute that can be further
subdivided to yield additional attributes. For example, the attribute ADDRESS can be
subdivided into street, city, state, and zip code. Similarly, the attribute PHONE_NUMBER can
be subdivided into area code and exchange number. A simple attribute is an attribute that cannot
be subdivided. For example, age, sex and marital status would be classified as simple attributes.
To facilitate detailed queries, it is wise to change composite attributes into a series of simple
attributes.
Single-Valued Attributes: A single-valued attribute is an attribute that can have only a single
value. For example, a person can have only one Social Security number, and a manufactured
part can have only one serial number. Keep in mind that a single-valued attribute is not
necessarily a simple attribute. For instance, a part’s serial number, such as SE-08-02-189935,
is single-valued, but it is a composite attribute because it can be subdivided into the region in
which the part was produced (SE), the plant within that region (08), the shift within the plant
(02), and the part number (189935).
Multivalued Attributes: Multivalued attributes are attributes that can have many values. For
instance, a person may have several college degrees, and a household may have several
different phones, each with its own number. Similarly, a car’s color may be subdivided into
many colors (that is, colors for the roof, body, and trim). In the Chen ERM, the multivalued
attributes are shown by a double line connecting the attribute to the entity. The Crow’s Foot
notation does not identify multivalued attributes.
The ERD above contains all of the components introduced thus far. Note that CAR_VIN is the
primary key, and CAR_COLOR is a multivalued attribute of the CAR entity.
Another benefit we can derive from this approach is that we are now able to assign as many
colors as necessary without having to change the table structure.
Note that the ERM shown in Figure above reflects the components listed in previous table.
This is the preferred way to deal with multivalued attributes. Creating a new entity in a 1:M
relationship with the original entity yields several benefits: it’s a more flexible, expandable
solution, and it is compatible with the relational model.
Derived Attributes: A derived attribute is an attribute whose value is calculated (derived)
from other attributes. The derived attribute need not be physically stored within the database;
instead, it can be derived by using an algorithm. For example, an employee’s age, EMP_AGE,
may be found by computing the integer value of the difference between the current date and
the EMP_DOB. In Microsoft Access, we use: INT((DATE() – EMP_DOB)/365).
In Microsoft SQL Server, we use
SELECT DATEDIFF(“YEAR”, EMP_DOB, GETDATE()),
where DATEDIFF is a function that computes the difference between dates. The first parameter
indicates the measurement, in this case, years. In Oracle, we use SYSDATE instead of DATE().
A derived attribute is indicated in the Chen notation by a dashed line connecting the attribute
and the entity. The Crow’s Foot notation does not have a method for distinguishing the derived
attribute from other attributes.
The left side of the ER diagram shows the Chen notation, based on Peter Chen’s landmark
paper. In this notation, the connectivities are written next to each entity box. Relationships are
represented by a diamond connected to the related entities through a relationship line. The
relationship name is written inside the diamond. The right side illustrates the Crow’s Foot
notation. The name “Crow’s Foot” is derived from the three-pronged symbol used to represent
the “many” side of the relationship. In the basic Crow’s Foot ERD represented above, the
connectivities are represented by symbols. For example, the “1” is represented by a short line
segment, and the “M” is represented by the three-pronged “crow’s foot.” The relationship name
is written above the relationship line. In Figure above, entities and relationships are shown in
a horizontal format, but they may also be oriented vertically. The entity location and the order
in which the entities are presented are immaterial; just remember to read a 1:M relationship
from the “1” side to the “M” side.
Connectivity and Cardinality
As stated above, the term connectivity is used to describe the relationship classification.
Cardinality expresses the minimum and maximum number of entity occurrences associated
with one occurrence of the related entity. In the ERD, cardinality is indicated by placing the
appropriate numbers beside the entities, using the format (x,y). The first value represents the
minimum number of associated entities, while the second value represents the maximum
number of associated entities. Many database designers who use Crow’s Foot modeling
notation do not depict the specific cardinalities on the ER diagram itself because the specific
limits described by the cardinalities cannot be implemented directly through the database
design. Correspondingly, some Crow’s Foot ER modeling tools do not print the numeric
cardinality range in the diagram; instead, you can add it as text if you want to have it shown.
In this case, a weak relationship exists between COURSE and CLASS because the
CLASS_CODE is the CLASS entity’s PK, while the CRS_CODE in CLASS is only an FK. In
this example, the CLASS PK did not inherit the PK component from the COURSE entity.
Note that the Chen notation above identifies the weak entity by using a double-walled entity
rectangle. The Crow’s Foot notation generated by Visio Professional uses the relationship
line and the PK/FK designation to indicate whether the related entity is weak. A strong
(identifying) relationship indicates that the related entity is weak. Such a relationship means
that both conditions for the weak entity definition have been met—the related entity is
existence-dependent, and the PK of the related entity contains a PK component of the parent
entity. Remember that the weak entity inherits part of its primary key from its strong
counterpart. For example, at least part of the DEPENDENT entity’s key shown in Figure
above was inherited from the EMPLOYEE entity:
EMPLOYEE (EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_DOB,
EMP_HIREDATE)
DEPENDENT (EMP_NUM, DEP_NUM, DEP_FNAME, DEP_DOB)
Crowfoot symbols
FILE SYSTEM
In computing, a file system (often also written as file system) is a method for storing and
organizing computer files and the data they contain to make it easy to find and access them.
File systems may use a data storage device such as a hard disk or CD-ROM and involve
maintaining the physical location of the files. More formally, a file system is a special-purpose
database for the storage, hierarchical organization, manipulation, navigation, access, and
retrieval of data.
The most familiar file systems make use of an underlying data storage device that offers access
to an array of fixed-size blocks, sometimes called sectors, generally a power of 2 in size (512
bytes or 1, 2, or 4 KiB are most common). The file system software is responsible for
organizing these sectors into files and directories, and keeping track of which sectors belong to
which file and which are not being used. Most file systems address data in fixed-sized units
called "clusters" or "blocks" which contain a certain number of disk sectors (usually 1-
64). This is the smallest logical amount of disk space that can be allocated to hold a file.
File names
Whether the file system has an underlying storage device or not, file systems typically have
directories which associate file names with files, usually by connecting the file name to an
index in a file allocation table of some sort. Directory structures may be flat, or allow
hierarchies where directories may contain subdirectories. In some file systems, file names are
structured, with special syntax for filename extensions and version numbers. In others, file
names are simple strings, and per-file metadata is stored elsewhere.
File organization method
There are a large number of ways data can be organized on disk or tape. The main methods of
file organization used for files are:
• Serial
• Sequential
• Indexed Sequential
• Random (or Direct)
Serial Organization
Serial files are stored in chronological order, that is, as each record is received it is stored in
the next available storage position. In general it is only used on a serial medium such as
magnetic tape. This type of file organization means that the records are in no particular order
and therefore to retrieve a single record, the whole file needs to be read from the beginning to
end. Serial organization is usually the method used for creating Transaction files (unsorted),
Work and Dump files.
Sequential Organization
Sequential files are serial files whose records are sorted and stored in an ascending or
descending on a particular key field. The physical order of the records on the disk is not
necessarily sequential, as most manufacturers support an organization where certain records
(inserted after the file has been set up) are held in a logical sequence but are physically placed
into an overflow area. They are no longer physically contiguous with the preceding and
following logical records, but they can be retrieved in sequence.
Indexed Sequential Organization
Indexed Sequential file organization is logically the same as sequential organization, but an
index is built indicating the block containing the record with a given value for the Key field.
This method combines the advantages of a sequential file with the possibility of direct access
using the Primary Key (the primary Key is the field that is used to control the sequence of the
records). These days’ manufacturers providing Indexed Sequential Software allow for the
building of indexes using fields other than the primary Key. These additional fields on which
indexes are built are called Secondary Keys.
There are three major types of indexes used:
Basic Index: This provides a location for each record (key) that exists in the system.
Implicit Index: This type of index gives a location of all possible records (keys)
whether they exist or not.
Limit Index: This index groups the records (keys) and only provides the location of
the highest key in the group. Generally they form a hierarchical index. Data records are
blocked before being written to disk. An index may consist of the highest key in each
block, (or on each track).
In the above example, data records are shown as being 3 to a block. The index, then, holds the
key of the highest record in each block. (An essential element of the index, which has been
omitted from the diagram for simplicity, is the physical address of the block of data records).
Should we wish to access record 5, whose key is A0038, we can quickly determine from the
index that the record is held in block 2, since this key is greater than the highest key in block
1, A0025, and less than the highest key in block 2, A0053. By way of the index we can go
directly to the record we wish to retrieve, hence the term "direct access".
Random (or Direct)
A randomly organized file contains records arranged physically without regard to the sequence
of the primary key. Records are loaded to disk by establishing a direct relationship between the
Key of the record and its address on the file, normally by use of a formula (or algorithm) that
converts the primary Key to a physical disk address. This relationship is also used for retrieval.
The use of a formula (or algorithm) is known as 'Key Transformation' and there are several
techniques that can be used:
Division Taking Quotient
Division Taking Remainder
Truncation
Folding
Squaring
Radix Conversion
These methods are often mixed to produce a unique address (or location) for each record (key).
The production of the same address for two different records is known as a synonym.
Random files show a distinct advantage where:
Hit Rate is low
Data cannot be batched or sorted
Fast response is required.
File access methods
Algorithm used for the storage and/or retrieval of records to/from a data file is referred to as
file access method. This term is used interchangeably (and loosely) for both the access method
and file organization. Access methods are of two kinds; those that determine the structural
characteristics of the file on which it is used (file organization) and those that do not, as in
secondary indexing. In the case of the first one the same algorithm is used for the retrieval of
records as was used for its original physical placement/organization. File access method
include the following;
i. Serial access method: This is method of accessing records in an unsorted file. The term
refers to the ability of accessing each record in a file by a sequential searching process,
i.e., from the first record to the last. One must finish before the next begins, i.e., one
after another. This access method is associated with time consuming.
ii. Sequential access method: A method of access to a data file where each record is
sequenced in some order. A file is said to be sequentially accessed if the sequence of
transaction presented to it matches a sequence in which records are organized.
iii. Indexed sequential access method: This is an access method for a sequentially
organized file whose records are indexed with their corresponding address. This access
method supports both sequential access and indexed access and the two types of access
that can be comfortably done in this method are sequential access and random access.
Note that this method is only possible with disk files.
a. Sequential access: The access of the index sequential file makes minimal use
of the index. The records can be accessed serially as they are ordered one after
another and from the first to the last record in the file.
b. Random access: The random access of the index sequential makes full use of
the index to access the records using the record address associated with the
index. The records are accessed without any defined sequence of sorting. The
method is used when the transaction is to be processed immediately without
waiting for sorting and it’s widely used in commercial data processing because
it can be allow both sequential and random methods of access.
iv. Random access method: Random access is a type of memory access in which storage
location can be accessed in any order. This means that a file is said to be randomly
accessed if the sequence of transactions submit to it does not match any sequence in
which records may be organized.
Summary of file access methods
The information stored in a file can be accessed in a variety of methods:
Sequential: in order, one record after another.
Direct (random): in any order, skipping the previous records.
Indexed or Keyed: in any order, but with particular value(s).
Other access methods, such as indexed, can be built on top of the above basic techniques.
Indexed sequential access method (ISAM) is built on random and sequential access.
File characteristics
Data file(s) should have at least one of the following behaviors or qualities:
Hit rate: This is the frequency with which active records in the file are accessed. File
records that are accessed more often are said to have high hit rate, while those records
which are rarely accessed are said to have low hit rate.
Volatility: The ease at which records are either added or deleted from file is described
as file volatility. This calls for careful handling of data file to avoid sudden loss of its
contents (records) due to its volatility nature.
Growth: This is referred to the increase in size of the file. When a file is created, it
grows as new records are added to the file. The file grows as its contents (records)
increase.
Size: The file size is expressed in terms of characters in the field of the record and the
number of records. When a file is created, the size is determined by the amount of data
stored in the file.
Seek Time: Seek time is one of the three delays associated with reading or writing data
on a computer’s drive and somewhat similar for CD or DVD drives. The others are
rotational delay and transfer time and their sum is the access time. In order to read or
write data in a particular place on the disk, the read/write head of the disk needs to be
physically moved to the correct place. This process is known as seeking and the time it
takes for the head to move to the right place is the seek time. Seek time for a given disk
varies depending on how far the head destination is from its origin at the time of each
read/write instruction.
Rotational Delay: The rotational delay is the time required for the address area of the
disk to rotate into a position where it is accessible by the read/write head.
Transfer Time: Transfer Time is the number of bits that are transferred per unit of time.
The unit therefore is in bit per second (bps).
CONCEPT OF BUFFER
Buffer is a temporary memory for data, normally used to accommodate the difference in the
rate at which two devices can handle data during transfer. The buffer may be built into a
peripherals device such as printer or disk drive or may be part of the system’s main memory.
Buffering technique is used to compensate for the slow and possible erratic rate at which a
peripheral device produces or consume data. If the device communicates directly with the
program, the program is constraint to run in synchronism with the device to operate
independently.
Consider a program sending output to a slow device, a memory area (the buffer) is set aside for
communication. The program places data in the buffer at its own rate. Although the device may
be slow, the program does not have to stop unless the buffer fills up, at the same time the device
runs at full speed unless the buffer is empty. This buffering technique is applicable to input
device.
By providing a data buffer in the device or control unit, devices that would ordinarily require
the channel for a long period can be made to perform independently of the channel. For
instance, a card reader may physically require about 60 ms to read a card and transfer the data
to memory through the channel.
However, a buffered card reader always reads one card before it is needed and save the 80 bytes
in a buffer in the card reader or control unit. When the channel requests that a card be read, the
contents of the 80-bytes buffered are transferred to memory at high speed (e.g., 100 µs) and
the channel is released. The device then proceeds to read the next card and buffer it
independently. Card readers, card punches and printers are frequently buffered.
Functions of Buffer
Buffer synchronizes the speed of data transfer among the systems/devices that are under the
control of the CPU. It makes individual devices (that is, input and output devices) to perform
their functions independently. This is because, the rate at which the CPU performs/sends or
receive it data is very high compare to the rate at which the I/O devices receive or send data.
Buffer is equally used to accommodate the differences in the rate at which two devices can
handle data during data communication/transfer.
Information capture and representation
File Concepts
In this section, we shall deal with the concepts of file & their relationship.
Many operating system consider files as a collection of bytes. At a higher level, where the
content of the file is being considered, these binary digits may represent integer values or text
characters, or anything else. It is up to the program using the file to understand the meaning
and internal layout of information in the file and present it to a user as more meaningful
information (like text, images, sounds).
At any instant in time, a file might have a size, normally expressed in bytes, that indicates how
much storage is associated with the file. In most modern operating systems the size can be any
whole number up to a system limit. However, the general definition of a file does not require
that its instant size has any real meaning, unless the data within the file happens to correspond
to data within a pool of persistent storage.
Information in a computer file can consist of smaller packets of information, often called
"records" or "lines" that are individually different but share some trait in common. For example,
a payroll file might contain information concerning all the employees in a company and their
payroll details; each record in the payroll file concerns just one employee, and all the records
have the common trait of being related to payroll. This is very similar to placing all payroll
information into a specific filing cabinet in an office that does not have a computer. A text file
may contain lines of text, corresponding to printed lines on a piece of paper. Alternatively, a
file may contain an arbitrary binary image (a BLOB) or it may contain an executable.
The way information is grouped into a file is entirely up to the person designing the file. This
has led to a plethora of more or less standardized file structures for all imaginable purposes,
from the simplest to the most complex. Most computer files are used by computer programs.
These programs create, modify and delete files for their own use on an as-needed basis. The
programmers who create the programs decide what files are needed, how they are to be used
and their names.
In some cases, computer programs manipulate files that are made visible to the computer user.
For example, in a word-processing program, the user manipulates document files that she
names herself. The content of the document file is arranged in a way that the word-processing
program understands, but the user chooses the name and location of the file, and she provides
the bulk of the information (such as words and text) that will be stored in the file.
Many applications pack all their data files into a single file, using internal markers to discern
the different types of information contained within.
Files on a computer can be created, moved, modified, grown, shrunk and deleted. In most cases,
computer programs that are executed on the computer handle these operations, but the user of
a computer can also manipulate files if necessary. For instance, Microsoft Word files are
normally created and modified by the Microsoft Word program in response to user commands,
but the user can also move, rename, or delete these files directly by using a file manager progrm
such as Windows Explorer.
The following are different data representation and their relationships:
i. Bit: A Bit is simply defined as binary digits, that’s 0’s &1’s. The bit is the smallest unit
of storage.
A bit is a binary digit, taking a value of either 0 or 1. Binary digits are a basic unit of
information storage and communication in digital computing and digital information
theory. Information theory also often uses the natural digit, called either a nit or a nat.
Quantum computing also uses qubits, a single piece of information with a probability of
being true. The bit is also a unit of measurement, the information capacity of one binary
digit. It has the symbol bit, or b (see discussion below).
Binary digit
Claude E. Shannon first used the word bit in his 1948 paper A Mathematical Theory of
Communication. He attributed its origin to John W. Tukey, who had written a Bell Labs
memo on 9 January 1947 in which he contracted "binary digit" to simply "bit".
Interestingly, Vannevar Bush had written in 1936 of "bits of information" that could be
stored on the punch cards used in the mechanical computers of that time.
A bit of storage can be either on (1) or off (0). A single bit is a one or a zero, a true or a
false, a "flag" which is "on" or "off", or in general, the quantity of information required to
distinguish two mutually exclusive equally probable states from each other. Gregory
Bateson defined a bit as "a difference that makes a difference".
Representation
Transmission
Bits can be implemented in many forms depending on context. For example, in digital
circuitry in most computing devices as well as flash memories, a bit is an electrical pulse
generated by the internal clock in the control unit or data register. For devices using positive
logic, a logical 1 (true value) is represented by up to 5 volts, while a logical 0 (false value)
is represented by 0 volt.
Storage
Bits are manipulated in the volatile memory of a computer, and can further be kept in a
persistent manner on a magnetic storage device such as magnetic tape or disc, as well as on
optical discs.
Unit
It is important to differentiate between the use of "bit" in referring to a discrete storage unit
and the use of "bit" in referring to a statistical unit of information. The bit, as a discrete
storage unit, can by definition store only 0 or 1. A statistical bit is the amount of information
that, on average, can be stored in a discrete bit. It is thus the amount of information carried
by a choice between two equally likely outcomes. One bit corresponds to about 0.693 nats
(ln(2)), or 0.301 hartleys (log10(2)).
ii. Nibble: Nibble is ½ a byte. Generally, it is 4 bits, e.g 0110, 1110.
iii. A Byte: This is the collection of 2 nibbles or 8 bits. E.g, 010101010. A byte fixed
number of bits that can be treated as a unit by the computer. It represents a character. It
can also be described as the smallest unit of the computer memory. A byte is a collection
of 8 bits, originally differing in size depending on the context but now almost always
eight bits. Eight-bit bytes, also known as octets, can represent 256 values (28 values,
0–255). A four-bit quantity is known as a nibble, and can represent 16 values (24 values,
0–15). A rarely used term, crumb, can refer to a two-bit quantity, and can represent 4
values (2² values, 0–3).
iv. Word: is a term for a slightly larger group of bits, but it has no standard size. It
represents the size of one register in a Computer-CPU.
v. Character: A character is the smallest unit of information. It includes letters, digits and
special characters such as; +, -, = *, %, &, etc.
vi. Field: A field is made up of characters, or simply defined as a combination of
characters. It’s an attribute that can be assigned values. For example, First name, Age,
Gender, Date of birth, etc.
vii. Record: Record is the collection of organizes and related fields.
viii. File is the collection of organized and related records.
File Processing
File is a body of stored data or information in an electronic format. Almost all information
stored on computers is in the form of files. Files reside on mass storage devices such as hard
drives, CD-ROMs, magnetic tape, and floppy disks. When the central processing unit (CPU)
of a computer needs data from a file, or needs to write data to a file, it temporarily stores the
file in its main memory, or Random Access Memory (RAM), while it works on the data.
Information in computers is usually classified into two different types of files: data files (those
containing data) and program files (those that manipulate data). Within each of these
categories, many different types of files exist that store various kinds of information.
Different computer operating systems have unique rules for the naming of files. Windows 95
(Win95) and disk operating systems (DOS), for instance, make use of an extension attached to
the end of each filename in order to indicate the type of file. Extensions begin with a period (.),
and then have one or more letters. An example of a file extension used in Win95 and DOS is
.bak, which indicates that the file is a backup file.
When saving a file, a user can give it any name within the rules of the operating system. In
addition, the name must be unique. Two files in the same directory may not have the same
name, but some operating systems allow the same name for one file to be placed in more than
one location. These additional names are called aliases.
Directory files contain information used to organize other files into a hierarchical structure. In
the Macintosh operating system, directory files are called folders. The topmost directory in any
file system is the root directory. A directory contained within another directory is called a
subdirectory. Directories containing one or more subdirectories are called parent directories.
Directory files contain programs or commands that are executable by the computer.
Executable files have a .exe suffix at the end of their names and are often called EXE
(pronounced EX-ee) files. Text files contain characters represented by their ASCII (American
Standard Code for Information Interchange) codes. These files are often called ASCII
(pronounced ASK-ee) files. Files that contain words, sentences, and bodies of paragraphs are
frequently referred to as text files. The diagram below shows the root directory, sub director
and file.