Sta 2107 Database Mangement 2017 - Copy-1-1
Sta 2107 Database Mangement 2017 - Copy-1-1
Sta 2107 Database Mangement 2017 - Copy-1-1
Database and File System are two methods used to store, retrieve, manage and manipulate
data. Both systems can be used to allow the user to work with data in a similar way. A File
System is a collection of raw data files stored in the hard-drive, whereas a database is
intended for easily organizing, storing and retrieving large amounts of data. In other words,
a database holds a bundle of organized data (typically in a digital form) for one or more
users. Databases, often abbreviated DB, are classified according to their content, such as
document-text, bibliographic and statistical. It should be noted that, even in a database, data
are eventually (physically) stored in some sort of files.
What is a File system?
As mentioned above, in a typical File System electronic data are directly stored in a set of
files. If only one table is stored in a file, it is called a flat file. They contain values in each row
separated with a special delimiter like commas. In order to query some random data, first it
is required to parse each row and load it to an array at run time, but for this file should be
read sequentially (because, there is no control mechanism in files); therefore it is quite
inefficient and time consuming. The burden of locating the necessary file, going through the
records (line by line), checking for the existence of a certain data and remembering what
files/records to edit are on the user. The user either has to perform each task manually or
has to write a script that does them automatically with the help of the file management
capabilities of the operating system. Because of these reasons, File Systems are easily
vulnerable to serious issues like inconsistency, inability to maintain concurrency, data
isolation, threats on integrity and lack of security.
What is a Database?
A database is a collection of information stored, that is organized so that it can easily be
accessed, managed, and updated.
A Database may contain different levels of abstraction in its architecture. Typically, the
three levels: external, conceptual and internal make up the database architecture. External
level defines how the users view the data. A single database can have multiple views. The
internal level defines how the data is physically stored. The conceptual level is
the communication medium between internal and external levels. It provides a unique view
of the database regardless of how it is stored or viewed. There are several types of
databases such as Analytical databases, Data warehouses and Distributed databases.
Databases (more correctly, relational databases) are made up of tables, and they contain
rows and columns, much like spreadsheets in Excel. Each column corresponds to an
attribute while each row represents a single record. For example, in a database, which
stores employee information of a company, the columns could contain employee name,
employee Id and salary, while a single row represents a single employee. Most databases
come with a Database Management System (DBMS) that makes it very easy to
create/manage/organize data.
What is the difference between File system and Database?
As a summery, in a File System, files are used to store data while, a database is a collection
of organized data. Although File System and databases are two ways of managing data,
databases clearly have many advantages over File Systems. Typically when using a File
System, most tasks such as storage, retrieval and search are done manually (even though
most operating systems provide graphical interfaces to make these tasks easier) and it is
quite tedious whereas when using a database, the inbuilt DBMS will provide automated
methods to complete these tasks.
Because of this reason, using a File System will lead to problems like data integrity, data
inconsistency and data security, but these problems could be avoided by using a database.
Unlike a File System, databases are efficient because reading line by line is not required, and
certain control mechanisms are in place.
Disadvantages of File Systems
Data redundancy is found- the presence of duplicate data in multiple data files e.g.
Customer Name, NI number, Address is present in several different files on several
different systems. Errors are generated, time entering data is wasted, computer
resources are needlessly taken up and updates can be an enormous problem.
Data inconsistency will become a problem e.g. information is duplicated in each
system and may be updated in most systems but not necessarily in all – the savings
account and loan account may have different addresses for a particular customer.
Lack of data independence. (Data dependence refers to the close relationship
between data stored in files and the specific software programs needed to update
and maintain those files). In the bank imagine cases where any change in data
format or structure requires a change in all the programs that access the data. E.g.
effort required to change from 3-digit STD to 4-digit STD may be very difficult.
Data lacks integrity, i.e. this is the quality by which information from the system
can be trusted. The problems already stated show that information can be out of
date, can have different values in different parts of the system, can be inaccurate etc.
Lack of flexibility i.e. creating reports, e.g. in this system a summary of account
balances in each separate account would require separate reports for each open
account or cutting and pasting into a word processor file – not a satisfactory state of
affairs.
Data not shareable. Similarly if different departments want to access data this
would be very awkward.
Main Functions of Databases System.
Facilitate remote data entry.
Facilitate information and resource sharing.
Facilitate long term and large amount of information storage
Facilitate centralized information storage.
There are many functions a Database Management System (DBMS) serves that are key
components to the operation of database management. When deciding to implement a
DBMS in your business, first you must decide what type of DBMS you want. Common types
of DBMS are the relational, network, hierarchy and object oriented models. Each kind of
database structure has its own pros and cons.
While each is unique in its own way, there are some standard functions of a DBMS, and
these are:
Updating data in a database includes adding new records, deleting existing records and
changing information within a record. The user does not need to be aware of how DBMS
structures this data, all the user needs to be aware of is the availability of updating and/or
pulling up information, the DBMS handles the processes and the structure of the data on a
disk.
Concurrent updates occur when multiple users make updates to the database
simultaneously. Supporting concurrent updates is also crucial to database management as
this component ensures that updates are made correctly and the end result is accurate.
Without DBMS intervention, important data could be lost and/or inaccurate data stored.
DBMS uses features to support concurrent updates such as batch processing, locking, two-
phase locking, and time stamping to help make certain that updates are done accurately.
Again, the user is not aware all this is happening as it is the database management system’s
responsibility to make sure all updates are stored properly.
In the event a catastrophe occurs, DBMS must provide ways to recover a database so that
data is not permanently lost. There are times computers may crash, a fire or other natural
disaster may occur, or a user may enter incorrect information invalidating or making
records inconsistent.
If the database is destroyed or damaged in any way, the DBMS must be able to recover the
correct state of the database, and this process is called Recovery. The easiest way to do this
is to make regular backups of information. This can be done at a set structured time so in
the event a disaster occurs, the database can be restored to the state that it was last at prior
to backup.
A disadvantage to this is any data or changes entered after the backup would be lost. A way
to counteract this is to set the DBMS to provide a feature called Journaling. This involves
keeping a log of all updates made to the database, it is maintained in a file separate from the
database and can be obtained to re-update the database after it is recovered from the
backup.
Security
Security is the prevention of unauthorized users accessing the database. DBMS uses
features such as encryption, authentication, authorization and views to provide security to
the database. Encryption is when DBMS converts the data in a database to an
indecipherable format. No unauthorized person trying to access this information will be
able to read it. Authorized users will be able to see it in normal form.
Authentication is a technique in which the database administrator can identify the person
accessing the database. Authorized users are given passwords and successful entry of a
valid password will allow the user entry into the database, if a password is not successfully
entered, the user will be denied access. Authorization is a set of rules that the database
administrator (DBA) sets up to specify levels of usage that individuals or groups are allowed
to have. Some users may only be allowed viewing options, while others may be allowed to
both view/make changes.
In some circumstances, users may only be allowed to access certain pieces of the database,
and be denied access to areas that does not relate to their specific needs. In these cases, the
DBA will assign workgroups, and these workgroups will be assigned levels of access and
permissions. Views are when DBA allows certain users the ability to view the tables or fields
that pertain to them, any other view does not exist for them in the database, DBMS does this
behind the scenes and to the user it appears that the information they see are the only
existing data.
Data Integrity
Data integrity is an important function in database management. This is a set of rules that
DBMS provides to see that data integrity is enforced, thus avoiding incorrect or inconsistent
data. Types of integrity that DBMS provides are data type, legal values and format. Key
integrity also falls into this function of DBMS. This enforces that the primary key of a record
remains unique.
Without any of the above listed functions, a database would not be able to work effectively.
Each of these functions plays an important role in database management. And many more
such as
Manages the database - has to ensure problems do not arise if two people simultaneously
access a record and try to update it.
Governs interactions between application programs, input data and the database itself i.e.
allows users to store, retrieve and update as easily as possible without having to be aware
of the internal structure of the database.
Provides an interface between users and the database.
environment
Application-data independence
Improved security
A transaction
A transaction is a logical unit of work that contains one or more SQL statements. A
transaction is an atomic unit. The effects of all the SQL statements in a transaction can be
either all committed (applied to the database) or all rolled back (undone from the
database).
A transaction begins with the first executable SQL statement. A transaction ends when it is
committed or rolled back, either explicitly with a COMMIT or ROLLBACK statement or
implicitly when a DDL statement is issued.
To illustrate the concept of a transaction, consider a banking database. When a bank
customer transfers money from a savings account to a checking account, the transaction can
consist of three separate operations:
Oracle must allow for two situations. If all three SQL statements can be performed to
maintain the accounts in proper balance, the effects of the transaction can be applied to the
database. However, if a problem such as insufficient funds, invalid account number, or a
hardware failure prevents one or two of the statements in the transaction from completing,
the entire transaction must be rolled back so that the balance of all accounts is correct.
A Banking
Transaction
ACID properties
(all good DBMS's should guarantee these)
Atomicity
o should not be able to execute half of an operation
o either all or none of the effects of a transaction are made permanent
Consistency
o there should be no surprises in the world, e.g., gpa > 4.0, balance < 0, cats
should never have more than 1 tail!
o the effect of concurrent transactions is equivalent to some serial execution
o use constraints, triggers, active DB elements (context-free)
Isolation
o concurrency control
o transactions should not be able to observe the partial effects of other
transactions
o use locks (whole relations or individual tuples?)
Durability
o if power goes out, nothing bad should happen
o once accepted, the effects of a transaction are permanent (until, of course,
changed by another transaction)
o use logs
1. Hierarchical model
The hierarchical data model organizes data in a tree structure. There is a hierarchy of
parent and child data segments. This structure implies that a record can have repeating
information, generally in the child data segments. Data in a series of records, which have
a set of field values attached to it. It collects all the instances of a specific record
together as a record type. These record types are the equivalent of tables in the
relational model, and with the individual records being the equivalent of rows. To create
links between these record types, the hierarchical model uses Parent Child
Relationships. These are a 1:N mapping between record types. This is done by using
trees, like set theory used in the relational model, "borrowed" from maths. For example,
an organization might store information about an employee, such as name, employee
number, department, salary. The organization might also store information about an
employee's children, such as name and date of birth. The employee and children data
forms a hierarchy, where the employee data represents the parent segment and the
children data represents the child segment. If an employee has three children, then
there would be three child segments associated with one employee segment. In a
hierarchical database the parent-child relationship is one to many. This restricts a child
segment to having only one parent segment. Hierarchical DBMSs were popular from the
late 1960s, with the introduction of IBM's Information Management System (IMS)
DBMS, through the 1970s.
For example, the following is the hierarchical schema of a company database:
The Hierarchical Data Model structures data in a tree of records, with each record having
one parent record and many children. It can be represented as follows:
Network model
The popularity of the network data model coincided with the popularity of the
hierarchical data model. Some data were more naturally modeled with more than one
parent per child. So, the network model permitted the modeling of many-to-many
relationships in data. In 1971, the Conference on Data Systems Languages (CODASYL)
formally defined the network model. The basic data modeling construct in the network
model is the set construct. A set consists of an owner record type, a set name, and a
member record type. A member record type can have that role in more than one set,
hence the multiparent concept is supported. An owner record type can also be a
member or owner in another set. The data model is a simple network, and link and
intersection record types (called junction records by IDMS) may exist, as well as sets
between them . Thus, the complete network of relationships is represented by several
pairwise sets; in each set some (one) record type is owner (at the tail of the network
arrow) and one or more record types are members (at the head of the relationship
arrow). Usually, a set defines a 1:M relationship, although 1:1 is permitted. The
CODASYL network model is based on mathematical set theory.
The Network Data Model uses a lattice structure in which a record can have many parents
as well as many children. It can be represented as follows:
Like the The Hierarchical Data Model the Network Data Model also consists of nodes and
branches, but a child may have multiple parents within the network structure instead of
being restricted to just one.
Both hierarchical and network databases, and they both suffered from the following
deficiencies (when compared with relational databases):
Access to the database was not via SQL query strings, but by a specific set of API's,
typically for FIND, CREATE, READ, UPDATE and DELETE.
Each API would only access a single table (dataset), so it was not possible to
implement a JOIN, which would return data from several tables.
It was not possible to provide a variable WHERE clause. The only selection
mechanism available was
Only a single sort sequence could be defined for each path (link to a parent), so all
records retrieved on that path would be provided in that sequence.
It could make inserts rather slow when attempting to insert into the middle of a
large collection, or where a table had multiple paths each with its own set of sort
criteria.
Relational model
(RDBMS - relational database management system) A database based on the relational
model developed by E.F. Codd. A relational database allows the definition of data structures,
storage and retrieval operations and integrity constraints. In such a database the data and
relations between them are organised in tables. A table is a collection of records and each
record in a table contains the same fields.
Certain fields may be designated as keys, which means that searches for specific
values of that field will use indexing to speed them up. Where fields in two different
tables take values from the same set, a join operation can be performed to select
related records in the two tables by matching values in those fields. Often, but not
always, the fields will have the same name in both tables. For example, an "orders"
table might contain (customer-ID, product-code) pairs and a "products" table might
contain (product-code, price) pairs so to calculate a given customer's bill you would
sum the prices of all products ordered by that customer by joining on the product-
code fields of the two tables. This can be extended to joining multiple tables on
multiple fields. Because these relationships are only specified at retreival time,
relational databases are classed as dynamic database management system. The
RELATIONAL database model is based on the Relational Algebra.
The Relational Data Model
The Relational Data Model has the relation at its heart, but with a whole series of rules
governing it for example,
4. A primary key is the candidate key which is selected as the principal unique
identifier. Every relation must contain a primary key. The primary key is usually the
key selected to identify a row when the database is physically implemented. For
example, a part number is selected instead of a part description.
5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs
from a candidate key in that it does not require the non redundancy property.
6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non
key attribute in one relation and as a primary key attribute in another relation. I
say usuallybecause it is possible for a foreign key to also be the whole or part of a
primary key:
A many-to-many relationship can only be implemented by introducing an
intersection or link table which then becomes the child in two one-to-many
relationships. The intersection table therefore has a foreign key for each of its
parents, and its primary key is a composite of both foreign keys.
A one-to-one relationship requires that the child table has no more than one
occurrence for each parent, which can only be enforced by letting the foreign
key also serve as the primary key.
7. A semantic or natural key is a key for which the possible values have an obvious
meaning to the user or the data. For example, a semantic primary key for a COUNTRY
entity might contain the value 'USA' for the occurrence describing the United States of
America. The value 'USA' has meaning to the user.
8. A technical or surrogate or artificial key is a key for which the possible values
have no obvious meaning to the user or the data. These are used instead of semantic
keys for any of the following reasons:
When the value in a semantic key is likely to be changed by the user,
or can have duplicates. For example, on a PERSON table it is unwise
to use PERSON_NAME as the key as it is possible to have more than
one person with the same name, or the name may change such as
through marriage.
9. A key functionally determines the other attributes in the row, thus it is always
a determinant.
10. Note that the term 'key' in most DBMS engines is implemented as an index which
does not allow duplicate entries.
Relationships
One table (relation) may be linked with another in what is known as a relationship.
Relationships may be built into the database structure to facilitate the operation
of relational joins at runtime.
6. Some database engines allow a parent table to be linked via a candidate key, but if
this were changed it could result in the link to the child table being broken.
Personal Guidelines
Even if you obey all the preceding rules it is still possible to produce a database design that
causes problems during development. I have come across many different implementation
tips and techniques over the years, and some that have worked in one database system have
been successfully carried forward into a new database system. Some tips, on the other hand,
may only be applicable to a particular database system.
For particular options and limitations you must refer to your database manual.
Database Names
2. Do not waste time using a prefix such as db to identify database names. The SQL
syntax analyser has the intelligence to work that out for itself - so should you.
3. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase (my preference - see The choice between upper and lower
case).
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
Again, be consistent.
5. Rather than putting all the tables into a single database it may be better to create
separate databases for each logically related set of tables. This may help with
security, archiving, replication, etc.
Table Names
1. Table names should be short and meaningful, such as part, customer and invoice.
o Short, but not too short.
2. Do not waste time using a prefix such as tbl to identify table names. The SQL syntax
analyser has the intelligence to work that out for itself - so should you.
3. Table names should be in the singular (e.g. customer not customers). The fact that
a table may contain multiple entries is irrelevant - any multiplicity can be derived
from the existence of one-to-many relationships.
4. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, It is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
o Leading uppercase, remainder lowercase.
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
5. If a table name contains more than one word, such as in sales order and purchase
order, decide how to deal with it:
o Separate the words with a single space, as in sales order (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
Again, be consistent.
6. Be careful if the same table name is used in more than one database - it may lead to
confusion.
Field Names
2. Do not waste time using a prefix such as col or fld to identify column/field names.
The SQL syntax analyser has the intelligence to work that out for itself - so should
you.
3. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
4. If a field name contains more than one word, such as in part name and customer
name, decide how to deal with it:
o Separate the words with a single space, as in part name (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
Again, be consistent.
6. Although field names must be unique within a table, it is possible to use the same
name on multiple tables even if they are unrelated, or they do not share the same set
of possible values. It is recommended that this practice should be avoided, for
reasons described in Field names should identify their content and The naming of
Foreign Keys.
Primary Keys
1. It is recommended that the primary key of an entity should be constructed from the
table name with a suffix of _ID. This makes it easy to identify the primary key in a
long list of field names.
2. Do not waste time using a prefix such as pk to identify primary key fields. This has
absolutely no meaning to any database engine or any application.
3. Avoid using generic names for all primary keys. It may seem a clever idea to use the
name ID for every primary key field, but this causes problems:
o It causes the same name to appear on multiple tables with totally different
contexts. The string ID='ABC123' is extremely vague as it gives no idea of the
entity being referenced. Is it an invoice id, customer id, or what?
o It also causes a problem with foreign keys.
4. There is no rule that says a primary key must consist of a single attribute - both
simple and composite keys are allowed - so don't waste time creating artificial keys.
5. Avoid the unnecessary use of technical keys. If a table already contains a satisfactory
unique identifier, whether composite or simple, there is no need to create another
one. Although the use of a technical key can be justified in certain circumstances, it
takes intelligence to know when those circumstances are right. The indiscriminate
use of technical keys shows a distinct lack of intelligence. For further views on this
subject please refer to Technical Keys - Their Uses and Abuses.
Foreign Keys
1. It is recommended that where a foreign key is required that you use the same name
as that of the associated primary key on the foreign table. It is a requirement of
arelational join that two relations can only be joined when they share at least one
common attribute, and this should be taken to mean the attribute name(s) as well as
the value(s). Thus where the customer and invoice tables are joined in a parent-
child relationship the following will result:
o The primary key of customer will be customer_id.
2. For MySQL users this means that the shortened version of the join condition may be
used:
o Short: A LEFT JOIN B USING (a,b,c)
Do not waste time using a prefix such as fk to identify foreign key fields
Entity-Relationship Diagram (ERD)
The entity is a person, object, place or event for which data is collected. It is
equivalent to a database table. An entity can be defined by means of its properties,
called attributes. For example, the CUSTOMER entity may have attributes for such
things as name, address and telephone number.
The relationship is the interaction between the entities. It can be described using a
verb such as:
The relating line can be enhanced to indicate cardinality which defines the relationship
between the entities in terms of numbers. An entity may be optional (zero or more) or it
may be mandatory (one or more).
As well as using lines and circles the cardinality can be expressed using numbers, as in:
In order to determine if a particular design is correct here is a simple test that I use:
If the output from step (2) is not the same as the input to step (1) then something is wrong.
If the model allows a situation to exist which is not allowed in the real world then this could
lead to serious problems. The model must be an accurate representation of the real world in
order to be effective. If any ambiguities are allowed to creep in they could have disastrous
consequences.
We have now completed the logical data model, but before we can construct the physical
database there are several steps that must take place:
Assign attributes (properties or values) to all the entities. After all, a table without
any columns will be of little use to anyone.
Refine the model using a process known as 'normalisation'. This ensures that each
attribute is in the right place. During this process it may be necessary to create new
tables and new relationships.
Data Normalisation
Relational database theory, and the principles of normalisation, were first constructed by
people with a strong mathematical background. They wrote about databases using
terminology which was not easily understood outside those mathematical circles. Below is
an attempt to provide understandable explanations.
It follows a set of rules worked out by E F Codd in 1970. A normalised relational database
provides several benefits:
Because the principles of normalisation were first written using the same terminology as
was used to define the relational data model this led some people to think that
normalisation is difficult. Nothing could be more untrue. The principles of normalisation are
simple, common sense ideas that are easy to apply.
o Check relations for dependencies of one non key attribute with another non
key attribute.
o Check for dependencies within each primary key (i.e. dependencies of one
attribute in the key on other attributes within the key).
A table is in first normal form if all the key attributes have been defined and it
contains no repeating groups.
Taking the ORDER entity in figure 7 as an example we could end up with a set of attributes
like this:
ORDER
In order to create a table that is in first normal form we must extract the repeating groups
and place them in a separate table, which I shall call ORDER_LINE.
ORDER
order_id customer_id
123 456
456 789
I have removed 'product1', 'product2' and 'product3', so there are no repeating groups.
ORDER_LINE
order_id product
123 abc1
123 def1
123 ghi1
456 abc2
Each row contains one product for one order, so this allows an order to contain any number
of products.
A table is in second normal form (2NF) if and only if it is in 1NF and every non
key attribute is fully functionally dependent on the whole of the primary key (i.e.
there are no partial dependencies).
1. Anomalies can occur when attributes are dependent on only part of a multi-
attribute (composite) key.
2. A relation is in second normal form when all non-key attributes are dependent on
the whole key. That is, no attribute is dependent on only a part of the key.
3. Any relation having a key with a single attribute is in second normal form.
A table is in third normal form (3NF) if and only if it is in 2NF and every non key
attribute is non transitively dependent on the primary key (i.e. there are
no transitive dependencies).
1. Anomalies can occur when a relation contains one or more transitive dependencies.
2. A relation is in 3NF when it is in 2NF and has no transitive dependencies.
3. A relation is in 3NF when 'All non-key attributes are dependent on the key, the
whole key and nothing but the key'.
You must also note the use of calculated or derived fields. Take the example where a table
contains PRICE, QUANTITY and EXTENDED_PRICE where EXTENDED_PRICE is calculated as
QUANTITY multiplied by PRICE. As one of these values can be calculated from the other two
then it need not be held in the database table. Do not assume that it is safe to drop any one
of the three fields as a difference in the number of decimal places between the various fields
could lead to different results due to rounding errors. For example, take the following fields:
If you were to drop EXCH_RATE could it be calculated back to its original 9 decimal places?
Reaching 3NF is is adequate for most practical needs, but there may be circumstances which
would benefit from further normalisation.
1. Unnormalized table:
Tables should have only two dimensions. Since one student has several classes, these classes
should be listed in a separate table. Fields Class1, Class2, and Class3 in the above records are
indications of design trouble.
Spreadsheets often use the third dimension, but tables should not. Another way to look at
this problem is with a one-to-many relationship, do not put the one side and the many side
in the same table. Instead, create another table in first normal form by eliminating the
repeating group (Class#), as shown below:
Records should not depend on anything other than a table's primary key (a compound key, if
necessary). For example, consider a customer's address in an accounting system. The address is
needed by the Customers table, but also by the Orders, Shipping, Invoices, Accounts Receivable, and
Collections tables. Instead of storing the customer's address as a separate entry in each of these
tables, store it in one place, either in the Customers table or in a separate Addresses table.
Students:
3.
Registration:
Student# Class#
1022 101-07
1022 143-01
1022 159-02
4123 201-01
4123 211-02
4123 214-01
In the last example, Adv-Room (the advisor's office number) is functionally dependent on
the Advisor attribute. The solution is to move that attribute from the Students table to the
Faculty table, as shown below:
Students:
Student# Advisor
1022 John
4123 Simon
5.
Faculty:
Simon 216 42
A table is in Boyce-Codd normal form (BCNF) if and only if it is in 3NF and every
determinant is a candidate key.
1. Anomalies can occur in relations in 3NF if there is a composite key in which part of
that key has a determinant which is not itself a candidate key.
2. This can be expressed as R(A,B,C), C A where:
o The relation contains attributes A, B and C.
3. Anomalies can also occur where a relation contains several candidate keys where:
o The keys contain more than one attribute (they are composite keys).
o An attribute is common to more than one key.
Note that no two buildings on any of the university campuses have the same name, thus
ROOM/BLDG CAMPUS. As the determinant is not a candidate key this table is NOT in
Boyce-Codd normal form.
R2(room/bldg, campus)
(student#, course#)
(student#, c_name)
(s_name, course#) - this assumes that s_name is a unique identifier
(s_name, c_name) - this assumes that c_name is a unique identifier
The relation is in 3NF but not in BCNF because of the following dependencies:
student# s_name
course# c_name
4th Normal Form
A table is in fourth normal form (4NF) if and only if it is in BCNF and contains no
more than one multi-valued dependency.
1. Anomalies can occur in relations in BCNF if there is more than one multi-valued
dependency.
2. If A B and A C but B and C are unrelated, ie A (B,C) is false, then we have
more than one multi-valued dependency.
3. A relation is in 4NF when it is in BCNF and has no more than one multi-valued
dependency.
1 Programming Golf
1 Programming Bowling
1 Analysis Golf
1 Analysis Bowling
2 Analysis Golf
2 Analysis Gardening
2 Management Golf
2 Management Gardening
This table is difficult to maintain since adding a new hobby requires multiple new rows
corresponding to each skill. This problem is created by the pair of multi-valued
dependencies EMPLOYEE# SKILLS and EMPLOYEE# HOBBIES. A much better
alternative would be to decompose INFO into two relations:
skills(employee#, skill)
hobbies(employee#, hobby)
... and there are no pairwise cyclical dependencies in the primary key comprised
of three or more attributes.
Anomalies can occur in relations in 4NF if the primary key has three or more fields.
5NF is based on the concept of join dependence - if a relation cannot be decomposed
any further then it is in 5NF.
Pairwise cyclical dependency means that:
o You always need to know two values (pairwise).
o For any one you must know the other two (cyclical).
This is used to track buyers, what they buy, and from whom they buy.
The question is, what do you do if Claiborne starts to sell Jeans? How many records must
you create to record this fact?
The problem is there are pairwise cyclical dependencies in the primary key. That is, in order
to determine the item you must know the buyer and vendor, and to determine the vendor
you must know the buyer and the item, and finally to know the buyer you must know the
vendor and the item.
The solution is to break this one table into three tables; Buyer-Vendor, Buyer-Item, and
Vendor-Item.
... if every constraint on the table is a logical consequence of the definition of keys
and domains.
This standard was proposed by Ron Fagin in 1981, but interestingly enough he made no
note of multi-valued dependencies, join dependencies, or functional dependencies in his
paper and did not demonstrate how to achieve DKNF. However, he did manage to
demonstrate that DKNF is often impossible to achieve.
If relation R is in DKNF, then it is sufficient to enforce the domain and key constraints for R,
and all constraints on R will be enforced automatically. Enforcing those domain and key
constraints is, of course, very simple (most DBMS products do it already). To be specific,
enforcing domain constraints just means checking that attribute values are always values
from the applicable domain (i.e., values of the right type); enforcing key constraints just
means checking that key values are unique.
Unfortunately lots of relations are not in DKNF in the first place. For example, suppose
there's a constraint on R to the effect that R must contain at least ten tuples. Then that
constraint is certainly not a consequence of the domain and key constraints that apply to R,
and so R is not in DKNF. The sad fact is, not all relations can be reduced to DKNF; nor do we
know the answer to the question "Exactly when can a relation be so reduced?"
Personal Guidelines
Even if you obey all the preceding rules it is still possible to produce a database design that
causes problems during development. I have come across many different implementation
tips and techniques over the years, and some that have worked in one database system have
been successfully carried forward into a new database system. Some tips, on the other hand,
may only be applicable to a particular database system.
For particular options and limitations you must refer to your database manual.
Database Names
7. Do not waste time using a prefix such as db to identify database names. The SQL
syntax analyser has the intelligence to work that out for itself - so should you.
8. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase (my preference - see The choice between upper and lower
case).
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
10. Rather than putting all the tables into a single database it may be better to create
separate databases for each logically related set of tables. This may help with
security, archiving, replication, etc.
Table Names
7. Table names should be short and meaningful, such as part, customer and invoice.
o Short, but not too short.
8. Do not waste time using a prefix such as tbl to identify table names. The SQL syntax
analyser has the intelligence to work that out for itself - so should you.
9. Table names should be in the singular (e.g. customer not customers). The fact that
a table may contain multiple entries is irrelevant - any multiplicity can be derived
from the existence of one-to-many relationships.
10. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, It is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
11. If a table name contains more than one word, such as in sales order and purchase
order, decide how to deal with it:
o Separate the words with a single space, as in sales order (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
Again, be consistent.
12. Be careful if the same table name is used in more than one database - it may lead to
confusion.
Field Names
8. Do not waste time using a prefix such as col or fld to identify column/field names.
The SQL syntax analyser has the intelligence to work that out for itself - so should
you.
9. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can
lead to disasters.
10. If a field name contains more than one word, such as in part name and customer
name, decide how to deal with it:
o Separate the words with a single space, as in part name (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
Again, be consistent.
6. It is recommended that the primary key of an entity should be constructed from the
table name with a suffix of _ID. This makes it easy to identify the primary key in a
long list of field names.
7. Do not waste time using a prefix such as pk to identify primary key fields. This has
absolutely no meaning to any database engine or any application.
8. Avoid using generic names for all primary keys. It may seem a clever idea to use the
name ID for every primary key field, but this causes problems:
o It causes the same name to appear on multiple tables with totally different
contexts. The string ID='ABC123' is extremely vague as it gives no idea of the
entity being referenced. Is it an invoice id, customer id, or what?
o It also causes a problem with foreign keys.
9. There is no rule that says a primary key must consist of a single attribute - both
simple and composite keys are allowed - so don't waste time creating artificial keys.
10. Avoid the unnecessary use of technical keys. If a table already contains a satisfactory
unique identifier, whether composite or simple, there is no need to create another
one. Although the use of a technical key can be justified in certain circumstances, it
takes intelligence to know when those circumstances are right. The indiscriminate
use of technical keys shows a distinct lack of intelligence. For further views on this
subject please refer to Technical Keys - Their Uses and Abuses.
Foreign Keys
4. It is recommended that where a foreign key is required that you use the same name
as that of the associated primary key on the foreign table. It is a requirement of
arelational join that two relations can only be joined when they share at least one
common attribute, and this should be taken to mean the attribute name(s) as well as
the value(s). Thus where the customer and invoice tables are joined in a parent-
child relationship the following will result:
o The primary key of customer will be customer_id.
5. For MySQL users this means that the shortened version of the join condition may be
used:
o Short: A LEFT JOIN B USING (a,b,c)
7. Do not waste time using a prefix such as fk to identify foreign key fields. This has
absolutely no meaning to any database engine or any application.
Generating Unique ids
Where a technical primary key is used a mechanism is required that will generate new and
unique values. Such keys are usually numeric, so there are several methods available:
1. Some database engines will maintain a set of sequence numbers for you which can
be referenced using code such as :
If the number just used needs to be retrieved so that it can be passed back to the
application it can be done so with the following:
SELECT <seq_name>.CURRVAL FROM DUAL
I have used this method, but a disadvantage that I have found is that the DBMS has no
knowledge of what primary key is linked to which sequence, so it is possible to insert
a record with a key not obtained from the sequence and thus cause the two to
become unsynchronised. The next time the sequence is used it could therefore
generate a value which already exists as a key and therefore cause an INSERT error.
3. Some database engines will allow you to specify a numeric field as 'auto-increment',
and on an INSERT they will automatically generate the next available number
(provided that no value is provided for that field in the first place). This is better than
the previous method because:
4. While the previous methods have their merits, they both have a common failing in
that they are not-standard extensions to the SQL standard, therefore they are not
available in all SQL-compliant database engines. This becomes an important factor if
it is ever decided to switch to another database engine. A truly portable method
which uses a standard technique and can therefore be used in any SQL-compliant
database is to use an SQL statement similar to the following to obtain a unique key for
a table:
6. table_id = table_id+1
Some people seem to think that this method is inefficient as it requires a full table
search, but they are missing the fact that table_id is a primary key, therefore the
values are held within an index. The SELECT max(...) statement will automatically be
optimised to go straight to the last value in the index, therefore the result is obtained
with almost no overhead. This would not be the case if I used SELECT count(...) as
this would have to physically count the number of entries. Another reason for not
using SELECT count(...) is that if records were to be deleted then record count would
be out of step with the highest current value.
7. The Radicore development framework has separate data access objects for each
DBMS to which it can connect. This means that the different code for dealing with
auto_increment keys can be contained within each object, so is totally transparent to
the application. All that is necessary is that the key be identified as 'auto_increment'
in the Data Dictionary and the database object will take care of all the necessary
processing.
What is SQL
Structured Query Language
Using sql one can store manipulate and retrieve data stored in databases
Division of SQL
important for representing the data. (Of course, the data manipulation language
must be aware of how data is represented, and reflects this in the constructs that
it supports ie
Data Definition
Creating a database:
Create a table:
column1 datatype,
columnname2 datatype,
columnname3 datatype,
...
Select data
SELECT column name(s)
WHERE
WHERE condition(S)
WHERE condition(S)
SQL Functions
table name
Products)
Creating a Schema
Creating a Table:
CREATE TABLE employee (
Date_of_Birth DATE,
Salary NUMBER
) ;
Date_of_Birth DATE,
);
Note: When naming tables, columns and other database objects, do not
include spaces in the names. For example, do not call the last name
column: Last Name
If you wish to separate words in a name, use the underscore character.
Specifying Primary and Foreign keys:
CREATE TABLE order_header (
order_date DATE,
sales_person VARCHAR(25),
bill_to VARCHAR(35),
bill_to_address VARCHAR(45),
bill_to_city VARCHAR(20),
bill_to_state VARCHAR(2),
bill_to_zip VARCHAR(10),
);
quantity NUMBER(4,0),
);
FirstName TEXT,
LastName TEXT,
ssn INTEGER
);
ON worker (ssn) ;
department_location VARCHAR(30)
Date_of_Birth DATE,
Salary NUMBER,
departmentid NUMBER
) ;
Constraints on attributes:
o NOT NULL - Attribute may not take a NULL value
o DEFAULT - Store a given default value wh no value is specified
o PRIMARY KEY - Indicate which attribute(s) form the primary key
o FOREIGN KEY - Indicate which attribute(s) form a foreign key.
This enforces referential integrity
o UNIQUE - Indicates which attribute(s) must have unique values.
Specify when constraint should be enforced:
o Immediate
o Deferrable until commit time
Referential Integrity Constraint: Specify the behavior for child tuples
when a parent tuple is modified.
Action to take if referential integrity is violated:
o SET NULL - Child tuples foreign key is set to NULL - Orphans.
o SET DEFAULT - Set the value of the foreign key to some default
value.
o - Child tuples are updated (or deleted) according to the
CASCADE
action take on the parent tuple.
Examples of ON DELETE and ON UPDATE
CREATE TABLE order_items (
quantity NUMBER(4,0),
ON UPDATE CASCADE,
);
Constraints can also be given names so that they can later be modified
or dropped easily.
CREATE TABLE order_header (
order_date DATE,
sales_person VARCHAR(25),
bill_to VARCHAR(35),
bill_to_address VARCHAR(45),
bill_to_city VARCHAR(20),
bill_to_state VARCHAR(2),
bill_to_zip VARCHAR(10),
CONSTRAINT pk_order_header
);
quantity NUMBER(4,0),
CONSTRAINT pk_order_items
CONSTRAINT fk1_order_items
ON UPDATE CASCADE,
CONSTRAINT fk2_order_items
ON UPDATE CASCADE
);
An even better approach is to create the tables without constraints and
then add them separately with ALTER TABLE statements
CREATE TABLE order_header (
order_date DATE,
sales_person VARCHAR(25),
bill_to VARCHAR(35),
bill_to_address VARCHAR(45),
bill_to_city VARCHAR(20),
bill_to_state VARCHAR(2),
bill_to_zip VARCHAR(10)
);
quantity NUMBER(4,0)
);
CONSTRAINT fk1_order_items
ON UPDATE CASCADE;
CONSTRAINT fk2_order_items
ON UPDATE CASCADE;
Changing Attributes:
ALTER TABLE student ALTER last_name VARCHAR(35);
ALTER TABLE student ALTER gpa DROP DEFAULT
ALTER TABLE student ALTER gpa SET DEFAULT 0.00;
Adding Attributes:
ALTER TABLE student ADD admission DATE;
Removing Attributes (not widely implemented):
ALTER TABLE student DROP home_phone;
DDL is used to create and specify the schema. DML is then used to
manipulate (select, insert, update, delete) data.
General syntax:
INSERT INTO tablename (column1, column2, ... columnX)
Examples:
INSERT INTO student_grades (student_id, test_name, score, grade)
Quotes are placed around the data depending on the Data type and on
the specific RDBMS being used:
Main way of getting data out of tables is with the SELECT statement.
SELECT syntax:
SELECT column1, column2, ... columnN
HAVING condition
FROM employee
FROM employee
SELECT *
FROM employee
ORDER BY 2;
FROM employee
WHERE departmentid = 3
FROM stocks
symbol = "IBM"
ORDER BY close_date
SELECT symbol, close_date, close_price
FROM stocks
FROM employee;
Results:
Expr1000
---------
43166.667
FROM employee;
Results:
AverageSalary
-------------
43166.667
Give the name of the employee with the highest salary in the company:
This is an example of a subquery
FROM employee
WHERE salary =
);
Results:
FROM employee e1
Results:
first_name last_name departmentid salary
Note the two aliases given to the employee table: e1 and e2. These
allow us to refer to different views of the same table.
FROM employee
GROUP BY departmentid
Results:
departmentid EmployeeCount
3 2
1 2
2 1
4 1
FROM employee
Result:
first_init last_name
J Smith
M Green
E Jones
S Smith
H Brown
J Foster
Show the First initial and last name of each employee but make them all
upper case
SELECT UCASE(MID(first_name, 1, 1)) AS first_init,
UCASE(last_name) AS last_name
FROM employee
Result:
first_init last_name
J SMITH
M GREEN
E JONES
S SMITH
H BROWN
J FOSTER
Show the current salary and a proposed new salary for each employee:
SELECT first_name, salary AS CurrentSalary,
FROM employee
Result:
Ed 39500 41080
Show the Date of Birth and the age of each employee in Department 3.
This uses the Now() function which returns the current date and time.
SELECT first_name, date_of_birth,
FROM employee
WHERE departmentid = 3
Result:
Ed 1/9/1971 14901.5790625
Note that the "Age" is given in number of days. A better output would
be to give this in years. So divide answer by 365.
SELECT first_name, date_of_birth,
FROM employee
WHERE departmentid = 3
Result:
Ed 1/9/1971 40.8262484462202
FROM employee
WHERE departmentid = 3
Result:
Ed 1/9/1971 40.8
Show all employees who have a birthday in August. this uses the
MONTH function. Given a date, MONTH(date) returns the month as a
number. Similar functions include DAY and YEAR.
SELECT first_name, last_name
FROM employee
WHERE MONTH(date_of_birth) = 8;
Result:
first_name last_name
Sally Smith
Howard Brown
In the FROM portion, list all tables separated by commas. Called a Join.
The WHERE part becomes the Join Condition
List all of the employees working in New York:
SELECT employee.first_name, employee.last_name
Results:
first_name last_name
Joe Smith
Howard Brown
Mary Green
List each employee name and what state (location) they work in. List
them in order of location and name:
SELECT employee.last_name, department.department_location
Results:
last_name department_location
Jones NJ
Smith NJ
Brown NY
Green NY
Smith NY
Foster PA
List each department and all employees that work there. Show the
department and location even if no employees work there.
SELECT department.departmentid,
department.department_location,
employee.last_name
ON employee.departmentid = department.departmentid
Results:
1 NY Smith
1 NY Brown
2 NY Green
3 NJ Jones
3 NJ Smith
4 PA Foster
5 CA NULL
Results:
MAX(SALARY)
------------
45000
FROM department;
GROUP BY customers.LastName
Results:
LASTNAME SUM(BALANCE)
--------- ------------
Axe $15,000.00
Builder $1,300.00
Jones $1,000.00
Smith $6,000.00
We can also use a Column Alias to change the title of the columns
SELECT customers.LastName, Sum(Balance) AS TotalBalance
GROUP BY customers.LastName
Results:
LASTNAME TotalBalance
--------- ------------
Axe $15,000.00
Builder $1,300.00
Jones $1,000.00
Smith $6,000.00
salary AS CurrentSalary,
FROM employee;
Results:
FROM students
WHERE grade between 91 and 100
Using BETWEEN with dates: Show the employees who were born in the
first 5 months of 1972.
For MS Access:
FROM employee
Result:
Note for Oracle or DB2, enclose the dates in single quotes like so:
FROM employee
Subqueries using = (equals): Find the student with the highest grade.
SELECT name, grade
FROM students
WHERE grade =
);
Result:
FROM employee
FROM employee
WHERE departmentid IN
(SELECT departmentid
FROM department
In the above case, the subquery returns a set of tuples. The IN clause
returns true when a tuple matches a member of the set.
FROM employee
WHERE EXISTS
(SELECT last_name
FROM EMPLOYEE e2
AND EXISTS
(SELECT last_name
FROM EMPLOYEE e3
Results:
Ed Jones 39500
The above query shows all employees names and salaries where there
is at least one person who makes more money (the first exists) and at
least one person who makes less money (second exists).
NOT EXISTS:
SELECT first_name, salary
FROM employee
(SELECT last_name
FROM EMPLOYEE e2
Results:
first_name salary
Sally 55000
Above query shows all employees for whom there does not exist an
employee who is paid less.
)
COUNT(tutors.student_tutorid) AS NumberTutored
GROUP BY s1.name;
We might be interested in the maximum number of people one person
is tutoring:
COUNT(tutors.student_tutorid) AS NumberTutored
GROUP BY s1.name
WHERE NumberTutored =
(SELECT MAX(NumberTutored)
COUNT(tutors.student_tutorid) AS NumberTutored
GROUP BY s1.name
LIKE operator:
Use the LIKE operator to perform a partial string match. Generally, the
% character is used as the wild card although in some DBMS, the *
character is used.
FROM employee
Show all employees whose name contains the letter 'e' and the letter 'n'
in that order:
FROM employee
Show all employees whose name contains the letter 'e' and the letter 'n'
in any order:
FROM employee
GROUP BY department.department_name
Result:
department_name TotalDeptSalary
Finance 88000
IT 94500
Or we can put it another way. Show total of salaries paid by each department
but only if the total is greater than the average paid by each department
GROUP BY department.department_name
Suppose we are interested in the name of the department with the largest
total salary. In this case we need to find the MAX of the SUM
GROUP BY department.department_name
HAVING SUM(salary) =
(SELECT MAX(TotalDeptSalary)
FROM
(SELECT department.department_name,
SUM(salary) AS TotalDeptSalary
GROUP BY department.department_name
WHERE departmentid IN
(SELECT departmentid
FROM department