Data Files and Database Options: Unit One

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 68

UNIT ONE

Data Files and Database Options


Introduction
The configuration choices that you make for a database affect its
performance, scalability, and management. You learn how to design
the file and filegroup storage structures underneath a database. You
learn how to configure database options and recovery models. You will
also learn how to check and manage the integrity of a database.

Data within a database is stored on disk in one or more data files. Prior to
being written to the data file(s), every transaction is written to a
transaction log file. In this lesson, you learn howto design the data files
underneath a database, group the files into filegroups to link
physicalstorage into a database, and manage the transaction log. You also
learn how to configure the tempdb database for optimal performance.

Files and Filegroups

Although storing all your data in memory would provide extremely fast
access, you wouldlose everything after the machine was shut down. To
protect your data, it has to be persistedto disk. Underneath each
database is one or more files for persisting your data.

SQL Server uses two different types of files:-


Data files and
Transaction log files.
Data files are responsible for the long-term storage of all the data within a
database. Transaction log files, are responsible for storing all the
transactions that are executed against a database.

Instead of defining the storage of objects directly to a data file, SQL


Server provides anabstraction layer for more flexibility called a filegroup.
Filegroups are a logical structure,defined within a database, that map a
database and the objects contained within adatabase, to the data files on
disk. Filegroups can contain more than one data file.

All objects that contain data, tables, indexes, and indexed views have an
ON clause thatyou can use to specify when you create an object that allows
you to specify the filegroupwhere SQL Server stores the object. As data
is written to the objects, SQL Server uses the filegroup definition to
determine on which file(s) it should store the data.

At the time that a file is added to a database, you specify the initial size of
the file. You canalso specify a maximum size for the file, as well as whether
SQL Server automatically increases the size of the file when it is full of 1
data. If you specify automatic growth, you can specify whether the file size
increases based on a percentage of the current size or whether the
Unless a file group has only a single file, you do not know in which file a specific row of data is
stored. When writing to files, SQL Server uses a proportional fill algorithm. The proportional
fill algorithm is designed to ensure that all files within a filegroup reach the maximum
definedcapacity at the same time. For example, if you had a data file that was 10 gigabytes (GB) and a
data file that was 1 GB, SQL Server writes ten rows to the 10 GB file for every one row that is written
to the 1 GB file.

The proportional fill algorithm is designed to allow a resize operation to occur at a filegroup
level. In other words, all files within a filegroup expand at the same time.

All data manipulation within SQL Server occurs in memory within a set of buffers. If
you areadding new data to a database, the new data is first written to a memory buffer, then written
to the transaction log, and finally persisted to a data file via a background process calledcheck
pointing. When you modify or delete an existing row, if the row does not already existin memory,
SQL Server first reads the data off disk before making the modification. Similarly ifyou are reading
data that has not yet been loaded into a memory buffer, SQL Server must read it out of the data files
on disk.

If you could always ensure that the machine hosting your databases had enough
memory tohold all the data within your databases, SQL Server could simply read all the data
off disk intomemory buffers upon startup to improve performance. However, databases
are almost alwaysmuch larger than the memory capacity on any machine, so SQL Server retrieves
data from disk only on needed basis. If SQL Server does not have enough room in memory for the
data beingread in, the least recently used buffer pools are erased to make room for newly requested
data.

Because accessing a disk drive is much slower than accessing memory, the data file design
underneath a database can have an impact on performance.

The first layer of design is within the disk subsystem. As the number of disk drives within
a volume increases, the read and write throughput for SQL Server increases. However, thereis
an upper limit on the disk input/output (I/O),which is based upon:-
the capacity of the redundant array of independent disks (RAID) controller,
host bus adapter (HBA),
and disk bus.

So you cannot fix a disk I/O bottleneck by continually adding more disk drives. The
process of designing the disk subsystem is reduced to ensuring that you have enoughdisks
along with appropriately sized controllers and disk caches to deliver the read/writethroughput
required by your database.

If it were simply a matter of the number of disks, there would be far fewer disk
I/Obottlenecks in systems. But there is a second layer of data file design:
determining how many data files you need and
the location of each data file.
SQL Server creates a thread for each file underneath a database. As you increase the number
of files underneath a database, SQL Server creates more threads that can be usedto read and write
data.
2
Designing the data layer of a database begins with the database creation. When youcreate a database,
it should have three files and two filegroups. You should have a file with an .mdf extension within a
filegroup named PRIMARY, a file with an .ndf extension in a filegroupwith any name that you choose,
and the transaction log with an .ldf extension.

Besides being the logical definition for one or more fi les that defines the storage boundary for an
object, filegroups have a property called DEFAULT. The purpose of the DEFAULT property is to
define the filegroup where SQL Server places objects if you do not specify the ON
clauseduring object creation.

When the database is created, the primary filegroup is marked as the default filegroup.After
you create the database, you should mark the second filegroup as the default filegroup. By changing
the default filegroup, you ensure that any objects you create arenot accidentally placed on the
primary filegroup and that only the system objects for thedatabase reside on the primary
filegroup. You change the default filegroup by using thefollowing command:

ALTER DATABASE <database name> MODIFY FILEGROUP <filegroup name> DEFAULT

The main reason not to place any of your objects on the primary filegroup is to provideas
much isolation in the I/O as possible.Following the initial creation of the database, you add
filegroups as needed to separatethe storage of objects within the database. You also add files
to filegroups to increase the diskI/O available to the objects stored on the filegroup,
thereby reducing disk bottlenecks.

Transaction Logs

When SQL Server acknowledges that a transaction has been committed, SQL Server mustensure that
the change is hardened to persistent storage. Although all writes occur throughmemory
buffers, persistence is guaranteed by requiring that all changes are written to thetransaction log
prior to a commit being issued. In addition, the writes to the transaction logmust occur directly to
disk.

Because every change made to a database must be written directly to disk, the disk
storagearchitecture underneath your transaction log is the most important decision affecting
themaximum transaction throughput that you can achieve.

SQL Server writes sequentially to the transaction log but does not read from the log exceptduring a
restart recovery. Because SQL Server randomly reads and writes to the data filesunderneath a
database, by isolating the transaction log to a dedicated set of disks youensurethat the disk heads do
not have to move all over the disk and move in a mostlylinear manner.
The maximum transaction throughput for any database is bound by the amount of dataper second
that SQL Server can write to the transaction log.

3
FILESTREAM data

Although the volume of data within organizations has been exploding, leading the way
inthisdata explosion is unstructured data. To tackle the problem of storing, managing,
andcombiningthe large volumes of unstructured databases with the structured data in
yourdatabases, SQLServer 2008 introduced FILESTREAM.

The FILESTREAM feature allows you to associate files with a database. The files are storedin a folder
on the operating system, but are linked directly into a database where the filescanbe backed
up, restored, full-text-indexed, and combined with other structured data.

As the variety of data withinan organization expands, organizations need to be able


toconsolidate data of all formatswithin a single storage architecture. SQL Server has theability
to store all the various datawithin an organization, the majority of which exist asdocuments,
spreadsheets, and other types of files.

Prior to SQL Server 2008, you had to extract the contents of a file to store it in aVARBINARY(MAX),
VARCHAR(MAX), or NVARCHAR(MAX) data type. However, you werelimited to storing only 2 GB of
data within a large data type. To work around thisrestriction,many organizations stored the filename
within SQL Server and maintained the file on theoperating system. The main issue with storing the
file outside the database isthat it was veryeasy to move, delete, or rename a file without
making a correspondingupdate to the database.

SQL Server 2008 introduces a new property for a column called FILESTREAM.FILESTREAMcombines
the best of both worlds. Binary large objects (BLOBs) stored in aFILESTREAM columnare controlled
and maintained by SQL Server; however, the data resides in a file on the operatingsystem. By storing
the data on the file system outside ofthe database, you are no longerrestricted to the 2-GB limit on
BLOBs. In addition, whenyou back up the database, all the filesare backed up at the same
time, ensuring that the state of each file remains synchronized withthe database.

You apply the FILESTREAM property to columns with a VARBINARY(MAX) data type.
Thecolumn within the table maintains a 16-byte identifier for the file. SQL Server manages theaccess
to the files stored on the operating system.

FILEGROUP designated for FILESTREAM storage is off-line and inaccessible withina Database
Snapshot. In addition, you cannot implement Database Mirroring against adatabase containing data
stored with FILESTREAM.

to store FILESTREAM data within a database, you needto specify where the data will be stored. You
define the location for FILESTREAM data in a database by designating a filegroup within the database
to be used for storage withthe CONTAINS FILESTREAM property. The FILENAME property defined
for a FILESTREAM file group specifies the pathto a folder. The initial part of the folder path
definition must exist;however, the last folder in the path defined cannot exist and is created
automatically. Afterthe FILESTREAM folder has been created, a filestream.hdr file is created in the
folder, which is a system file used tomanage the files subsequently written to the folder.

4
tempdbDatabase

Because the tempdb database is much more heavily used than in previous versions, specialcare
needs to be taken in how you design the storage underneath tempdb.In addition to temporary
objects, SQL Server uses tempdb for worktables used ingrouping/sortingoperations, worktables
to support cursors, the version store supportingsnapshot isolation level, and overflow for table
variables. You can also cause index buildoperations to use space in tempdb.

Due to the potential for heavy write activity, you should move tempdb to a set of disksseparated
from your databases and any backup files. To spread out the disk I/O, you might
consider adding additional files to tempdb.

Database Options

A database has numerous options that control a variety of behaviors. These options arebroken down
into several categories, including the following:
Recovery
Auto options
Change tracking
Access
Parameterization

Recovery Options

The recovery options determine the behavior of the transaction log and how
damagedpagesare handled.

Recovery Models
Every database within a SQL Server instance has a property setting called the
recoverymodel.The recovery model determines the types of backups you can perform against
adatabase.
The recovery models available in SQL Server 2008 are:
Full
Bulk-logged
Simple

THE FULL RECOVERY MODEL

When a database is in the Full recovery model, all changes made, using both
datamanipulation language (DML) and data definition language (DDL), are logged to
thetransaction log. Becauseall changes are recorded in the transaction log, it is possible torecover a
database in the Fullrecovery model to a given point in time so that data loss canbe minimized or
eliminated if youshould need to

5
recover from a disaster. Changes areretained in the transaction log indefinitelyand are removed only
by executing a transactionlog backup.

THE BULK-LOGGED RECOVERY MODEL

Certain operations are designed to manipulate large amounts of data. However, theoverhead
of logging to the transaction log can have a detrimental impact on performance.The Bulk-
logged recovery model allows certain operations to be executed with minimallogging. When a
minimally logged operation is performed, SQL Server does not logeveryrow changed but instead logs
only the extents, thereby reducing the overhead andimproving performance. The operations
that are performed in a minimally logged mannerwith the database set in the Bulk-logged recovery
model are:
BCP
BULK INSERT
SELECT. . .INTO
CREATE INDEX
ALTER INDEX. . .REBUILD
Because the Bulk-logged recovery model does not log every change to the transactionlog,you cannot
recover a database to a point in time, within the interval that a minimallylogged transaction
executed, when the Bulk-logged recovery model was enabled.

THE SIMPLE RECOVERY MODEL

The third recovery model is Simple. A database in the Simple recovery model logsoperations tothe
transaction log exactly as the Full recovery model does. However, eachtime the databasecheckpoint
process executes, the committed portion of the transactionlog is discarded. Adatabase in the Simple
recovery model cannot be recovered to a point intime because it is notpossible to issue a transaction
log backup for a database in the simplerecovery model.

Because the recovery model is a property of a database, you set therecovery model byusing
the ALTER DATABASE command as follows:

ALTER DATABASE database_nameSET RECOVERY { FULL | BULK_LOGGED | SIMPLE }

The backup types available for each recovery model are shown below Backup Types Available for
Each Recovery Model

6
Auto Options

There are five options for a database that enable certain actions to occur automatically:
AUTO_CLOSE
AUTO_SHRINK
AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_UPDATE_STATISTICS_ASYNCH
Each database within an instance requires a variety of resources, the most significant ofwhich is a set
of memory buffers. Each open database requires several bytes of memoryandany queries against the
database populate the data and query caches. If theAUTO_CLOSEoption is enabled, when the
last connection to a database is closed, SQLServer shuts downthe database and releases all
resources related to the database. When anew connection ismade to the database, SQL Server
starts up the database and beginsallocating resources.

By default, AUTO_CLOSE is disabled. Unless you have severe memory pressure, you shouldnot enable
a database for AUTO_CLOSE. In addition, a database that is frequently accessedshould not be set to
AUTO_CLOSE because it would cause a severe degradation inperformance.This is because you would
never be able to use the data and query cachesadequately.

Data files can be set to grow automatically when additional space is needed. Although
mostoperations to increase space affect the database on a long-term basis, some space increasesare
needed only on a temporary basis. If the AUTO_SHRINK option is enabled, SQL
Serverperiodically checks the space utilization of data and transaction log files. If the spacechecking
algorithm finds a data file that has more than 25 percent free space, the fileautomaticallyshrinks to
reclaim disk space.

Expanding a database file is a very expensive operation. Shrinking a database file is alsoan expensive
operation. Because the growth of database files is so expensive, it isrecommended to leave
the AUTO_SHRINKoption disabled and manually shrink files onlywhen necessary.
Statistics allow the Query Optimizer to build more efficient query plans. If the
AUTO_CREATE_STATSTICS option is enabled, SQL Server automatically creates statistics
aremissing during the optimization that of query processing. Although the
phase
ofstatisticsincurs some overhead, the benefit tocreation
query performance is worth the overheadcost
for SQLServer to create statistics automatically when necessary.

Databases have two options that allow SQL Server to update out-of-date statisticsautomatically. The
AUTO_UPDATE_STATISTICS option updates out-of-date statisticsduringquery optimization. If you
choose to enable AUTO_UPDATE_STATISTICS, a secondoption,AUTO_UPDATE_STATISTICS_ASYNC,
controls whether statistics are updated duringqueryoptimization or if query optimization continues
while the statistics are updatedasynchronously.

7
Access
Access to a database can be controlled through several options.
The status of a database can be explicitly set to ONLINE, OFFLINE, or EMERGENCY. Whena database
is in an ONLINE state, you can perform all operations that would otherwise bepossible. A database
that is in an OFFLINE state is inaccessible. A database in anEMERGENCYstate can be accessed only by
a member of the db_owner role, and the onlycommandallowed to be executed is SELECT.

You can control the ability to modify data for an online database by setting the databaseto either
READ_ONLY or READ_WRITE. A database in READ_ONLY mode cannot be written to.In addition,
when a database is placed in READ_ONLY mode, SQL Server removes anytransaction log file that is
specified for the database. Changing a database fromREAD_ONLY to READ_WRITEcauses SQL
Server to re- create the transaction log file.

User access to a database can be controlled through the SINGLE_USER, RESTRICTED_USER,and


MULTI_USER options.
When a database is in SINGLE_USER mode, only a single user is allowed to access the database.
A database set to RESTRICTED_USER only allows access to members of the db_owner,
dbcreator, and sysadmin roles.

If multiple users are using the database when you change the mode to SINGLE_USER or users that
conflict with the allowed set for RESTRICTED_USER, the ALTER DATABASEcommandis blocked until
all the non-allowed users disconnect.

Instead of waiting for usersto completeoperations and disconnect from the database, you can specify
a ROLLBACKaction to terminateconnections forcibly. The ROLLBACK IMMEDIATE option
forcibly rollsback any open transactions,along with disconnecting any nonallowed users.
ALTER DATABASE <database name> SET RESTRICTED_USER ROLLBACK IMMEDIATE

You can allowusers to complete transactionsand exit the database by using the ROLLBACK AFTER
<number of seconds> option, which waits for the specified number of seconds beforerolling back
transactions and disconnecting users.The normal operational mode for mostdatabases is
ONLINE, READ_WRITE, and MULTI_USER.

Parameterization

One of the “hot button” topics in application development is whether to parameterize callsto
the database. When a database call is parameterized, the values are passed as variables.applications
gaina significant benefit when database calls are parameterized.

SQL Server caches the query plan for every query that is executed. Unless there is pressureon the
query cache that forces a query plan from the cache, every query executed sincetheinstance started is
in the query cache. When a query is executed, SQL Server parses andcompiles the query. The query is
then compared to the query cache using a string-matchingalgorithm. If a match is found, SQL Server
retrieves the plan that has already beengeneratedand executes the query.

8
A query that is parameterized has a much higher probability of being matched becausethe
query string does not change even when the values being used vary. Therefore,parameterized
queries can reuse cached query plans more frequently and avoid the timerequired to build a query
plan.

Because not all applications parameterize calls to the database, you can force SQL Serverto
parameterize every query for a given database by setting the PARAMETERIZATIONFORCED database
option.You could see aperformance boost by forcing parameterization.

Collation Sequences

SQL Server has the capability to store character data that spans every possible written
language.However, not every language follows the same rules for sorting or datacomparisons. SQL
Server allows you to define the rules for comparison, sorting, casesensitivity, and accent sensitivity
through the specification of a collation sequence.

Whenyou install SQL Server, you specify a default collation sequence that is used for alldatabases,
tables, and columns. You can override the default collation sequence at eachlevel.The collation
sequence for an instance can be overridden at a database level byspecifyingthe COLLATE clause in
either the CREATE DATABASE or ALTER DATABASEcommand.

Maintaining Database Integrity

In a perfect world, everything that you save to disk storage would always
write correctly,readcorrectly, and never have any problems. Unfortunately, your SQL Server
databaseslive in animperfect world where things do go wrong. Although this occurs very
rarely, datawithin yourdatabase can become corrupted if there is a failure in the disk storage
systemas SQL Server iswriting to a page. Data pages are 8 kilobytes (KB) in size, but SQL
Serverdivides a page into16 blocks of 512 bytes apiece when performing write operations.

If SQLServer begins writingblocks on a page and the disk system fails in the middle of the
writeprocess, only a portionof the page is written successfully, producing a problem called atorn
page.

Database Integrity Checks

Databases have an option called PAGE_VERIFY. The page verification can be set to
either TORN_PAGE_DETECTION or CHECKSUM. The PAGE_VERIFYTORN_PAGE_DETECTION option
exists
for backwards compatibility and should not be used.

When the PAGE_VERIFYCHECKSUM option is enabled, SQL Server calculates a checksum forthe page
prior to thewrite. Each time a page is read off disk, a checksum is recalculated andcompared
to thechecksum written to the page. If the checksums do not match, the page hasbeen
corrupted.When SQL Server encounters a corrupt page, an error is thrown, the command
attemptingto access the corrupt page is aborted, and an entry is written into the suspect_pages table
in the msdb database.
9
You should enable the PAGE_VERIFY CHECKSUM option on every production database.
Although page verification can detect and log corrupted pages, the page must be readoff disk
to trigger the verification check. Data is normally read off disk when users andapplications access
data, but instead of having a user receive an error message, it is muchbetter for you to proactively
find corruption and fix the problem by using a backup beforethe user has a process aborted.

You can force SQL Server to read every page from disk and check the integrity by executingthe DBCC
CHECKDB command. The Syntax is:-

DBCC CHECKDB ('databasename') WITH NO_INFOMSGS, ALL_ERRORMSGS

When DBCC CHECKDB is executed, SQL Server performs all the following actions:
Checks page allocation within the database
Checks the structural integrity of all tables and indexed views
Calculates a checksum for every data and index page to compare against the storedchecksum
Validates the contents of every indexed view
Checks the database catalog
Validates Service Broker data within the database
To accomplish these checks, DBCC CHECKDB executes the following commands:
DBCC CHECKALLOC, to check the page allocation of the database
DBCC CHECKCATALOG, to check the database catalog
DBCC CHECKTABLE, for each table and view in the database to check the structural integrity Any
errors encountered are output so that you can fix the problems. If an integrity error isfound in an
index, you should drop and re-create the index. If an integrity error is found inatable, you need to use
your most recent backups to repair the damaged pages.

If the database is participating in Database Mirroring, SQL Server attempts to retrieve acopy of the
page from the mirror. If the page can be retrieved from the mirror and hasthe correct page contents,
the page is replaced automatically on the principal withoutrequiringany intervention. When
SQL Server replaces a corrupt page from the mirror, an entry is written into
the sys.dm_db_mirroring_auto_page_repair view.

10
UNIT-TWO

Transactions Processing
What is a transaction?
A Transaction is a mechanism for applying the desired modifications/operations to a
database. It is evident in real life that the final database instance after a successful
manipulation of the content of the database is the most up-to-date copy of the database.
Action, or series of actions, carried out by a single user or application program, which
accesses or changes contents of database. (i.e. Logical unit of work on the database.)
A transaction could be a whole program, part/module of a program or a single command.
Changes made in real time to a database are called transactions. Examples include ATM
transactions, credit card approvals, flight reservations, hotel check-in, phone calls,
supermarket canning, academic registration and billing.
A transaction could be composed of one or more database and non-database operations.
Transforms database from one consistent state to another, although consistency may be
violated during transaction.
A database transaction is a unit of interaction with database management system or similar
system that is treated in a coherent and reliable way independent of other transactions.
Transaction processing system
A system that manages transactions and controls their access to a DBMS is called a TP
monitor. A transaction processing system (TPS) generally consists of a TP monitor, one or
more DBMSs, and a set of application programs containing transaction.
In database field, a transaction is a group of logical operations that must all succeed or fail as a
group. Systems dedicated to supporting such operations are known as transaction
processing systems.

In comparison with database transaction, application program is series of transactions with non-
database processing in between.

Distinction between business transactions and online transactions:


 A business transaction is an interaction in the real world, usually between an
enterprise and a person, where something is exchanged.
 An online transaction is the execution of a program that performs an administrative
or real-time function, often by accessing shared data sources, usually on behalf of an
online user (although some transactions are run offline in batch).

What we are interested about is the online transaction, which is the interaction between the users
of a database system and the shared data stored in the database. This transaction program
contains the steps involved in the business transaction.

11
Transactions can be started, attempted, then committed or aborted via data manipulation
commands of SQL.

Can have one of two outcomes for any transaction:


Success - transaction commits and database reaches a new consistent state
 Committed transaction cannot be aborted or rolled back.
 How do you discard a committed transaction?
Failure - transaction aborts, and database must be restored to consistent state before it
started.
 Such a transaction is rolled back or undone.
 Aborted transaction that is rolled back can be restarted later.
In many data manipulation languages, there are keywords used to indicate different states of a
transaction.

A single transaction might require several queries, each reading and/or writing information in the
database. When this happens it is usually important to be sure that the database is not left with
only some of the queries carried out. For example, when doing a money transfer, if the money was
debited from one account, it is important that it also be credited to the depositing account. Also,
transactions should not interfere with each other.

A transaction is expected to exhibit some basic features or properties to be considered as a valid


transaction. These features are:

A: Atomicity C: Consistency I: Isolation

D: Durability

It is referred as ACID property of transaction. Without the ACID property, the integrity of
the database cannot be guaranteed.

Atomicity
Is All or None property

Every transaction should be considered as an atomic process which cannot be sub divided into small tasks.
Due to this property, just like an atom which exists or does not exist, a transaction has only two states.
Done or Never Started.

12
Done - a transaction must complete successfully and its effect should be visible in
the database.

Never Started - If a transaction fails during execution then all its modifications must
be undone to bring back the database to the last consistent state, i.e., remove the
effect of failed transaction.

No state between Done and Never Started

Consistency
If the transaction code is correct then a transaction, at the end of its execution, must
leave the database consistent. A transaction should transform a database from
one previous consistent state to another consistent state.

Isolation
A transaction must execute without interference from other concurrent transactions and its
intermediate or partial modifications to data must not be visible to other
transactions.

Durability
The effect of a completed transaction must persist in the database, i.e., its updates
must be available to other transaction immediately after the end of its execution,
and it should not be affected due to failures after the completion of the transaction.

In practice, these properties are often relaxed somewhat to provide better


performance.

State of a Transaction
A transaction is an atomic operation from the users’ perspective. But it has a collection of
operations and it can have a number of states during its execution.

A transaction can end in three possible ways.

• Successful Termination: when a transaction completes the execution of all


operations in it and reaches the COMMIT command.
• Suicidal Termination: when the transaction detects an error during its processing
and decide to abrupt itself before the end of the transaction and perform a ROLL
BACK
• Murderous Termination: When the DBMS or the system force the execution to abort
for any reason.

13
Start Ok to Commit
Commit Commit
Database
Modified

No Error

System Detects Error

Modify Abort End of


Transaction

Error Detected System


by Transaction Initiated Consistent
Consistent State
State

Error Transaction RollBack


Database
Initiated unmodified

Most SQL statements seem to be very short and easy to execute. But the reverse is true if you consider it as
a one command transaction. Actually a database system interprets a transaction not as an
application program but as a logical sequence of low- level operations read and write (referred to as
primitives).

Ways of Transaction Execution


In a database system many transactions are executed. Basically there are two ways of executing a set of
transactions:

(a) Serially: Serial Execution: In a serial execution transactions are executed strictly serially. Thus,
Transaction Ti completes and writes its results to the database then only the next transaction Tj is
scheduled for execution. This means at one time there is only one transaction that is being executed in the
system. The data is not shared between transactions at one specific time.

14
In Serial transaction execution, one transaction being executed does not interferethe execution of any other
transaction.

Good things about serial execution

Correct execution, i.e., if the input is correct then output will be correct.
Fast execution, since all the resources are available to the active. The worst thing about serial execution is
very inefficient resource utilization.

Example of a serial execution, Suppose data items X = 10, Y = 6, and N =1 and T1 and T2 are transactions.

T1 T2

read (X) read (X)

X := X+N X := X+N

write (X) write (X)

read (Y)

Y := Y+N

write (Y)

We execute this transaction serially as follows:

Time T1 T2

read (X) {X = 10}

X := X+N {X = 11}

write (X) {X = 11}

read (Y) {Y = 6}

Y := Y+N {Y = 7}

write (Y) {Y = 7}

read (X) {X = 11}

X := X+N {X = 12}

write (X)

15
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.

Thus we can witness that in serial execution of transaction, if we have two transactions Ti and Ti+1, then
Ti+1 will only be executed after the completion of Ti.

(b) Concurrently: is the reverse of serially executable transactions, in this scheme the individual
operations of transactions, i.e., reads and writes are interleaved in some order.

Time T1 T2

read (X) {X = 10}

read (X) {X = 10}

X := X+N {X = 11}

X := X+N {X = 11}

write (X) {X = 11}

write (X)

read (Y) {Y = 6}

Y := Y+N {Y = 7}

write (Y) {Y = 7}

Final values at the end of T1 and T2 : X = 11, and Y = 7. This improves resource
utilization, unfortunately gives incorrect result.

The correct value of X is 12 but in concurrent execution X =11, which is incorrect.


The reason for this error is incorrect sharing of X by T1 and T2.

In serial execution T2 read the value of X written by T1 (i.e., 11) but in concurrent
execution T2 read the same value of X (i.e., 10) as T1 did and the update made by T1 was
overwritten by T2’s update.

This is the reason the final value of X is one less than what is produced by serial execution.

16
Problems Associated with Concurrent Transaction Processing
Although two transactions may be correct in themselves, interleaving of operations
may produce an incorrect result which needs control over access.

Having a concurrent transaction processing, one can enhance the throughput of


the system. As reading and writing is performed from and on secondary storage, the
system will not be idle during these operations if there is a concurrent processing.

Every transaction should be correct by themselves, but this would not guarantee that
the interleaving of these transactions will produce a correct result.

The three potential problems caused by concurrency are:

Lost Update Problem


Uncommitted Dependency Problem
Inconsistent Analysis Problem.
Lost Update Problem

Successfully completed update on a data set by one transaction is overridden by


another transaction/user.

• Account with balance A=100.

 T1 reads the account A


 T1 withdraws 10 from A
 T1 makes the update in the Database
 T2 reads the account A
 T2 adds 100 on A
 T2 makes the update in the Database
In the above case, if done one after the other (serially) then we have no problem.
 If the execution is T1 followed by T2 then A=190
 If the execution is T2 followed by T1 then A=190
But if they start at the same time in the following sequence:
T1 reads the account A=100
T1 withdraws 10 making the balance A=90
T2 reads the account A=100
T2 adds 100 making A=200
T1 makes the update in the Database A=90
T2 makes the update in the Database A=200

17
 After the successful completion of the operation in this schedule, the final value of A
will be 200 which override the update made by the first transaction that changed the
value from 100 to 90.

Uncommitted Dependency Problem

Occurs when one transaction can see intermediate results of another transaction before
it is committed.

E.g.

 T2 increases 100 making it 200 but then aborts the transaction before it
is committed. T1 gets 200, subtracts 10 and make it 190. But the actual balance
should be 90

Inconsistent Analysis Problem

Occurs when transaction reads several values but second transaction updates
some of them during execution and before the completion of the first.

E.g.

 T2 would like to add the values of A=10, B=20 and C=30. after the values are
read by T2 and before its completion, T1 updates the value of B to be 50. at the
end of the execution of the two transactions T2 will come up with the sum of
60 while it should be 90 since B is updated to 50.

As discussed above, the objective of Concurrency Control Protocol is to


schedule transactions in such a way as to avoid any interference between
them. This demands a new principle in transaction processing, which is
serializability of the schedule of execution of multiple transactions.

18
UNIT – THREE

Concurrency Control Techniques


Concurrency Control is the process of managing simultaneous
operations on the database without having them interfere with one
another. Prevents interference when two or more users are accessing
database simultaneously and at least one is updating data. Although
two transactions may be correct in themselves, interleaving of
operations may produce an incorrect result.

Three basic concurrency control techniques:

•Locking methods
•Time stamping
•Optimistic
Locking and Time stamping are pessimistic approaches since they delay
transactions.

Both Locking and Time stamping are conservative approaches: delay


transactions in case they conflict with other transactions.

The optimistic approach allows us to proceed and check conflicts at the

end. Optimistic methods assume conflict is rare and only check for

conflicts at commit.
Locking Method

A LOCK is a mechanism for enforcing limits on access to a resource


in an environment where there are many threads of execution. Locks
are one way of enforcing concurrency control policies. Transaction uses
locks to deny access to other transactions and so prevent incorrect
updates.

Lock prevents another transaction from modifying item or even


reading it, in the case of a write lock.

Lock (X): If a transaction T1 applies Lock on data item X, then X is


locked and it is not available to any other transaction.
19
Unlock (X): T1 Unlocks X. X is available to other transactions.
Types of a Lock
Shared lock:A Read operation does not change the value of a data item. Hence a
data item can be read by two different transactions simultaneously under share lock
mode. So only to read a data item T1 will do: Share lock (X), then Read (X), and finally Unlock
(X).

Exclusive lock:A write operation changes the value of the data item. Hence two
write operations from two different transactions or a write from T1 and a read from
T2 are not allowed. A data item can be modified only under Exclusive lock. To modify a
data item T1 will do: Exclusive lock (X), then Write (X) and finally Unlock (X).

When these locks are applied, then a transaction must behave in a special way.
This special behavior of a transaction is referred to as well-formed.

Well-formed:A transaction is well- formed if it does not lock a locked data item and it
does not try to unlock an unlocked data item.

Locking - Basic Rules

If transaction has shared lock on item, can read but not update item.
If transaction has exclusive lock on item, can both read and update item.
Reads cannot conflict, so more than one transaction can hold shared locks
simultaneously on same item.
Exclusive lock gives transaction exclusive access to that item.
Some systems allow transaction to upgrade a shared lock to an exclusive lock, or
vice-versa.
Examples: T1 and T2 are two transactions. They are executed under locking as follows. T1
locks Ain exclusive mode. When T2 wants to lock A, it finds it locked by T1 so T2 waits for
Unlock on A by T1. When A is released then T2 locks A and begins execution.

Suppose a lock on a data item is applied, the data item is processed and it is unlocked
immediately after reading/writing is completed as follows.

Initial values of A = 10 and B = 20.

20
Serial Execution of T1 and then T2 Concurrent Execution of T1 and T2

T1 T2 T1 T2

Lock (A) Lock (A)

read (A) {A = 10} read (A) {A = 10}

A := A + 100 A := A + 100

write (A) (A = 110} write (A) (A = 110}

Unlock (A) Unlock (A)

Lock (B) Lock (B)

read (B) {B = 20} read (B) {B = 20}

B := B + 10 B := B * 5

write (B) {B =30} write (B) {B = 100}

Unlock (B) Unlock (B)

Lock (B) Lock (B)

read (B) {B = 30} read (B) {B = 100}

B := B * 5 B := B + 10

write (B) {B = 150} write (B) {B = 110}

Unlock (B) Unlock (B)

Lock (A) Lock (A)

Read (A) {A = 110} Read (A) {A = 110}

A := A + 20 A := A + 20

Write (A) {A = 130} Write (A) {A = 130}

Unlock (A) Unlock (A)

Final Result: A=130 B=150 Final Result: A=130 B=110


The final result of the two transactions using the two types of transaction
execution (serial and concurrent) is not the same. This indicates that the above
method of locking and unlocking is not correct. Thus, to preserve consistency we
have to use another approach to locking, two-phase lockingscheme.
Two-Phase Locking (2PL)

A transaction follows 2PL protocol if all locking operations precede the first
unlock operation in the transaction.

The 2PL protocol demands locking and unlocking of a transaction to have two
phases.

Growing phase - acquires all locks but cannot release any locks.
Shrinking phase - releases locks but cannot acquire any new locks.

Locking methods: problems

Deadlock: A deadlock that may result when two (or more) transactions are
each waiting for locks held by the other to be released.

Deadlock - possible solutions

Only one way to break deadlock: abort one or more of the transactions in the
deadlock.

Deadlock should be transparent to user, so DBMS should restart transaction(s).

Two general techniques for handling deadlock:


Deadlock prevention.
Deadlock detection and recovery.

Timeout

The deadlock detection could be done using the technique of TIMEOUT. Every
transaction will be given a time to wait in case of deadlock. If a transaction waits
for the predefined period of time in idle mode, the DBMS will assume that
deadlock occurred and it will abort and restart the transaction.
22
Time-stamping Method

Timestamp: a unique identifier created by DBMS that indicates relative starting


time of a transaction.

Can be generated by:

using system clock at time transaction started, or


Incrementing a logical counter every time a new transaction starts.
Time-stamping Aconcurrency control protocol that orders transactions in such
a way that older transactions, transactions with smaller time stamps, get priority
in the event of conflict.

Transactions ordered globally basedon their timestamp so that older


transactions, transactions with earlier timestamps, get priority in the event of
conflict.
Conflict is resolved by rolling back and restarting transaction.
Since there is no need to use lock there will be No Deadlock.
In timestamp ordering, the schedule is equivalent to the particular serial order
that corresponds to the order of the transaction timestamps.

To implement this scheme, every transaction will be given a timestamp


which is a unique identifier of a transaction.

If Ti came to processing prior to Tj then TS of Tj will be larger than TS of Ti.

Again each data item will have a timestamp for Read and Write.

WTS(A) which denotes the largest timestamp of any transaction that


successfully executed Write(A)
RTS(A) which denotes the largest timestamp of any transaction that
successfully executed Read(A)
These timestamps are updated whenever a new Read(A) or Write(A)
instruction is executed.

Read/write proceeds only if last update on that data item was carried out by an
older transaction. Otherwise, transaction requesting read/write is restarted and
given a new timestamp.

23
The timestamp ordering protocol ensures that any conflicting read and
write operations are executed in the timestamp order.

The protocol manages concurrent execution such that the time-stamps


determine the seria lizability order.

Rules for permitting execution of operations in Time-stamping Method

Suppose that Transaction Ti issues Read(A)

If TS(Ti) < WTS(A): this implies that Ti needs to read a value of A which
was already overwritten. Hence the read operation must be rejected and Ti
is rolled back.
If TS(Ti) >= WTS(A): then the read is executed and RTS(A) is set to the
maximum of RTS(A) and TS(Ti).
Suppose that Transaction Ti issues Write(A)

If TS(Ti) < RTS(A): then this implies that the value of A that Ti is
producing was previously needed and it was assumed that it would never
be produced. Hence, the Write operation must be rejcted and Ti is rolled
back.
If TS(Ti) < WTS(A): then this implies that Ti is attempting to Write an
object value of A. hence, this write operation can be ignored.
Otherwise the Write operation is executed and WTS(A) is set to the
maximum of WTS(A) or TS(Ti).
A transaction that is rolled back due to conflict will be restarted and be
given a new timestamp.

Problem with timestamp-ordering protocol:

 Suppose Ti aborts, but Tj has read a data item written by Ti


 Then Tjmust abort; if Tjhad been allowed to commit earlier, the
schedule is not recoverable.
 Further, any transaction that has read a data item written by Tj
must also abort
 This can lead to cascading rollback…

24
Cascading Rollback
Whenever some transaction T tries to issue a Read_Item(X) or a
Write_Item(X) operation, the basic timestamp ordering algorithm compares the
timestamp of T with the read timestamp and the write timestamp of X to
ensure that the timestamp order of execution of the transactions is not violated.
If the timestamp order is violated by the operation, then transaction T will violate
the equivalent serial schedule, so T is aborted. Then T is resubmitted to the
system as a new transaction with new timestamp. If T is aborted and rolled
back, any transaction Ti that may have used a value written by T must also be
rolled back. Similarly, any transaction Tj that may have used a value written by Ti
must also be rolled back, and so on. This effect is known as cascading rollback.

Optimistic Technique

Locking and assigning and checking timestamp values may be


unnecessary for some transactions
Assumes that conflict is rare.
When transaction reaches the level of executing commit, a check is
performed to determine whether conflict has occurred. If there is a
conflict, transaction is rolled back and restarted.
Based on assumption that conflict is rare and more efficient to let
transactions proceed without delays to ensure serializability.
At commit, check is made to determine
whether conflict has occurred.
If there is a conflict, transaction must be rolled back and restarted.
Potentially allows greater concurrency than traditional protocols.
Three phases:
• Read
• Validation
3. Write

•Optimistic Techniques - Read Phase


 Extends from start until immediately before commit.
 Transaction reads values from database and stores them
in local variables. Updates are applied to a local copy of the
data.
25
2. Optimistic Techniques - Validation Phase
 Follows the read phase.
2. For read-only transaction, checks that data read are still current values. If no
interference, transaction is committed, else aborted and restarted.
 For update transaction, checks transaction leaves database in a consistent
state, with serializability maintained.

• Optimistic Techniques - Write Phase


2. Follows successful validation phase for update transactions.
 Updates made to local copy are applied to the database.

Granularity of data items

Granularityis the size of the data items chosen as the unit of protection by a
concurrency control protocol.

It could be:

 The entire database


 A file
2. A page (a section of physical disk in which relations are stored)
3. A record
 A field value of a record

The granularity has effect on the performance of the system. As locking


will prevent access to the data, the size of the data required to be locked will
prevent other transactions from having access. If the entire database is
locked, then consistency will be highly maintained but less performance of
the system will be witnessed. If a single data item is locked;
consistency maybe at risk but concurrent processing and performance
will be enhanced. Thus, as one go from the entire database to a single value,
performance and concurrent processing will be enhanced but consistency
will be at risk and needs good concurrency controlmechanism and
strategy.

26
Transaction Subsystem

Transaction manager
Scheduler
Recovery manager
Buffer manager
Transaction Manager in DBMS coordinates transactions on behalf of
application programs. Communicates with scheduler (lock manager)
which is responsible for implementing a strategy for concurrency control.
The Scheduler in the DBMS ensures that the individual steps of different
transactions preserve consistency.
Recovery Manager: Ensures that database is restored to the original state
incase failure occurs.
Buffer Manager: responsible for transfer of data between disk storage and main
memory.

27
UNIT FOUR

Database Security and authorization


Introduction
Designing a solid security system requires implementation of a
layered approach.This process is called “defense in depth.” This
chapter explains how to configure eachlayer within the Microsoft SQL
Server security infrastructure to help prevent unauthorizedaccess to
data.

TCP Endpoints
Endpoints control the capability to connect to an instance of SQL
Server as well as to dictate the communications methods that are
acceptable. Acting very similar to firewalls onthenetwork, endpoints
are a layer of security at the border between applications and
yourSQLServer instance.

Endpoint Types and Payloads


An endpoint has two basic parts: a transport and a payload.
Endpoints can be of two differenttransports: TCP and HTTP.
Endpoints also have a payload that defines the basiccategory of
traffic that is allowed and have the values of SOAP, TSQL,
SERVICE_BROKER,and DATABASE_MIRRORING.

By combining an endpoint transport and payload, SQL Server can filter acceptable
trafficbefore a command even reaches the SQL Server instance. For example, suppose you
have an endpoint defined as TCP with a payload of TSQL. If any application attempted to
sendHTTP,SERVICE_BROKER, or DATABASE_MIRRORING traffic through the endpoint, the
connection would be denied without needing to authenticate the request.

This process is very similar to the way firewalls work on a network.


Network administrators configure firewalls to allow traffic on only a specific set of
TCP andUDP ports. Any request attempting to use a port that is blocked is
rejected at the
28
firewall.Endpoints act in the same manner by rejecting requests that are not
properlyformattedbased on the endpoint definition.

Endpoint Access
Even if traffic going to the endpoint matches the correct transport and
payload,
aconnectionis still not allowed unless access has been granted on the
endpoint. Endpointaccess hastwo layers. The first layer of access security is
determined by the endpoint state.An endpoint can haveone of three states:
STARTED, STOPPED, and DISABLED. The threestates of an endpoint reactas follows:
STARTED The endpoint is actively listening for connections and will reply to an
application.
STOPPED The endpoint is actively listening but returns a connection error to
anapplication.
DISABLED The endpoint does not listen and does not respond to any connection thatis
attempted.

The second layer of security is permission to connect to the endpoint. An applicationmust


have a login created in SQL Server that has the CONNECT permission granted on
theendpoint before the connection is allowed through the endpoint.

You might be wondering about all the effort involved just to create a connection to
aninstance of SQL Server before the user is even authenticated. In prior versions of
SQLServer,any application could connect to a server running SQL Server and transmit any
typeofrequest. No attempt was made to ensure that applications had to transmit validly
formedrequests, so hacking into a server running SQL Server was much easier to
accomplish.

SQL Server 2008 ensures that only valid requests can be submitted by a valid user before
arequest is scheduled within the engine. Administrators also have a master switch
to shutoffaccess immediately if they feel someone is attempting to compromise
their serverrunningSQL Server by setting the state of the endpoint being used to
DISABLED.

You can create Transmission Control Protocol (TCP) endpoints with three different
payloads: TSQL, DATABASE_MIRRORING, and SERVICE_BROKER.
TCP Protocol Arguments
You can configure TCP endpoints to listen on specific Internet Protocol (IP)
addresses
andport numbers. The two arguments you can specify that are universal for all TCP
endpointsarethe following:
29
LISTENER_PORT
LISTENER_IP
LISTENER_PORT is required. The TCP for TSQL endpoint that is created for
each instanceduring installation is already configured for TCP port 1433 or the alternate
port number for the instance.

The LISTENER_IP argument is an optional argument that can provide a very


powerful security layer for some types of applications. You can specify a specific IP
address for theendpoint to listen on. The default setting is ALL, which means that the
endpoint listens forconnections sent to any valid IP address configured on the machine.

However, if you want to limit connection requests to a specific network interface


card (NIC), you can specify a LISTENER_IP argument. When you specify an IP
address, the endpoint listens for requests sent only to the IP address specified.

Database Mirroring and Service Broker Common Arguments

Database Mirroring and Service Broker endpoints provide options to


specify theauthentication method and the encryption setting. You can use
either MicrosoftWindows–based authentication or certificates. You specify
Windows-based authentication by selecting the NTLM, KERBEROS, or NEGOTIATE option.

The NEGOTIATE option causes the instances to select the authentication


method dynamically. You can set up certificate-based authentication by using a certificate
from a trusted authority or by generating your own Windows certificate.

SQL Server can encrypt all communications between endpoints, and you can specify
whichencryption algorithm to use for the communications. The default algorithm is RC4,
but you can specify the much stronger Advanced Encryption Standard (AES) algorithm.

Database Mirroring–Specific Arguments


Database Mirroring endpoints include a third argument related to the role within
the Database Mirroring session.

You can specify that an endpoint is a PARTNER, WITNESS, or ALL.


An endpoint specified as PARTNER can participate only as the principal or as the mirror.
An endpoint specified as WITNESS can participate only as a witness.
An endpoint specified as ALL can function in any role.

The following T-SQL example shows how to create a Database Mirroring endpoint:

30
CREATE ENDPOINT [Mirroring] AS TCP (LISTENER_PORT = 5022)
FOR DATA_MIRRORING (ROLE = PARTNER, ENCRYPTION = REQUIRED); ALTER
ENDPOINT [Mirroring] STATE = STARTED;

Service Broker–Specific Arguments


In addition to authentication modes and encryption, Service Broker
endpoints implementarguments related to message forwarding.

The MESSAGE_FORWARDING option enables messages destined for a


different brokerinstance to be forwarded to a specified forwarding address.
The options are ENABLED and DISABLED. If the MESSAGE_FORWARDING option
is set to ENABLED, you can also specify the MESSAGE_FORWARDING_SIZE, which
specifies the maximum amount of storage to allocate for forwarded messages.

Service Broker instances process messages by executing stored procedures to


perform work in an asynchronous manner. Each Service Broker instance is
configured to process messages of a particular format.

However, it ispossible to have many Service Broker instances configured


in an environment, each of which processes different types of messages. By
employing message forwarding, administrators can balance the load on Service
Broker instances more easily, without requiring changes to applications.

you can run the following to gather information about end points configured
in your system
SELECT * FROM sys.endpoints
SELECT * FROM sys.tcp_endpoints SELECT * FROM sys.http_endpoints
SELECT * FROM sys.database_mirroring_endpoints SELECT * FROM
sys.service_broker_endpoints

Configuring the SQL Server Surface Area

Given enough time, anyone can eventually beat any security implementation. The
purpose of security is to provide enough barriers such that the effort
required to break into a system exceeds the benefit received. In this lesson,
you learn how to configure your instances to expose the minimum number of
attack points possible by minimizing thefeature set that isenabled within SQL
Server.

One of the most frequent ways that attackers gain access to a system is through
features that have been enabled but are rarely used. SQL Server now disables
every feature not required for the operation of the database engine.

31
At the time of installation, you can decide to force users to authenticate to an instance
using only Windows credentials. If the authentication mode for your instance is set to
Windows only, you have disabled users’ ability to use SQL Server native logins.

The biggest potential risk to an instance is through the use of features that expose
an external interface or ad hoc execution capability. The two features with the greatest
risk are OPENROWSET/OPENDATASOURCE and OLE Automation procedures.

You enable and disable SQL Server features by using sp_configure. The features that
youshould have disabled unless you need the specific functionality are the following:
Ad Hoc Distributed Queries
CLR Enabled
Cross Database Ownership Chaining (CDOC)
Database Mail
External Key Management
Filestream Access Level
OLE Automation Procedures
Remote Admin Connections
SQL Mail extended stored procedures (XPs)
xp_cmdshell
Execute the following code to turn on the ability to view all the configuration options for
an instance:
EXEC sp_configure 'show advanced options',1 GO
RECONFIGURE WITH OVERRIDE GO
EXEC sp_configure GO

Execute the following code to turn off ad hoc distributed queries, CDOC, CLR,
OLEautomation procedures, SQL Mail XPs, and xp_cmdshell:

EXEC sp_configure 'Ad Hoc Distributed Queries',0 EXEC sp_configure 'clr enabled',0
EXEC sp_configure 'cross db ownership chaining',0
EXEC sp_configure 'Ole Automation Procedures',0 EXEC sp_configure 'SQL Mail XPs',0
EXEC sp_configure 'xp_cmdshell',0 GO
RECONFIGURE WITH OVERRIDE
GO

Creating Principals

Principals are the means by which you authenticate and are identified within an instance
or database. Principals are broken down into two major categories: logins/users and
groups that exist at both an instance and database level.

32
Logins
To gain access to an instance, a user has to authenticate by supplying credentials
for SQLServer to validate. You create logins for an instance to allow a user to
authenticate. Logins with in SQL Server can be five different types:

Standard SQL Server login


Windows login
Windows group
Certificate
Asymmetric key

A standard SQL Server login is created by a database administrator (DBA) and


configured with a name and password which must be supplied by a user to authenticate
successfully. The login is stored inside the master database and assigned a local security
identifier (SID) with in SQL Server.

SQL Server login can also be mapped to either a Windows login or a Windows group.
When adding a Windows login or Windows group, SQL Server stores the name of the
login or group along with the corresponding Windows SID.

login mapped to a certificate or asymmetric key does not provide a means to authenticate
to the instance. Logins mapped to certificates and asymmetric keys are used internally as
asecurity container.

The generic syntax for creating a login is:

CREATE LOGIN loginName{ WITH<option_list1> | FROM <sources> }


<option_list1> ::=
PASSWORD = { 'password' | hashed_password HASHED } [ MUST_CHANGE ] [ ,<option_list2>
[ ,... ] ]
<option_list2> ::= SID = sid
| DEFAULT_DATABASE = database
| DEFAULT_LANGUAGE = language
| CHECK_EXPIRATION = { ON | OFF}
| CHECK_POLICY = { ON | OFF}
| CREDENTIAL = credential_name
<sources> ::=
WINDOWS [ WITH<windows_options> [ ,... ] ]
| CERTIFICATE certname
| ASYMMETRIC KEY asym_key_name
<windows_options> ::= DEFAULT_DATABASE = database
| DEFAULT_LANGUAGE = language

33
When the CHECK_POLICY option (the default and recommended setting) is enabled,
SQLServer 2008 enforces the Windows password policy settings when you create a
SQL Serverlogin.

CHECK_EXPIRATION is used to prevent brute force attacks against a login. When it


is enabled, each time the login is used to authenticate to an instance, SQL Server
checks whether the password has expired and prompts the user to change the
password if necessary.

Groups

Using Windows groups provides the greatest flexibility for managing security access. You
simply add or remove accounts from the group to control access to a SQL Server instance.

A DBA is also isolated from the details of people joining and leaving companies or
moving to different groups within an organization. The DBA can then focus on
defining groups based on permission profiles and leave the mechanics of adding and
removing user accounts to standard business processes within your company.

When you are performing maintenance on a database you need to ensure that
users are not accessing the database in the meantime.

You can preventaccess while keeping the permissions for a login intact by disabling the
login. You can disable the login by executing the following code:

ALTER LOGIN <loginname>DISABLE

Fixed Server Roles

Roles in SQL Server provide the same functionality as groups within Windows. Roles
provide a convenient way to group multiple users with the same permissions. Permissions
are assigned to the role, instead of individual users. Users then gain the required set of
permissions by having their account added to the appropriate role.

SQL Server ships with a set of instance-level roles. The instance-level roles are referred to
as fixed server roles, because you cannot modify the permissions on the role. You
also can not create additional roles at an instance level.

34
Database Users
SQL Server security works on the principle of “no access by default.” If you haven’t
explicitly been granted permission, you cannot perform an action. You grant access to a
database by adding a login to the database as a user by executing the CREATE
USER command.

CREATEUSER has the following general syntax:

CREATE USER user_name [ { { FOR | FROM }


{ LOGINlogin_name
| CERTIFICATE cert_name
| ASYMMETRIC KEY asym_key_name}
| WITHOUT LOGIN ]
[ WITH DEFAULT_SCHEMA = schema_name ]

Just because a user can access a database, that does not mean that any objects within the
database can be accessed since the user still needs permissions granted
to databaseobject(s).

Loginless Users

It is possible to create a user in the database that is not associated to a login, referred to as
a loginless user. Prior to SQL Server 2005, if you wanted to allow users to access a
database only when a specific application was being used, you used an application role.
You created the application role with a password, and assigned permissions to the
applicationrole.

Users would then specify the password for the application role to gain access to
the database under the application role’s security context. Unfortunately, when you
are connected with the application role, SQL Server no longer knew the user
issuing commands, which created a problem for auditing activity.

Loginless users were added to replace application roles. Users still authenticate to
the instance using their own credentials. The user’s login needs access to the database.

Fixed Database Roles


Just as you have fixed roles at an instance level, SQL Server provides a set of fixed roles at
adatabase level, as shown below.

35
User Database Roles
Instead of managing permissions for each account, all modern operating systems allow
you to define groups of users that all have the same permissions. All system
administrators need to do is to manage the members of a group instead of the
potentially hundreds or thousands of individual permissions.

A database role is a principal within the database that contains one or more database
users. Permissions are assigned to the database role. Although you can assign
permissions directly to a user, it is recommended that you create database roles, add users
to a role, and then grant permissions to the role.

Managing Permissions

SQL Server denies access by default. Therefore, to access any object or perform any action,
you must be granted permission.

Securables

Permissions would be rather uninteresting unless you had something to apply the
permissions to. In addition, there would be no need to have permissions if no one existed
to use the permission. Permissions work in concert with securables and principals.
You GRANT/REVOKE/DENY <permissions> ON <securables> TO <principals>.
36
Securables are the objects on which you grant permissions. Every object within SQL
Server, including the entire instance, is a securable. Securables also can be nested
inside other securables. For example, an instance contains databases; databases contain
schemas; and schemas contain tables, views, procedures, functions, and so on.

Schemas

Every object created within a database cannot exist without an owner. All objects must
have an owner because objects cannot spontaneously come into existence; rather,
they must be created by someone.

when you need to remove a user from a database it would not be possible to drop a user
unless you reassigned the objects to a different owner. Reassigning an object to a different
owner would change the name of the object.

Schemas provide the containers that own all objects within a database and in turn,
a schema is owned by a database user. By introducing a schema between users and
objects, you can drop a user from the database without affecting the name of an
object or applications that use the object.

Permissions

Permissions provide the authority for principals to perform actions within an instance or
database. Some permissions apply to a statement such as INSERT, UPDATE, and SELECT,
and other permissions apply to an action such as ALTER TRACE, and still others
encompass a broad scope of authority such as CONTROL.

You add permissions to an object with the GRANT statement. Access to an object is
prevented with the DENY statement. To access an object, permission must be
granted explicitly. A DENY overrides any other permission, a DENY overrides a GRANT.

The REVOKE statement removes permission.

For example, if you issue a GRANT SELECT ON Person. Address TO Test,


you can remove the access by executing REVOKE SELECT ON Person.Address FROM Test.

Similarly, if you issue DENYSELECTON Person.Address TO Test,


you can remove the DENY by executing REVOKESELECT ONPerson.Address FROM Test.

Permission Scope

Because you grant permissions on asecurable, you can assign permissions to a securable at
any scope. Granting permission on a database causes the permission to be granted
implicitly to all schemas within the database and thereby to all objects within all schemas.

37
Granting permission on a schema causes thepermission to be granted implicitly to
all objects within a schema.

Auditing SQL Server Instances DDL Triggers


DDL triggers allow you to trap andrespond to login events. You can scope DDL triggers at
either an instance or a database level.

The generic syntax for creating a DDL trigger is as follows: CREATE TRIGGER trigger_name
ON ALL SERVER
[ WITH<logon_trigger_option> [ ,...n ] ]
{ FOR | AFTER } LOGON
AS { sql_statement [ ; ] [ ,...n ] |
EXTERNAL NAME < method specifier>[ ; ] }

You use the ON clause to scope a trigger as either instance-level (ON ALL
SERVER), or database-level, ON DATABASE. You specify the DDL event or event group that
the trigger fires upon within the FOR clause.

DDL triggers fire within the context of the DDL statement being executed. In addition to
obtaining information about the command that was executed, DDL triggers allow you to
prevent many DDL actions.

Audit Specifications

Prior to SQL Server 2008, you had to use multiple features to perform the full array
of auditing for an instance. DDL triggers would audit DDL changes; data
manipulation language (DML) triggers would audit data changes at the cost
of increasing transactiontimes; SQLTrace would audit SELECT statements.

SQL Server 2008 combines all the auditing capabilities into an audit specification. Audit
specifications begin with a server-level audit object that defines the logging location for
the audit trail. You then create server and database audit specifications tied to the audit
object.

C2 Auditing

C2 auditing is a U.S. Department of Defense audit specification that can be enabled


byexecuting the following code:

sp_configure 'c2 audit mode', 1

38
C2 auditing has been superseded by the Common Criteria specification developed by the
European Union. Whether you are complying with C2 or Common Criteria, with respect
toSQL Server, the audit result is essentially the same. You need to audit every successful
and unsuccessful attempt to access a database object.

You must be very careful when implementing C2 auditing. Be sure to check that a lower
level of auditing does not meet your requirements. When you enable C2 auditing, you have
made the decision that the audit is more important than a transaction. If the
system becomes too busy, SQL Server aborts a user transaction to write audit information.

Encrypting Data

Data that must remain confidential, even from a user that has SELECT permission
on a table, should be encrypted.

Data Encryption

Data that needs to remain confidential within the database (such as credit
card numbers)should be encrypted. After it’s encrypted, the data cannot be read without
having the proper credentials. In addition, encrypted columns cannot be used as search
arguments or as columns within an index because each action would defeat the purpose of
encrypting the data.

Columns can be encrypted using a hash, symmetric key, asymmetric key, or a certificate.
Symmetric keys are commonly used since a symmetric key provides the best
balance between securing data and performance. Asymmetric keys and certificates
provide the strongest encryption and decryption method.

Hash Algorithms

Encryption algorithms are either one-way or two-way. Two-way algorithms allow


you toencrypt and decrypt data. One-way algorithms only encrypt data, without any ability
to decrypt. A hash algorithm is a one-way algorithm that allows you to encrypt data but
does not allow decryption.

SQLServer uses an MD5 hash to handle passwords. When a password is specified for an
object, it applies an MD5 hash and stores the hash value. When you specify a password to
access an object, the password is hashed using the same MD5 hash, the hashed password
is transmitted in a secure channel, and the hash value transmitted is compared to
the hashvalue stored. Even an administrator who is running a trace cannot access
the password.

SQL Server allows you to specify five different hash algorithms—SHA, SHA1, MD2, MD4,
and MD5. MD5 is the algorithm of choice because it provides stronger encryption than the
other algorithms. Hash algorithms are also platform-Independent. You could hash a value
39
with in PHP on a Linux system and receive the same value as if you hashed the same value
with in SQLServer, so long as you used the same algorithm.

Hash algorithms are vulnerable to brute force attacks. If the range of values that you are
seeking to encrypt is small, an attacker can easily generate all the possible hashes for the
range of possible values. After generating these hashes, the attacker needs to compare the
hash values to find a match and thus reverse-engineer your data. For example, birthdates,
salaries, and credit card numbers would not be good choices to encrypt using a
hash algorithm.

Symmetric Keys

Symmetric keys utilize a single key for both encryption and decryption. Because
only a single key is needed to encrypt and decrypt data, symmetric key encryption is not
as strong as asymmetric key or certificate-based encryption. However, symmetric keys
provide the best possible performance for routine use of encrypted data.

Certificates and Asymmetric Keys

Certificates and asymmetric keys are based on the X.509 standard and are
essentially equivalent in their application. Asymmetric keys are generated by a key server
within an organization and cannot be backed up or moved from one system to
another. Certificatescan be backed up to and restored from a file, allowing you to move
databases that are encrypted while being able to re-create the certificate to access your
data.

you can execute the following code and compare the results for each hash algorithm:

DECLARE @Hash varchar(100)


SELECT @Hash = 'Encrypted Text'
SELECT HashBytes('MD5', @Hash) SELECT @Hash = 'Encrypted Text' SELECT
HashBytes('SHA', @Hash)

40
UNIT FIVE

Designing SQL Server Indexes


Introduction
After you have an optimal table design, you need to
design efficient indexes to effectivelyquery any data that is
stored. In this chapter, youlearn about the internal architecture
of anindex, as well as how to construct clustered,
nonclustered, Extensible Markup Language(XML), and spatial
indexes. You will then learnhow to manage and maintain the
indexes to ensure peak performance.

Index Architecture

Indexes are designed so that you can find the information you
are looking for within avastvolume of data by needing to perform
only a very small number of read operations.In thislesson, you
learn about the internal structure of an index, as well as how
SQL Serverbuildsand manipulates indexes. Armed with this
structural information, you can make betterdecisions on the
number, type, and definition of the indexes that you choose to
create.

Index Structure

SQL Server does not need to have indexes on a table to retrieve


data. A table can simply be scanned to find the piece of data that is
requested. However, the amount of time to find apiece of data is
directly proportional to the amount of data in the table. Because
users wantto store increasing amounts of data in a table as
well as have consistent query performance,regardless of the
data volume, you need to employ indexes to satisfy the needs
of theapplications that all businesses are built upon.

An index is useful only if it can provide a means to find data very


quickly regardless ofthe volume of data that is stored.

Balanced Trees (B-Trees)

The structure that SQL Server uses to build and maintain


indexes is called a balanced tree(B-tree).

41
A B-tree is constructed of a root node that contains a single page of data, one or
moreoptional intermediate level pages, and one or more optional leaf level pages.
The coreconcept of a B-tree can be found in the first word of the name: balanced. A B-tree
is always symmetrical, with the same number of pages on both the left and right halves at
each level.

A query scans the root page until it finds a page that contains the value being searched on.
It then uses the page pointer to hop to the next level and scan the rows in that page until it
finds a page that contains the data being searched for. It then repeats the process
with subsequent levels until it reaches the leaf level of the index. At this point, the
query haslocated the required data.

Index Levels

The number of levels in an index and the number of pages within each level of an index
aredetermined by simple calculation.

Designing Indexes

Indexes enable you to effectively query large amounts of data within a database. In this
lesson, you learn how to create clustered and nonclustered indexes, as well as why each
type of index is useful. Finally, you learnhow tocreate XML and spatial indexes to improve
search capabilities for XML documents and spatial applications.

Clustered Indexes

You can define an index by using one or more columns in the table, called the
index key,withthe following restrictions:
You can define an index with a maximum of 16 columns.
The maximum size of the index key is 900 bytes.
The column(s) defined for the clustered index are referred to as the clustering key.
A clusteredindex is special because it causes SQL Server to arrange the data in the
table according to the clustering key. Because a table cannot be sorted more than one way,
you can define only one clustered index on a table.

Clustered indexes provide a sort order for the storage of data within a table. However, a
clustered index does not provide a physical sort order.
42
Although a primary key is a constraint, SQL Server physically implements a primary key as
an index. Because the default option for a primary key is clustered unless you
specify otherwise, SQL Server creates a clustered index for a primary key.

Likewise,a uniqueconstraint is physically implemented as a unique index. Because a


primary key is alsounique by default unless it is specified as nonclustered, SQL
Server physically implements each primary key as a unique, clustered index.

Nonclustered Indexes

The other type of relational index that you can create is a nonclustered index.
Nonclustered indexes do not impose a sort order on the table, so you can create multiple
nonclustered indexes on a table. Nonclustered indexes have the same restrictions as a
clustered index - they can have a maximum of 900 bytes in the index key and a maximum
of 16 columns. Inaddition, a table is limited to a maximum of 1,000 nonclustered indexes.

The leaf level of a nonclustered index contains a pointer to the data you require. If
a clustered index exists on the table, the leaf level of the nonclustered index points
at theclustering key. If aclustered index does not exist on the table, the leaf level
of the nonclustered index points at therow of data in the table.

Both clustered and nonclustered indexes can be designated as unique. After an index has
been specified as such, you cannot place duplicate entries within it. If you attempt to insert
duplicate values, you receive an error and the transaction is disallowed.

By default, multi-rowinserts where even one row produces a duplicate have the
entire transaction rolledback.

Ifyou want to allow any rows to be inserted that do not produce duplicate values and
reject only the rows that cause duplicates in a multi-row insert operation, you
can specifytheIGNORE_DUP_KEY option. With the IGNORE_DUP_KEY option enabled,
rows that produced uplicate values generate a warning message, and only those rows are
rejected.

XML Indexes

An XML data type can contain up to 2 gigabytes (GB) of data in a single column.
Althoughthe XML data has a structure that can be queried, SQL Server needs to scan the
data structure to locate data within an XML document. To improve the performance
of queriesagainst XMLdata, you can create a special type of index called an XML index.

There are two different types of XML indexes: primary and secondary.

A primary XML index is built against all the nodes within the XML column. The primary
XML index is also tied to the table by maintaining a link to the corresponding row in the
43
clustered index. Therefore, a clustered index is required before you can create a primary
XML index.

After a primary XML index has been created, you can create additional secondary
indexes.Secondary indexes can be created on PATH, VALUE, or PROPERTY. A primary XML
index is first required, because secondary XML indexes are built against the data contained
within the primary XML index.

Maintaining Indexes

Over time, data changes will cause indexes to become fragmented. In order to ensure the
most efficient query operations possible, you need to ensure that fragmentation is
minimized.

Because the data within an index is stored in sorted order, over time, values can
movearoundwithin the index due to either page splits or changes in the values. To manage
the fragmentation of an index over time, you need to perform periodic maintenance.

To control the rate of fragmentation of an index, you can use an index option called the
fillfactor. You can also use the ALTER INDEX statement to remove the fragmentation.

Disabling an index

An index can be disabled by using the ALTER INDEX statement ALTER INDEX

{ index_name | ALL }
ON <object> DISABLE [

When an index is disabled, the definition remains in the system catalog but is no
longerused. SQL Server does not maintain the index as data in the table changes, and the
index cannot be used to satisfy queries. If a clustered index is disabled, the entire table
becomesinaccessible.

To enable an index, it must be rebuilt to regenerate and populate the B-tree structure. You
can accomplish this by using the following command:

ALTER INDEX { index_name | ALL } ON <object>


REBUILD

44
UNIT SIX

Database Recovery, Backing up and Restoring a database


Along with security, the other fundamental task of a database administrator
(DBA) is toensure that data can be recovered in the event of a disaster. Unless
you can protect thedata, the thousands

Backing up Databases

Database backups form the backbone upon which every disaster recovery plan is
built. Backup strategies are designed in the opposite order of the material
presented in this chapter. You start with the recovery requirements and
procedures and then figure outwhich types ofbackups best meet your recovery
needs. Unless you are devising recovery-oriented backupstrategies, it is unlikely
that you will ever meet your disaster recovery requirements.

Backup Types

SQL Server 2008 allows you to create four different types of backups:
Full
Differential
Transaction log
Filegroup

Full Backups

A full backup captures all pages within a database that contain data. Pages that
do notcontain data are not included in the backup. Therefore, a backup is never larger,
and in most cases is smaller, than the database for which it is created. A full backup is
the basis for recovering a database and must exist before you can use a
differential or transaction logbackup.

A backup can be executed concurrently with other database operations.


Because changescanbe made to the database while a backup is running, SQL Server
needs to be able toaccommodatethe changes while also ensuring that backups
are consistent for restorepurposes. To ensureboth concurrent access and backup
consistency, SQL Server performsthe steps of the backupprocedure as follows:
•Locks the database, blocking all transactions
•Places a mark in the transaction log
•Releases the database lock
•Extracts all pages in the data files and writes them to the backup device
•Locks the database, blocking all transactions
•Places a mark in the transaction log
•Releases the database lock
45
8.Extracts the portion of the log between the marks and appends it to the backup

The only operations that are not allowed during a full backup are
8. Adding or removing a database file
9. Shrinking a database

Transaction Log Backups

Every change made to a database has an entry made to the transaction log.
Each row isassigned a unique number internally called the Log Sequence Number (LSN).
The LSN is aninteger value that starts at 1 when the database is created and increments to
infinity. An LSN is never reused for a database and always increments. Essentially, an LSN
provides asequence number for every change made to a database.

The contents of a transaction log are broken down into two basic parts:-active and
inactive.

The inactive portion of the transaction log contains all the changes that have
been committed to the database.

The active portion of the log contains all the changes that have not yet been
committed.

When a transaction log backup is executed, SQL Server starts with the lowest LSN
in the transaction log and starts writing each successive transaction log record into the
backup. As soon as SQL Server reaches the first LSN that has not yet been committed (that
is, the oldest open transaction), the transaction log backup completes. The portion of
the transaction log that has been backed up is then removed, allowing the space to be
reused.

Based on the sequence number, it is possible to restore one transaction log backup
after another to recover a database to any point in time.

The restrictions on transaction log backups depend on having the entire sequence
of LSNs intact. Any action that creates a gap in the LSN sequence prevents any
subsequent transaction log backup from being executed. If an LSN gap is introduced, you
must create a full back up before you can start backing up the transaction log.

Differential Backups
A differential backup captures all extents that have changed since the last full
backup. The primary purpose of a differential backup is to reduce the number of
transaction log backups that need to be restored. A differential backup has to be applied
to a full backup and can’t exist until a full backup has been created.

SQL Server tracks each extent that has been changed following a full backup using46a
special page in the header of a database called the Differential Change Map (DCM). A full
backup zeroes out the contents of the DCM.
As changes are made to extents within the database, SQL Server sets the bit corresponding
to the extent to 1. When a differential backup is executed, SQL Server reads the contents of
the DCM to find all the extents that have changed since the last full backup. A differential
backup is not the same as an incremental backup.

A transaction log backup is an incremental backup because it captures any changes that
have occurred since the last transaction log backup. A differential backup contains all
pages changed since the last full backup.

For example, if you were to take a full backup at midnight and a differential backup every
four hours, both the 4 A.M. backup and the 8 A.M. backup would contain all the changes
made to the database since midnight.

Filegroup Backups

Although full backups capture all the used pages across the entire database, a full backup
of a large database can consume a significant amount of space and time. If you need to
reduce the footprint of a backup, you can rely on file or filegroup backups instead. As the
name implies, a file/filegroup backup allows you to target a portion of a database to be
backed up.

Filegroup backups can be used in conjunction with differential and transaction log backups
to recover a portion of the database in the event of a failure. In addition, so long as you do
not need to restore the primary filegroup and you are running SQL Server 2008
Enterprise; the database can remain online and accessible to applications during the
restore operation. Only the portion of the database being restored is off-line.

Partial Backups

Filegroups can be marked as read-only. A read-only filegroup cannot have any changes to
the objects that are stored on the filegroup. Because the purpose of backups is to capture
changes so that you can reconstruct a database to the most current state during a recovery
operation, backing up filegroups that cannot change unnecessarily consumes space within
the backup.

To reduce the size of a backup to only the filegroups that can change, you can perform a
partial backup. Partial backups are performed by specifying
the READ_WRITE_FILEGROUPS option as follows:

BACKUP DATABASE <database_name> READ_WRITE_FILEGROUPS


[filegroup_list] To <backup device>

TO When a partial backup is executed, SQL Server backs up the primary filegroup,
all read/write filegroups, and any explicitly specified read-only filegroups.

47
Restoring Databases

Backups are the insurance policy for your data. You hope that you never need to use your
backups, but in the event of a disaster, backups allow you to recover your data
and continue business operations.

Transaction Log Internals

A transaction log is one or more files associated with a database that tracks every
modification made to either data or objects within the database.
The transaction log is also required to store enough information to allow SQL Server to
recover a database when the instance is restarted.

The key piece of information within a transaction log is the LSN. The LSN starts at 0 when
the database is created and increments to infinity. The LSN always moves forward, never
repeats, and cannot be reset to a previous value. Each operation that affects the state of
the database increments the LSN.

When a SQL Server is started, every database undergoes a process called restart recovery.
Restart recovery runs in two phases—UNDO and REDO. During the REDO phase, all
committed transactions in the transaction log are flushed to disk.

At the completion of the REDO phase, the UNDO phase starts. The UNDO phase
moves through the transaction log and invalidates any transaction in the log that is
still open, ensuring that an uncommitted transaction cannot be written out to disk.

Database Restores

All restore sequences begin with either a full backup or filegroup backup. When restoring
backups, you have the option to terminate the restore process at any point and make the
database available for transactions. After the database or filegroup being restored has
been brought online, you can’t apply any additional differential or transaction log backups
to the database.

Restoring a Full Backup

When a RESTORE command is issued, if the database does not already exist within
the instance, SQL Server creates the database along with all fi les underneath the
database. During this process, each fi le is created and sized to match the fi le sizes at the
time the backup was created. After it creates the fi les, SQL Server begins restoring each
database page from the backup.

The REPLACE option is used to force the restore over the top of an existing database.
If you want the database to be online and accessible for transactions after the RESTORE
operation has completed, you need to specify the RECOVERY option. When a RESTORE is
48
issued with the NORECOVERY option, the restore completes, but the database is left in a
RECOVERING state such that subsequent differential and/or transaction log backups can
be applied.

The STANDBY option can be used to allow you to issue SELECT statements against
the database while still issuing additional differential and/or transaction log restores.

If you wanted to restore the database without losing any data, you only need to restore
the full backup and then the differential backup because the differential also contains all
the changes captured by each of the transaction log backups.

Restoring a Differential Backup


When the full backup has been restored, you can then restore the most recent differential
backup. There are times that you need to restore a database but do not want to recover
every transaction that was issued.

When restoring a transaction log, you can have SQL Server replay only a portion of
a transaction log by issuing what is referred to as a point-in-time restore. The
STOPAT command allows you to specify a date and time to which SQL Server
restores. The STOPATMARK and STOPBEFOREMARK options allow you to specify
either an LSN or a transaction log MARK to use for the stopping point in the restore
operation.

Online Restores

A database has a state that governs whether it can be accessed and what operations can be
performed. For example, a database in an ONLINE state allows transactions and any other
operations to be executed, but a database in an EMERGENCY state allows only
SELECT operations to be executed by a member of the db_owner database role.

Each filegroup within a database can have a state. While one filegroup can be in a
RESTORING state and not be accessible, another filegroup can be in an ONLINE state and
accept transactions. The state of the database equals the state of the filegroup designated
as PRIMARY.

SQL Server 2008 Enterprise allows you to perform restore operations while the database
is still online and accessible. However, because a full backup affects the entire database,
the state of the database is the state of the primary filegroup, and a database that is
restoring is not accessible. To perform an online restore operation, you must perform a file
or filegroup restore. In addition, you cannot be restoring the primary filegroup or a
file within the primary filegroup. A filegroup restore that affects only a portion of the
database is referred to as a partial restore.

49
UNIT SEVEN

Object based databases and XML


Traditional database applications consist of data-processing tasks,
such as banking and payroll management, with relatively simple
data types that are well suited to the relational data model. As
database systems were applied to a wider range of applications,
such as computer-aided design and geographical information
systems, limitations imposed by the relational model emerged as an
obstacle. The solution was the introduction of object-based
databases, which allow one to deal with complex data types.

Overview

The first obstacle faced by programmers using the relational data


model was the limited type system supported by the relational
model. Complex application domains require correspondingly
complex data types, such as
nested record structures,
multivalued attributes, and
inheritance, which are supported by traditional programming
languages.
Such features are in fact supported in the E-R and extended E-R
notations, but had to be translated to simpler SQL data types. The
object-relational data model extends the relational data model by
providing a richer type system including complex data types and
object orientation.

Relational query languages, in particular SQL, need to be


correspondingly extended to deal with the richer type system.

Object-relational database systems, that is, database systems based


on the object-relation model, provide a convenient migration path for
users of relational databases who wish to use object-oriented
features.

The second obstacle was the difficulty in accessing database data


from programs written in programming languages such as C++ or
Java. Merely extending the type system supported by the database
was not enough to solve this problem completely. Differences
between the type system of the database and the type system of the
programming language make data storage and retrieval more
complicated, and need to be minimized. Having to express
database access using a language (SQL) that is different from the
programming language again makes the job of the programmer
harder. 50

It is desirable, for many applications, to have programming


language constructs or extensions that permit direct access to data
1. Build an object-oriented database system, that is, a database system that natively
supports an object-oriented type system, and allows direct access to data from an
object-oriented programming language using the native type system of
the language.
2. Automatically convert data from the native type system of the programming
language to a relational representation, and vice versa. Data conversion is specified
using an object-relational mapping.

Complex Data Types

In recent years, demand has grown for ways to deal with more complex data
types. Consider, for example, addresses. While an entire address could be viewed
as an atomic data item of type string, this view would hide details such as the street
address, city, state, and postal code, which could be of interest to queries.

On the other hand, if an address were represented by breaking it into the


components (street address, city, state, and postal code), writing queries would be
more complicated since they would have to mention each field. A better alternative
is to allow structured data types that allow a type address with subparts street
address, city, state, and postal code.

With complex type systems we can represent E-R model concepts, such as
composite attributes, multivalued attributes, generalization, and specialization
directly, without a complex translation to the relational model.

Structured Types and Inheritance in SQL Structured Types


Structured types allow composite attributes of E-R designs to be represented directly.
For instance, we can define the following structured type to represent a composite
attribute name with component attribute firstname and lastname:
create type Name as
(firstname varchar(20), lastname varchar(20)) final;

Similarly, the following structured type can be used to represent a composite


attribute address:
create type Address as
(street varchar(20), city varchar(20), zipcode varchar(9)) not final;
The final and not final specifications are related to subtyping.

51
We can now use these types to create composite attributes in a relation, by simply
declaring an attribute to be of one of these types. For example, we could create a
table person as follows:

create table person (


name Name, address Address, dateOfBirth date);

The components of a composite attribute can be accessed using a “dot” notation;


for instance name.firstname returns the firstname component of the name attribute. An
access to attribute name would return a value of the structured type Name.

We can also create a table whose rows are of a user-defined type. For example, we could
define a type PersonType and create the table person as follows:

create type PersonType as ( name Name,


address Address, dateOfBirth date) not final
create table person of PersonType;

The following query illustrates how to access component attributes of a composite


attribute. The query finds the last name and city of each person.

select name.lastname, address.city from person;

A structured type can have methods defined on it. We declare methods as part of the type
definition of a structured type:

create type PersonType as (


name Name, address Address, dateOfBirth date) not final
method ageOnDate(onDate date)
returns interval year;

We create the method body separately:


create instance method ageOnDate (onDate date) returns interval year
for PersonType
begin
return onDate − self.dateOfBirth;
end

52
Note that the for clause indicates which type this method is for, while the keyword
instance indicates that this method executes on an instance of the Person type. The
variable self refers to the Person instance on which the method is invoked.

Methods can be invoked on instances of a type. If we had created a table person of type
PersonType, we could invoke the method ageOnDate() as illustrated below, to find the age
of each person.

select name.lastname, ageOnDate(current date) from person;

Type Inheritance

Suppose that we have the following type definition for people:


create type Person
(name varchar(20),
address varchar(20));

We may want to store extra information in the database about people who are students,
and about people who are teachers. Since students and teachers are also people, we can
use inheritance to define the student and teacher types in SQL:

create type Student under Person (degree varchar(20),


department varchar(20));

create type Teacher under Person (salary integer,


department varchar(20));

Both Student and Teacher inherit the attributes of Person—namely, name and
address. Student and Teacher are said to be subtypes of Person, and Person is a supertype
of Student, as well as of Teacher.

Methods of a structured type are inherited by its subtypes, just as attributes are. However,
a subtype can redefine the effect of a method by declaring the method again, using
overriding method in place of method in the method declaration.

The SQL standard requires an extra field at the end of the type definition, whose value is
either final or not final. The keyword final says that subtypes may not be created from the
given type, while not final says that subtypes may be created.

Note that the SQL standard does not support multiple inheritance, although future
versions of the SQL standard may support it.

53
Table Inheritance

Subtables in SQL correspond to the E-R notion of specialization/generalization. For

instance, suppose we define the people table as follows:

create table people of Person;

We can then define tables students and teachers as subtables of people, as follows:

create table students of Student


under people;
create table teachers of Teacher
under people;

The typesof the subtables (Student and Teacher in the above example) are subtypes of the
type of the parent table Person. As a result, every attribute present in the table people is
also present in the subtables students and teachers.

Further, when we declare students and teachers as subtables of people, every tuple
present in students or teachers becomes implicitly present in people. Thus, if a query uses
the table people, it will find not only tuples directly inserted into that table, but also tuples
inserted into its subtables, namely students and teachers.

SQL permits us to find tuples that are in people but not in its subtables by using
“only people” in place of people in a query. The only keyword can also be used in delete
and update statements. Without the only keyword, a delete statement on a supertable,
such as people, also deletes tuples that were originally inserted in subtables (such as
students); for example, a statement:

delete from people where P;

would delete all tuples from the table people, as well as its subtables students and
teachers, that satisfy P.

Array and Multiset Types in SQL

SQL supports two collection types: arrays and multisets

a multiset is an unordered collection, where an element may occur multiple times.


Multisets are like sets, except that a set allows each element to occur at most once.

Unlike elements in a multiset, the elements of an array are ordered.

The following example illustrates how these array and multiset-valued attributes can be
defined in SQL:

54
create type Publisher as (name varchar(20), branch varchar(20));

create type Book as


(title varchar(20),
author array varchar(20) array [10],
pub date date,
publisher Publisher,
keyword set varchar(20) multiset);

create table books of Book;

Note that we used an array, instead of a multiset, to store the names of authors, since the
ordering of authors generally has some significance, whereas we believe that the
ordering of keywords associated with a book is not significant.

In general, multivalued attributes from an E-R schema can be mapped to multiset-valued


attributes in SQL; if ordering is important, SQL arrays can be used instead of
multisets.

If we want to insert the preceding tuple into the relation books, we could execute
the statement:

insert into books


values (’Compilers’, array[’Smith’, ’Jones’],
new Publisher(’McGraw-Hill’, ’New York’), multiset[’parsing’, ’analysis’]);

Object-Identity and Reference Types in SQL

Object-oriented languages provide the ability to refer to objects. An attribute of a type can
be a reference to an object of a specified type. For example, in SQL we can define a
type Department with a field name and a field head that is a reference to the type
Person, and a table departments of type Department, as follows:

create type Department (


name varchar(20),
head ref(Person) scope people);

create table departments of Department;

Here, the reference is restricted to tuples of the table people. The restriction of the scope
of a reference to tuples of a table is mandatory in SQL, and it makes references
behave like foreign keys.

55
We can omit the declaration scope people from the type declaration and instead make an
addition to the create table statement:

create table departments of Department (head with options scope people);

insert into departments


values (’CS’, null);
update departments
set head = (select p.person id
from people as p
where name = ’John’)
where name = ’CS’;

XML

The Extensible Markup Language (XML) was not designed for database applications. In
fact, like the Hyper-Text Markup Language (HTML) on which the World Wide Web is
based, XML has its roots in document management, and is derived from a language for
structuring large documents known as the Standard Generalized Markup Language
(SGML). However, unlike SGML and HTML, XML is designed to represent data.

It is particularly useful as a data format when an application must communicate


with another application, or integrate information from several other applications. When
XML is used in these contexts, many database issues arise, including how to organize,
manipulate, and query the XML data.

The term markup refers to anything in a document that is not intended to be part of the
printed output. In electronic document processing, a markup language is a formal
description of what part of the document is content, what part is markup, and what the
markup means.
the title of a document might be marked up as follows:
<title>Database System Concepts</title>

56
Unlike HTML, XML does not prescribe the set of tags allowed, and the set may be chosen
as needed by each application. This feature is the key to XML’s major role
in data representation and exchange, whereas HTML is used primarily for document
formatting.
<university>
<department>
<dept name> Comp. Sci. </dept name>
<building> Taylor </building>
<budget> 100000 </budget>
</department>
<department>
<dept name> Biology </dept name>
<building> Watson </building>
<budget> 90000 </budget>
</department>
<course>
<course id> CS-101 </course id>
<title> Intro. to Computer Science </title>
<dept name> Comp. Sci </dept name>
<credits> 4 </credits>
</course>
<course>
<course id> BIO-301 </course id>
<title> Genetics </title>
<dept name> Biology </dept name>
<credits> 4 </credits>
</course>
<instructor>
<IID> 10101 </IID>
<name> Srinivasan </name>
<dept name> Comp. Sci. </dept name>
<salary> 65000 </salary>
</instructor>
<instructor>
<IID> 83821 </IID>
<name> Brandt </name>
<dept name> Comp. Sci. </dept name>
<salary> 92000 </salary>
</instructor>
<instructor>
<IID> 76766 </IID>
<name> Crick </name>
<dept name> Biology </dept name>
<salary> 72000 </salary>
</instructor>
<teaches>
<IID> 10101 </IID>
<course id> CS-101 </course id> 57
</teaches>
<teaches>
<IID> 83821 </IID>
<course id> CS-101 </course id>
</teaches>
<teaches>
<IID> 76766 </IID>
<course id> BIO-301 </course id>
</teaches>
</university>

Compared to storage of data in a relational database, the XML


representation may be inefficient, since tag names are repeated throughout the
document.

However, in spite of this disadvantage, an XML representation has significant


advantages when it is used to exchange data between organizations, and
for storing complex structured information in files
Just as SQL is the dominant language for querying relational data, XML
has become the dominant format for data exchange.

Structure of XML Data


The fundamental construct in an XML document is the element. An element is
simply a pair of matching start- and end-tags and all the text that appears
between them. XML
documents must have a single root element that encompasses all other
elements in the document. In the example above, the<university>element
forms the root element. Further,
elements in an XML document must nest properly. For instance:
<course> . . . <title> . . . </title> .. . </course> is properly nested, whereas:
<course> . . . <title> . . . </course> ... </title> is not properly nested.
In addition to elements, XML specifies the notion of an attribute. For instance,
the course identifier of a course can be represented as an attribute. The
attributes of an element
appear as name=value pairs before the closing “>” of a tag. Attributes are strings
and do not contain markup. Furthermore, attributes can appear only once in a
given tag, unlike sub 58
elements, which may be repeated.
<course course id= “CS-101”>
<title> Intro. to Computer Science</title>
<dept name> Comp. Sci. </dept name>
<credits> 4 </credits>
</course>

XML Document Schema


Databases have schemas, which are used to constrain what information can be
stored in the database and to constrain the data types of the stored
information. In contrast, by default, XML documents can be created without any
associated schema: an element may then have any subelement or attribute. While
such freedom may occasionally be acceptable given the self-describing nature of
the data format, it is not generally useful when XML documents must be
processed automatically as part of an application, or even when large amounts of
related data are to be formatted in XML.

Here, we describe the first schema-definition language included as part of the


XML standard, the Document Type Definition, as well as its more recently defined
replacement, XML Schema.

Document type definition (DTD)

The document type definition (DTD) is an optional part of an XML document. The
main purpose of a DTD is much like that of a schema: to constrain and type
the information present in the document. However, the DTD does not in fact
constrain types in the sense of basic types like integer or string. Instead, it
constrains only the appearance of subelements and attributes within an element.
The DTD is primarily a list of rules for what pattern of subelements may appear
within an element.

<!DOCTYPE university [
<!ELEMENT university ( (department|course|instructor|teaches)+)>
<!ELEMENT department ( dept name, building, budget)>
<!ELEMENT course ( course id, title, dept name, credits)>
59
<!ELEMENT instructor (IID, name, dept name, salary)>
<!ELEMENT teaches (IID, course id)>
<!ELEMENT dept name( #PCDATA )>
<!ELEMENT building( #PCDATA )>
<!ELEMENT budget( #PCDATA )>
<!ELEMENT course id ( #PCDATA )>
<!ELEMENT title ( #PCDATA )>
<!ELEMENT credits( #PCDATA )>
<!ELEMENT IID( #PCDATA )>
<!ELEMENT name( #PCDATA )>
<!ELEMENT salary( #PCDATA )>
]>
The elements course id, title, dept name, credits, building, budget, IID, name, and
salary are all declared to be of type #PCDATA. The keyword #PCDATA indicates
text data; it derives its name, historically, from “parsed character data.”

Two other special type declarations are empty, which says that the element
has no contents, and any, which says that there is no constraint on the subelements
of the element; that is, any elements, even those not mentioned in the DTD, can
occur as subelements of the element. The absence of a declaration for an element is
equivalent to explicitly declaring the type as any.

XML Schema
An effort to redress the deficiencies of the DTD mechanism resulted in the
development of a more sophisticated schema language, XML Schema. We provide a
brief overview of XML Schema, and then we list some areas in which it improves
DTDs.

XMLSchema defines a number of built-in types such as string, integer, decimal date,
and boolean. In addition, it allows user-defined types; these may be simple types
with added restrictions, or complex types constructed using constructors such
as complexType and sequence. 60
The first thing to note is that schemadefinitions in XMLSchema are themselves specified in
XML syntax, using a variety of tags defined by XML Schema. To avoid conflicts with user-
defined tags,weprefix the XMLSchema tag with the namespace prefix “xs:”; this prefix is
associated with the XML Schema namespace by the xmlns:xs specification in the
root element:
<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>
Note that any namespace prefix could be used in place of xs; thus we could
replace all occurrences of “xs:” in the schema definition with “xsd:” without changing the
meaning of the schema definition. All types defined by XML Schema must be
prefixed by this namespace prefix.

<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>
<xs:element name=“university” type=“universityType” />
<xs:element name=“department”>
<xs:complexType>
<xs:sequence>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“building” type=“xs:string”/>
<xs:element name=“budget” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name=“course”>
<xs:element name=“course id” type=“xs:string”/>
<xs:element name=“title” type=“xs:string”/>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“credits” type=“xs:decimal”/>
</xs:element>
<xs:element name=“instructor”>
<xs:complexType>
<xs:sequence>
<xs:element name=“IID” type=“xs:string”/>
<xs:element name=“name” type=“xs:string”/>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“salary” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML Schema version of DTD above

61
The type of department is defined to be a complex type, which is further specified
to consist of a sequence of elements dept name, building, and budget. Any type that has
either attributes or nested subelements must be specified to be a complex type.
Attributes are specified using the xs:attribute tag. For example,we could have defined dept
name as an attribute by adding:
<xs:attribute name = “dept name”/>
In the context of XML, we need to specify a scope within which values are unique and form
a key. The selector is a path expression that defines the scope for the constraint, and field
declarations specify the elements or attributes that form the key. To specify that dept
name
forms a key for department elements under the root university element, we add
the following constraint specification to the schema definition:
<xs:key name = “deptKey”>
<xs:selector xpath = “/university/department”/>
<xs:field xpath = “dept name”/>
</xs:key>
Correspondingly a foreign-key constraint from course to department may be defined
as follows:
<xs: name = “courseDeptFKey” refer=“deptKey”>
<xs:selector xpath = “/university/course”/>
<xs:field xpath = “dept name”/>
</xs:keyref>
Note that the refer attribute specifies the name of the key declaration that is being
referenced, while the field specification identifies the referring attributes. XML
Schema offers several benefits over DTDs, and is widely used today.
Among the benefits that we have seen in the examples above are these:
•It allows the text that appears in elements to be constrained to specific types, such
as numeric types in specific formats or complex types such as sequences of elements of
other types.
•It allows user-defined types to be created.
•It allows uniqueness and foreign-key constraints.
62
• It is integrated with namespaces to allow different parts of
a document to conform to different schemas.
In addition to the features we have seen, XML Schema supports several
other features that DTDs do not, such as these:
• It allows types to be restricted to create specialized types, for
instance
by specifying minimum and maximum values.
• It allows complex types to be extended by using a form of inheritance.

Querying and Transformation


Given the increasing number of applications that use XML to exchange,
mediate, and store data, tools for effective management of XML data
are becoming increasingly important. In particular, tools for
querying and transformation of XML data are essential to extract
information from large bodies of XML data, and to convert data
between different representations (schemas) in XML. Just as the
output of a relational query is a relation, the output of an XML query
can be an XML document.
As a result, querying and transformation can be combined into a single
tool. In this section, we describe the XPath and XQuery languages:
 XPath is a language for path expressions and is actually a building
block for XQuery.
 XQuery is the standard language for querying XML data. It is modeled
after SQL but is significantly different, since it has to deal with
nested XML data. XQuery also incorporates XPath expressions.

XPath
XPath addresses parts of an XML document by means of path
expressions. The language can be viewed as an extension of the
simple path expressions in object oriented and object- relational
databases. The current version of the XPath standard is XPath
2.0, and our description is based on this version.
A path expression in XPath is a sequence of location steps separated by
“/” (instead of the “.” operator that separates location steps in SQL).
63
The result of a path expression is a set of nodes. For instance, on the
XML document above, the XPath expression:
<name>Srinivasan</name>
<name>Brandt</name> The expression:
/university-3/instructor/name/text() returns the same names, but without
the enclosing tags.

Path expressions are evaluated from left to right. Like a directory hierarchy,
the initial ’/’ indicates the root of the document. Note that this is an abstract
root “above” <university-3> that is the document tag.

As a path expression is evaluated, the result of the path at any point consists
of an ordered set of nodes from the document.
Selection predicates may follow any step in a path, and are contained in
square brackets. For example,
/university-3/course[credits >= 4]
returns course elements with a credits value greater than or equal to 4,
while:
/university-3/course[credits >= 4]/@course id returns the course identifiers
of those courses.
the path expression:
/university-2/instructor[count(./teaches/course)> 2]
returns instructors who teach more than two courses. Boolean connectives
and, and or can be used in predicates, while the function not(. . .) can be
used for negation.

For example, if the university data in our university example is


contained in a file “university.xml”, the following path expression would
return all departments at the university:
doc(“university.xml”)/university/department

XQuery
The World Wide Web Consortium (W3C) has developed XQuery as the
standard query language for XML.

64
XQuery queries are modeled after SQL queries, but differ significantly from SQL. They are
organized into five sections: for, let, where, order by, and return. They are referred to as
“FLWOR” (pronounced “flower”) expressions, with the letters in FLWOR denoting the five
sections.

A simple FLWOR expression that returns course identifiers of courses with greater than 3
credits, shown below, is based on the XML document

for $x in /university-3/course let $courseId := $x/@course id where $x/credits > 3


return <course id> { $courseId } </course id>

The for clause is like the from clause of SQL, and specifies variables that range over the
results of XPath expressions. When more than one variable is specified, the results include
the Cartesian product of the possible values the variables can take, just as the SQL from
clause does.
The let clause simply allows the results of XPath expressions to be assigned to variable
names for simplicity of representation. The where clause, like the SQL where
clause, performs additional tests on the joined tuples from the for clause.

The order by clause, like the SQL order by clause, allows sorting of the output. Finally, the
return clause allows the construction of results in XML. A FLWOR query need not contain
all the clauses;

XML Applications

We now outline several applications of XML for storing and communicating (exchanging)
data and for accessing Web services (information resources).

Storing Data with Complex Structure

65
Many applications need to store data that are structured, but are not easily modeled as
relations. XML-based representations are now widely used for storing
documents, spreadsheet data and other data that are part of office application packages.

XML-based representations are now widely used for storing documents, spreadsheet data
and other data that are part of office application packages. The Open Document Format
(ODF), supported by the Open Office software suite as well as other office suites, and the
Office Open XML (OOXML) format, supported by the Microsoft Office suite, are document
representation standards based on XML.

XML is also used to represent data with complex structure that must be exchanged
between different parts of an application. For example, a database system may represent a
query execution plan (a relational-algebra expression with extra information on how
to execute operations) by using XML.

Standardized Data Exchange Formats

XML-based standards for representation of data have been developed for a variety
of specialized applications, ranging from business applications such as banking and
shipping to scientific applications such as chemistry and molecular biology.

Web Services
Applications often require data from outside of the organization, or from
another department in the same organization that uses a different database. In
many such situations, the outside organization or department is not willing to allow direct
access to its database using SQL, but is willing to provide limited forms of
information through predefined interfaces.

In effect, the provider of the information defines procedures whose input and output are
both in XML format. The HTTP protocol is used to communicate the input and
output information, since it is widely used and can go through firewalls that institutions
use to keep out unwanted traffic from the Internet.
66
The Simple Object Access Protocol (SOAP) defines a standard for invoking
procedures, using XML for representing the procedure input and output. SOAP defines a
standard XML schema for representing the name of the procedure, and result status
indicators such as failure/error indicators. The procedure parameters and results are
application-dependent XML data embedded within the SOAP XML headers.

SOAP standard is widely used today. For example, Amazon and Google provide SOAP-
based procedures to carry out search and other activities. These procedures can be
invoked by other applications that provide higher-level services to users.
The SOAP standard is independent of the underlying programming language, and it
is
possible for a site running one language, such as C#, to invoke a service that runs on a
different language, such as Java.
A site providing such a collection of SOAP procedures is called a Web service.
Several standards have been defined to support Web services. The Web Services
Description Language (WSDL) is a language used to describe a Web service’s
capabilities. WSDL provides facilities that interface definitions (or function definitions)
provide in a traditional programming language, specifying what functions are available and
their input and output types.

In addition WSDL allows specification of the URL and network port number to be used to
invoke the Web service. There is also a standard called Universal Description, Discovery,
and Integration (UDDI) that defines how a directory of available Web services may
be created and how a program may search in the directory to find a Web service satisfying
its requirements.

Data Mediation

Once the basic tools are available to extract information from each source, a
mediator application is used to combine the extracted information under a single schema.
This may require further transformation of the XML data from each site, since different
67
sites may
structure the same information differently. They may also use different names for the
same information (for instance, acct number and account id), or may even use the same
name for different information.

The mediator must decide on a single schema that represents all required information, and
must provide code to transform data between different representations. XML query
languages such as XSLT and XQuery play an important role in the task of transformation
between different XML representations.

68

You might also like