DBMS Answer Notes
DBMS Answer Notes
DBMS Answer Notes
A database is a collection of occurrence of multiple record types containing the relationship between records, data aggregate and data items. A database may be defined as A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications The data is stored so that they are independent of programs, which use the data. A common and control approach is used in adding the new data, modifying and retrieving existing data or deletion of data within the database A running database has function in a corporation, factory, government department and other organization. Database is used for searching the data to answer some queries. A database may be design for batch processing, real time processing or on line processing. DATABASE SYSTEM Database System is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance. It is a system, which satisfied the data need for various applications in an organization without unnecessary redundancy. A database system is based on the data. Also a database system can be run or executed by using software called DBMS (Database Management System). A database system controls the data from unauthorized access. Foundation Data Concept A hierarchy of several levels of data has been devised that differentiates between different groupings, or elements, of data. Data are logically organized into: Character It is the most basic logical data element. It consists of a single alphabetic, numeric, or other symbol. Field It consists of a grouping of characters. A data field represents an attribute (a characteristic or quality) of some entity (object, person, place, or event). Record The related fields of data are grouped to form a record. Thus, a record represents a collection of attributes that describe an entity. Fixed-length records contain, a fixed number of fixed-length data fields. Variable-length records contain a variable number of fields and field lengths. File A group of related records is known as a data file, or table. Files are frequently classified by the application for which they ar primarily used, such as a payroll file or an inventory file, or the type of data they contain, such as a document file or a graphical image file. Files are also classified by their permanence, for example, a master file versus a transaction file. A transaction file would contain records of all transactions occurring during a period, whereas a master file contains all the permanent records. A history file is an obsolete transaction or master file retained for backup purposes or for long-term historical storage called archival storage. Database It is an integrated collection of logically related records or objects. A database consolidates records previously stored in separate files into a common pool of data records that provides data for many applications. The data stored in a database is independent of the application programs using it and o the type of secondary storage devices on which it is stored.
Q. 2. What are the various characteristics of DBMS? Ans. The major characteristics of database approach are:
Self-describing Nature of a Database System Insulation between Programs and Data, and Data Abstraction Support of Multiple Views of the Data Sharing of Data and Multi user Transaction Processing Q. 3. What are the various characteristics of DBMS approach? Ans. 1. Self-contained nature DBMS system contains data plus a full description of the data (called metadata) metadata is data about data - data formats, record structures, locations, how to access, indexes metadata is stored in a catalog and is used by DBMS software to know how to access the data. Contrast this with the file processing approach where application programs need to know the structure and format of records and data. 2. Program-data independence Data independence is immunity of application programs to changes in storage structures and access techniques. E.g. adding a new field, changing index structure, changing data format, In a DBMS environment these changes are reflected in the catalog. Applications arent affected. Traditional file processing programs would all have to change, possibly substantially. 3. Data abstraction A DBMS provides users with a conceptual representation of data (for example, as objects with properties and inter-relationships). Storage details are hidden. Conceptual representation is provided in terms of a data model. 4. Support for multiple views DBMS may allow different users to see different views of the DB, according to the perspective each one requires. E.g. a subset of the data - For example; the people using the payroll system need not/should not see data about students and class schedules. E.g. data presented in a different form from the way it is stored - For example someone interested in student transcripts might get a view which is formed by combining information from separate files or tables.
5. Centralized control of the data resource The DBMS provides centralized control of data in an organization. This brings a number of advantages: (a) reduces redundancy (b) avoids inconsistencies (c) data can be shared (d) standards can be enforced (e) security restrictions can be applied (f) integrity can be maintained a, b. Redundancy and Inconsistencies Redundancy is unnecessary duplication of data. For example if accounts department and registration department both keep student name, number and address. Redundancy wastes space and duplicates effort in maintaining the data. Redundancy also leads to inconsistency. Inconsistent data is data which contradicts itself - e.g. two different addresses for a given student number. Inconsistency cannot occur if data is represented by a single entry (i.e. if there is no redundancy). Controlled redundancy: Some redundancy may be desirable (for efficiency). A DBMS should be aware of it, and take care of propagating updates to all copies of a data item. This is an objective, not yet currently supported. c. Sharing Need concurrency control Multiple user views d. Standards E.g. data formats, record structures, naming, documentation International, organizational, departmental ... standards e. Security - restricting unauthorized access DBMS should perform security checks on all accesses. f. Integrity Maintaining validity of data; e.g. employee numbers must be in some range e.g. every course must have an instructor e.g.. student number must be unique e.g. hours worked cannot be more than 150 These things are expressed as constraints. DBMS should perform integrity checks on all updates. Currently DBMSs provide limited integrity checks.
Q. 3. What are the various types of databases? Ans. Types of Databases Continuing developments in information technology and its business applications have resulted in the evolution of several major types of databases. Several major conceptual categories of databases that may be found in computer-using organizations include: Operational Databases The databases store detailed data needed to support the operations of the entire organization. They are also called subject area databases (SADB), transaction databases, and production databases: Examples are customer databases, personnel databases, inventory databases, and other databases containing data generated by business operations Distributed Databases Many organizations replicate and distribute copies or parts of databases to network sewers at a variety of sites. These distributed databases can reside on network servers on the World Wide Web, on corporate Intranets or extranets, or on other company networks. Distributed databases may be copies of operational or analytical. databases, hypermedia or discussion databases, or any other type of database. Replication and distribution of databases is done to improve database performance and security. External Databases Access to external, privately owned online databases or data banks is available for a fee to end users and organizations from commercial online services, and with or without charge from many sources on the Internet, especially the Web. Hypermedia Databases It consists of hyperlinked pages of multimedia (text, graphics, and photographic images, video clips, audio segments, etc.). From a database management point of view, the set of interconnected multimedia pages at a website is a database of interrelated hypermedia page elements, rather than interrelated data records. Q. 4. What do you mean by DBMS? Ans. A DBMS is best described as a collection of programs that manage the database structure and that control shared access to the data in the database. Current DBMSes also store the relationships between the database components; they also take care of defining the required access paths to those components A database management system (DBMS) is the combination of data, hardware, software and users to help an enterprise manage its operational data. The main function of a DBMS is to provide efficient and reliable methods of data retrieval to many users. Efficient data retrieval is an essential function of database systems. DBMS must be able to deal with several users who try to simultaneously access several items and most frequently, the same data item A DBMS is a set of programs that is used to store and manipulation data that include the following: Adding new data, for example adding details of new student. Deleting unwanted data, for example deleting the details of students who have completed course. Changing existing data, for example modifying the fee paid by the student. A database is the information to be stored whereas the database management system is the system used to manage the database. . This structure may be regarded in terms of its hardware implementation, called the physical structure, or this structure may be regarded independently of its hardware implementation, called the logical structure. In either case, the data structure is regarded as static because a database cannot process anything. The DBMS is regarded as dynamic because it is through the DBMS
that all database processing takes place. How the DBMS presents data to the user is called the view structure. There are two general modes for data use: queries and transactions. Both forms use the DBMS for processing. The query is processed for presentation in views and none of these processes are written to the database. The transactional is processed for updating values in the database variables. These updates are written to the database. A DBMS provides various functions like data security, data integrity, data sharing, data concurrence, data independence, data recovery etc. However, all database management systems that are now available in the market like Sybase, Oracle, and MS-Access do not provide the same set of functions, though all are meant for data management. Q. 5. What are the various components of DBMS? Ans. Basic Components: A database system has four components. These four components are important for understanding and designing the database system. These are: 1. Data 2. Hardware 3. Software 4. Users 1. Data As we have discussed above, data is raw hand information collected by us. Data is made up of data item or data aggregate. A Data item is the smallest unit of named data: It may consist of bits or bytes. A Data item is often referred to as field or data element. A Data aggregate is the collection of data items within the record, which is given a name and referred as a whole. Data can be collected orally or written. A database can be integrated and shared. Data stored in a system is partition into one or two databases. So if by chance data lost or damaged at one place, then it can be accessed from the second place by using the sharing facility of data base system. So a shared data also cane be reused according to the users requirement. Also data must be in the integrated form. Integration means data should be in unique form i.e. data collected by using a well-defined manner with no redundancy, for example Roll number in a class is non-redundant form and so these have unique resistance, but names in class may be in the redundant form and can create lot of problems later on in using and accessing the data. 2. Hardware Hardware is also a major and primary part of the database. Without hardware nothing can be done. The definition of Hardware is which we can touch and see, i.e. it has physical existences. All physical quantity or items are in this category. For example, all the hardware input/output and storage devices like keyboard, mouse, scanner, monitor, storage devices (hard disk, floppy disk, magnetic disk, and magnetic drum) etc. are commonly used with a computer system. 3. Software Software is another major part of the database system. It is the other side of hardware. Hardware and software are two sides of a coin. They go side by side. Software is a system. Software are further subdivided into two categories, First type is system software (like all the operating systems, all the languages and system packages etc.) and second one is an application software (payroll, electricity billing, hospital management and hostel administration etc.). We can define software as which we cannot touch and see. Software only can execute. By using software, data can be manipulated, organized and stored. 4. Users
Without user all of the above said components (data, hardware & software) are meaning less. User can collect the data, operate and handle the hardware. Also operator feeds the data and arranges the data in order by executing the software. Other components 1. People - Database administrator; system developer; end user. 2. CASE tools: Computer-aided Software Engineering (CASE) tools. 3. User interface - Microsoft Access; PowerBuilder. 4. Application Programs - PowerBuilder script language; Visual Basic; C++; COBOL. 5. Repository - Store definitions of data called METADATA, screen and report formats, menu definitions, etc. 6. Database - Store actual occurrences data. 7. DBMS - Provide tools to manage all of this - create data, maintain data, control security access to data and to the repository, etc. Q. 6.What are the various functions of DBMS? Ans. These functions will include support for at least all of the following: Data definition: The DBMS must be able to accept data definitions (external schemas, the conceptual schema, the internal schema, and all associated mappings) in source form and convert them to the appropriate object form. Data manipu1ation: The DBMS must be able to handle requests from the users to retrieve, update, or delete existing data the database, or to add new data to the database. In other words, the DBMS must include a data manipulation language (DML) processor component. Data security and integrity: The DBMS must monitor user requests and reject any attempt to violate the security and integrity rules defined by the DBA. Data recovery and concurrency: The DBMS - or else some other related software component, usually called the transaction manager - must enforce certain recovery and concurrency controls. Data Dictionary: The DBMS must provide a data dictionary function. The data dictionary can be regarded as a database in its own right (but a system database, rather than a user database). The dictionary contains data about the data (sometimes called metadata) - that is, definitions of other objects in the system - rather than justraw data. In particular, all the various schemas and mapping (external, conceptual, etc.) will physically be stored, in both source and object form, in the dictionary. A comprehensive dictionary will also include cross- reference information, showing, for instance, which programs use which pieces of the database, which users require which reports, which terminals are connected to the system, and so on. The dictionary might even - in fact, probably should be integrated into the database it defines, and thus include its own definition. It should certainly be possible to query the dictionary just like any other database, so that, for example, it is possible to tell which programs and or users are likely to be affected by some proposed change to the system. Performance: It goes without saying that the DBMS should perform all of the functions identified above as efficiently as possible. Q7. What are the advantages and disadvantages of a database approach? Ans. ADVANTAGES OF DBMS One of the major advantages of using a database system is that the organization can be handled easily and have centralized management and control over the data by the DBA. Some more and main advantages of database management system are given below: The main advantages of DBMS are: 1. Controlling Redundancy
In a DBMS there is no redundancy (duplicate data). If any type of duplicate data arises, then DBA can control and arrange data in non-redundant way. It stores the data on the basis of a primary key, which is always unique key and have non-redundant information. For example, Roll no is the primary key to store the student data. In traditional file processing, every user group maintains its own files. Each group independently keeps files on their db e.g., students. Therefore, much of the data is stored twice or more. Redundancy leads to several problems: Duplication of effort Storage space wasted when the same data is stored repeatedly Files that represent the same data may become inconsistent (since the updates are applied independently by each users group).We can use controlled redundancy. 2. Restricting Unauthorized Access A DBMS should provide a security and authorization subsystem. Some db users will not be authorized to access all information in the db (e.g., financial data). Some users are allowed only to retrieve data. Some users are allowed both to retrieve and to update database. 3. Providing Persistent Storage for Program Objects and Data Structures Data structure provided by DBMS must be compatible with the programming languages data structures. E.g., object oriented DBMS are compatible with programming languages such as C++, SMALL TALK, and the DBMS software automatically performs conversions between programming data structure and file formats. 4. Permitting Inferencing and Actions Using Deduction Rules Deductive database systems provide capabilities for defining deduction rules for inferencing new information from the stored database facts. 5. Inconsistency can be reduced In a database system to some extent data is stored in, inconsistent way. Inconsistency is another form of delicacy. Suppose that an em1oyee Japneet work in department Computer is represented by two distinct entries in a database. So way inconsistent data is stored and DBA can remove this inconsistent data by using DBMS. 6. Data can be shared In a database system data can be easily shared by different users. For example, student data can be share by teacher department, administrative block, accounts branch arid laboratory etc. 7. Standard can be enforced or maintained By using database system, standard can be maintained in an organization. DBA is overall controller of database system. Database is manually computed, but when DBA uses a DBMS and enter the data in computer, then standard can be enforced or maintained by using the computerized system. 8. Security can be maintained Passwords can be applied in a database system or file can be secured by DBA. Also in a database system, there are different coding techniques to code the data i.e. safe the data from unauthorized access. Also it provides login facility to use for securing and saving the data either by accidental threat or by intentional threat. Same recovery procedure can be also maintained to access the data by using the DBMS facility. 9. Integrity can be maintained In a database system, data can be written or stored in integrated way. Integration means unification and sequencing of data. In other words it can be defined as the data contained in the data base is both accurate and consistent. Data can be accessed if it is
compiled in a unique form. We can take primary key ad some secondary key for integration of data. Centralized control can also ensure that adequate checks are incorporated in the DBMS to provide data integrity. 10. Confliction can be removed In a database system, data can be written or arranged in a well-defined manner by DBA. So there is no confliction between the databases. DBA select the best file structure and accessing strategy to get better performance for the representation and use of the data. 11. Providing Multiple User Interfaces For example query languages, programming languages interfaces, forms, menu- driven interfaces, etc. 12. Representing Complex Relationships Among Data It is used to represent Complex Relationships Among Data 13. Providing Backup and Recovery The DBMS also provides back up and recovery features. DISADVANTAGES OF DBMS Database management system has many advantages, but due to some major problem arise in using the DBMS, it has some disadvantages. These are explained as: 1.Cost A significant disadvantage of DBMS is cost. In addition to the cost of purchasing or developing the software, the organization *111 also purchase or upgrade the hardware and so it becomes a costly system. Also additional cost occurs due to migration of data from one environment of DBMS to another environment. 2. Problems associated with centralization Centralization also means that data is accessible from a single source. As we know the centralized data can be accessed by each user, so there is no security of data from unauthorized access and data can be damaged or lost. 3. Complexity of backup and recovery Backup and recovery are fairly complex in DBMS environment. As in a DBMS, if you take a backup of the data then it may affect the multi-user database system which is in operation. Damage database can be recovered from the backup floppy, but iterate duplicacy in loading to the concurrent multi-user database system. 4. Confidentiality, Privacy and Security When information is centralized and is made available to users from remote locations, the possibilities of abuse are often more than in a conventional system. To reduce the chances of unauthorized users accessing sensitive information, it is necessary to take technical, administrative and, possibly, legal measures. Most, databases store valuable information that must be protected against deliberate trespass and destruction. 5. Data Quality Since the database is accessible to users remotely, adequate controls are needed to control users updating data and to control data quality. With increased number of users accessing data directly, there are enormous opportunities for users to damage the data. Unless there are suitable controls, the data quality may be compromised. 6. Data Integrity Since a large number of users could be using .a database concurrently, technical safeguards are necessary to ensure that the data remain correct during operation. The main threat to data integrity comes from several different users attempting to update the same data at the same time. The database therefore needs to be protected against inadvertent changes by the users.
7. Enterprise Vulnerability Centralizing all data of an enterprise in one database may mean that the database becomes an indispensable resource. The survival of the enterprise may depend on reliable information being available from its database. The enterprise therefore becomes vulnerable to the destruction of the database or to unauthorized modification of the database. 8. The Cost of using a DBMS Conventional data processing systems are typically designed to run a number of well-defined, preplanned processes. Such systems are often tuned to run efficiently for the processes that they were designed for. Although the conventional systems are usually fairly inflexible in that new applications may be difficult to implement and/or expensive to run, they are usually very efficient for the applications they are designed for. The database approach on the other hand provides a flexible alternative where new applications can be developed relatively inexpensively. The flexible approach is not without its costs and one of these costs is the additional cost of running applications that the conventional system was designed for. Using standardized software is almost always less machine efficient than specialized software. Q. 8. List five significant differences between a file-processing system and a DBMS. Ans. Before differentiating between file and database systems, there be need to understand the DBMS and its component. Let us consider an organization have a huge amount (collection) of data on its different departments, its employees, its products, sale and purchase order etc. As we know such type of data is accessed simultaneously by different and several employees. Now some users apply number of queries and want answers quickly. If data is stored in the files, then it will create a problem of slow processing. As we try to deal with this type of data management problem by storing the data in a collection of operating system files. Such type of techniques creates number of problems or drawbacks, which are discussed as below: 1. As we have not 1000GB main memory (primary memory) to store the data, so we store the data in some permanent storage device (secondary memory) like magnetic disk or magnetic tape etc. So fileoriented system fails in primary memory cases and we apply data base management system to store the data files permanently. 2. Suppose if we have such a large amount of primary memory on a 16 bit or 32 bit computer system, then there be a problem occur in file based system to use the data by direct or random addressing. Also we cannot call more then 2GB or 4Gb of data direct to the primary memory at a time. So there be need a database program to identify the data. 3. Some programs are too lengthy and complex which cannot store large amount of data in the files related to the operating systems. But a database system made it simple and fast. 4. We cannot change and access file-oriented data simultaneously, so we have requirement a type of system which can be used to access the large amount of data concurrently. 5. Also we cannot recall or recover the file-oriented data, but centralized database management solve such type of problem. 6. File oriented operating system provide only a password mechanism for security, but this is not successful in case of number of users are accessing the same data by using the same login. At end we can sat that a DBMS is a piece of software that is designed to make the processing faster and easier.
Q 9 Describe major advantages of a database system over file system Or Discuss the DBMS and File processing system Also give the limitations of file processing system Ans. TRADITIONAL FILE PROCESSING Data are organized, stored, and processed in independent files of data records. In the traditional file processing approach, each business application was designed to use one or more specialized data files containing only specific types of data records TRADITIONAL FILE SYSTEM OR FILE ORIENTED APPROACH The business computers of 1980 were used in processing of business records and produce information using file oriented approach or file processing environment At that time that system was reliable and faster than the manual system of record keeping and processing In this system the data is organized in the form of different files. Since that system was the collection of files - so we can say it was a file-oriented system. Following terms was commonly used in this approach or the features of File oriented system. 1. Master file The file that is created only once i.e. at the starting of computerization or a file which rarely changes. For example: In a bank master file the account no, name and balance are entered only once and less frequently changes. 2. File activity ratio The number of records processed one run divided by total number of records. For example: if we changes 100 records from a bank file containing 200 records then file activity ratio is 100/200 0.5. It should be noted that this ratio of master file is less. 3. Transaction file A file that is created repeatedly after regular interval of time. For example: the payroll file of employee is updated at the end of every month. 4. File volatility ratio It is the number of records updated in a transaction file divided by total number of records. The file volatility ratio of transaction file is very high. 5. Work file A temporary file that helps in sorting and merging of records from one file to other. 6. File organization It means the arrangement of records in a particular order. There were three types of file organizations 1. Sequential 2. Direct 3. Indexed sequential 7. Data island In this system each dept has its own files designed for local applications. Each department has its own data processing staff, set of policies, working rules and report formats. It means programs were depending on the file structure or format of file. If the structure of file changes, the program has also to be changed. These days the file oriented approach is still used but has following limitations: LIMITATIONS OF FILE ORIENTED APPROACH Duplicate data Since all the files are independent of each other. So some of the fields or files are stored more than once. Hence duplicacy is more in case of file approach but dbms has controlled duplicacy. Separated and isolated data To make a decision, a user might need data from two separate files. First, analysts and programmers to determine the specific data required from each file and the relationships between the data evaluated the
files. Then applications could be written in a third generation language to process and extract the needed data. Imagine the work involved if data from several files was needed! Inconsistency In this system, data is not consistent. If a data item is changed the all the files containing that data item need to be changed and updated properly. If all the files are not updated properly there may be high risk of inconsistency. DBMS have data consistency. Poor data integrity A collection of data has integrity. A file is said to be have data integrity - it means a item is not be stored in duplicate manner. It has been seen that file oriented system have poor data integrity control. Data integrity has been achieved in DBMS. Every operation is programmable The processing tasks like searching, editing, deletion etc should have separate programs. It means there were no functions available for these operations. DBMS have ready-made commands for such operations. Data inflexibility Program-data interdependency and data isolation limited the flexibility of file processing systems in providing users with ad hoc information requests. Because designing applications was so programmingintensive, MIS department staff usually restricted information requests Therefore, users often resorted to manual methods to obtain needed information. Concurrency problem It means using a same record at same time. This problem was common in file approach but can be controlled in DBMS. Application programs are dependent on the file format: In file processing system the physical formats of the files are entered in the programs. The change in file means change in program and vice versa. No such problem in DBMS. Poor data security All the files are stored in the flat form or text files. These files can be easily located and trapped because file approach, has no data security. Difficult to represent the complex objects: Some the objects may be of variable length records can be computerized using this approach. DBMS has capability to handle fixed-length records as well as variable-length records. Can not support heavy databases: The databases on the Internet can be handled by the files system - but DBMS like oracle is used for heavy data base applications. On the other hand the DBMS have following advantages. Difficulty in representing data from the users view To create useful applications for the user, often data from various files must be combined. In file processing it was difficult to determine relationships between isolated data in order to meet user requirements. PROBLEMS OF FILE PROCESSING The file processing approach finally became too cumbersome, costly, and inflexible to supply the information needed to manage modem businesses. It was replaced by the database management approach. File processing systems had the following major problems: Data Redundancy Independent data files included a lot of duplicated data; the same data was recorded and stored in several files. This data redundancy caused problems when data had to be updated, since separate file
maintenance programs had to be developed and coordinated to ensure that each file was properly updated. Unfortunately, a lot of inconsistencies occurred among data stored in separate files. Lack of Data Integration Having independent files made it difficult to provide end users with information for ad hoc requests that required accessing data stored in several different files. Special computer programs had to be written to retrieve data from each independent file. This was so difficult, time-consuming, and costly for some organizations that it was impossible to provide end users or management with such information. Data Dependence In file processing systems, major components of the system - the organization of files, their physical locations of storage hardware, and the application software used to access those files depended on one another in significant ways. Changes in the format and structure of data and records in a file required that changes be made to all of the programs that used that file. This program maintenance effort was a major burden of file processing systems. Other Problems It was easy for data elements to be defined differently by different end users and applications. Integrity of the data was suspect because there was no control over their use and maintenance by authorized end users. Q.10. What are the various types of database uses? Ans. Without user all o the above said components (data, hardware & software) are meaning less. User can collect the data, operate and handle the hardware. Also operator feeds the data and arranges the data in order by executing the software. Users are of mainly of four types. These are: (a) Nave user Nave user has no knowledge of database system and its any supporting software. These are used at the end form. These are like a layman, which have little bit knowledge or computer system. These users are mainly used for collecting the data on the notebooks or on the pre-deigned forms. An automated teller machine (ATMs) user are in these categories. Nave user can work on any simple GUI base menu driven system. Internet using non-computer based person are in this form. (b) End User or Data Entry Operators Data entry operators are preliminary computer based users. The function of data entry operators are only to operate the computer (start! stop the computer) and feed or type the collected information (data) in menu driven application program and to execute it according to the analyst requirement. These user are also called On line users. These user communicate the database directly via an on line terminal or indirectly via a user interface. These users require certain amount of expertise in the computer programming language, but require complete knowledge of computer operations. (c) Application programmer He is also called simple programmer. The working of application programmer is to develop a new project i.e. program for a particular application or modify an existing program. Application programmer works according to some instructions given by database administrator (DBA). Application programmer can handle all the programming language like Fortran, Cobol, dbase etc. (d) DBA (Data Base Administrator) DBA is a major user. DBA either a single person or a group of persons. DBA is only the custodian of the business firm or organization but not the owner of the organization. As bank manager is the DBA of a bank, who takes care about the bank money and not use it. Only DBA can handle the information collected by end user and give the instructions to the application programmer for developing a new program or modifying an existing program. DBA is also called an overall controller of the organization. In computer department of a firm either system analysts or an EDP (Electronic Data Processing)
Manager works as DBA. In other words DBA is the overall controller of complete hardware and software. RESPONSIBILITIES OF DBA As we know DBA is the overall commander of a computer system, so it has number of duties, but some of his/her major responsibilities are as follows: 1. DBA can control the data, hardware, and software and gives the instructions to the application programmer, end user and naive user. 2. DBA decides the information contents of the database. He decides the suitable database file structure for arrangement of data. He/She uses the proper DDL techniques. 3. DBA compiles the whole data in a particular order and sequence. 4. DBA decides where data can be stored i.e. take decision about the storage structure. 5. DBA decides which access strategy and technique should be used for accessing the data. 6. DBA communicates with the user by appropriate meeting, DBA co-operates with user. 1. DBA also define and, apply authorized checks and validation procedures. 2. DBA also takes backup of the data on a backup storage device so that if data can be lost then it can be again recovered and compiled. DBA also recovers the damaged data. 3. DBA also changes the environment according to user or industry requirement and monitor the performance. 4. DBA should be good decision-maker. The decision taken by DBA should be correct, accurate & efficient. 5. DBA should have leadership quality. 6. DBA liaise with the user in the business to take confidence of the customer about availability of data. Q11. Discuss the architecture of database management system. Ans. DBMS ARCHITECTURE There are many different framework have been suggested for the DBMS over the last several year. The generalized architecture of a database system is called ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee) model. In 1972, a final report about database is submitted by ANSI (American National Standard Institute) and SPARC (Standard Planning And Requirement Committee). According to this approach, three levels of a database system was suggested and they are: External view (Individual user view) Conceptual View (Global or community user view) Internal level (physical or storage view). For the system to be usable, it must retrieve data efficiently. This concern has led to the design of complex data structures for the representation of data in the database. Since many database systems users are not computer trained, developers hide the complexity from users through several levels of abstraction, to simplify users interactions with the system. These three views or levels of the architecture are as shown in the diagram as follows:
OBJECTIVES OF THREE LEVEL ARCHITECTURE The database views were suggested because of following reasons or objectives of levels of a database: 1. Make the changes easy in database when some changes needed by environment. 2. The external view or user views do not depend upon any change made ii other view. For example changes in hardware, operating system or internal view should not change the external view. 3. The users of database should not worry about the physical implementation and internal working of database system. 4. The data should reside at same place and all the users can access it as per their requirements. 5. DBA can change the internal structure without effecting the users view. 6. The database should be simple and changes can be easily made. 7. It is independent of all hardware and software. All the three levels are shown below
External/View level The highest level of abstraction where only those parts of the entire database are included which are of concern to a user. Despite the use of simpler structures at the logical level, some complexity remains, because of the large size of the database. Many users of the database system will not be concerned with all this information. Instead, such users need to access only a part of the database. So that their
interaction with the system is simplified, the view level of abstraction is defined. The system may provide many views for the same database. Databases change over time as information is inserted and deleted. The collection of information stored in the database at a particular moment is called an instance of the database. The overall design of the database is called the database schema. Schemas are changed infrequently, if at all. Database systems have several schemas, partitioned according to the levels of abstraction that we discussed. At the lowest level is the physical schema; at the intermediate level is the logical schema and at the highest level is a subschema. The features of this view are The external or user view is at the highest level of database architecture. Here only one portion of database will be given to user. One portion may have many views. Many users and program can use the interested part of data base. By creating separate view of database, we can maintain security. Only limited access (read only, write only etc) can be provided in this view. For example: The head of account department is interested only in accounts but in library information, the library department is only interested in books, staff and students etc. But all such data like student, books, accounts, staff etc is present at one place and every department can use it as per need. Conceptual/Logical level Database administrators, who must decide what information is to be kept in the database, use this level of abstraction. One conceptual view represents the entire database. There is only one conceptual view per database. The description of data at this level is in a format independent of its physical representation. It also includes features that specify the checks to retain data consistence and integrity. The features are: The conceptual or logical view describes the structure of many users. Only DBA can be defined it. It is the global view seen by many users. It is represented at middle level out of three level architecture. It is defined by defining the name, types, length of each data item. The create table commands of Oracle creates this view. It is independent of all hardware and software. Internal/Physical level The lowest level of abstraction describes how the data are stored in the database, and what relationships exist among those data. The entire database is thus described in terms of a small number of relatively simple structures, although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. The features are : It describes the actual or physical storage of data. It stores the data on hardware so that can be stored in optimal time and accessed in optimal time. It is the third level in three level architecture. It stores the concepts like: B-tree and Hashing techniques for storage of data. Primary keys, secondary keys, pointers, sequences for data search. Data compression techniques. It is represented as
FILE EMP [ INDEX ON EMPNO FIELD = { (EMPNO: BYTE (4), ENAME BYTE(25))] Mapping between views The conceptual/internal mapping: o defines conceptual and internal view correspondence specifies mapping from conceptual records to their stored counterparts o An external/conceptual mapping: defines a particular external and conceptual view correspondence A change to the storage structure definition means that the conceptual/internal mapping must be changed accordingly, so that the conceptual schema may remain invariant, achieving physical data independence. A change to the conceptual definition means that the conceptual/external mapping must be changed accordingly, so that the external schema may remain invariant, achieving logical data independence. Q. 12. Write a note on Database Language And Interfaces. Ans. Some main types of languages and facilities are provided by DBMS. 1. Programming Language 2. Data Manipulation Language 3. Data Definition Language 4. Schema Description Language 5. Sub-Schema Description Language 6. SQL (Structured Query Language) 1. Programming Language All the programming language like Cobol, Fortran, C, C++, Pascal etc. has syntax and semantics. These all have structured and logical structure, so these all commonly used to solve general and scientific problems. All the business-oriented problems can be solved by the three GL and Fourth Gt. 2. DML Some language that gives instructions to the programming language and other languages is called data manipulation language (DML). This language creates interface (linkage) between user and application program. This is extension of the program of the language used to manipulate data in the database. DML involves retrieval of data from the database, insertion of new data into the database and deletion or modification of the existing data. Some data manipulation operations are also called QUERY or QUERY OPERATIONS. A Query is a statement in DML that request the retrieval of data from the database i.e. to search the data according to the user requirement. The subset of the DML used to operate the query is known as Query Language. DML provides commands to select & retrieve data from the database. Commands used in the DML are to insert, to update & to delete the records. The commands have different syntax for different programming language. For example, Fortran, Cobol, C etc. provide such type of facility with the help of database management system. The data manipulation function provided by DBMS can be invoked in a application program directly by procedural calls or by processors statement. This procedure can be done by the compiler. The DML can become procedural language according to the user requirement. If the DML is non-procedural than user will indicate only what is to be retrieved. In both the cases the DBMS optimize the exact answer by using DML. 3. DDL
Database management system provides a facility known as Data Definition Language or data description language (DDL). DDL can be used to define conceptual schema (Global) and also give some details about how to implement this schema in the physical devices used to store the data. The definition includes all the entity sets and their associated attributes as well as the relationship among the entities set. The definitions also have some constraints which are used in DML. DDL also have some meta-data (it is data about the data in database). Meta-data have data dictionary, directory, system catalog to describe data about data. The dictionary contains the information about the data stored in the database and it is consulted by DBMS before any data manipulation operations. The DBMS maintain the information on the file structure and also used some access method to access the data efficiently. DDL is used for the help of DML. We can say that there is another language - Data Sub Language (DSL) which is the combination of both DML and DDL. DSL = DML + DDL 4. Schema Description Language (SDL) or Schema It is necessary to describe the organization of the data in a formal manner. The logical and physical database descriptions are used by DBMS software. The complete and overall description of data is referred to as schema. The schema and subschema words are brought into DBMS by CODASYL (Conference on data system language committee) and also by the CODASYLs database task group. Schema is also referred to as conceptual model or global view (community view) of data. Suppose a complete description of collected data having all classes and student data, all employees (teaching & non-teaching) data and other concept of data related to the college is called Schema of the college. We can say that we relate whole college data logically, which is called schema. 5 Sub Schema Description language The term schema is used to mean an overall chart of the data items, types and record type stored in a database. The term sub-schema refers to an application programmers view of data he uses. Sub-schema is the part of schema. Many different sub-schemas can be derived from one schema. An application programmer does not use whole data i.e. full schema, e.g. As in an organization, purchase-order for the maintenance department is the sub-schema of the whole schema description of the purchase department in the hole industry. Two or more than two application- programmers use the different subschemas. One person named A uses the sub-schema purchase-order whereas programmer B uses the subschema supplier. Their operations and views are different according to their own sub-schema but both combined these two sub-schemas on the basis of a common key. 6. Structured Query Language (SQL): SQL organized with the system R. System R means it is relational language. SQL is also called Structure Query Language. This language was developed in 1974 at IBMs San Jose Research Center. The purpose of this language is to provide such non-procedural commands which are used for validation of the data and for searching the data. By using this language we can do any query about the data. SQL is sometimes named by SQUARE language. This language was helpful for both DDL and DML for the system R. Some SQL are also called Relational languages and used in a commercial RDBMS. Some commonly used SQL are ORACLE, INGRES, SYBASE etc. SQL resembles relational algebra and relational calculus in a relational system approach. DBMS INTERFACES Types of interfaces provided by the DBMS include: Menu-Based interfaces for Web Clients or Browsing Present users with list of options (menus) Lead user through formulation of request
Query is composed of selection options from menu displayed by system. Forms-Based Interfaces Displays a form to each user User can fill out form to insert new data or fill out only certain entries. Designed and programmed for nave users as interfaces to canned transactions. Graphical User Interfaces Displays a schema to the user in diagram form. The user can specify a query by manipulating the diagram. GUIs use both forms and menus. Natural Language Interfaces Accept requests in written English or other languages and attempt to understand them. Interface has its own schema, and a dictionary of important words. Uses the schema and dictionary to interpret a natural language request. Interfaces for Parametric Users Parametric users have small set of operations they perform. Analysts and programmers design and implement a special interface for each class of nave users. Often a small set of commands included to minimize the number of keystrokes required. (I.e. function keys) Interfaces for the DBA Systems contain privileged commands only for DBA staff. Include commands for creating accounts, setting parameters, authorizing accounts, changing the schema, reorganizing the storage structures etc. Q.13. Describe the Classification of Database Management Systems. Ans. Categories of DBMS DBMS (Database Management System) It is software to manage many databases. A DBMS is a software component or logical tool to handle the databases. All the queries from user about the data stored in the database will be handled by DBMS. There are many DBMSs available in market like dBase, FoxBASE, FoxPro, Oracle, Unify, Access etc. RDBMS (Relational Data Base Management System) Each database system uses a approach to store and maintain the data. For this purpose three data models were developed like Hierarchical model, Network Model and Relational Model. In the hierarchical model the data were arranged in the form of trees, in network model the data was arranged in the form of pointers and network and in relational model the data was arranged in the form of tables. The data stored in the form tables is easy to stored, maintain and understand. Many DBMS has been developed using approach of hierarchical and network models. Any DBMS that uses the relational data model for data storage and modeling Is called RDBMS. In RDBMS we can create relations among tables and can access the information from tables - while tables store stored in separately file and may or may not have identical structures. The RDBMS is based upon the rules given by Dr. Codd known as Dr. Codds Rules. HDBMS (Heterogeneous DBMS) In RDBMS we store the information related to the same kind of data like student data, teacher data, employee data etc. In HDBMS we store the data in the database which is entirely different. DDBMS (Distributed DBMS) During 1950s & 1960s there was trend to use independent or decentralized system. There was a duplication of hardware and facilities. In a centralized database system, the DBMS & data reside at a single place and all the control & location is limited to a single location, but the PCs are distributed
geographically. Distributed system is parallel computing using multiple independent computers communicating over a network to accomplish a common objective or task. The type of hardware, programming languages, operating systems and other resources may vary drastically. It is similar to computer clustering with the main difference being a wide geographic dispersion of the resources For example an organization may have an office in a building and have many sub- buildings that are connected using LAN. The current trend is towards distributed systems. This is a centralized system connected to intelligent remote sites. Each remote site have own storage and processing capabilities - but in a centralized or network there is a single storage. OODBMS (Object Oriented DBMS) Object-Oriented Database Management Systems (OODBMSs) have been developed to support new kinds of applications for which semantic and content are represented more efficiently with the object model. Therefore, the OODBMSs present the two main problems: Impedance mismatch: It is basically due to two reasons. Firstly, the no suitable abstractions of the operating systems, so when a client object has to invoke a method that is offered by a server object, and both objects are not into the same address space, it is necessary to use the mechanisms that are offered by the operating system, and these mechanisms do not became proper to the object oriented paradigm since they are oriented to communicate processes. In order to solve this problem intermediate software is included (e.g. COM or CORBA).In the second place, an impedance mismatch is also caused every time that the object-oriented applications need to use the operating system services. Interoperability problem between object models: Although different system elements use the objectoriented paradigm, an interoperability problem can exist between them. So, an application implemented using the C++ language, with the C++ object model, can easily interact with its objects, but when it wants to use objects that have been created with another programming language or another objectoriented database an interoperability problem appears. The programming LANGUAGES like C, FORTRAN, PASCAL & FORTRAN use the POP (Procedure Oriented Approach) to develop applications, but the current trend is towards OOP (Object Oriented Programming). The languages like C++, Java, Oracle, C# (C Sharp). Visual Basic 6 use this approach. Many databases have been developed that follows this approach (OI approach) like Oracle. So the DBMS which follow OOP approach is called OODBMS. Q. 14. Explain the difference between physical and logical data independence. Ans. One of the biggest advantages of database is data independence. It means we can change the conceptual schema at one level without affecting the data at other level. It means we can change the structure of a database without affecting the data required by users and program. This feature was not available in file oriented approach. There are two types of data independence and they are: 1. Physical data independence 2. Logical data independence Data Independence The ability to modify schema definition in on level without affecting schema definition in the next higher level is called data independence. There are two levels of data independence: 1. Physical data independence is the ability to modify the physical schema without causing application programs to be rewritten. Modifications at the physical level are occasionally necessary to improve performance. It means we change the physical storage/level without affecting the conceptual or external view of the data. The new changes are absorbed by mapping techniques. 2. Logical data independence in the ability to modify the logical schema without causing application program to be rewritten. Modifications at the logical level are necessary whenever the logical structure of the database is altered (for example, when money-market accounts are added to banking system).
Logical Data independence means if we add some new columns or remove some columns from table then the user view and programs should not changes. It is called the logical independence. For example: consider two users A & B. Both are selecting the empno and ename. If user B add a new column salary in his view/table then it will not effect the external view user; user A, but internal view of database has been changed for both users A & B. Now user A can also print the salary. User As External View
(View after adding a new column salary) It means if we change in view then program which use this view need not to be changed. Logical data independence is more difficult to achieve than is physical data independence, since application programs are heavily dependent on the logical structure of the data that they access. Logical data independence means we change the physical storage/level without effecting the conceptual or external view of the data. Mapping techniques absorbs the new changes. Q. 15. What is physical data independence? Ans. Physical data independence is the ability to modify the physical schema without causing application programs to be rewritten. Modifications at the physical level are occasionally necessary to improve performance. It means we change the physical storage/level without affecting the conceptual or external view of the data. The new changes are absorbed by mapping techniques. Q. 16. What do you mean by data redundancy? Ans. Redundancy is unnecessary duplication of data. For example if accounts department and registration department both keep student name, number and address. Redundancy wastes space and duplicates effort in maintaining the data. Redundancy also leads to inconsistency. Inconsistent data is data which contradicts itself - e.g. two different addresses for a given student number. Inconsistency cannot occur if data is represented by a single entry (i.e. if there is no redundancy). Controlled redundancy Some redundancy may be desirable (for efficiency). A DBMS should be aware of it, and take care of propagating updates to all copies of a data item. This is an objective, not yet currently supported. Q. 17. What do you mean by database schema? Ans. It is necessary to describe the organization, of the data in a formal manner. The logical and physical database descriptions are used by DBMS software. The complete and overall description of data is referred to as schema, The schema and subschema words are brought into DBMS by CODASYL
(Conference on data system language1 committee) and also by the CODASYLs database task group. Schema is also referred to as conceptual model or global view (community view) of data. Suppose a complete description of collected data having all classes and student data4 all employees (teaching & non-teaching) data and other concept of data related to the college is called Schema of the college. We can say that we relate whole college data logically, which is called schema. Q. 18. Explain the distinctions among the terms primary key, candidate key and superkey. Or What is the significance of foreign key? Or What are the various keys? Ans. Keys: As there are number of keys can be defined, but some commonly and mainly used keys are explained as below: 1. Primary Key A key is a single attribute or combination of two or more, attributes of an entity that is used to identify one or more instances of the set. The attribute Roll # uniquely identifies an instance of the entity set STUDENT. It tells about student Amrita having address 101, Kashmir Avenue and phone no. 112746 and have paid fees 1500 on basis of Roll No. 15. The 15 is unique value and it gives unique identification of students So here Roll No is unique attribute and such a unique entity identifies called Primary Key. Primary key cannot be duplicate. From the definition of candidate key, it should be clear that each relation must have at least one candidate key even if it is the combination of all the attributes in the relation since all tuples in a relation are distinct. Some relations may have more t one candidate keys. As discussed earlier, the primary key of a relation is an arbitrarily but permanently selected candidate key. The primary key is important since it is the sole identifier for the tuples in a relation. Any tuple in a database may be identified by specifying relation name, primary key and its value. Also for a tuple to exist in a relation, it must be identifiable and therefore it must have a primary key. The relational data model therefore imposes the following two integrity constraints: (a) No component of a primary key value can be null; (b) Attempts to change the value of a primary key must be carefully controlled. The first constraint is necessary because if we want to store information about some entity, then we must be able to identify it, otherwise difficulties are likely to arise. For example, if a relation CLASS (STUNO, LECTURER, CNO) has (STUNO, LECTURER) as the primary key then allowing tuples like 3123 NULL CP302 NULL SMITH CP302 is going to lead to ambiguity since the two tuples above may or may not be identical and the integrity of the database may be compromised. Unfortunately most commercial database systems do not support the concept of primary key and it would be possible to have a database state when integrity of the database is violated. The second constraint above deals with changing of primary key values. Since the primary key is the tuple identifier, changing it needs very careful controls. Codd has suggested three possible approaches: Method 1 Only a select group of users be authorised to change primary key values. Method 2 Updates on primary key values be banned. If it was necessary to change a primary key, the tuple would first be deleted and then a new tuple with new primary key value but same other values would be inserted. Of course, this does require that the old values of attributes be remembered and be reinserted in the database.
Method 3 A different command for updating primary keys be made available. Making a distinction in altering the primary key and another attribute of a relation would remind users that care needs to be taken in updating primary keys. 2. Secondary Key The ke1 which is not giving the unique identification and have duplicate infonna6o is called secondary key, e g in a STUDENT entity if Roll Number is the primary key, then Name of the student, address of the student, Phone number of the student and the fees paid by the student all are secondary keys. A secondary key is an attribute or combination of attributes that not be primary key and have duplicate data. In otherworlds secondary key is used after the identification of the primary key. Also we can identify the data from the combination of the secondary keys. 3. Super Key If we add additional attributes to a primary key, the resulting combination would still uniquely identify an instance of the entity set Such keys are called super keys A primary key is therefore a minimum super key For example, if DOB (date of birth field or attribute) is the primary key, then by adding some additional information about the day of the month key in the DOB field, this field or attribute becomes more powerful and useful Such type of key is called super key Super key are less used in a small database file. Now these days it has less importance, but due to its feature, this key gives the complete description of the database. 4. Candidate Key There may be two or more attributes or combination of attributes that uniquely identify an instance of an entity set These attributes or combination of attributes are called candidate keys. Candidate key also gives unique identification. Candidate key comes with primary key. A candidate is a combination of two or more attributes e.g. if Roll No. and student name are two different attributes then we combine these two attribute and form a single attribute Roll No. & Name, then this combination is the candidate key and it is unique and gives unique identification about a particular roll no. and about particular name. 5. Alternative Key A candidate key which is not the primary key is called alternative key, e.g. if Roll No. and Name combination is the candidate key, then if Roll No, is the primary key, other key in the candidate key is Name. Name attribute work as the alternative key. 6 Foreign Key Suppose there are some relations as: SP (S#, P#, QTY), relation S (S#, S Name, status, city) and relation P (P#, PName, Color, Weight, City). We know entity SP is defined as the relationship of the relation S and the relation P. These two relations has sand P# as the Primary Keys in relation S and P respectively, but in the relation SP we can take either # as the primary key or P# as the primary key. Suppose if we take P# as the primary key, then other primary key S# which is actually the primary key, but do not work as primary key in the relation SF is called the Foreign Key. If S# is the primary key then P# is the Foreign Key. Similarly in the relation ASSIGNMENT, attribute Emp #, Prod #, Job # are given and if S# and P# are the primary keys, then the Job # key is the Foreign Keys. Q. 19. What are the major functions of a database administrator? Ans. RESPONSIBILITIES OF DBA As we know DBA is the overall commander of a computer system, so it has number of duties, but some of his/her major responsibilities are as follows: 1. DBA can control the data, hardware, and software and gives the instructions to the application programmer, end user and naive user.
2. DBA decides the information contents of the database. He decides the suitable database file structure for arrangement of data. He/She uses the proper DDL techniques. 3. DBA compiles the whole data in a particular order and sequence. 4. DBA decides where data can be stored i.e. take decision about the storage structure. 5. DBA decides which access strategy and technique should be used for accessing the data. 6. DBA communicates with the user by appropriate meeting. DBA co-operates with user. 7. DBA also define and apply authorized checks and validation procedures. 8. DBA also takes backup of the data on a backup storage device so that if data can then lost then it can be again recovered and compiled. DBA also recovers the damaged data. 9. DBA also changes the environment according to user or industry requirement and monitor the performance. 10. DBA should be good decision-maker. The decision taken by DBA should be correct, accurate & efficient. 11. DBA should have leadership quality. 12. DBA liaise with the user in the business to take confidence of the customer about the availability of data. Q. 20. What do you mean by relationships? Explain different types of relationships. Ans. Relationships: One table (relation) may be linked with another in what is known as a relationship. Relationships may be built into the database structure to facilitate the operation of relational joins at runtime. 1. A relationship is between two tables in what is known as a one-to-many or parent-child or masterdetail relationship where an occurrence on the one or parent or master table may have any number of associated occurrences on the many or child or detail table. To achieve this, the child table must contain fields which link back the primary key on the parent table. These fields on the child table are known as a foreign key, and the parent table is referred to as the foreign table (from the viewpoint of the child). 2. It is possible for a record on the parent table to exist without corresponding records on the child table, but it should not be possible for an entry on the child table to exist without a corresponding entry on the parent table. 3. A child record without a corresponding parent record is known as an orphan. 4. It is possible for a table to be related to itself. For this to be possible it needs a foreign key which points back to the primary key. Note that these two keys cannot be comprised of exactly the same fields otherwise the record could only ever point to itself. 5. A table may be the subject of any number of relationships, and it may be the parent in some and the child in others. 1. Some database engines allow a parent table to be linked via a candidate key, but if this were changed it could result in the link to the child table being broken. 2. Some database engines allow relationships to be managed by rules known as referential integrity or foreign key restraints. These will prevent entries onchild tables from being created if the foreign key does not exist on the parent table, or will deal with entries on child tables when the entry on the parent table is updated or deleted. Relational Joins The join operator is used to combine data from two or more relations (tables) in order to satisfy a particular query. Two relations may be joined when they share at least one common attribute. The join is implemented by considering each row in an instance of each relation. A row in relation R1 is joined to a
row in relation R2 when the value of the common attribute(s) is equal in the two relations. The join of two relations is often called a binary join. The join of two relations creates a new relation. The notation R1 x R2 indicates the join of relations R1 and R2. For example, consider the following:
Note that the instances of relation RI and R2 contain the same data values for attribute B. Data normalisation is concerned with decomposing a relation (e.g. R(A,B,C,D,E) into smaller relations (e.g. R1 and R2). The data values for attribute B in this context will be identical in R1 and R2. The instances of R1 and R2 are projections of the instances of R(A,B,C,D,E) onto the attributes (A,B,C) and (B,D,E) respectively. A projection will not eliminate data values duplicate rows are removed, but this will not remove a data value from any attribute. The join of relations RI and R2 is possible because B is a common attribute. The result of the join is:
The row (2 4 5 7 4) was formed by joining the row (2 4 5) from relation R1 to the row (4 7 4) from relation R2. The two rows were joined since each contained the same value for the common attribute B. The row (2 4 5) was not joined to the row (6 2 3) since the values of the common attribute (4 and 6) are not the same. The relations joined in the preceding example shared exactly one common attribute. However, relations may share multiple common attributes. All of these common attributes must be used in creating a join. For example, the instances of relations R1 and R2 in the following example are joined using the common attributes B and C: Before the join:
The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row (1 4 9) from relation R2. The join was created since the common set of attributes (B and C) contained identical values (1 and 4). The row (6 1 4) from R1 was not joined to the row (1 2 1) from R2 since the common attributes did not share identical values - (1 4) in R1 and (1 2) in R2. The join operation provides a method for reconstructing a relation that was decomposed into two relations during the normalisation process. The join of two rows, however, can create a new row that was not a member of the original relation. Thus invalid information can be created during the join process.
Now suppose that a list of courses with their corresponding room numbers is required. Relations R1 and R4 contain the necessary information and can be joined using the attribute HOUR. The result of this join is:
This join creates the following invalid information (denoted by the coloured rows): Smith, Jones, and Brown take the same class at the same time from two different instructors in two different rooms. Jenkins (the Maths teacher) teaches English. Goldman (the English teacher) teaches Maths. Both instructors teach different courses at the same time. Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result would be:
Jenkins teaches Math I and Algebra simultaneously at both 8:00 and 9:00. A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting relation with R4 (using both INSTRUCTOR and HOUR). The result would be:
Extracting the COURSE and ROOM attributes (and eliminating the duplicate row produced for the English course) would yield the desired result:
The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless (gainless?) join property A relational database is in 4th normal form when the lossless join property can be used to answer unanticipated queries. However, the choice of joins must be evaluated carefully. Many different sequences of joins will recreate an instance of a relation. Some sequences are more desirable since they result in the creation of less invalid data during the join operation. Suppose that a relation is decomposed using functional dependencies and multi- valued dependencies. Then at least one sequence of joins on the resulting relations exists that recreates the original instance with no invalid data created during any of the join operations. For example, suppose that a list of grades by room number is desired. This question, which was probably not anticipated during database design, can be answered without creating invalid data by either of the following two join sequences:
The required information is contained with relations R2 and R4, but these relations cannot be joined directly. In this case the solution requires joining all 4 relations. The database may require a lossless join relation, which is constructed to assure that any ad hoc inquiry can be answered with relational operators. This relation may contain attributes that are not logically related to each other. This occurs because the relation must serve as a bridge between the other relations in the database. For example, the lossless join relation will contain all attributes that appear only on the left side of a functional dependency. Other attributes may also be required, however, in developing the lossless join relation. Consider relational schema R (A, B, C, D), A B and C D. Relations and are in 4th normal form. A third relation however, is required to satisfy the lossless join property. This relation can be used to join attributes B and D. This is accomplished by joining relations R1 and R3 and then joining the result to relation R2. No invalid data is created during these joins. The relation is the lossless join relation for this database design. A relation is usually developed by combining attributes about a particular subject or entity. The lossless join relation, however, is developed to represent a relationship among various relations. The lossless join relation may be difficult to populate initially and difficult to maintain - a result of including attributes that are not logically associated with each other. The attributes within a lossless join relation often contain multi-valued dependencies. Consideration of 4th normal form is important in this situation. The lossless join relation can sometimes be decomposed into smaller relations by eliminating the multi-valued dependencies. These smaller relations are easier to populate and maintain. Q. 21. What is an ER-diagram? Construct an ER diagram for a hospital with a set of patients and a set of doctors. Associate with each patient a log o1 the various tests and examinations conducted. Or Discuss in detail the ER diagram. Or What is one to many relationship? Give examples. Or Draw an ER diagram for a library management system, make suitable assumptions. Describe various symbols used in ER. diagram. Or Construct an ER diagram for a university registrars office. The office maintains data about each class, including the instructor, the enrollment and the time and place of the class meetings. For each student class pair, a grade is recorded also design a relational database for the said I.R. diagram.
Ans. E-R model grew out of the exercise of using commercially available DBMS to model application database. Earlier DBMS were based on hierarchical and network approach. E-R is a generalization of these models. Although it has some means of describing the physical database model, it is basically useful in the design of logical database model. This analysis is then used to organize data as a relation, normalizing relations and finally obtaining a relational database model. The entity-relationship model for data uses three features to describe data. These are: 1. Entities, which specify distinct real-world items in an application. 2. Relationships, which connect entities and represent meaningful dependencies between them. 3. Attributes, which specify properties of entities and relationships. We illustrate these terms with an example. A vendor supplying items to a company, for example, is an entity. The item he supplies is another entity. A vendor supplying items are related in the sense that a vendor supplies an item. The act of supplying defines a relationship between a vendor and an item. An entity set is a collection of similar entities. We can thus define a vendor set and an item set. Each member of an entity set is described by some attributes. For example, a vendor may be described by the attributes: (vendor code, vendor name, address) An item may be described by the attributes: (item code, item name) Relationship also can be characterized by a number of attributes. We can think of the relationship as supply between vendor and item entities: The relationship supply can be described by the attributes: (order no. date of supply)
Relationship between Entity Sets The relationship between entity sets may be many-to-many (M: N), one-to-many (1: M), many-toone (M: 1) or one-to-one (1:1). The 1:1 relationship between entity sets E1 and E2 indicates that for each entity in either set there is at most one entity in the second set that is associated with it. The 1: M relationship from entity set E1 to E2 indicates that for an occurrence of the entity from the set E1, there could be zero, one or more entities from the entity set E2 associated with it. Each entity in E2 is
associated with at most one entity in the entity set E1. In the M: N relationship between entity sets E1 and E2, there is no restriction to the number of entities in one set associated with an entity in the other set. The database structure, employing the E-R model is usually shown pictorially using entityrelationship (E-R) diagram. To illustrate these different types of relationships consider the following entity sets: DEPARTMENT, MANAGER, EMPLOYEE, and PROJECT The relationship between a DEPARTMENT and a MANAGER is usually one-to- one; there is only one manager per department and a manager manages only one department. This relationship between entities is shown in Figure. Each entity is represented by a rectangle and the relationship between them is indicated by a direct line. The relationship for MANAGER to DEPARTMENT and from DEPARTMENT to MANAGER is both 1:1. Note that a one-to-one relationship between two entity sets does not imply that for an occurrence of an entity from one set at any time there must be an occurrence of an entity in the other set. In the case of an organization, there could be times when a department is without a manager or when an employee who is classified as a manager may be without a department to manage. Figure shows some instances of one-to-one relationships between the entities DEPARTMENT and MANAGER.
A one-to-many relationship exists from the entity MANAGER to the entity EMPLOYEE because there are several employees reporting to the manager. As we just pointed out, there could be an occurrence of the entity type MANAGER having zero occurrences of the entity type EMPLOYEE reporting to him or her. A reverse relationship, from EMPLOYEE to MANAGER, would be many to one, since many employees may be supervised by a single manager. However, given an instance of the entity set EMPLOYEE, there could be only one instance of the entity set MANAGER to whom that employee reports (assuming that no employee reports to more than one manager). The relationship between entities is illustrated in Figures shows some instances of this relationship. Figure: 1:M Relationship
Figure: Instances of 1: M Relationship The relationship between the entity EMPLOYEE and the entity PROJECT can be derived as follows: Each employee could be involved in a number of different projects, and a number of employees could be
working on a given project. This relationship between EMPLOYEE and PROJECT is many-to-many. It is illustrated in Figures shows some instances of such a relationship. Figure: M : N Relationship
Figure: Instances of M:N Relationship In the entity-relationship (E-R) diagram, entities are represented by rectangles, relationships by a diamond-shaped box and attributes by ellipses or ovals. The following E-R diagram for vendor, item and their relationship is illustrated in Figure (a).
Figure (a): E-R diagram for vendor; item and their Relationship Representation of Entity Sets in the form of Relations The entity relationship diagrams are useful in representing the relationship among entities they show the logical model of the database. E-R diagrams allow us to have an overview of the important entities for developing an information system and other relationship. Having obtained E-R diagrams, the next step is to replace each entity set and relationship set by a table or a relation. Each table has a name. The name used is the entity name. Each table has a number of rows and columns. Each row contains a number of the entity set. Each column corresponds to an attribute. Thus in the E-R diagram, the vendor entity is replaced by table below. Table: Table For the Entity Vendor
The above table is also known as a relation. Vendor is the relation name. Each row of a relation is called a tuple. The titles used for the columns of a relation are known as relation attributes. Each tuple in
the above example describes one vendor. Each element of a tuple gives specific property of that vendor. Each property is identified by the title used for an Attribute column. In a relation the rows may be in any order. The columns may also be depicted in any order. No two rows can be identical. Since it is inconvenient to show the whole table corresponding to a relation, a more concise notation is used to depict a relation. It consists of the relation name and its attributes. The identifier of the relation is shown in bold face. A specified value of a relation identifier uniquely identifies the row of a relation. If a relationship is M: N, then the identifier of the relationship entity is a composite identifier, which includes the identifiers of the entity sets, which are related. On the other hand, if the relationship is 1:N, then the identifier of the relationship entity is the identifier of one of the entity sets in the relationship.. For example, the relations and identifiers corresponding to the E-R diagram of Figure are as shown:
Figure: E-R Diagram for Teacher, Student and their relationship Teacher (Teacher-id, name, department, address) Teaches (Teacher-id, Student-id) Student (Student-id, name, department, address) One may ask why an entity set is being represented as a relation. The main reasons are case of storing relations as flat files in a computer and, more importantly, the existence of a sound theory on relations, which ensures good database design. The raw relations obtained as a first step in the above examples are transformed into normal relations. The rules for transformations called normalization are based on sound theoretical principles and ensure that the final normalized relations obtained reduce duplication of data, ensure that no mistake occur when data are added or, deleted and simplify retrieval of required data. Q. 22. Discuss relational approach of database management system? Explain with the help of suitable relational operations to demonstrate insert, delete and update functions. Or What is relational model compare and contrast it with network and hierarchical model. Ans. Database models are collection of conceptual tools for describing data semantics and data constraints. DBMS has number of ways to represent the data, But some important and commonly used model are of four types, among which three are mainly used. These are: I. Relational Model or Relational Approach II. Hierarchical Model or Hierarchical Approach III. Network Model or Network Approach I. Relational Data Model
Relational Data Model has been developed from the research in deep and by testing and by trying through many stages. This model has advantages that it is simple to implement and easy to understand. We can express queries by using query language in this model. In this model relation is only constructed by setting the association among the attributes of an entity as well the relationship among different entities. One of the main reasons for introducing this model was to increase the productivity of the application programmers by eliminating the need to change application programmer, when a change is mode to the database. In this user need not know the exact physical structure. Data structure used in the data model represented by both entities and relationship between them. We can explain relation view of data on relational approach on the basis of following example. Suppose there are three tables in which data is organized. These tables are Supplier tables or S table or S relation, Part table or P table or P relation, Shipment table of SP table or SP relation. The S table further has some fields or attributes. These are supplier number (S#), supplier name, status of the supplier and the city in which the supplier resides. Similarly P table has field part number (P#), part name, part color, weight of the part and location where the part is stored. Also SP table contains field supplier number (S#), part number (P#) and the quantity which supplier can ship. Each supplier s unique supplier number S# and similarly each part has unique part umber P#. These three tables are called relational table. S table is also called S-relation because it gives the relationship between different attributes. These attributes are field name and in the form of column. Rows of such table are called tuples. Pool of values in a particular w and attributes called domain. In other words domain is a pool of values from which actual value appearing in a given column are drawn. For example, in S table - S#, Sname, S-status are the attributes and s1, s2, s3 are domains. A relational table or relationship can be defined as: Definition: A relation represented by table having n column, defined on domain Dl, D2, .... Dn is a subset of cartesian product DI x D2 xx Dn. Another definition is : It is collection of Dl, D2, D3,.Dn then R is relation on these n sets if these n sets are ordered in n tuples such that each value of attribute belong to Dl, D2,.Dn. These three relations are represented by diagram: S table (Entity) or S Relation:
As in the .S table insertion, deletion and modifications can be done easily. II. Hierarchical Model It is a tree structure. It has one root and many branches, we call it parent child relationship. In this a single file has relation with many files and similarly we can say that it is the arrangement of individual
data with group data. In an organization chart manager is the parent root and employees working under the manager are their children The representation of this model is expressed by linking different tables. Such type of representation is better for a linkage have many relationships with one. Some times it will create ambiguity in designing and defining the association and relationship between SP table (Entity) or SP Relation:
In hierarchical approach, insertion can be done if a child has a parent and insertion on the child side is easy. Deletion and insertion is easy, but you cant delete a parent: parent has one or more child. In the parent child relationship updation in parent and child both are difficult. III. Network Approach It is a complex approach of DBMS. In this we link all the records by using a chain or pointer. It has many to many relationships. Network approach is created when there are more than one relations in the database system. Network approach starts from on point and after connecting similar type of data it returns back to the same record. Network approach is more symmetric than the hierarchical structure. In network model insertion at any point is very complex. We can insert only by creating a new record having linkage with other record. Similarly deletion is also complex if we delete any record than chain disconnect and whole structure vanish. Updation is also complex because we cannot change name or any data record because it connected with each other. Difference between Relational, Hierarchical and Network Approaches: (A) Relational Approach: Relational Approach (RA) has relationship between different entities and attribute in a particular entity. RA is in tabular form. RA has one to one relationships. R-A has table in asymmetric form. Insertion, deletion, updation in R table is very easy. Languages used in RA are SQL, Ingress, Oracle, Sybase. RA is simple in nature. Relational approach creates relationship between different entities and different attributes in the same entity. It is the best approach to represent the data than the other models. (B) Hierarchical Approach: Hierarchical Approach (HA) creates a linkage between two or more entities. HA has parent child relationship. HA has one to many relationships. HA relationship is in symmetric form by defining parent and their child. Insertion, deletion, updation is little difficult than the RA. HA has IMS language, which is theoretical. It is Complex in nature. (C) Network Approach: Network Approach (NA) has chain among many entities. NA has chaining technique or pointer technique. NA has many to many relationships. NA relationship is full or completely symmetric form because it has one chain symmetry. Insertion, deletion, updation is very difficult. NA has DBTG (Database Task Group) set hiving different classes & members. More complex than RA & HA.
Q. 23. What is the usage of unified modelling language (UML)? Ans. UML is a graphical language for visualizing, specifying, constructing and documenting an object oriented software-intensive systems artifacts. Q. 24. What are graphical user interfaces? Ans. A graphical user interface (GUI) is sometime pronounced gooey is a method of interacting with a computer through a metaphor of direct manipulation of graphical images and widgets in addition to text. GUI display visual elements such as icon, Windows and other gadgets Q. 25. Define the term dangling pointer. Ans. The pointers that points to nothing is called dangling pointer. Q. 26. Write a short note on Mapping. Ans. Mappings The conceptual/internal mapping: defines conceptual and internal view correspondence specifies mapping from conceptual records to their stored counterparts An external/conceptual mapping: defines a particular external and conceptual view correspondence A change to the storage structure definition means that the conceptual/internal mapping must be changed accordingly, so that the conceptual schema may remain invariant, achieving physical data independence. A change to the conceptual definition means that the conceptual/external mapping must be changed accordingly, so that the external schema may remain invariant, achieving logical data independence. Q. 27. Distinguish between RDBMS and DBMS. Ans.
Chapter 2 : Relational Model, Language and Systems Q. 1. What is relational algebra? Ans. Relational Algebra and Relational Calculus are two approaches to specifying manipulations on relational databases. The distinction between them is somewhat analogous to that between procedural and declarative programming. Algebra is equivalent to Relational Calculus, in that every expression in one has an equivalent expression in the other. Thus relational completeness of a database language can also be established by showing that it can define any relation expressible in Relational Algebra. Relational Algebra comprises a set of basic operations. An operation is the application of an operator to one or more source (or input) relations to produce a new relation as a result. More abstractly, we can think of such an operation as a function that maps arguments from specified domains to a result in a specified range. In this case, the domain and range happen to be the same, i.e. relations. Q. 2. Define relational algebra. Explain the various traditional set operations and relational operations of it. Or Discuss the basic operations that can performed using relational algebra and SQL. Ans. Relational Algebra comprises a set of basic operations. An operation is the application of an operator to one or more source (or input) relations to produce a new relation as a result. This is illustrated in Figure 8.1 below. More abstractly, we can think of such an operation as a function that maps arguments from specified domains to a result in a specified range. In this case, the domain and range happen to be the same, i.e. relations. Relational Algebra is a procedural language. It specifies the operations to be reformed on. existing relations in derived result relations. Therefore, it defines the complete schema for each of the result relations. The relational algebraic operations can be divided into basic set-oriented operations and relational-oriented operations. The former are the traditional set operations, the latter, those for performing joins selection, projection, and division. Relational Algebra is a collection of operations to manipulate relations. Each operation takes one or more relations as its operands and produce another relation as its results. Some mainly used operations are join, selection and projection. Relational algebra is a procedural language. It specifies the operations to be performed on existing relations to derive relations. The relational algebra operations can be divided into basic set oriented operations and relational oriented operations. The former are the traditional set operations, the latter are joins, selections, projection and division. Basic Operations Basic operations are the traditional set operations: union, difference, intersection and cartesian product. Three of these four basic operations - union, intersection, and differencerequire that operand relations be union compatible. Two relations are union compatible if they have the same parity and oneto-one correspondence of the attributes with the corresponding attributes defined over the same domain. The cartesian product can be defined on any two relations. Two relations P and Q are said to be union compatible if both P and Q are of the same degree n and the domain of the corresponding n attributes are identical, i.e. if P= P [P1,. Pn] and Q = [Q1, ....... Qn] then Dom (Pi) = Dom (Qi) for i = (1,2,n) Where Dom (Pi) represents the domain of the attribute Pi. Some basic operations used in Relational Algebra are: Traditional Set Operations: Further traditional set operations are subdivided as: (a) UNION (b) INTERSECTION
(c) DIFFERENCE (d) CARTESIAN PRODUCT Relational Set Operators: Similarly further Relational Set Operations subdivided as: (a) PROJECTION (b) SELECTION (c) JOIN (d) DIVISION Traditional Set Operations (i) UNION (U): The union of two relations A and B is done by UNION command as: A UNION B It is the set of all types belonging to either A or B or both. Let us consider set A and B as: Table A:
Table B:
If P and Q are two sets, then R is the resultant set by union operations. The R is resented by: For example, let us consider A be the set of suppliers tuples for suppliers in London and B is the set of supplier who supply part P1. Then A UNION B is the set of supplier samples for suppliers who are either located in London city or supply part P1 (or both). It is denoted by the symbol U (union). We can combine it as: A U B (in mathematical form.)
(ii) Intersection The intersection operation selects the common tuples from the two relations. It is denoted by the symbol . The intersection of two relations and B is defined as: or A INTERSECT B For example, if A and B are two sets, then intersection between these two are in R1.
(iii) Difference (): The difference operation removes common tuples from t first relation. The difference between two relations A and B be defined as: A MINUS B It is the set of all tuples belonging to set A but not belonging to B. It is denoted by (-). We can represent it as A - B. For example, from the above said two A and B sets, the difference between A and B be represented as:
(iv) Cartesian Product: It is denoted by X or x (Cross). The cartesian product of two relations A and B is defined as: A TIMES B Or AxB The extended cartesian or simply the cartesian product of two relations is the concatenation of tuples belonging to the two relations. A new resultant relation schema is created consisting of all possible combinations of the tuples is represented as: R=PxQ For example, let us consider A be the set of all supplier number and B is set of all part number. Then A TIMES B is the set of all possible supplier number / part number pairs as:
Relational Set Operations: The basic set operations, which provide a very limited data manipulating facility have been supplemented by the definition of the following operations: (i) Projection (x): The projection of a relation is defined as the projection of all its tuples over some set of attributes i.e. it yields a vertical subset of a relation. The projection operation is used to either reduce the number of attributes in the resultant or the reorder attributes. For example, if P is the table, then we can project on the field name as and get resultant projected table:
(ii) Selection : Selection is the selection of some tuples based on some condition. It is horizontal subset of relation. It is denoted by s. It reduces the number of tuples. from a relation, e.g. if P is the relation then R is the resultant table after selection on P. Condition is to select all tuples having roll no. < 105. R = s P (RN < 105)
(iii) Join The join operator allows the combining of two relations to form a single new relations. These are of three types: (i) Theta Join (ii) Natural Join (iii) Equi Join Theta Join is the joining of two tables on the basis of a condition. Natural Join is the joining of two tables without any condition and equality. Equi Join is the joining of two tables of both having common equal key field. For example, if S and P are two tables and these are joined on CITY field as S.CITY and P.CITY. (iv) Division (+): The division operator divides a dividend relation A of degree m + n by a divider relation B of degree n. It will produce a result relation of degree m. Suppose A is relational table of supplier having supplier number and B is the relational tables of different types of parts, then A DIVIDE BY B gives the resultant table R.
Q. 3. What are single-valued and multivalued attributes? Ans. Multi-Valued Dependencies and Fourth Normal Form It was proposed as a sample form of 3NF but it was found to be stricter than 3NF because every relation in BCNF is also in 3NF; however a relation in 3NF is not necessarily in BCNF. Definition : A normalized relation scheme R<S, F> is in BCNF if for every nontrivial FD in F of the form XA where X I S and AIS, X is a super key of R. BCNF is a special case in 3NF. Features: 1. Key attributes (candidate keys) are composite (there is no single key which identify record). 2. More than one candidate keys are there. 3. In each candidate key at least one attribute is overlapping.
Any table if follow above mentioned three features of BCNF, then we will say that this table is in BCNF.
5. Course ID is such an attribute which is overlapping. Hence relation shown is in 3NF and also in BCNF. A relation Schema R(S, F) is in BCNF (S=set of Attributes, F=All of functional dependency), if a set of attributes X which is subset of S and an attribute Y which belongs to Ds.
One of the following two conditions hold. (i) Either Y belongs to DX (Y, X) is a Trivial Attribute. (ii) Or X is a Super key. Whereas, Trivial dependency: If the right hand side is a subset of the left hand side is known as trivial dependency. Super Key: Adding primary key with any attribute is known as super key.
It is in INF by definition. It is in 2NF since any non key attributes are dependent on the entire key. It is in 3NF because it has no transitive dependencies. 1. It is not in BCNF because it has a determinant FName, that is not a candidate key.
STU-ADV Key: (SID,FName) ADVSUBJ Key: (FName) Relations in BCNF Now we can say that a relation is in BCNF if and only if every nontrivial left- irreducible FD has a candidate key as its determinant. Or less formally, A relation is in BCNF if and only if the only determinant are candidate keys. Q. 4. Define the term data manipulation language. Ans. DML(Data Manipulation Language): Some language that gives instructions to the programming language and other languages is called data manipulation language (DML). DDL (Data Definition Language): Database management systems provide a facility known as Data Definition Language or data description language (DDL). DDL can be used to define conceptual schema (Global) and also give some details about how to implement this schema in the physical devices used to store the data.
Q. 5. What is RDBMS? Ans. Relational Data Base Management Systems (RDBMS) are database management systems that maintain data records and indices in tables. Relationships may be created and maintained across and among the data and tables. In recent years, database management systems (DBMS) have established themselves as the primary means of data storage for information systems ranging from large commercial transaction processing applications to PC-based desktop applications. At the heart of most of todays information systems is a relational database management system (RDBMS). RDBMSs have been the workhouse for data management operations for over a decade and continue to evolve and mature, providing sophisticated storage, retrieval, and distribution functions to enterprise-wide data processing and information management systems. Compared to the file systems, relational database management systems provide organizations with the capability to easily integrate and leverage the massive amounts of operational data into meaningful information systems. The evolution of high-powered database engines such as Oracle7 has fostered the development of advanced enabling technologies including client/server, data warehousing, and online analytical processing, all of which comprise the core of todays state-of-the-art information management systems. Relational Database Management System is a software package which manages a relational database, optimized for rapid and flexible retrieval of data; also called a database engine. In other words Relational Database Management System is a computer program that lets you store, index, and retrieve tables of data. The simplest way to look at an RDBMS is as a spreadsheet that multiple users can update. The most important thing that an RDBMS does is provide transaction. Relational Database Management System is used to store, process and manage data arranged in relational tables. Often used for transaction processing and data warehouses. RDBMS has ability to access data organized in tabular files that can be related to each other by a common field (item). An RDBMS has the capability to recombine the data items from different files, providing powerful tools for data usage. Relational databases are powerful because they require few assumptions about how data is related or how it will be extracted from the database. As a result, the same database can be viewed in many different ways. Almost all full-scale database systems are RDBMSs. A database management system (like Oracle) in which the database is organized and accessed according to the relationships between data items. In a relational database, relationships between data items are expressed by means of tables. Interdependencies among these tables are expressed by data values rather than by pointers. This allows a high degree of data independence. Some of the best-known RDBMSs include Oracle, Informix, Sybase, PostgreSQL and Microsoft Access. Characteristics of A Relational Database Relational databases consist of one or more tables;, these can be joined by the database software in queries. Each table consists of rows and fields. Each table is about one aspect (or subject) of the database. Thus contexts and finds are different subjects and are in different tables. Each row corresponds to one instance of the subject of the table. Thus each row is about one context. Each row must be unique. This is a logical result of the row being about one instance. If you have duplicate rows, the results of searching are unpredictable. Each field corresponds to a variable and is named to indicate its role. For example finds have a name and a size/weight. Each cell (where the fields and rows intersect) contains only one value. This is important because otherwise it is not possible properly to search. Using a relational
database, if you find a need for two values per cell - the design has to be altered. If fields in different tables have the same range of values and are thus about the same object, there is an association 6etween the fields and thus the tables - they are called keys. The rows corresponding to matching values can be retrieved from different tables. Q. 6. What do you mean by Relational Constraints? Ans. The integrity of the data in a relational database must be maintained as multiple users access and change the data. Whenever data is shared, there is a need to ensure the accuracy of the values within database tables. The term data integrity has the following meanings: 1. The condition in which data is identically maintained during any operation, such as transfer, storage, and retrieval. 2. The preservation of data for their intended use. 3. Relative to specified operations, the a priori expectation of data quality. Another aspect of data integrity is the assurance that data can only be accessed and altered by those authorized to do so. Data integrity means, in part, that you can correctly and consistently navigate and manipulate the tables in the database. There are two basic rules to ensure data integrity; entity integrity and referential integrity. The entity integrity rule states that the value of the primary key can never be a null value (a null value is one that has no value and is not the same as a blank). Because a primary key is used to identify a unique row in a relational table, its value must always be specified and should never be unknown. The integrity rule requires that insert, update, and delete operations maintain the uniqueness and existence of all primary keys. The referential integrity rule states that if a relational table has a foreign key, then every value of the foreign key must either be null or match the values in the relational table in which that foreign key is a primary key. Types of Data Integrity 1. Null Rule : A null rule is a rule defined on a single column that allows or disallows inserts or updates of rows containing a null (the absence of a value) in that column. 2. Unique Column Values: A unique value rule defined on a column (or set of columns) allows the insert or update of a row only if it contains a unique value in that column (or set of columns). 3. Primary Key Values: A primary key value rule defined on a key (a column or set of columns) specifies that each row in the table can be uniquely identified by the values in the key. 4. Referential Integrity Rules: A referential integrity rule is a rule defined on a key (a column or set of columns) in one table that guarantees that the values in that key match the values in a key in a related table (the referenced value). Referential integrity also includes the rules that dictate what types of data manipulation are allowed on referenced values and how these actions affect dependent values. Rules for Referential Integrity The rules associated with referential integrity are: Restrict: Disallows the update or deletion of referenced data. Set to Null: When referenced data is updated or deleted, all associated dependent data is set to NULL. Set to Default: When referenced data is updated or deleted, all associated dependent data is set to a default value. Cascade: When referenced data is updated, all associated dependent data is correspondingly updated. When a referenced row is deleted, all associated dependent rows are deleted.
No Action: Disallows the update or deletion of referenced data. This differs from RESTRICT in that it is checked t the end of the statement, or at the end of the transaction if the constraint is deferred. (Oracle uses No Action as its default action) Complex Integrity Checking: Complex integrity checking is a user-defined rule for a column (or set of columns) that allows or disallows inserts, updates, or deletes of a row based on the value it contains for the column (or set of columns). Integrity Constraints Description An integrity constraint is a declarative method of defining a rule for a column of a table. Oracle supports the following integrity constraints: NOT NULL constraints for the rules associated with nulls in a column UNIQUE key constraints for the rule associated with unique column values PRIMARY KEY constraints for the rule associated with primary identification values FOREIGN KEY constraints for the rules associated with referential integrity. Oracle supports the use of FOREIGN KEY integrity constraints to define the referential integrity actions, including: o Update and delete No Action o Delete CASCADE o Delete SET NULL CHECK constraints for complex integrity rules You cannot enforce referential integrity using declarative integrity constraints if child and parent tables are on different nodes of a distributed database. However, you can enforce referential integrity in a distributed database using database triggers (see next section). Advantages of Integrity Constraints This section describes some of the advantages that integrity constraints have over other alternatives, which include: Enforcing business rules in the code of a database application Using stored procedures to completely control access to data Enforcing business rules with triggered stored database procedures Types of Integrity Constraints You can use the following integrity constraints to impose restrictions on the input of column values: NOT. NULL Integrity Constraints UNIQUE Key Integrity Constraints PRIMARY KEY Integrity Constraints Referential Integrity Constraints CHECK Integrity Constraints 1. NOT NULL Integrity Constraints. By default, all columns in a table allow nulls. Null means the absence of a value. A NOT NULL constraint requires a column of a table contain no null values. For example, you can define a NOT NULL constraint to require that a value be input in the last name column for every row of the employees table. 2. UNIQUE Key Integrity Constraints A UNIQUE key integrity constraint requires that every value in a column or set of columns (key) be unique that is, no two rows of a table have duplicate values in a specified column or set of columns. Unique Keys : The columns included in the definition of the UNIQUE key constraint are called the unique key. Unique key is often incorrectly used as a synonym for the terms UNIQUE key constraint or UNIQUE index. However, note that key refers only to the column or set of columns used in
the definition of the integrity constraint. If the UNIQUE key consists of more than one column, then that group of columns is said to be a composite unique key. This UNIQUE key constraint lets you enter an area, code and telephone number any number of times, but the combination of a given area code and given telephone number cannot be duplicated in the table. This eliminates unintentional duplication of a telephone number. UNIQUE Key Constraints and Indexes: Oracle enforces unique integrity constraints with indexes. For example Oracle enforces the UNIQUE key constraint by implicitly creating a unique index on the composite unique key. Therefore, composite UNIQUE key constraints have the same limitations imposed on composite indexes: up to 32 columns can constitute a composite unique key. Combine UNIQUE Key and NOT NULL Integrity Constraints: In UNIQUE key constraints allow the input of nulls unless you also define NOT NULL constraints for the same columns. In fact, any number of rows can include nulls for columns without NOT NULL constraints because nulls are not considered equal to anything. A null in a column (or in all columns of a composite UNIQUE key) always satisfies a UNIQUE key constraint. Columns with both unique keys and NOT NULL integrity constraints are common. This combination forces the user to enter values in the unique key and also eliminates the possibility that any new rows data will ever conflict with an existing rows data. 3. PRIMARY KEY Integrity Constraints Each table in the database can have at most one PRIMARY KEY constraint. The values in the group of one or more columns subject to this constraint constitute the unique identifier of the row. In effect, each row is named by its primary key values. The Oracle implementation of the PRIMARY KEY integrity constraint guarantees that both of the following are true: No two rows of a table have duplicate values in the specified column or set of columns. The primary key columns do not allow nulls. That is, a value must exist for the primary key columns in each row. Primary Keys : The columns included in the definition of a tables PRIMARY KEY integrity constraint are called the primary key. Although it is not required, every table should have a primary key so that: Each row in the table can be uniquely identified No duplicate rows exist in the table PRIMARY KEY Constraints and Indexes: Oracle enforces all PRIMARY KEY constraints using indexes. In the primary key constraint created for the deptno column is enforced by the implicit creation of: A unique index on that column . A NOT NULL constraint for that column Composite primary key constraints are limited to 32 columns, which is the same limitation imposed on composite indexes. The name of the index is the same as the name of the constraint. Also, you can specify the storage options for the index by including the ENABLE clause in the CREATE TABLE or ALTER TABLE statement used to create the constraint. If a usable index exists when a primary key constraint is created, then the primary key constraint uses that index rather than implicitly creating a new one. 4. Referential Integrity Constraints Different tables in a relational database can be related by common columns, and the rules that govern the relationship of the columns must be maintained. Referential integrity rules guarantee that these relationships are preserved.
A referential integrity constraint requires that for each row of a table, the value in the foreign key matches a value in a parent key. Self-Referential Integrity Constraints: Another type of referential integrity constraint is called a selfreferential integrity constraint. This type o- foreign key references a parent key in the same table. In the referential integrity constraint ensures that every value in the mgr column of the emp table corresponds to a value that currently exists in the empno column of the same table, but not necessarily in the same row, because every manager must also be an employee. This integrity constraint eliminates the possibility of erroneous employee numbers in the mgr column. Nulls and Foreign Keys: The relational model permits the value of foreign keys either to match the referenced primary or unique key value, or be null. If any column of a composite foreign key is null, then the non-null portions of the key do not have to match any corresponding portion of a parent key. Q.7. What is the difference between the Relational algebra and the Relational Calculus? Ans. 1. Relational algebra operations manipulate some relations and provide some expression in the form of queries where as relational calculus are formed queries on the basis of pairs of expressions. 2. RA have operator like join, union, intersection, division, difference, projection, selection etc. where as RC has tuples and domain oriented expressions. 3. RA is procedural language where as RC is non procedural query system. 4. Expressive power of RA and RC are equivalent. This means any query that could be expressed in RA could be expressed by formula in RC. 5. Any KC formula is translated in Algebric query. 6. There is modification which is easy in queries in RA than the RC. 7 RA formed the mathematical form and have no specificjuer1 language RC also has mathematical form but has one query language QUEL. 8. Relational algebra is easy to manipulate and understand than RC. 9. RA queries are more powerful than the RC. 10. RC are formed WFFs where as RA does not form any formula. Q. 8. Write a note on SQL basic queries. Ans. Structured Query Language (SQL) is the language used to manipulate relational databases. SQL is tied very closely with the relational model.
In the relational model, data is stored in structures called relations or tables. Each table has one or more attributes or columns that describe the table. In relational databases, the table is the fundamental building block of a database application. Tables are used to store data on Employees, Equipment, Materials, Warehouses, Purchase Orders, Customer Orders, etc. Columns in the Employee table, for example, might be Last Name, First Name, Salary, Hire Date, Social Security Number, etc. SQL statements are issued for the purpose of: Data definition - Defining tables and structures in the database (DB). Data manipulation - Inserting new data, Updating existing data, Deleting existing data, and Querying the Database ( Retrieving existing data from the database). Another way to say this is the SQL language is actually made up of 1) the Data Definition Language (DDL) used to create, alter and drop scema objects such as tables and indexes, and 2) The Data Manipulation Language (DML) used to manipulate the data within those schema objects. SQL*Plus commands allow a user to manipulate and submit SQL statements. Specifically, they enable a user to: Enter, edit, store, retrieve, and run SQL statements List the column definitions for any table Format, perform calculations on, store, and print query results in the form of reports Access and copy data between SQL databases The following is a list of SQL*Plus commands and their functions. The most commonly used commands are emphasized in italics: Execute the current SQL statement in the buffer - same as RUN ACCEPT - Accept a value from the user and place it into a variable APPEND - Add text to the end of the current line of the SQL statement in the buffer. AUTOTRACE - Trace the execution plan of the SQL statement and gather statistics BREAK - Set the formatting behavior for the output of SQL statements BTITLE - Place a title on the bottom of each page in the printout from a SQL statement CHANGE - Replace text on the current line of the SQL statement with new text CLEAR - Clear the buffer COLUMN - Change the appearance of an output column from a query COMPUTE - Does calculations on rows returned from a SQL statement CONNECT - Connect to another Oracle database or to the same Oracle database under a different user name COPY - Copy data from one table to another in the same or different databases DEL - Delete the current line in the buffer DESCRIBE - List the columns with data types of a table (Can be abbreviated as DESC) EDIT - Edit the current SQL statement in the buffer using an external editor such as vi or emacs EXIT - Exit the SQL*Plus program GET - Load a SQL statement into the buffer but do not execute it HELP - Obtain help for a SQL*Plus command (In some installations) HOST - Drop to the operating system shell INPUT - Add one or more lines to the SQL statement in the buffer LIST - List the current SQL statement in the buffer QUIT - Exit the SQL*Plus program REMARK - Place a comment following the REMARK keyword RUN - Execute the current SQL statement in the buffer SAVE - Save the current SQL statement to a script file SET - Set an environment variable to a new value
SHOW - Show the current value of an environment variable SPOOL - Send the output from a SQL statement to a file START - Load a SQL statement located in a script file and then run that SQL statement TIMING - Used to time the execution of SQL statements for performance analysis TITLE -Place a title on the top of each page in the printout from a SQL statement UNDEFINE - Delete a user defined variable Q 9. What are the various features of SQL? Ans. SQL Features 1. It is meant to be an English like Language using set English phrases to manipulate the database How well it achieves this is questionable 2. It is non procedural. You specify the information required not the navigation and operations required to access the data. Each RDBMS has an inbuilt query optimiser which parses your SQL statements and works out the optimum path to the required data. 3. When you query data, all the rows affected by your statement are dealt with in one go as a set, they are not dealt with separately. The work area that holds the set is known as a CURSOR. 4. SQL encompasses a range of uses and users. DBAs, application programmers, management and end users can use SQL. 5. It provides commands for the following tasks : querying data inserting, updating and deleting data creating, modifying and deleting database objects controlling access to the database and database objects guaranteeing database consistency monitoring database performance and configuration Q. 10. What is a trigger? Ans. Triggers are special stored procedures that are executed when a table undergoes an INSERT, a DELETE, or an UPDATE operation. Triggers often enforce referential integrity and can also call other stored procedures. Or Triggers are parameter-less procedures that are triggered (fired) either before or after inserting, updating or deleting rows from a table. Because they are fired by the event and not by choice they cannot have parameters Q. 11. What is the difference between a procedural and a non-procedural language? Ans. Non procedural Can be used on its own to specify complex database operations. DMBSs allow DML statements to be entered interactively from a terminal, or to be embedded in a programming language. If the commands are embedded in a general purpose programming language, the statements must be identified so they can be extracted by a pre-compiler and processed by the DBMS. Procedural Must be embedded in a general purpose programming language. Typically retrieves individual records or objects from the database and processes each separately. Therefore it needs to use programming language constructs such as loops. Low-level DMLs are also called record at a time DMLS because of this. High-level DMLs, such as SQL can specify and retrieve many records in a single DML statement, and are called set at a time or set oriented DMLs.
High-level languages are often called declarative, because the DML often specifies what to retrieve, rather than how to retrieve it. Q. 12. Consider the following employee database, where the primary keys are underlined. Employee (person-name, street, city) Works (person-name, company-name, salary) Company (company-name, city) Managers (person-name, manager-name) Give an expression in SQL for each of the following queries. (i) Find the names of all employees who work for First Bank Corporation and live in Las Vegas. (ii) Find the names, street address and cities of residences of all employees who work for First Bank Corporation and earn more than $10000. (iii) Find all employees who do not work for First Bank Corporation. (iv) Find the company that has the smallest payroll. (v) Find all employees in the database who do not live in the same cities and on the same streets as do their managers. Ans. (i) Select person name from employee where company name = first bank cooperation and city = Las Vegas . (ii) Select person name, street and city from employee where company name first bank cooperation and salary > $10,000 . (iii) Select * from employee where compant ! = first bank cooperation. (iv) Select *from works where salary <$1,000 . (v) Select *from employee where employee city name! = manager city name Q. 13. Consider the following relational database and give an expression in relational algebra to express each of the following queries: employee (person-name, street, city) works (person-name, company-name, salary) company (company-name, city) managers (person-name, manager-name) (a) Find the names of all employees who work for First Bank Corporation. (b) Find the names and cities of residences of all employees who work for First Bank Corporation. (c) Find the names of all employees who do not work for First Bank Corporation. (d) Find names of all employees who earn more than $10000 per annum. (e) Find names of all employees who earn more than every employee of Small Bank Corporation. Ans. (i) Select person_name from works where company-name = first bank cooperation. (ii) Select person_name and city from employee where company name = first bank cooperation. (iii) Select person_name form works where company-name! first bank cooperation. (iv) Select person_name from works where salary > $10,000. (v) Select * from works where salary > small bank cooperation. Q. 14. List any two procedural programming languages. Ans. 1. Postgre SQL 2. DB2 SQL 3. PL/SQL.
Q. 15. What are row triggers? Ans. A row level trigger is fired each time the table is affected by the triggering statement e.g., if an update statement updates multiple rows of a table, a row trigger is fired once for each row affected by the update statement. If a triggering statement affects no rows, a row trigger is not executed at all. Q. 16. Define the term DDL. Ans. DDL is data definition languages. Q. 17. Consider the following relational database Employee (person-name, street, city) Works (person-name, company-name, salary) Company (company-name, city) Manager (person-name, manager-name) Ans. Similar Question Chapter-2, Question No. 22. Q. 18. Give relational algebra expression for each of the following queries: (a) Find the names of all employees who work for first Bank corporation. (b) Find names, cities of residence of all employees who work for first Bank corporation and earn more than $10,000. Ans. Similar Question Chapter-2, Question No. 13. Q. 19. Define View. Ans. A view is a method of organising table data to meet a specific need. Views are based on select statement which derive their data from real tables CREATE VIEW Create a new view based on b1s in the database. The table names must already exist. The new view name must not exist. CREATE VIEW has the following syntax:
Additional information on the SELECT statement and SQL queries can be found in the next section. Note that an ORDER BY clause may not be added to the sql select statement when defining a view. In general, views are read-only. That is, one may query a view but it is normally the case that views can not be operated on with INSERT, UPDATE or DELETE. This is especially true in cases where views joining two or more tables together or when a view contains an aggregate function. DROP VIEW Drop a view from the database. The view name must already exist in the database. The syntax for the DROP VIEW command is: DROP VIEW; Q. 20. Define entity and attribute. Or What do you mean by Entities and Attributes? Ans. Entities and their Attributes:
Entities are the basic units in modeling classes of concrete (real) or abstract objects. Entities have concrete existence or it contains ideas or concept e.g. a building, a room, a chair, employee etc. are all different entities. An entity type or entity set is a group of similar objects of an organization, which is used for maintaining the data. Examples of entity sets are transactions, job positions, employees, inventories of raw and finished products, students, academic staff, non-academic staff, manager etc. An object can belong to different entity sets simultaneously. A person can be a student as well as a part time employee. Consider the modeling of flight crew. It consists of a group of individuals employed by an organization who belong to the entity sets EMPLOYEE & PERSON. The individual numbers of the flight crew have different skills and functions. So the entity set MPLOYEE add the attribute skill with possible values. So entity set of EMPLOYEE has relationship with attribute skill. To store data on an entity set, we have to create a model for it. For example, employees of an organization are modeled by the entity set EMPLOYEE. We must have some properties as characteristics of employee that may be useful to the organization. Some of these properties are employee, name, employee no., employee address, employee skill and employees pay. The properties that characterize an entity set are called its attribute. An attribute is also referred to by the term data item, date element data field, item, elementary item of object property.
Chapter 3 : Database Design Theory And Methodology (Part 1) Q. 1. What is normalization? Discuss various Normal forms with the help of examples. Ans. Normalization is a design technique that is widely used as a guide in design relational databases. Normalization is essentially a two-step process that puts data in tabular form by removing repeating groups and then removes duplicated data fro the relational tables. Normalization theory is based on the concepts of normal forms. A relational tab is said to be a particular normal form if it satisfied a certain set of constraints. There currently five normal forms that have been defined. In this section, we will cover first three normal forms that were defined by E. F. Codd Significance of Normalization Improves update efficiency, Removes many causes of anomalous data dependencies Allows better checks for consistency. Is (usually) better for query handling. But computational penalties in some SQL operations. Normalization is also significant due to following reason 1. To make feasible represent any relation in the database 2. To obtain powerful relational retrieval using relational operator 3. To free relation from undesirable insertion, update and deletion anomalies 4. To reduce the need for restructuring the relations as new data types are introduced Normalization Avoids. Duplication of Data The same data is listed in multiple lines of the database Insert Anomaly - A record about an entity cannot be inserted into the table without first inserting information about another entity - Cannot enter a customer without a sales order Delete Anomaly - A record cannot be deleted without deleting a record about a related entity. Cannot delete a sales order without deleting all of the customers information.
Update Anomaly - Cannot update information without changing information in many places. To update customer information, it must be updated for each sales order the customer has placed Before Normalization 1. Begin with a list of all of the fields that must appear in the database. Think of this as one big table. 2. Do not include computed fields 3. One place to begin getting this information is from a printed document used by the system. 4. Additional attributes besides those for the entities described on the document can be added to the database. Normal Forms The normalization process as first proposed by Codd (1972), takes a relation schema through a series of tests to certify whether it satisfies a certain normal form. The process, which proceed in a top-down fashion by evaluating each relation against the criteria for normal form decomposing relation, as necessary, can thus be considered as relational design by analysis. Initially, Codd proposed three normal forms, which he called first, second and third normal form. A stronger definition of 3NFcalled Boyce Codd normal form (BCNF)was proposed later by Boyce Codd. All these normal forms are based on the functional dependencies among the attributes of a relation. Later, a 4NF and 5NF were proposed, base on the concept of multivalued dependencies and join dependencies, respectively. Need of Normalization Normalization of data can hence be looked upon as a process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of: 1. Minimizing redundancy. 2. Minimizing the insertion, deletion, and updation. Normal forms are based on primary key. Normalization: It is the process of structuring an unstructured relation into structural one with the purpose of removing redundancy and anomalies. First Normal Form (IN F) Definition : A relation schema is said to be in INF if the values in the domain each attribute of the relation are atomic. In other words, only one value is associates with each attribute and the value is not a set. of values or a list of values. A database schema is in INF if every relation schema included in database scheme is in INF. A relation is in 1NF if and only if all underlying domains contain scalar value only. Here scalar is atomicity, meaning there should be single value at the intersecting of each row and column as shown in the FIRST relation obtained by original relations
But problem occurs with each of the three operations. INSERT: We cannot insert the fact that a particular supplier is located in a particular city until that supplier supplies at least one part. FIRST relation does not show that supplier S5 is located in Athens. The reason is that, until S5 supplies some part, we have no appropriate primary key values. DELETE: If we delete only the FIRST tuple for a particular supplier, we destroy not only the shipment connecting that supplier to some port but also the information that the supplier is located in particular city. For example, if we delete the FIRST tuple with S# value S3 P# value P2, we lose the information that S# is located in Paris. UPDATE: The city value for a given supplier appears in FIRST many times, in general. This redundancy causes update problems. For example, if supplier SI moves from London to Amsterdam, we are faced with either the problem of reaching FIRST to final every tuple connecting S1 and London (and changing it) or the possibility of producing an inconsistent result (the city for SI might be given as Amsterdam in one tuple, London in another). Therefore, to overcome this problem we make 2NF. Before Proceeding to next form let us denote: R = Relation Scheme S = Set of attributes F = All of functional dependencies Second Normal Formal (2NF) Definition: A relation schema R<S, F> is in second normal form (2NF) if it is in the INF and if all nonprime attributes are fully functionally dependent on the relation keys). A database schema is in 2NF if every relation schema included in the data base schema is in 2NF. Feature: 1. A relation is in 2NF if it is INF and every nonkey attribute is fully dependent on the key. 2. If the key is a single attribute then the relation is automatically in the 2NF.
Second Normal Form (definition assuming only one candidate key, which is thus the primary key): A relation is in 2NF if and only if it is in INF and every nonkey attribute is irreducibly dependent on the primary key. So we decompose FIRST relation in two table word. It should be clear that revised structure overcomes all the problems with update operation sketched earlier. INSERT: We can insert the information that S5 is located in Athens, even though S5 does not currently supply any parts, by simply inserting the appropriate tuple into SECOND. DELETE: We can delete the shipment. Connecting S3 and P2 by deleting the opposite tuple from S2; we do not lose the information that S3 is located in Paris. UPDATE: The S# - CITY redundancy has been eliminated. Thus we can change the city for SI from London to Amsterdam by changing it once and for all in the relevant SECOND tuple. Still we have problem with their operations in the following ways: INSERT: We cannot insert the fact that a particular city has a particular status e.g., we cannot state that any supplier in Rome must have a status of 50 until we have some supplier actually located in that city. DELETE: If we delete the only SECOND tuple for a particular city, we destroy not only the information for the supplier concerned but also the information that city has that particular status. For example, if we delete the SECOND tuple of S5, we lose the information that the status for Athens is 30). UPDATE: The status for a given city appears in SECOND many times, in general (the relation still contain some redundancy). Thus, if we need to change the status for LONDON from 20 to 30, we are faced with either the problem of searching SECOND to find every tuple for London (and changing it) or the possibility of producing an inconsistent result (the status in London might be given 20 in one tuple and 30 in another). Again to overcome such problems we replace the original relation (SECOND, in this case) by two projections making 3NF. Third Normal Form (3NF) Definition: A relation schema R<S, F>is in 3NF, if for all nontrivial functional dependencies in F of the form XA, either X contains a key (i.e., X is a superkey) or A is a prime attribute. A database schema is in 3NF if every relation schema included in the database schema is in 3 NF. Feature: A relation R is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.
(3NF) (Definition assuming only one candidate key, which is thus the primary key): A relation .is in 3NF if and only if it is in 2NF and every nonkey attribute is non transitively dependent on the primary key. (No transitive dependencies implies no mutual dependencies). Relation, SC and CS are both in 3NF
Thus, by such relation we have removed transitivity from relation SECOND Boyce Codd Norma! Form (BCNF) A relation is in BCNF if and only if every nontrivial, left-irreducible FD has a candidate key as its determinant. Or, less formally, BNF (informal definition): A relation is in BCNF if and only if the only determinants are candidate keys. Relation FIRST and SECOND, which are not in 3NF, are not in BCNE either; also that relation SP, SC and CS, which were in 3NF are also in BCNF. Relation FIRST contains three determinants, namely S#, CITY, and {S#, P#), of these, only {S#, P#J is a candidate key, so FIRST is not in BCNF Similarly, SECOND is not in BCNF ether because the determinant CITY is not a candidate key. Relation SP, SC and CS on the other hand, are each in BCNF, because in each case the (single) candidate key in the only determinant in the reduction Multi-Valued Dependencies and Fourth Normal Form It was proposed as a sample form of 3NF but it was found to be stricter than 3NF because every relation in BCNF is also in 3NF; however a relation in 3NF is not necessarily in BCNF. Definition: A normalized relation scheme R<S, F> is .in BCNF if for every nontrivial FD in F of the form x A where X ??S and A?S, X is a superkey of R. BCNF is a special case in 3NF. Features. 1. Key attributes (candidate keys) are composite (there is no single key which identify record). 2. More than one candidate keys are there. 3. In each candidate key at least one attribute is overlapping.
Any table if follow above mentioned three features of BCNF, then we will say that this table is in BCNF 4. Explanation
It is in INF by definition. It is in 2NF since any non key attributes are dependent on the entire key. It is in 3NF because it has no transitive dependencies. 1. It is not in BCNF because it has a determinant FName, that is not a candidate key.
STU-ADV Key: (SID, FName) ADVSUBJ Key: (FName) Relations in BCNF Now we can say that a relation is in BCNF if and only if every nontrivial left- irreducible FD has a candidate key as its determinant. Or less formally, A relation is BCNF if and only if the only determinant are candidate keys. Fourth Normal Form (4NF) Definition: Given a relation schema R such that the set D of FDs and MVDS satisfied, consider a set of attributes X and Y where X ??R , Y? R. The relation schema R is in Fourth normal Form (4NF) if for all multivalued dependencies of the form X ?????? D+, either X ???Y is a trivial MVD or X is a superkey of R. A database scheme is in 4NF if all relation schema included in the database schema are in 4NF. Joint Dependencies and Fifth Normal Form
So far in this chapter we have assumed that the sole operation necessary or available in the further normalization process is the replacement of a relation in a non-loss way by exactly two of its projections. This assumption has successfully carried us as far 4NF. It comes perhaps as a surprise, therefore, to discover that there exist relatior5 that cannot be non-loss-decomposed into two projections but can be non-loss- decomposed into three (or more). To coin an ugly but convenient term, we will describe such a relation as n-decomposable (for some n > 2) meaning that the relation in question can be non-lossdecomposed into n projections but not into m for any m <n. A relation that can be non-loss-decomposed into two projections we will call 2- decomposable, In short 1NF A relation is in INF if it contains no repeating groups To convert an unnormalised relation to I NF either: Flatten the table and change the primary key, or Decompose the relation into smaller relations, one for the repeating groups and one for the nonrepeating groups. Remember to put the primary key from the original relation into both new relations. This option is liable to give the best results. 2NF A relation is in 2NF if it contains no repeating groups and no partial key functional dependencies Rule: A relation in IN.F with a single key field must be in 2NF To convert a relation with partial functional dependencies to 2NF create a set of new relations One relation for the attributes that are fully dependent upon the key One relation for each part of the key that has partially dependent attributes 3NF A relation is in 3F if it contains no repeating groups, no partial functional dependencies, and no transitive functional dependencies To convert a relation with transitive functional dependencies to 3NF, remove the attributes involved in the transitive dependency and put them in a new relation Rule: A relation in 2NF with only one non-key attribute must be in 3NF In a normalized relation a non-key field must provide a fact about the key, the whole key and nothing but the key. Relations in 3NF are sufficient for most practical database design problems. However, 3NF does not guarantee that all anomalies have been removed. Q. 2. What are multivalued dependencies? Or Define the tern functional dependency. Ans. FUNCTIONAL DEPENDENCIES Introduction Basically, a functional dependence (usually abbreviated FD) is a many to one relationship from one set of attributes to another within a given relation. In the shipments relation SP, for example, there is a functional dependence from the set of attributes (S#, P#) to the set attributes (QTY) what this means is that for many values of the attribute pair (S#, P#), there is one corresponding value of the attribute (QTY, FDs) provide a basis for a scientific attack on a number of practical problems. This is because FDs possess a rich set of interesting formal properties, which make it possible to treat the problems in question in a formal and rigorous manner. The terms functional dependence are used interchangeably in the technical literature. Customary English usage would suggest that the term dependence be used for
the FD concept perse and would reserve the term dependency for the object that depends. But we very frequently need to refer to FDs in the plural, and dependencies But we very frequently need to refer to EDs in the plural, and dependencies seems to trip off the tongue more readily than dependencies; hence our use of both terms. Basic definitions In order to illustrate the ideas of the present section, we make use of a slightly revised version of the shipments relation, one that includes, in addition to the usual attributes S#, P# and QTY, an attribute CITY, REPRESENTING TH CITY FOR THE RELEVANT SUPPLIER, We will refer to this revised relation as SCP to avoid confusion. A possible tabulation of relation SCP is given in Fig. Now, it is very important in this area as in so many other to distinguish clearly between a. the value of a given relation (i.e., relation variable) at a given point in time and b. the set of all possible values that the given relation (variable) might assume at different times. In what follows, we will first define the concept of functional dependency as it applies to Case a. and then extend it to apply to Case b. Here then is the definition for Case (a). Let R be a relation, and let X and Y be arbitrary subsets of the set of attributes or R. Then we say that Y is functionally dependent on X hi symbols. X Y (read X functionally determines 1, or simply X arrow Y) if and only if each X-value in R has associated with it precisely one Y-value in R. In other words, whenever two tuples of R agree on their X-value, they also agree on their Y-value. For example, the tabulation of relation SCP shown in Fig. 9.1 satisfies the FD. (s#) (CITY) because every Scp tuple with a given S# value also has the same City Value. Indeed it also satisfies several more FDs, the following among them:
(Exercise: Check these.) The left-hand side and right-hand side of an FD are sometimes called the determinant and the dependent, respectively. As the definition states, the determinant and dependent are both sets of attributes. When the set contains just one attribute, however i.e. when it is a singleton set we will often drop the set brackets and write ; just E.g., S# City
As already explained. The foregoing definitions apply to Case (a)i.e. to individual relation values. However, when we consider relation variables in particular, When we consider base relations we are usually interested not so much in the FDs that happen to hole in the particular value that the variable happens to have at some particular time, but rather in those FDs that hold for all possible values of that variable in the case of SCP, for example, the FD. S# City Hole for all possible values of SCP, because, at any given time, a given supplier has excessively one corresponding city, and so any two tuples appearing in SCP at the same time with the same supplier number must necessarily have the same city as well. In fact The statement that this FD holds for all time (i.e., for all possible values of SCP) integrity constraint for SCP it places limits on the values that SCP can legitimately assume. Here then is the Case definition of functional dependency (the extensions over the Case definition are shown in boldface. Let R be a relation Variable, and let X and Y be arbitrary subsets of the set of attributes of R. Then we say that Y is functionally dependent on X in symbols. X Y (read X functionally determines Y, or simply X arrow Y)if and only if, in every possible legal value of R, each X value has associated with it precisely one Y value In other words, in every possible legal value of R, whenever two tuples agree on their X-values, they also agree on their Y value. Henceforth, we will usually take the term functional dependency to have this latter, more demanding, timeindependent meaning(barring explicit statements to the contrary). Here are some time-independent FDs that apply to the relation variable SCP:
Notice in particular that the following FDs, which do hold in the sample tabulation of Fig. 3.8, do not hold for all time.
In other words, the statement that (e.g.)every shipment for a given supplier has the same shipment quantity happens to be true for the sample values in Fig 3.8 but it is not true for all possible legal values of SCP. It is worth pointing out that if X is a candidate key of relation Rin particular, if it is the primary key then all attributes Y of relation R must necessarily be functionally dependent on X (this fact follows from the definition of candidate key). In the usual parts relation, For example, we must necessarily have: In fact, if relation R satisfies the FD A B and A is not a candidate key, * then R will involve some redundancy. In the case of relation SCP, for example, the fact that a given supplier is located in a given city appears many times, in general (see Fig. 3.8). Now, even if we restrict our attention to FDs that hold for all time, the set of FDs satisfied by all legal values of a given relation can still be very large, as the SCP example suggests. Why is this objective desirable? One reason is that (as already stated) FDs represent integrity constraints, and hence the DBMS needs to check them when updates are performed. Given a particular
set S of FDs. Therefore, it is desirable to find some other set T that is (ideally) much smaller than S and has the property that every FD in S is implied by the FDs in T. If such a set T can be found, it is sufficient that the DBMS enforce the FDs in T, and the FDs in S will then be enforced automatically, The problem of finding such a set T is thus of considerable practical interest. Trivial and Nontrivial Dependencies Note: In the remained of this section, we will occasionally abbreviate functional dependency to just dependency, Similarly for functionally dependent on function dependency to just dependency. Similarly for functionally dependent on, functionally determines, etc. One obvious way to reduce the size of the set of FDs. we have to deal with is to eliminate the trivial dependencies, Is trivial, if it cannot possibly not be satisfied just one of the FD is trivial if and only if the right hand side is a subset (not necessarily a proper subset) of the left-hand side. As the name implies, trivial dependencies are not very interesting in practice; we are usually more interested in practice in nontrivial dependencies (which are, of course. Precisely the ones that are not trivial), because these are the ones that correspond to genuine integrity constraints. When we are dealing with formal dependency theory, however, we cannot necessarily assume that all dependencies are nontrivial. Closure of a Set of Dependencies As already suggested that certain FDs imply others. As a simply example, the FD {S#, P#1 (City, Qty.} Implies both the following FDs: {S#, P#) (City) {S#, P#) {Qty.) As a more complex example, suppose we have a relation R with three attributes A, B, and C, such that the FDs A B and B C both hold in R. Then it is easy to see that the FD A C also holds in R. The FD A C here is an example of a transitive FD C is said to depend on A transitively, via B. The set of all FDs that are implied by a given set S of FDs is called the closure of S. and is denoted S. Clearly we need a way of computing S from S. The first attack on this problem appeared in a paper by Armstrong which gave a set of rules of inference (more usually called Armstrongs axioms) by which new FDs can be inferred from given ones. Those rules can be stated in a variety of equivalent ways, one of the simplest of which is as follows Armstrongs inference rules: Let A, 13, and C be arbitrary subsets of the set of attributes of the given relation R, and let us agree to write (e.g.) AB to mean the union of A and B. Then 1. Reflexivity: If B is a subset of A, then A B 2. Augmentation: If A B, then AC BC. 3. Transitivity: If A B and B C, then A C. Each of these three rules can be directly proved from the definition of functional dependence (the first is just the definition of a trivial dependence, of course). More over, the rules are complete, in the sense that, given a set S of FDs, all FDs implied by S can be derived from S using the rules. They are also sound, sense that no additional FDs (i.e., FDs not implied by S) can be so derive. In other words, the rules can be used to derive precisely the closure S. Several further rules can be derived from the three given above, the following among them. These additional rules can be used to simplify the practical task of computing S+ from S. (D is another arbitrary subset of the set of attributes of R.) 4. Self-determination: A A 5. Decomposition: If A BC, then A B and A C. 6. Union: If A B and A C, then A BC.
7 Composition If A B and C D, then AC BD And Darwin proves the following rule, which he calls the General Unification Theorem. 8. If A B and C D, then A(C- B) BD (where E is union and - is set difference). The name General Unification Theorem refers to the fact that several of the earlier rules can be seen as special cases: Example: Suppose we are given relation R With attributes A, B, C, D, E, F, and the FDs A BC B E CD EF Observe that we are extending our notation slightly (though not incompatibly) by writing, e.g. BC for the set consisting of attributes B and C previously BC would have meant the union of B and C, where B and C were sets of attributes. Note: If you would prefer a more concrete example, take A as employee number for a project directed by that manager (unique within manager), E as department name, and F as percentage of time allocated by the specified manager to the specified project. We now show that the FDAD F holds in R, and so is a member of the closure of the given set:
Closure of a Sat of Attributes We have not yet given an effective algorithm for computing the closure of a given set S of FDs. However, in this section we give an effective way of determining whether a given (specified) FD is in that closure We begin our discussion with the notion of a superkey. A superkey for a relation R is a set of attributes of R that includes at least one candidate key of R as a subset not necessarily a proper subset. Of course (The definition of superkey can thus be derived from that of candidate key by simply deleting the irreducibility requirement.) it follows immediately that the superkeys for a given relation R are precisely those subsets K of the set of attributes of R such that the functional dependency, holds true for every attribute A of R. Now suppose we know the FDs that hold for some given relation, and we need to determine the candidate keys for that relation. The candidate keys are, by definition, those superkeys that are irreducible. So determining whether or not a given set of attributes K is a superkey is a big step toward determining whether K is in fact a candidate key. To determine whether K is a superkey, we need to determine whether the set of all attributes functionally dependent on K is in fact the set of all attributes of R and so, given a set S of FDs that hold in R, we need a way of determining the set of all attributes of R that are functionally dependent on K the so-called closure K of K under S.A simple algorithm for computing this closure is given in Fig. Example: Suppose we are given relation R with attributes A, B, C, D, E, F, and FDs
We now compute the closure {A,B}+ of the set of attributes {A,B} under this set of FDs. 1. We initialize the result CLOSURE [K,S] TO {A,B). 2. We now go round the inner loop four times, once for each of the given FDs. On the first iteration (for the FD, A BC), we find that the left-hand side is indeed a subset of CLOSURE [K.SJ as computed so far, so we add attributes (B and) C to the result. CLOSURE {K.S) is now the set (A, B, C}
3. On the second iteration (for the FD, E -, CF), we find that the left-hand side is not a subset of the result as computed so far, which thus remains unchanged. 4. On the third iteration {for the FD, B -, E), we add E to CLOSURE [K,S] which now has value (A,B,C,E). 5. On the fourth iteration (for the FD, CD EF), CLOSURE [K,S], remains unchanged. 6. Now we go round the inner loop four times again. One the first iteration, the result does not change; on the second, it expands to {A,B,C,E,F}; on the third and fourth, it does not change. 7. Now we go round the inner loop four times again. CLOSURE [K.S.] does not change, and so the whole process terminates, with Note, therefore, that {A,B} is not a superkey (and hence not a candidate key a fortiori). An important corollary of the foregoing is as follows: Given a set S of FDs. We can easily tell whether a specific FD X- Y follows from 5, because that FD will follow if and only if Y is a subset of the closure X of X under S. In other words, we now have a simple way of determining whether a given FD XY is in the closure S of S. Irreducible Sets of Dependencies Let SI and S2 be two sets of FDs. If every FD implied by SI is implied by the FDs in S2, i.e., if S1+ is a subset of S2+, we say that S2 is a cover for S1*. What this means is that if the DBMS enforces that constraints represented by the FDs in S2, then it will automatically be enforcing the FDs in SI. Next, if S2 is a cover for S1 and S1 is a cover for S2, i.e., if we say that SI & S2 are equivalent. Clearly, if SI & S2 are equivalent, then if the DBMS enforces the constraints represented by the FDs in S2, it will automatically be enforcing the FDs in SI, and vice versa. Now we define a set S of FDs to be irreducible, if and only if it satisfies the following three properties: 1. The right hand side (the dependent) of every FD in S involves just one attribute (i.e. is a singleton set). 2. The left hand side (the determinant) of every FD in S is irreducible in turn, meaning that no attribute can be discarded from the determinant without changing the closure (i.e. without converting S into some set not equivalent to S). We will say that such an FD is left-irreducible. 3. No FD in S can be discarded from S without changing the closure (i.e. without converting S into same set not equivalent to S). For example, consider the familiar parts selection P. The following FDs (among others) hold in that relation:
This set of FD is easily seen to be irreducible. The right-hand side is a single attribute in each case, the left hand side is obviously irreducible in turn, and none of the FDs can be discarded without changing the closure (i.e., without losing some information). By contest, the following sets of FDs are not irreducible.
We now claim that for every set of FDA, there exists at least one equivalent set that is irreducible In fact. this is easy to see. Let the original set of FDs be S. Thanks to the decomposition sets, we can assume without loss of generality that every FD in S has a singleton right-hand side. Next, for each FD fin S, we examine each attribute A in the left hand side of f; if S and the set of FDs obtained by eliminating A from the left hand side of f are equivalent, we delete A from the left hand side of f. Then, for each FD f remaining in 5, if S and S-f are equivalent, we delete f from S The final set S is irreducible and is equivalent to the original set S. Example: Suppose we are given relation R with attributes A, B, C, D, and FDs,
We now compute an irreducible set of FDs that is equivalent to this given set 1.The first step is to rewrite the FDs such that each one has. a singleton right-hand side.
We observe immediately that the FD A B occurs twice, so one occurrence can be eliminated. 2. Next, attribute can be eliminated from the L.H.S. of the FD AC D because we have A C, so A AC by augmentation, and we are given AC D so A D by transitivity; thus the C on the lefthand-side (LHS) of AC D is redundant.
3. Next, we observe that the FD, AB C can be eliminated, because again we have A C, so AB CB by augmentation, so AB C by decomposition. 4. Finally, the FD, A C is implied by the FD, A B and B C, so it can also be eliminated. We are left with:
This set is irreducible. A set 1 of FDs that is irreducible and is equivalent to some other set S of FDs is said to be an irreducible cover for S. Thus. Given some particular set S of FDs that need to be enforced, t. is sufficient for the system to find and enforce an irreducible, cover I instead. We should make it clear, however, that a given set of FDs does not necessarily have a unique irreducible cover. Q. 3. What is query optimization? Ans. Optimzation techniques that apply heuristic rules to modify the internal representation of a query, which is usually in the form of a query tree or a query graph data structure to improve its expected performance. The parser of a high-level query first generates an initial internal representation, which is then optimized according to heuristic rules. Following that, query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query. One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other binary operations. This is because the size of the file resulting from a binary operation, such as JOIN, is usually a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations reduce the size of a file and hence, should be applied before a join or other binary operation. Q. 4. What are the various guidelines for database design? Ans. The process of database design can be stated as follows: Design the logical and physical structure of one or more databases to accommodate the information needs of the users in an organisation for a defined set of applications. The goals of database design are multiple: 1. Satisfy the information content requirements of the specified users and applications 2. Provide a natural and easy-to-understand structuring of the information. 3. Support processing requirements and any performance objectives such as response time, processing time, and storage space. These goals are very hard to accomplish and measure, and they involve an inherent tradeoff if one attempts to achieve more naturalness and understand ability of the model, it may be at the cost of performance The problem is aggravated because the database design process often begins with informal and poorly defined requirements In contrast, the result of the design activity is a rigidly defined database schema that cannot be modified easily once the database is implemented. We can identify, six main phases of the database design process (i) Requirements collection and analysis (ii) Conceptual database, design (iii) Choice of a DBMS (iv) Data Model mapping (also called logical database design) (v) Physical database design (vi) Database system implementation and tuning.
Q. 5 Discuss the concepts of normalization in detail. Ans. NORMAL FORMS BASED ON PRIMARY KEYS Normalization In very simple words normalization is a technique which helps to determine the most appropriate grouping of data items into records, segments or tuples. This is necessary as the data items are arranged in tables which indicate the structure, relationship integrity in the relational databases. Normal Forms The normalization process as first proposed by Codd (1972), takes a relation schema through a series of tests to certify whether it satisfies a certain normal form. The process, which proceed in a top-down fashion by evaluating each relation against the criteria for normal form decomposing relation, as necessary, can thus be considered as relational design by analysis. Initially, Codd proposed three normal forms, which he called first, second and third normal form. A stronger definition of 3NFcalled Boyce Codd normal form (BCNF)was proposed later by Boyce Codd. All these normal forms are based on the functional dependencies among the attributes of a relation. Later, a 4NF and 5NF were proposed, base on the concept of mutivalued dependencies and join dependencies, respectively. Need of Normalization Normalization of data can hence be looked upon as a. process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of: 1. Minimizing redundancy. 2. Minimizing the insertion, deletion, and updation. Normal forms are based on primary key. Normalization: It is the process of structuring an unstructured relation into structural one with the purpose of removing redundancy and anomalies. First Normal Form (INF) Definition: A relation schema is said to be in INF if the values in the domain of each attribute of the relation are atomic. In other words, only one value is associated with each attribute and the value is not a set of values or a list of values. A database schema is in INF if every relation schema included in database scheme is in INF. A relation is in INF if and only if all underlying domains contain scalar values only. Here scalar is atomic ity, meaning there should be single value at the intersection of each row and column as shown in the FIRST relation obtained by original relation
But problem occurs with each of the three operations. INSERT: We cannot insert the fact that a particular supplier is located in a particular city until that supplier supplies at least one part. FIRST relation does not show that supplier S5 is located in Athens. The reason is that, until S5 supplies some part, we have no appropriate primary key values. DELETE: If we delete only the FIRST tuple for a particular: supplier, we destroy not only the shipment connecting that supplier to some port but also the information that the supplier is located in particular city. For example, if we delete the. FIRST tuple with S# value S3 P# value P2, we lose the information that S# is located in Paris. UPDATE: The city value for a given supplier appears in FIRST many times, in general. This redundancy causes update problems. For example, if supplier S1 moves from London to Amsterdam, we are faced with either the problem of reaching FIRST to final every tuple connecting S1 and London (and changing it) or the possibility of producing an inconsistant result (the city for S1 might be given as Amsterdam in one tuple, London in another). Therefore, to overcome this problem we make 2NF. Before Proceeding to next form let us denote: R = Relation Scheme S = Set of attributes F = All of functional dependencies Second Norma! Formal (2NF) Definition: A relation schema R<S, F> is in second normal form (2NF) if it Is in the INF and if all nonprime attributes are fully functionally dependent on the relation keys). A database schema is in 2NF if every relation schema included in the data base schema is in 2NF. Feature: I. A relation is in 2NF if it is INF and every nonkey attribute is fully dependent on the key. 2. If the key is a single attribute then the relation is automatically in the 2NF.
Chapter 3 - Part 2
SECOND Second Normal Form (definition assuming only one candidate key, which is thus the primary key): A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key. So we decompose FIRST relation in two table word. It should be clear that revised structure overcomes all the problems with update operation sketched earlier. INSERT: We can insert the information that S5 is located in Athens, even though S5 does not currently supply any parts, by simply inserting the appropriate tuple into SECOND. DELETE: We can delete the shipment. Connecting S3 and P2 by deleting the opposite tuple from S2; we do not lose the information that S3 is located in Paris. UPDATE: The S# - CITY redundancy has been eliminated. Thus we can change the city for SI from London to Amsterdam by changing it once and for all in the relevant SECOND tuple. Still we have problem with their operations in the following ways: INSERT: We cannot insert the fact that a particular city has a particular status e.g., we cannot state that any supplier in Rome must have a status of 50 until we have some supplier actually located in that city. DELETE: If we delete the only SECOND tuple for a particular city, we destroy not only the information for the supplier concerned but also the information that city has that particular status. For example, if we delete the SECOND tuple of S5, we lose the information that the status for Athens is 30). UPDATE: The status for a given city appears in SECOND many times, in general 3. Course ID is such an attribute which is overlapping. Hence relation shown is in 3NF and also in BCNF. A relation Schema R(S, F) =is in BCNF (S=set of Attributes, F=All of functional dependency), if a set of attributes X which is subset of S and an attribute Y which belongs to Ds.
One of the following two conditions hold. i. Either Y belongs to DX (Y, X) is a Trivial Attribute ii. Or X is a Superkey. Whereas, Trivial dependency: If the right hand side is a subset of the left hand side is known as trivial dependency.
eg. (S#, P#) -. S# Super Key: Adding primary key with any attribute is known as super key.
It is in INF by definition. It is in 2NF since any non key attributes are dependent on the entire key. It is in 3NF because it has no transitive dependencies. (the relation still contain some redundancy). Thus, if we need to change the status for LONDON from 20 to 30, we are faced with either the problem of searching SECOND to find every tuple for London (and changing it) or the possibility of producing an inconsistent result (the status in London might be given 20 in one tuple and 30 in another). Again to overcome such problems we replace the original relation (SECOND, in this case) by two projections making 3NF. Third Normal Form (3NF) Definition: A relation schema R<S, F>is in 3NF, if for all nontrivial functional dependencies in F of the form XA, either X contains a key (i.e., X is a superkey) or A is a prime attribute. A database schema is in 3NF if every relation schema included in the database schema is in 3 NF. Feature: A relation R is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. (3NF) (Definition assuming only one candidate key, which is thus the primary key): A relation is in 3NF if and only if it is ir 2NF and every nonkey attribute is non transitively dependent on the primary key. (No transitive dependencies implies no mutual dependencies). Relation, SC and CS are both in 3NF
Thus, by such relation we have removed transitivity from relation SECOND Boyce Codd Normal Form (BCNF)
A relation is in BCNF if and only if every nontrivial, left-irreducible FD has a candidate key as its determinant. Or, less formally, BCNF (informal definition): A relation is in BCNF if and only if the only determinants are candidate keys. Relation FIRST: and SECOND, which are not in 3NF, are not in BCNF either; also that relation SP, SC and CS, which were in 3NF are also in BCNF. Relation FIRST contains three determinants, namely S#, CITY, and {S#, P#}; of these, only {S#, P#} is a candidate key, so FIRST is not in BCNF. Similarly, SECOND is not in BCNF either, because the determinant CITY is not a candidate key. Relation SP, SC and CS on the other hand, are each in BCNF, because in each case the (single) candidate key in the only determinant in the reduction. Multi-Valued Dependencies and Fourth Normal Form It was proposed, as a sample form of 3NF but it was found to be stricter than 3NF because every relation in BCNF is also in 3NF; however a relation in 3NF is not necessarily in BCNF. Definition : A normalized relation scheme R<S, F> is in BCNF if for every nontrivial FD in F of the form XA where X I S and AIS, X is a superkey of R. BCNF is a special case in 3NF. Features: 1. Key attributes (candidate keys) are composite (there is no single key which identify record). 2. More than one candidate keys are there. 3. In each candidate key at least one attribute is overlapping.
Any table if follow above mentioned three features of BCNF, then we will say that this table is in BCNF. 4. Explanation
1. It is not in BCNF because it has a determinant FName, that is not a candidate key.
STU-ADV Key: (SID,FName) ADVSUBJ Key: (FName) Relations in I3CNF Now we can say that a relation is in BCNF if and only if every nontrivial left-irreducible FD has a candidate key as its determinant. Or less formally, A relation is in BCNF if and only if the only determinant are candidate keys. Fourth Normal Form (4NF) Definition: Given a relation schema R such that the set D of FDs and MVDS are satisfied, consider a set of attributes X and Y where X R , Y R. The relation schema R is in Fourth normal Form (4NF) if for all multivalued dependencies of the form X Y D+, either X Y is a trivial MVD or X is a superkey of R. A database scheme is in 4NF if all relation schema included in the database schema are in 4NF. Multi-Valued Dependencies and Fourth Normal Form The 3NF and BCNF normal forms most of the times serve the purpose well. However, there are occasions, where higher normal forms must be considered. The next higher form of normalization is the fourth normal form. It makes use of a new kind of dependency, called a multi-valued dependency (MVD); MVDs are a generalization of FDs. Likewise, the definition of Fifth normal form makes use of another new kind of dependency, called a join dependency (JD), JDs in turn are a generalization of MVD. Multi-valued dependence: Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is multi-dependent on A, in symbols: A B (read A multi-determines B, or simply A double arrow B) if and only if, for every possible legal value or R, the set of B values matching a give (A value, C value) pair depends only on the A value and is independent of the C value. To understand it, we will take an example. Suppose we are given a relation HCTX (H for hierarchy) containing information about course, teachers, and texts, in which the attributes corresponding to teachers and texts are relation-valued (see Fig.). As you can see, each HCTX tuple consists of a course name, plus a relation containing teacher names, plus a relation containing text names (two such tuples are shown in the figure). The intended meaning of such a tuple is that the specified course can be taught by any of the specified teachers and uses all of the specified text as references. We assume that, for a given course, there can exist any number of corresponding teachers and any number of corresponding texts. Moreover, we also assume perhaps not very realistically! that teachers and text are quite independent of one another; that is, no teacher who actually teaches any particular offering of a given
course, the same texts are used. Finally, we also assume that a given teacher or a given text can be associated with any number of course.
Now suppose that we want to eliminate the relation-valued attributes. One way to do this, however, is simply to. replace relation HCTX by a relation CTX with three scalar attributes COURSE, TEACHER, and TEXT as indicated in Fig. 3.10. As you can see from the figure, each tuple of HCTX gives rise to rn * n tuples, in CTX, where m and n are the cardinalities of the TEACHERS and TEXTS relations in that HCTX tuple. Note that the resulting relation CTX is all key (the sole candidate key for HCTX, by contrast, was just {COURSE)).
Figure: Value for relation CTX corresponding to the HCTX value in Fig. The meaning of relation CTX is basically as follow3: A tuple (COURSE: c,TEACHER: t, TEXT x;) appears in CTX if and only if course c can be taught by teacher t and uses text x as a reference. Observe that, for a given course, all possible combinations of teacher and text appear; that is, CTX satisfies the (relation) constraint, If tuples (c, tl, xl), (c, t2, x2) both appear then tuples (c, tl, x2), (c, t2, xl) both appear also. Now, it should be apparent that relation CTX involves a good deal of redundancy, leading as usual to certain update anomalies. For example, to add the information that the physics course can be taught by a new teacher, it is necessary to insert two new tuples, one for each of the two texts. Can we avoid such problems? Well it is easy to see that: 1. The problems in question are caused by the fact that teachers and texts are completely independent of one another; 2. Matters would be much improved if CTX were decomposed into its two projections call them CT and CXon (COURSE.TEACHER) and (COURSE.TEXT1] respectively (see Fig.).
To add the information that the physics course can be taught by a new teacher, all we have to do now is insert a single tuple into relation CT. (Note that relation CTX can be recovered by joining CT and CX back together again, so the decomposition is non-loss.) Thus, it does seem reasonable to suggest that there should be a way of further normalizing a relation like CTX. Note: At this point, you might object that the redundancy in CTX was unnecessary in the first place, and hence that the corresponding update anomalies were unnecessary too. More specifically, you might suggest that CTX need not include all possible TEACHER/TEXT combinations for a given course; for example, two tuples are obviously sufficient to show that the physics course has two teachers and two texts. The problem is, which two tuples? Any particular choice leads to a relation having a very unobvious interpretation and very strange update behavior (try slating the predicate for such a relation! i.e., try slating the criteria for deciding whether or not some given update is an acceptable operation on that relation).
Figure: Values for relation CT and CX corresponding to the CTX value in Fig. Informally, therefore, it is obvious that the design of CTX is bad and the decomposition into CT and CX is better. The trouble is, however, these facts are not formally obvious. Note in particular that CTX satisfies no functional dependencies at all (apart from trivial ones such as COURSE COURSE), in fact. CTX is in BCNF, since as already noted it is all key any ill key relation must necessarily be in BCNF. (Note that the two projections CT and CS are also all key and hence in BCNF). The ideas of the previous chapter are therefore of no help with the problem at hand. The existence of problem BCNF relation like CTX was recognized very early on, and the way to deal with them was also understood, at least intuitively. However, it was not until 1977 that these intuitive ideas were put on a sound theoretical footing by Fagins introduction of the notion of multivalued dependencies, MVDs. Multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist MVDs that are not FDs). In the case of relation CTX there are two MVDs that hold: Course Teacher Course Text Note the double arrows; the MVD A B is read as B is multi-dependent on A, or, equivalently, A multi-determines B. Let us concentrate on the first MVD, Course Teacher. Intuitively, what this MVD means is that, although a course does not have a single corresponding teacher- i.e. the functional dependence COURSE TEACHER does not hold-nevertheless, each course does have a well-defined set of corresponding teachers. By well-defined, here we mean, more precisely, that for a given course c an a given text x, the set of teachers t matching the pair (c, x) in CTX depends on the value c alone-it makes no difference which particular value of x we choose. The second MYD, COURSE TEXT, is interpreted analogously. Here then is the formal definition of Multi-valued dependence: Let R be a and let A, B, and C be subsets of the attributes of R Then we say that B is multidependent on A-in symbols.
A B (read A multi-determines B, or simply A double arrow B) - if and only if, in every possible legal value of R, the set of B values matching a give (A value, C value) pair depends only on the A value an is independent of the C value. It is easy to show that, given the relation R {A, B, C}, the MVDA B holds if and only if the MVDA C also holds. MVDs always go together in pairs in this way. For this reason it is common to represent them both in one statement, thus: A . B / C For example: Course Teacher / Text We stated above that multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which the set of dependent (right-hand side) values matching a given determinant (left-hand side) value is always a singleton set. Thus, if A B. then certainly A B. Returning to our original CTX problem, we can now see that the trouble with relations such as CTX is that they involve MVDs that are not also FDs. (In case fit is not obvious, we point out that it is precisely the existence of those MVDs that leads to the necessity of for example inserting two tuples to add another physics teacher. Those two tuples are needed in older to maintain the integrity constraint that is represented by the MVD.) The two projection CT and CX do not involve any such MVDs which is why they represent an improvement over the original design. We would therefore like to replace CTX by those two projections, and an important theorem proved by Fagin in reference allows us to make exactly that replacement: Theorem (Fagin) : Let R (A, B, C) be a relation, where A, B and C are sets of attributes. Then R is equal to the join of its projections on (A, C) and (B, C) if and only if R Satisfies the MVDs A B / C. Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial (An MVD A B is trivial if either A is a superset of B or the union of A and B is the entire heading) MVD A B is satisfied, then all attributes of R are functionally dependent on A. In other words, the only nontrivial dependencies (FD5 or MVDs) are in the form K X (i.e., functional dependency from a superkey K to some other attribute X). Equivalently, R is in 4NF if it is in BCNF and all MVDs in R are in fact FDs out of keys. Note in particular, therefore, that 4NF implies BCNF. Relation CTX is not in 4NF, since it involves an MVD that is not FD at all, let alone an FD out of a key. The two projections CT and CX are both in 4NF, however. Thus 4NF is an improvement over BCNF in that it eliminates another form of undesirable dependency. What is more, that 4NF is always achievable; that is, any relation can be non-loss-decomposed into an equivalent collection of 4NF relations. Joint Dependencies and Fifth Normal Form So far in this chapter we have assumed that the sole operation necessary or available in the further normalization process is the replacement of a relation in a non-loss way by exactly two of its projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a surprise, therefore, to discover that there exist relations that cannot be non-loss-decomposed into two projections but can be non-loss- decomposed into three (or more). To coin an ugly but convenient term, we will describe such a relation as n-decomposable (for some n > 2)-. meaning that the relation in question can be non-lossdecomposed into n projections but not into m for any m <n. A relation that can be non-loss-decomposed into two projections we will call 2- decomposable,
Consider relation SPJ from the suppliers-parts-projects database (but ignore OTY for simplicity); a sample value is shown at the top of Fig. 3.10. Note that relation SPL is all key and involves no nontrivial FDs or MVDs at all, and is therefore in 4NF. Note to that Fig. also shows: (a) The three binary projections SP, PJ, and JS corresponding to the SPJ relation value shown at the top of the figure; (b) The effect of joining the SP and PJ projections (over P#); (c) The effect of joining that result and the JS projection (over J# and S#).
Figure: Relation SPJ is the join of all three of its binary projections but not of any two Observe that the result of the first join is to produce a copy of the original plus one additional (spurious) tuple, and the effect of the second join is then to eliminate that spurious tuple, thereby bringing us back to the original SPJ relation. In other words, the original SPJ relation is 3-decomposable. Note: The net result is the same whatever pair of projections we choose for the first join, though the intermediate result is different in each case. Exercise: Check this claim. Now, the example of Fig. 3.12 is of course expressed in terms of relations. However, the 3decomposability of SPJ could be a more fundamental, time-independent property i.e., a property satisfied by all legal values of the relationif the relation satisfies a certain time-independent integrity constraint. To understand what that constraint must be, observe first that the statement SPJ is equal to join of its three projections SP, PH and JS is precisely equivalent to the following statement: if the pair (S1, P1) appeals in SP and the pair (P1, J1) appears in PJ and the pair (J1, SI). appears in JS then the triple (S1, P1, JI) appears in SPJ because the tripl S1, P1, Ji) obviously appears in the join of SP, PJ, and JS (The converse of this statement, that if (SI;PI,J1) appears in SPJ then (SI, P1) appears in projection SP etc. is clearly true for any degree-3 relation SPJ.) Since (S1,Pi) appears in SP if and only if (SI, P1, JI) appears in SPJ for some J2, and similarly for (P1,J1) and (J1,SI), we can rewrite the statement above as a constraint on SPJ: If (S1,P1,J2), (S2,P1,JI), (SI,P2,J1) appear in SPJ
then (S1,P1,J1) also appears in SPJ And if this statement is true for all timei.e., for all possible legal values of relation SPJthen we do have a time-independent constraint on the relation (albeit a rather bizarre one) Notice the cyclic nature of that constraint (if SI is linked to Fl and JI must all coexists in the same tuple). A relation will be ndecomposable for some n> 2 if and only if it satisfies some such (n-way) cyclic constraint. Suppose then that relation SPJ does in fact satisfy that time-independent constraint (the sample values in Fig. 3.11 are consistent with this hypothesis). For brevity, let us agree in refer to that constraint as Constraint 3D (3D for 3-decomposable). What does Constraint 3D mean in real-world terms? Let us try to make it a little more concrete by giving an example. The constraint says that, in the portion of the real world that relation SPJ is supposed to represent, it is a fact that, if (for example) (a) Smith supplies monkey wrenches, and (b) Monkey wrenches are used in the Manhattan project, and (c) Smith supplies the Manhattan project. then Smith supplies monkey wrenches to the Manhattan project. Note that a b, and c. together normally do not imply d. We are saying there is no trapbecause there is an additional real-world constraint in effect, namely Constraint 3D, that makes the inference of d. from a, b, and c;. valid in this particular case. To return to the main topic of discussion: Because Constraint 3D is satisfied if and only if the relation concerned is equal to the join of certain of its projections, we refer to that constraint as a join dependency (JD). A JD is a constraint on the relation concerned, just as a MVD or an FD is a constraint on the relation concerned. Joint dependency: Let R be a relation, and let A, B Z be subsets of the attributes of R. Then we say that R satisfies the JD (A, B ,Z}* (read star A, 13. Z) if and only if every possible legal value of R is equal to the join of its projections on A, B,... Z. For example, if we agree to use SP, it means the subset (S#,P#) of the set of attributes of SPJ, and similarly for PJ and JS, then relation SPJ satisfies the JD (SP, PJ, JS). We have seen, then, that relation SPJ, with its JD (SP, PJ, JS), can be 3- decomposed. The question is, should it be? And the answer is Probably yes. Relation SPI (with its JD) suffers from a number of problems over update operations, problems that are removed when it is 3-decomposed.
Fig. Sample update problems in SPJ Fagins theorem, to the effect that R(A,B,C) can be non-loss-decomposed into its projections on (A, B) and (A, C) if and only if the MVDs A B and A C hold in A, can now be restated as follows: 1. R (A,B,C) satisfies the JD (AB,AC) if and only if it satisfies the MVDs A B/C. Since this theorem can be taken as a definition of multi-valued dependency, it follows that an MVD is just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Formally, we have A B/C=*(AB,AC) Note: It follows from the definition that join dependencies are the most general form of dependency possible (using, of course, the term dependency in a very special sense). That is, there does not exist a still higher form of dependency such that JDs are merely a special case of that higher form so long as we restrict our attention to dependencies that deal with a relation being decomposed via projection and recomposed via join. (However, if we permit other decomposition and recomposition operators, then other types of dependencies might come into play). Returning now to our example, we can see that the problem with relation SPJ is that it involves a JD that isnot an MVD, and hence not an FD either. We have also seen that it is possible, and probably desirable, to decompose such a relation into smaller components-namely, into the projections specified by the join dependency. That decomposition process can be repeated until all resulting relations are in fifth normal form, which we now define: Fifth normal form: A relation R is in 5NF, also called projection-join normal form (PJNF) if and only if every nontrivial join dependency that holds for R is implied by the candidate keys of R. Note: We explain below what it means for a JD to be implied by candidate keys. Relation SPJ is not in 5NF. It satisfies a certain join dependency, namely Constraint 3D, that is certainly not implied by its sole candidate key (that key being the combination of all of its attributes). To state this differently, relation SPj is not in 5NF, because (a) it can be 3 decomposed and (b) 3-decomposability is not implied by the fact that the combinations (S#, P#, J#) is a candidate key. By contrast, after 3decomposition, the three projections SP, P1, and JS are each in 5NF, since they do not involve any (nontrivial) JDs at all. Although it might not yet be obvious-because we have not yet explained what it means for a JD to be implied by candidate keysit is a fact that any relation in 5NF is automatically in 4NF also, because (as we have seen) an MVD is a special case of a JD. In fact any MVD that is implied by a candidate key must be in fact an FD in which that candidate key is the determinant, that any given relation can be nonless-decomposed into an equivalent of location of 5NF relations; that is, 5NF is always achievable. We now explain what it means for a JD to be implied by candidate keys. First we consider a simple example. Suppose once again that the familiar SUPPLIERS relation S has two candidate keys, (S#) and {SNAME). Then that relation satisfies several join dependenciesfor example, it satisfies the JD * (A,B,. . .Z) is trivial if and only if one of the projections A, B,.. .Z is the identity projection R (i.e., the projection over all attributes of R). {{S#, SNAME , STATUS}, {S#, CITY}} That is, relation S is equal to the join of its projections on (S#, SNAMES, STATUS) and (S#, CITY), and hence can be non-loss-decomposed into those projections. (This fact does not mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the fact that (S#) is a candidate key. Likewise, relation S also satisfies the JD. {{ S#. SNAME), {S#, STATUS} {SNAME, CITY)) This JD is implied by the fact that (S#) and {SNAME} are both candidate keys. As the foregoing example suggests, a given JD * (A, B Z) is implied by candidate keys if and only if each of A, B,... Z is in fact a superkey for the relation in question. The given, relation R, we can decompose it in 5NF so long as we know all candidate keys and all JDs in R. However, discovering all the JDs might itself be a nontrivial operation. That is, whereas it is relatively easy to identify FDs and MVDs (because they have a fairly straight forward real-world interpretation), the same cannot be said for JDs rmal that is, they are not MDs and not FDs-because the intuitive meaning of JDs might not be obvious. Hence the process of determining when a given relation is in 4NF but not in 5NF, and so could
probably be decomposed to advantage, is still unclear. Experience suggests that such relations are pathological cases and likely to be rare in practice. In conclusion, we note that it follows from the definition that 5NF is the ultimate normal form with respect to projection and join (which accounts for its alternative name. projection-join normal form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking projections. For if a relation is in 5NF, the only join dependencies are those that are implied by candidate keys, and so the only valid decompositions are ones that are based on those candidate keys. (Each projection in such a decomposition will consist of one or more of those candidate keys, plus zero or more additional attributes.) For example, the SUPPLIERS relation S is in 5NF, It can be further decomposed in several nonloss ways, as we saw earlier, but every projection in any such decomposition will still include one of the original candidate keys, and hence there does not seem to be any particular advantage in further reduction. The Normalization Procedure Summarized Up to this point in this chapter, we have been concerned with the technique of nonloss decomposition as an aid to database design. The basic idea is as follows: Given some 1NF relation R and some set of FDs, MVDs, and JDs that apply to R, we systematically reduce R to a collection of smaller (i.e.. lowerdegree) relations that are equivalent to R in a certain well-defined sense but are also in some way more desirable. (The original relation R might have been obtained by first eliminating certain relation-valued attributes) Each step of the reduction process consists of taking projections of the relations resulting from the preceding step. The given constraints are used at each step to guide the choice of which projections to take next. The overall process can be stated informally as a set of rules, thus: 1. Take projections of the original INF relation to eliminate any FDs that are not irreducible. This step will produce a collection of 2NF relations. 2 Take projections of those 2NF relations to eliminate any transitive FDs. This step will produce a collection of 3NF relations. 3. Take projections of those 3NF relations to eliminate any remaining FDs in which the determinant is not a candidate key. This step will produce a collection of BCNF relations.
Q 1 What is database security? Explain the mechanism for maintaining database security. Ans. Security ma database involves both policies and mechanism to protect the data and ensure that it is not accessed, altered or deleted without proper authorization As well as information is increasingly in an organization, more and more database created day-to-day. So there are all database should be secure from unauthorized access or manipulations from the hand of unknown person Data has to be protected in the database. There are two dimensions for the protection of data in the database,. First a certain class of data is available only to those person who are authoize4 to access it This makes the data confidential e g the medical records of patients in a hospital are accessible to health care officer Second, the data must be protected from accidental or intentional corruption or destruction e g data on national defense is vital to the security of a state There is safety of data processing in a chemical plant In addition to the economic or strategic reasons for protecting data from un-authorization access, corruption or destruction, there is a privacy dimension for data security and integrity Security and Integrity Threats: Some security and integrity threats are: Some types of threats can only be addressed using social, behavior and control mechanism to damage the data. The threats are either accidental or intentional. So there are two types of security & integrity threats occur in the security integrity concept Accidental security and Integrity threats. Some accidental security and integrity threats are: 1. A user can get access to a portion of, the database which other users cannot access Also that user damage a part of the data accidentally then whole data may be corrupted e g if an application programmer accidentally delete some function or subroutine then whole of the program in database will be affected. 2. Sometimes failure of any portion effect the whole data, or example, during a transaction processing if power supply becomes (fail) off then the computed data will not be transferred to the storage device and so data will be lost. I Proper recovery procedures are normally used to recover from the failure occurring during transaction processing. 3 .Sometimes concurrent processing or concurrent usage of data gives problem and it will be lost or damaged 4 .Sometime system error occur. A dial in user may be assigned the identity of another dial in user who was disconnected accidentally or who hung up without going through a long off procedure. 5. Sometimes improper authorization will cause the problem, which could lead to database security and / or integrity violation. 6. Hardware failure also causes the problem of data destruction. So to avoid this Hardware failure security, integrity should be needed. Malacious or Intentional Security and Integrity Threats : Some intentional security and integrity threats factors are as :
1. A computer system operator or system programmer can intentionally by pass the normal security and integrity mechanisms, alter or destroy the data in the database or make unauthorized copies of sensitive data. 2. An unauthorized user can bet access to a secure terminal or the password of an authorized user and compromise the database. Such user could also destroy the data base file. 3. Authorized users could pass on sensitive information under pressure or form personal gain. 4. System and application programmers could by pass normal security in, their program by directly accessing database files, and making changes and copies for illegal use. 5. An unauthorized person can get access to the computer system, physically or by. using communication channel and compromise the database. Protection: Four levels of defense (protection) are generally recognized for database security. These are: (a) Human Factor: which encompass the ethical, legal and social environments. An organization depends on these to provide a certain degree of protection. (b) Physical Security: mechanism includes appropriate locks and keys and entry 1og to computing facility and terminals. Security of the physical storage devices (magnetic tapes, disk pack etc) within the organization and when being transmitted from one location to another must be maintained. User identification and password have to be kept confidential otherwise unauthorized user compromises the database. (c)Administrative control: Controls are the security and access control policies that determine what information will be accessible to what class of user and the type of access that will be allowed to this class. (d) OS and DBMS mechanism: These are very good feature of security. Operating system gives protection to the data and progress both in primary & secondary memories Also users are established by operating system The DBMS transaction management, audit and recovery data during logging process Also DBMS have some integrity constraint and validation procedure for the check of user and procedures. Protection and Version Methods of Protection: Protection Protection is the branch of security when you want to safe the data from unauthorized access by using different mechanisms and ways Then these ways and mechanisms are protected. Some protection methods are: (a) Identification and Authentication: The authorization mechanism prepares the user profile for user and indicates the portion of the database accessible to that user and the mode of access allowed. The enforcement of the security policies in the data base system requires that the system know the identity of security policies in the database system requires that the system knows the identity of the user making the request. So before making any request the user has to identify her or himself to the system and authenticate the identification to confirm that the user in fact the correct person. The simplest and most common authentication scheme used is a password to authenticate the user. The user enters the user name or number & then authenticates her(himself) by the password These are used once for the initial signs on to the system. But for sensitive and important data on every step authentication / identification procedures can be operated. Sometimes badge, card or keys are used for access.
(b) Distributed system Protection: For the protection of data, security enforcement in distributed system can be enhanced by distributor Sensitive information can be fragmented and stored at dispersed sites. The leakage of some portion of the fragment data may be not as disastrous as the leakage of unfragmented data. Also with distribution different sites can have different levels of security and protection of data. (c) Cryptography and Encryption: Suppose defence want to send or transmit a message with protected way. The message is: Thanks are coming towards AMRITSAR One method of transmitting this message is to substitute a different character of the alphabet for each character in the message. If we ignore the space between words and the punctuation and by substitution can be made by shifting each character by a different random amount, then the above message can be transformed into as: Sbolrbsfdpnnjohupxbsebn&julbs The above process is cryptography. This is also called Encryption of data, Before transmission data should be encrypted. This is best way to protection. System Integrity. Integrity implies that any properly authorized access, alteration or deletion of data in the database does not change the validity of the data security and integrity concepts are distinct but are related with each other. Actually integrity is obtained from security. The mechanism that is applied to ensure that the data in the database is correct and consistent is called Data Integrity. The integrity is also the maintenance of data, which is damaged by unauthorized person. Data Integrity This requires that there is a need for guarding against invalid database operations. An operation here is used to indicate any action performed on behalf of a user or application program that modifies the state of the database. Such operations are the result of the action such as update, insert or delete. Database integrity involves the correctness of data. This correctness has to be preserved in the presence of concurrent operations, error in the users operations and application programs and failures in Hardware and Software Integrity has recovery system for the lost and damaged data and also check for data information stored in memory. In database integrity there are some types of constraints that the database has to enforce to maintain the consistency and validity of the data. Integrity constraints are hard to understand when we use these constraints rule in application program. Centralizing the integrity checking directly under the DBMS reduces duplication and ensure the consistency and validity of the database. The centralized integrity constraints can be maintained in a system catalog (data dictionary) and can be accessible to the database users via the query language. Q. 2. What is database security? Ans. Security in a database involves both policies and mechanism to protect the data and ensure that it is not accessed, altered or deleted without proper authorization. As well as information is increasingly in an organization, more and more database are created day-to-day. So there are all database should be secured from unauthorized access or manipulations from the hand of unknown person. Data has to be protected in the database. There are two dimensions for the protection of data in the database. First a certain class of data is available only to those person who are authorized to access it. This makes the
data confidential e.g. the medical records of patients in a hospital are accessible to health care officer. Second, the data must be protected from accidental or intentional corruption or destruction e.g. data on national defense is vital to the security of a state. There is safety f data processing in a chemical plant. In addition to the economic or strategic reasons for protecting data from un-authorization access, corruption or destruction, there is a privacy dimension for data security and integrity. Q. 3. Discuss the concept of transaction in detail. Or What are the desirable properties of transaction? Ans. A transaction is a logical unit of work that must be either entirely completed or aborted; no intermediate states are acceptable. Most real-world database transactions are formed by two or more database requests. A database request is the equivalent of a single SQL statement in an application program or transaction. A transaction that changes the contents of the database must alter the database from one consistent database state to another. To ensure consistency of the database, every transaction must begin with the database in a known consistent state
A transaction is a logical unit of work that contains one or more SQL statements. A transaction is an atomic unit. The effects of all the SQL statements in a transaction can be either all committed (applied to the database) or all rolled back (undone from the database). A transaction begins with the first executable SQL statement. A transaction ends when it is committed or rolled back, either explicitly with a COMMIT or ROLLBACK statement or implicitly when a DDL statement is issued. Consider a banking database, when a bank customer transfers money from a savings account to a checking account, the transaction can consist of three separate operations: Decrement the savings account Increment the checking account Record the transaction in the transaction journal If all three SQL statements can be performed to maintain the accounts in proper balance, the effects of - the transaction can, be applied to the database. However, if a problem such as insufficient funds, invalid account number, or a hardware failure prevents one or two of the statements in the transaction from completing, the entire transaction must be rolled back so that the balance of all accounts is .correct. Figure 2 illustrates the banking transaction example
Statement Execution and Transaction Control: A SQL statement that runs successfully is different from a committed transaction. Executing successfully means that a single statement was: Parsed Found to be a valid SQL construction . Run without error as an atomic unit. For example, all rows of a multirow update are changed. However, until the transaction that contains the statement is committed, the transaction can be rolled back, and all of the changes of the statement can be undone.. A statement, rather than a transaction, runs successfully. Committing means that a user has explicitly or implicitly requested that the changes in the transaction be made permanent. An explicit request :means that the user issued a COMMIT statement. An implicit request can be made through normal termination of an application or in data definition language, for example. The changes made by the SQL statements of your transaction become permanent and visible to other users only after your transaction has been committed. Only other users transactions that started after yours will see the committed changes. You can name a transaction using the SET TRANSACHON ... NAME statement before you start the transaction. This makes it easier to monitor long-running transactions and to resolve in-doubt distributed transactions. If at any time during execution a SQL statement causes an error, all effects of the statement are rolled back. The effect of the rollback is as if that statement had never been run. This operation is a statement-level rollback. Errors discovered during SQL statement execution cause statement-level rollbacks. An example of such an error is attempting to insert a duplicate value in a primary key. Single SQL statements involved in a deadlock (competition for the same data) can also cause a statement-level rollback. Errors discovered during SQL statement parsing, such as a syntax error, have not yet been run, so they do not cause a statement- level rollback. A SQL statement that fails causes the loss only of any work it would have performed itself. It does not cause the loss of any work that preceded it in the current transaction. If the statement is a DDL statement, then the implicit commit that immediately preceded it is not undone. The user can also request
a statement-level rollback by issuing a ROLLBACK statement. Note that users cannot directly refer to implicit save points in rollback statements. Resumable Space Allocation. Some DBMS provides a means for suspending, and later resuming, the execution of large database operations in the event of space allocation failures. This enables an administrator to take corrective action, instead of the database server returning an error to the user. After the error condition is corrected, the suspended operation automatically resumes. This feature is called resumable space allocation and the statements that are affected are called resumable statements. A statement runs in a resumable mode only when the client explicitly enables resumable semantics for the session using the ALTER SESSION statement. Resumable space allocation is suspended when one of the following conditions occurs: Out of space condition Maximum extents reached condition Space quota exceeded condition For nonresumable space allocation, these conditions result in errors and the statement is rolled back. Suspending a statement automatically results in suspending the transaction. Thus all transactional resources are held through a statement suspend and resume. When the error condition disappears (for example, as a result of user intervention or perhaps sort space released by other queries), the suspended statement automatically resumes execution. Process of Transaction. A DBMS must provide transaction processing system (TP System) to guarantee that if the transaction executes some updates and then a failure occurs due to some reason before transaction reaches its termination, then those updates will be undone. Therefore the transaction either executes in its entirety or is totally canceled. Transaction processing systems provide tools to help software development for applications that involve querying and updating databases. The term TP system is generally taken to mean a complete system, including application generators, one or more database systems, utilities and networking software. Within a Ti system, there is a core collection of services, called the TP monitor, that coordinates the flow of transactions through the system. (Fig.)
In order to work properly transaction-processing system needs following system requirements High Availability: System must be on-line and operational while enterprise is functioning. High Reliability: Correctly tracks state, does not lose data, controlled concurrency. High Throughput: Many users => many transactions/sec. Low Response Time: On-line => users are waiting. Long Lifetime: Complex systems are not easily replaced. Must be designed so they can be easily extended as the needs of the enterprise change. Security: Sensitive information must be carefully protected since system is accessible to many users. Roles in Design, Implementation, and Maintenance of a TPS
System Analyst specifies system-using input from customer and also provides complete description of functionality from customers and users point of view. Database Designer specifies structure of data that will be stored in database Application Programmer implements application programs (transactions) that access data and support enterprise rules Database Administrator maintains database once system is operational: space allocation, performance optimization, database security System Administrator maintains transaction-processing system, monitors inter connection of hardware and software modules, deals with failures and congestion. A transaction begins when the first executable SQL statement is encountered. An executable SQL statement is a SQL statement that generates calls to an instance, including DML and DDL statements. When a transaction begins, DBMS assigns the transaction to an available undo tablespace or rollback segment to record the rollback entries for the new transaction. A transaction ends when any of the following occurs: A user issues a COMMIT or ROLLBACK statement without a SAVEPOINT clause. A user runs a DDL statement such as CREATE, DROP, RENAME, or ALTER. If the current transaction contains any DML statements, DBMS first commits the transaction, and then runs and commits the DDL statement as a new, single statement transaction. A user disconnects from DBMS. The current transaction is committed. A user process terminates abnormally. The current transaction is rolled back. After one transaction ends, the next executable SQL statement automatically starts the following transaction. Note that Applications should always explicitly commit or roll back transactions before program termination. Commit Transactions Committing a transaction means making permanent the changes performed by the SQL statements within the transaction. Before a transaction that modifies data is committed, the following has occurred: DBMS has generated rollback segment records in buffers in the SGA that store rollback segment data. The rollback information contains the old data values changed by the SQL statements of the transaction. DBMS has generated redo log entries in the redo log buffer of the SGA. The redo log record contains the change to the data block and the change to the rollback block. These changes may go to disk before a transaction is committed. The changes have been made to the database buffers of the SGA. These changes may go to disk before a transaction is committed. Note that the data changes for a committed transaction, stored in the database buffers of the SGA, are not necessarily written immediately to the data files by the database writer (DBWn) background process. This writing takes place when it is most efficient for the database to do so. It can happen before the transaction commits or alternatively, it can happen some time after the transaction commits. When a transaction is committed, the following occurs: The internal transaction table for the associated rollback segment records that the transaction has committed, and the corresponding unique system change number (SCN) of the transaction is assigned and recorded in the table. The log writer process (LGWR) writes redo log entries in the SGAs redo log buffers to the online redo log file. It also writes the transactions SCN to the online redo log file. This atomic event constitutes the commit of the transaction.
DBMS releases locks held on rows and tables. DBMS marks the transaction complete. Rollback of Transactions: Rolling back means undoing any changes to data that have been performed by SQL statements within an uncommitted transaction. DBMS uses undo tablespaces or rollback segments to store old values. The redo log contains a record of changes. DBMS lets you roll back an entire uncommitted transaction. Alternatively, you can roll back the trailing portion of an uncommitted transaction to a marker called a savepoint. All types of rollbacks use the same procedures: Statement-level rollback (due to statement or deadlock execution error) Rollback to a savepoint Rollback of a transaction due to user request Rollback of a transaction due to abnormal process termination Rollback of all outstanding transactions when an instance terminates abnormally Rollback of incomplete transactions during recovery In rolling back an entire transaction, without referencing any savepoints, the following occurs: 1. DBMS undoes all changes made by all the SQL statements in the transaction by using the corresponding undo tablespace or rollback segment. 2. DBMS releases all the transactions locks of data. 3. The transaction ends. Savepoints in Transactions You can declare intermediate markers called savepoints within the context of a transaction. Savepoints divide a long transaction into smaller parts. Using savepoints, you can arbitrarily mark your work at any point within a long transaction. You then have the option later of rolling back work performed before the current point in the transaction but after a declared savepoint within the. transaction. For example, you can use savepoints throughout a long complex series of updates, so if you make an error, you do not need to resubmit every statement. Savepoints are similarly useful in application programs. If a procedure contains several functions, then you can create a savepoint before each function begins. Then, if a function fails, it is easy to return the data to its state before the function began and re-run the function with revised parameters or perform a recovery action. After a rollback to a savepoint, DBMS releases the data locks obtained by rolled back statements. Other transactions that were waiting for the previously locked resources can proceed. Other transactions that want to update previously locked rows can do so. When a transaction is rolled back to a savepoint, the following occurs: DBMS rolls back only the statements run after the savepoint. DBMS preserves the specified savepoint, but all savepoints that were established after the specified one are lost. DBMS releases all table and row locks acquired since that savepoint but retains all data locks acquired previous to the savepoint.
The transaction remains active and can be continued. Note that whenever a session is waiting on a transaction, a rollback to savepoint does not free rowlocks. To make sure a transaction doesnt hang if it cannot obtain a lock, use FOR UPDATE .. NOWAIT before issuing UPDATE or DELETE statements. Transaction Naming You can name a transaction, using a simple and memorable text string. This name is a reminder of what the transaction is about. Transaction names replace commit comments for distributed transactions, with the following advantages: It is easier to monitor long-running transactions and to resolve in-doubt distributed transactions. You can view transaction names along with transaction IDs in applications. For example, a database administrator can view transaction names in Enterprise Manager when monitoring system activity. Transaction names are written to the transaction auditing redo record. Log Miner can use transaction names to search for a specific transaction from transaction auditing records in the redo log. You can use transaction names to find a specific transaction in data dictionary tables, such as V$TRANSACTION. Name a transaction using the SET TRANSACTION ... NAME statement before you start the transaction. When you name a transaction, you associate the transactions name with its ID. Transaction names do not have to be unique; different transactions can have the same transaction name at the same time by the same owner. You can use any name that enables you to distinguish the transaction. The Two-phase Commit Mechanism In a distributed database, DBMS must coordinate transaction control over a network and maintain data consistency, even if a network or system failure occurs. A distributed transaction is a transaction that includes one or more statements that update data on two or more distinct nodes of a distributed database. A two-phase commit mechanism. guarantees that all database servers participating in a distributed transaction either all commit or all roll back the statements in the transaction. A two-phase commit mechanism also protects implicit DML operations performed by integrity constraints, remote procedure calls, and triggers. The two-phase commit mechanism is completely transparent to users who issue distributed transactions. In fact, users need not even know the transaction is distributed. A COMMIT statement denoting the end of a transaction automatically triggers the two-phase commit mechanism to commit the transaction. No coding or complex statement syntax is required to include distributed transactions within the body of a database application. The recoverer (RECO) background process automatically resolves the outcome of in-doubt distributed transactions -distributed transactions in which the commit was interrupted by any type of system or network failure. After the failure is repaired and communication is reestablished, the RECO process of each local DBMS server automatically commits or rolls back any in-doubt distributed transactions consistently on all involved nodes. In the event of a long-term failure, DBMS allows each local administrator to manually commit or roll back any distributed transactions that are in doubt as a result of the failure This option enables the local
database administrator to free any locked resources that are held indefinitely as a result of the long-term failure. If a database must be recovered to a point in the past, DBMSs recovery facilities enable database administrators at other sites to return their databases to the earlier point in time also. This operation ensures that the global database remains consistent. Read/Write Operation Since a transaction is a general program, there are an enormous number of potential operations that a transaction can perform. However, there are only two really important operations: 1. read(A,t) (or read(A) when t is not important) This operation is used to read database element A into local variable t. 2. write(A,t) (or write(A) when t is not important) This operation is used to write the value of local variable t to the database element A. We will assume that the buffer manager insures that database element is in memory. We could make the memory management more explicit by using following operations: 3. input(A) This operation is used to read database element A into local memory buffer. 4. output(A) - This operation is used to copy the block containing A to disk. Let us consider an example to understand the use of read and write operation. Suppose that we want to transfer $50 from account A to account B, then the set of operations performed are: 1. read(A,t) 2. t = t 50 3. write(A,t) 4. read(B,t) 5. t = t + 50 6. write(B,t) The first sep is used to read amount in account A with the help of read operation into the local variable t. In step 2, we reduce the vale of t by 50. Step 3 is used to write back the updated value to account A with the help of write operation. In step 4, value of account B is read into local variable t, which is incremented by 50 in step 5. Value of t is written to account B in step 6 with the help of write operation. Transaction Properties (Acid Properties) Any change to system state within a transaction boundary, therefore, has to ensure that the change leaves the system in a stable and consistent state. A transactional unit of work is one in which the following four fundamental transactional properties are satisfied: atomicity, consistency, isolation, and durability (ACID). We will examine each property in detail. ATOMICITY Results of a transactions execution are either all committed or all rolled back. All changes take effect, or none do. It is common to refer to a transaction as a unit of work. In describing a transaction as a unit of work, we are describing one fundamental property of a transaction: that the activities within it must be considered indivisible that is, atomic. A Flute Bank customer may interact with Flutes ATM and transfer money from a checking account to a savings account. Within the Flute Bank software system, a transfer transaction involves two actions: debit of the checking account and credit to the savings account. For the transfer transaction to be successful, both actions must complete successfully. If either one fails, the transaction fails. The atomic property of
transactions dictates that all individual actions that constitute a transaction must succeed for the transaction to succeed, and, conversely, that if any individual action fails, the transaction as a whole must fail. As an example consider two transactions: TI: BEGIN A = A + 100, B = B - 100 END T2: BEGIN A = 1.06*A, B = 1.06*B END Intuitively, the first transaction is transferring $100 from Bs account to As account. The second is crediting both accounts with a 6% interest payment. There is no guarantee that TI will execute before T2 or vice-versa, if both are submitted together. However, the net effect must be equivalent to these two transactions running serially in some order. TI: A = A + 100, B = B 100, T2: A = 1.06*A, B = 1.06*B This is OK. But what about this TI: A = A + 100, B = B - 100, T2: A = 1.06*A, B = 1.06*B The DBMS s view of the second schedule is T1: read(A), write(A), read(B), write(B) T2: read(A), write(A), read(B), write(B) Fig. A possible interleaving (schedule) Consistency The database is transformed from one valid state to another valid state. This defines a transaction as legal only if it obeys user-defined integrity constraints. Illegal transactions arent allowed and, if an integrity constraint cant be satisfied then the transaction is rolled back. For example, suppose that you define a rule that, after a transfer of more than $10,000 out of the country, a row is added to an audit table so that you can prepare a legally required report for the IRS. Perhaps for performance reasons that audit table is stored on a separate disk from the rest of the database. If the audit tables disk is off-line and cant be written, the transaction is aborted. A database or other persistent store usually defines referential and entity integrity rules to ensure that data in the store is consistent. A transaction that changes the data must ensure that the data remains in a consistent state that data integrity rules are not violated, regardless of whether the transaction succeeded or failed. The data in the store may not be consistent during the duration of the transaction, but the inconsistency is invisible to other transactions, and consistency must be restored when the transaction completes. Isolation When multiple transactions are in progress, one transaction may want to read the same data another transaction has changed but not committed. Until the transaction commits, the changes it has made should be treated as transient state, because the transaction could roll back the change. If other transactions read intermediate or transient states caused by a transaction in progress, additional application logic must be executed to handle the effects of some transactions having read potentially erroneous data. The isolation property of transactions dictates how concurrent transactions that act on the same subset of data behave. That is, the isolation property determines the degree to which effects of multiple transactions, acting on the same subset of application state, are isolated from each other.
At the lowest level of isolation, a transaction may read data that is in the process of being changed by another transaction but that has not yet been committed. If the first transaction is rolled back, the transaction that read the data would have read a value that was not committed. This level of isolationread uncommitted, or dirty read-can cause erroneous results but ensures the highest concurrency. An isolation of read committed ensures that a transaction can read only data that has been committed. This level of isolation is more restrictive (and consequently provides less concurrency) than a read uncommitted isolation level and helps avoid the problem associated with the latter level of isolation. An isolation level of repeatable read signifies that a transaction that read a piece of data is guaranteed that the data will not be changed by another transaction until the transaction completes. The name repeatable read for this level of isolation comes from the fact that a transaction with this isolation level can read the same data repeatedly and be guaranteed to see the same value. The most restrictive form of isolation is serializable. This level of isolation combines the properties of repeatable- read and readcommitted isolation levels; effectively ensuring that transactions that act on the same piece of data are serialized and will not execute concurrently. The isolation portion of the ACID properties is needed when there are concurrent transactions. Concurrent transactions are transactions that occur at the same time, such as shared multiple users accessing shared objects. This situation is illustrated at the top of the figure 5 as activities occurring over time. The safeguards used by a DBMS to prevent conflicts between concurrent transactions are a concept referred to as isolation.
Fig. Concurrently executing transaction As an example, if two people are updating the same catalog item, its not acceptable for one persons changes to be clobbered when the second person saves a different set of changes. Both users should be able to work in isolation, working as though he or she is the only user. Each set of changes must be isolated from those of the other users. An important concept to understanding isolation through transactions is serializability. Transactions are serializable when the effect on the database is the same whether the transactions are executed in serial order or in an interleaved fashion. As you can see at the top of the figurel.5, Transactions 1 through Transaction 3 are executing concurrently over time. The effect on the DBMS is that th transactions may execute in serial order based on consistency and isolation requirements. If you look at the bottom of the figure 1.5, you can see several ways in which these transactions may execute. It is important to note that
a serialized execution does not imply the first transactions will automatically be the ones that will terminate before other transactions in the serial order. Degrees of Isolation Degree 0 A transaction does not overwrite data updated by another user or process (dirty data) of other transactions Degree 1 Degree 0 plus a transaction does not commit any writes until it completes all its writes (until the end of transaction). Degree 2 Degree 1 plus a transaction does not read dirty data from other transactions. Degree 3 Degree 2 plus other transactions do not dirty data read by a transaction before the transaction commits. These were originally described as degrees of consistency by Jam Gray For example let us consider two transactions: First transaction transfers $100 from Bs account to As Second transaction credits both accounts with 6% interest. Let us assume at first A and B each have $1000. Then what are the legal outcomes of running TI and T2? There is no guarantee that Ti will execute before T2 or vice-versa, if both are submitted together. Consider a possible interleaved schedule
Durability Once committed (completed), the results of. a transaction are permanent and survive future system and media failures. If the airline reservation system computer gives you seat 22A and crashes a millisecond later, it wont have forgotten that you are sitting in 22A and also give it to someone else. Furthermore, if a programmer spills coffee into a disk drive, it will be possible to install a new disk and recover the transactions up to the coffee spill, showing that you had seat 22A. The durability property of transactions refers to the fact that the effect of a transaction must endure beyond the life of a transaction and application. That is, state changes made within a transactional boundary must be persisted onto permanent storage media, such as disks, databases, or file systems. If the application fails after the transaction has committed, the system should guarantee that the effects of the transaction will be visible when the application restarts. Transactional resources are also recoverable: should the persisted data be destroyed, recovery procedures can be executed to recover the data to a point in time (provided the necessary administrative tasks were properly executed). Any change committed by one transaction must be durable until another valid transaction changes the data. States of Transaction
During its execution, a transaction can be in many states. These states indicate the status of a transaction. Various states in which a transaction can be are: 1. Active: This state is the initial state of a transaction. The transaction stays in this state while it is executing. A transaction enters active state when the first query or update is encountered. Data is processed in buffer or on disk. 2. Partially committed: A transaction is partially committed after its final statement, has been executed. A transaction may change its state form active to partially committed one. A transaction enters this state immediately before the commit work. All operations are completed (in the memory buffer or on disk) and wait to be finalized. 3. Failed: A transaction enters the failed state after the discovery that normal execution can no longer proceed. A transaction may change is state form active to failed state. A transaction enter this state when the transaction is interrupted by an event such as a program exception or a system. 4. Aborted: A transaction is aborted after it has been rolled back and the database restored to its prior state before the transaction. A transaction enters this state after a rollback -work or at the system recovery. All updates made by the transaction are rolled back and .the database is restored to the state prior to the start of the transaction. There are two options after abort: Restart the transaction: This option is selected only if there is no internal logical error. Kill the transaction: This option is selected if there is problem with transaction itself. 5. Committed: Commit state occurs after successful completion. May also consider terminated as a transaction state A transaction enters this state after commit work. Updates are guaranteed to be permanent.
Fig. Transaction State Diagram Advantages of Concurrent Execution of Transaction Concurrent execution of transaction means executing more than one transaction at the same time. The schedule shown in fig. represents an interleaved execution of two transactions. Ensuring transaction isolation while permitting such concurrent execution is difficult.
Thus, multiple transactions are allowed to run concurrently in the system. Advantages of using concurrent execution of transaction are:
Increased processor utilization - If system is executing only one transaction then processor might not be always busy for example if the only transaction in the system is waiting for the completion of some I/O operation, processor is also waiting and thus doing no task. On the other hand if system is executing more than one transaction at same time, processor might be always busy executing one or the older transaction. Increased Disk utilization Better transaction throu1put - One transaction can be using the CPU while another is reading from or writing to the disk Reduced average response time for transaction - As short transactions need not wait behind long ones. Reduced average turnaround time for transactions - Turnaround time is the time interval between transaction submission and transaction completion. As more than one transaction is executing at same time there is reduction in average turnaround time. . Reduced average wait time for transactions - As more transactions are completed in less time. Q. 4. What are the various Locking Techniques for Concurrency Control? Ans. A lock is a variable associated with a data item in the database and describes the status of that item with respect to possible access operations to the item. Locks enable a multi-user DBMS to maintain the integrity of transactions by isolating a transaction from others executing concurrently. 9ks are particularly critical in write- intensive and mixed workload (read/write) environments, because they can prevent the inadvertent loss of data or consistency problems with reads. Figure 10 1 depicts lost update situation that could occur if a DBMS did not lock data Two transactions read the same bank account balance, each intending to ad1 money to it However, because the second transaction bases its update on the original balance, the money deposited by the first is lost . We could have avoided this scenario if the DBMS had appropriately locked the balance on behalf of the first transaction in preparation for its update. The second transaction would have waited, thereby using the updated balance as a basis for its work. As with locking data in preparation for writes, locking data for reads can be important in certain situations, preventing inconsistent analysis of the database. While DBMSs use exclusive locks for writes, share locks are commonly used for reads. Share locks enable other concurrently executing transactions to read the same data but prohibit any transaction from writing the chosen data.
Consider a situation that might occur without share locks, as shown in Fig 10 2 The first transaction reads the balances of multiple accounts (perhaps through multiple queries) with the intention of calculating an accurate sum Another transaction transfers money from one account to another during this process The timing of this work is such that it causes the first transaction to read only part of the effect of this transfer, thus making its sum total inconsistent with what it should be If share locks Were held by the first transaction until transaction commit, this inconsistent analysis would not occur.
Serializability is an important concept associated with locking. It guarantees that the work of concurrently executing transactions will leave the database in consistent state as it would have been if these transactions had executed serially. This requirement is the ultimate criterion for database consistency and is the motivation for the two-phase locking protocol, which dictates that no new locks can be acquired on behalf of a transaction after the DBMS releases a lock held by that transaction. In practice, this protocol generally means that locks are held until commit time. Aside from their integrity implications, locks can have a significant impact on performance. While it may benefit a given application to lock a large amount of data (perhaps one or more tables) and hold these locks for a long period of time, doing so inhibits concurrency and increases the likelihood that other applications will have to wait for locked resources. Yet locking only small amounts of data and releasing these locks quickly may be inappropriate for some applications, increasing the overhead associated with transaction processing. In addition, certain integrity problems can arise if a single transaction acquires locks after some have already been released. Need. for Lock Lock is required for the following Need isolation (the I of ACID): Give each transaction the illusion that there are no concurrent updates. Hide concurrency anomalies. Do it automatically - (system does not know transaction semantics) Goal of lock -To provide concurrency in a system execution equivalent to some serial execution of the system -Not deterrninistic9utc2me just a consistent transformation Locks are a popular approach concurrency control. Transactions request and acquire locks on data items which they wish to access and which they do not want other transaction to update. i.e. a lock locks other transactions out. Transactions can not access data items unless they have the appropriate lock. Most locking protocols are based on two types of locks: WRITE (or exclusive) locks: if a transaction holds a write lock on an item no other transaction may acquire a read or write lock on that item. READ (or shared) locks: if a transaction holds a read lock on an item no other transaction may acquire a write lock on that item. Transactions go into a WAIT state till required lock is available. Acquisition of locks is the responsibility of the transaction management subsystem. For strict schedules - i.e. simple recovery,
transactions should hold all exclusive locks until COMMIT or ROLLBACK time. Thus no transaction can read or update an item until the last transaction that updated it has committed and released the exclusive lock. LOCK MANAGEMENT The art of the DBMS that keeps track of the locks issued to transactions is called the lock manager. The lock manager maintains a lock table which is a hah table with the data object identifier as the key. The DBMS also maintains a descriptive entry for each transaction in a transaction table, and among other things, the entry contains a pointer to a list of locks held by the transaction. A lock table entry for an object which can be a page, a record, and so on, depending on the DBMS contains the following information: the number of transactions currently. holding a lock on the object (this can be more than one if the object is locked in shared mode), the nature of the lock (shared or exclusive), and a pointer to a queue of lock requests. Implementing Lock and Unlock Requests According to the Strict 2PL protocol, before a transaction T reads or writes a database object 0, it must obtain a shared or exclusive lock on 0 and must hold on to the lock until it commits or aborts. When a transaction needs a lock on an object, it issues a. lock request to the lock manager: 1. If a shared lock is requested, the queue of requests is empty, and the object is not currently locked in exclusive mode, the lock manager grants the lock and updates the lock table entry for the object (indicating that the object is locked in shared mode, and incrementing the number of transactions holding a lock by one). 2. If an exclusive lock is requested, and no transaction currently holds a lock on the object (which also implies the queue of requests is empty), the lock manager grants the lock and updates the lock table entry. 3. Otherwise, the requested lock cannot be immediately granted, and the lock request is added to the queue of lock requests for this object. The transaction requesting the lock is suspended. When a transaction aborts or commits, it releases all its locks. When a lock on an object is released, the lock manager updates the lock table entry for the object and examines the lock request at the head of the queue for this object. If this request can now be granted, the transaction that made the request is woken up and given the lock. Indeed, if there are several requests for a shared lock on the object at the front of the queue, all of these requests can now be granted together. Note that if TI has a shared lock on 0, and T2 requests an exclusive lock, T2s request is queued. Now, if T3 requests a shared lock, its request enters the queue behind that of T2, even though the requested lock is compatible with the lock held by TI. This rule ensures that .T2 does not starve, that is, wait indefinitely while a stream of other transactions acquire shared locks and thereby prevent T2 from getting the exclusive lock that it is waiting for. Atomicity of Locking and Unlocking, The implementation of lock and unlock commands must ensure that these are atomic operations. To ensure atomicity of these operations when several instances of the lock manager code can execute concurrently, access to the lock table has to be guarded by an operating system synchronization mechanism such as a semaphore. To understand why, suppose that a transaction requests an exclusive lock. The lock manager checks and finds that no other transaction holds a lock on the object and therefore decides to grant the request. But in the meantime, another transaction might have requested and received a conflicting lock! To prevent this, the entire sequence of actions in a lock request call
(checking to see if the request can be granted, updating the lock table, etc.) must be implemented as an atomic operation. The DBMS maintains a transaction table, which contains (among other things) a list of the locks currently held by a transaction. This list can be checked before requesting a lock, to ensure that the same transaction does not request the same lock twice. However, a transaction may need to acquire an exclusive lock on an object for which it already holds a shared lock. Such a lock upgrade request is handled specially by granting the write lock immediately if no other transaction holds a shared lock on the object and inserting the request at the front of the queue otherwise. The rationale for favoring the transaction thus is that it already holds a shared lock on the object and queuing it behind another transaction that wants an exclusive lock on the same object causes both transactions to wait for each other and therefore be blocked forever. We have concentrated thus far on how the DBMS schedules transactions, based on their requests for locks. This interleaving interacts with the operating systems scheduling of processes access to the CPU and can lead to a situation called a convoy, where most of the CPU cycles are spent on process switching. The problem is that a transaction T holding a heavily used lock may be suspended by the operating system. Until T is resumed, every other transaction that needs this lock is queued. Such queues, called convoys, can quickly become very long; a convoy, once formed, tends to be stable. Convoys are one of the drawbacks of building a DBMS on top of a general purpose operating system with preemptive scheduling. In addition to locks, which are held over a long duration, a DBMS also supports short duration latches. Setting a latch before reading or writing a page ensures that the physical read or write operation is atomic; otherwise, two read/write operations might conflict if the objects being locked do not correspond to disk pages (the units of I 0). Latches are unset immediately after the physical read or write operation is completed. TYPES OF LOCKS Any data that are retrieved by a user for updating must be locked, or denied to other users, until the update is completed or aborted. Locking can be done at different levels. These levels includes database, table, record, field. Data items can be locked in two modes 1. Exclusive (X) lock Data item can be both read as well as written X-lock as requested using lock-X instruction Prevent another transaction from reading a record until it is unlocked C 2 Shared (S) lock Data item can only be read S-lock is requested using lock-S instruction Allow other transactions to read a record or other resource. Binary Locks A binary lock has only two states: locked (1) or unlocked (0). If a object is locked by transaction, no other transaction can use that object. If an object is unlocked, any transaction can lock the object its use. A transaction must unlock the object after its termination very transaction requires a lock and unlock operation for each data item that is accessed. Binary locks are simple but restrictive. Checking is done before entry is made, waiting is done when the object is found locked, unlock is done after use.
Share/Exclusive Locks An exclusive lock exists when access is specially reserved for, the transaction that locked the object. The exclusive lock must be used when the potential for conflict exists. An exclusive lock is issued when a transaction wants to write (update) data item and no locks are currently held on that data item. There are two basic requirements of locking: READ operations (such as SELECT and FETCH), acquire SHARE lock before rows can be retrieved. WRITE operations (such as UPDATE, INSERT, and DELETE),. must acquire EXCLUSIVE before rows can be modified. A SHARE(S) lock permits reading by other users. No other transaction may modify the data that is locked with an S lock. When an S lock is obtained at the table level the transaction can read all rows in the table. No row or page level lock are acquired when the transaction reads a row (the S lock at the table level covers all of the rows in the table, so additional locks are not necessary). When an s lock is obtained at the page level the transaction can read all rows on the page. No row level locks are acquired when the transaction reads a row (the S lock at the page level covers all of the rows on the page). When an S lock is obtained at the row level, the transaction can read the row. An EXCLUSIVE (X) lock prevents access by any other user. An X lock is the strongest type of lock. No other transaction may read or modify the data that is locked with an X lock. An X lock must be obtained (either at the table, page, or row level) when user data is updated, inserted, or deleted. When an X lock is obtained at the table level, the transaction can read and modify all rows in the table. No row or page level locks are acquired when the transaction reads or modifies a row. When an X lock is obtained at the page level, the transaction can read and modify all rows on the page. No row level locks are acquired when the transaction reads or modifies a row. When an X lock is obtained at the row level, the transaction can read and modify the row. Disadvantages of Locking Pessimistic concurrency control has a number of key disadvantages, particularly in distributed systems: Overheat: Locks cost, and you pay even if no conflict occurs. Even read only actions must acquire lock\High overhead forces careful choices about lock granularity. Low concurrency: If locks are too coarse, they reduce concurrency unnecessarily. Need for strict 2PL to avoid cascading aborts makes it even worse.
Low availability: A client cannot make progress if the server or lock holder is temporarily unreachable. Deadlock. TWO PHASE LOCKING PROTOCOL A locking protocol is a set of rules followed by all transactions while requesting and releasing locks Rules for Two-Phase Locking Protocol are Two transactions cannot have conflicting locks. No unlock operation can precede a lock operation in the same transaction. No data are affected until all locks are obtained that is, until the transaction is in its locked point. Two phase locking protocol is a protocol which ensures conflict-serializable schedules. Phase 1: Growing Phase Transaction may obtain locks Transaction may not release locks Phase 2: Shrinking Phase Transaction may release locks Transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock). Two-phase locking does not ensure freedom from deadlocks. Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts. Rigorous two-phase locking is even stricter: here all locks are held till commit/ abort. In this protocol transactions can be serialized in the order in which they commit. There can be conflict serializable schedules that cannot be obtained if two-phase locking is used. However, in the absence of extra information (e.g., ordering of access to data), two- phase locking is needed for conflict serializability in the following sense: Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable. Example of a transaction performing locking: Begin(T1) Lock(T1,A,S) Read (T1,A) Unlock(T1,A)
Begin(T2) Lock(T2,B,S) Read (T2,B) Unlock(T2,B) Display(A+B) Commit(T1) Commit(T2) Locking as above is not sufficient to guarantee serializability if A and B get updated in-between the read of A and B, the displayed sum would be wrong. Two-phase locking with lock conversions: First Phase: can acquire a lock-S on item can acquire a lock-X on item can convert a lock-S to a lock-X (upgrade) Second Phase: can release a lock-S can release a lock-X : can convert a 1ockX to a lock-S (downgrade) This protocol assures serializability But still relies on the programmer to insert the various locking instructions. IMPLEMENTATION OF LOCKING A lock manager can be implemented as a separate process to which transactions send lock and unlock requests The lock manager replies to a lock request by sending a lock grant messages (or a message asking the transaction to roll back, in case of a deadlock) The requesting transaction waits until its request is answered The lock manager maintains a data structure called a lock table to record granted locks and pending requests The lock table is usually implemented as an in-memory hash table indexed on the name of the data item being locked. In lock table, Black rectangles indicate granted locks, white ones indicate waiting requests. Lock table also records the type of lock granted or requested. New request is added to the end of the queue of requests for the data item, and granted if it is compatible with all earlier locks. Unlock requests result in the request being deleted, and later requests are checked to see if they can now be granted. If transaction aborts, all waiting or granted requests of the transaction are deleted. lock manager may keep a list of locks held by each transaction, to implement this efficiently. Problems With Two Phase Locking Protocol Consider the partial schedule
Fig. An example schedule Neither T3 nor T4 can make progress executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A. Such a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil. Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item. The same transaction is repeatedly rolled back due to deadlocks. Concurrency control manager can be designed to prevent starvation. PRECEDENCE GRAPH Precedence graph is used, for testing serializability of a schedule. There is one node for each transaction in the schedule.
If the precedence graph has a cycle the schedule is not serializable. If it has no cycle, any ordering of the transactions which obeys the arrows is an equivalent serial schedule, so the schedule is serializable. Solution Of Inconsistency Problem Problem of inconsistent analysis can be solved with the help of locks. Let us understand this with the help of following example Initial values: A = $400, B = $500, and C = $300
Note that the above example leads to a deadlock. Still, it is an acceptable solution because the ACID properties are preserved. Q. 5. What are Concurrency Control Based on Timestamp Ordering? Ans. Concurrency control : concurrent control is a method used to ensure that database transaction are executed in a safe manner it process of management operations against a database so tat 1ta operation do not interfere with each other in a multi-user environment. One such that shared database is used in on-line manner is the database for an airline reservation that is used by many agents accessing the database from their terminals A database could also be accessed in
hatch mode and it is concurrently used with the online mode The sharing of the database for read only access not cause any problem, but if one of the transactions running concurrently tries to modify same data item, it could lead to inconsistencies. Further if more than one transaction is allowed to simultaneously modify a data item in a database. It could lead to incorrect value for the data item and an inconsistent database will be created. For example, suppose that two ticked agents access the online reservation system simultaneously to see, if the seat is available on a given flight or not and if both agents make the reservation against the last available seat on that flight then a message of overbooking will be displayed. This will make the data in inconsistent way We can say that concurrent processing of the programs, process or job are similar to 11ie multiprogramming, i.e. no. of jobs or programs processed simultaneously to achieve their independent & different goals according to their own requirements. Some concurrency problem, when we apply a correct in the concurrent processing then we see that databases becomes inconsistent after the completion of the transaction In the case of concurrent operation where a number of transactions are running and using the database we cannot make any assumption about the order in which the statement belonging to different transactions will be executed The order in which these statements are executed is called Schedu3 So the processing of these statements which are in schedule and used in concurrent operation and we cannot change the schedule is called concurrent schedule. Some problems occur during scheduling and concurrent processing are as: (a) Lost update problem Consider the two transactions given below and these transactions are accessing the same data item A. Each of these transactions modifies the data item and write it back. Then we see that he concurrent processing of the modification of the value of A
(b) Inconsistent Read Problem The lost update problem was caused, by concurrent modification of same data item. However concurrency can also cause problem when only one transaction modifies given set of data while that set of data is being used by other transaction. For example, if here are two transactions occur T5 and T6 in a schedule Suppose A and B represents some data items having integer value then if both are concurrently processed then one processing will be reading the data and other will be modifying the data. So it will create an inconsistency in the reading in next transactions that which data is correct for reading and which is incorrect.
In phantom, phenomenon let us consider an organization where parts are purchased and kept in stock The parts are withdrawn from the stock and us number of projects To check the extent of loss, we want to see that whether (or if) current quantity of some part purchased and received is equal to the current sum of the quantity of that part in stock, plus the current quantity is used by various projects The phantom problem means if additional items are added, this additional information reflects the transaction and query during the concurrent processing. This problem could be prevented by using the concept of Locking i.e. locking of such type of records also prevents the addition of such phantom records. Chapter 4 - Part 2 (d) Semantic of Concurrent Transaction As we take two different transactions for different ordering, it is not necessary that the two transactions are commutative. Suppose two transactions.
For example, whether syntax of operation as same, but semantics of these transactions are different during concurrent processing. An important part of concurrency is serial execution or seralizability. When we let some independent transactions in a schedule by setting them in a order such that their execution becomes serially then these type of execution is called serial execution or serialzability. Some problem of concurrent processing are removed by serial execution by setting or ordering the operation in a particular sequences. Senalizablity A non-serial schedule that is equivalent to some serial execution of transactions is called a serializiable schedule, For example m below written three schedules, schedule- 3 is Serializable schedule and is equivalent to the scheule-1 and schedule-2. The purpose of Serializable scheduling is to find the non-serial schedules that allows the transactions to execute concurrently without interfering with one another and therefore produces a database state that could be produced by a serial execution. Note that serializability also removes the problem of inconsistency.
Definition: The given interleaved execution of some transactions is said to be serializable if it produces the same results as some serial execution of the transactions. In serializablity the ordering of read and write operations are important before any operation to avoid any type of confusion or in-consistency. Serializablity Test can be explained and solved with the help of Precedence Graph. Solution to these problems If all schedules in concurrent environment are restricted to serializable schedule, the result obtained will be consistent with some serial execution of the transaction and will be considered correct Also testing of serialzability of a schedule is not only expensive but it is impractical some time. Thus one of the following concurrent control schemes is applied m concurrent database environment to ensure that the schedule produced by concurrent transaction are serializable. Some concurrency controls schemes used to solve all the problems occurs during the concurrent scheduling are as discussed below Locking Scheme (ii) Time stamp based order (iii) Optimistic scheduling (iv) Multi version technique Locking From the point of view of Locking a database can be considered as being made up of set of data items A lock is a variable associated with each such data item Manipulating the value of the lock is called LOCKING The value of lock variable is used m locking scheme to control the concurrent access and manipulation of the associated data item. The locking is done by a subsystem of DBMS and such system is called Lock Manager. There are two- types of Lock: (a) Exclusive Lock: Exclusive Lock is also called update or writes lock. The intention of their mode of locking is to provide exclusive use of data items to one transaction. If a transaction T locks a data item in an exclusive mode, no other transaction can access Q or not even read Q until the lock is released by transaction. (b) Shared Lock : Share Lock is also called a read lock. Any number of transactions can concurrently lock and access a data item in -the shared mode but none of thes transactions can modify the data item. A data item locked in a shared mode cannot be locked in the exclusive mode until the shared lock is released by all transactions holding the lock. A data item in the exclusive mode cannot be locked in the share mode untill the exclusive lock on the data item is released. Two Phase Locking: Another method of locking is called Two Phase Locking. In this once a lock is released no additional lock are requested. In other words the release of the lock is delayed, until all the locks on all data items required by the transactions have been acquired. It has two phases, a growing phase in which the number of locks increase from 0 to maximum for the transaction and a contracting phase in which the number of locks held decreases from maximum to zero. Both of these phases is monotonic ie. the number of locks are only increasing in first phase and decreasing in the 2 phase. Once a transaction starts releasing locks, it is not allowed to request any further locks. In this way a transaction is obliged to request all locks it may need during its life before it releases any. This leads to control and lower the degree of Concurrency. (ii) Time stamp based order
In time stamp based method, a serial order is created among the concurrent transaction by assigning to each transaction a unique non-decreasing number. The usual value assigned to each transaction is the system clock value at the start of the transaction. It is called Time Stamp ordering. There are two types of. time stamp: (a) Write time stamp (b) Read time stamp. A variation of this scheme is used in a distributed environment includes the site of a transaction appendid to the system wide clock value. This value can then be the system wide clock value. This value can then be used on deciding the order in which the conflict between two transactions is resolved. A transaction with a smaller time stamp value is considered to be an older transaction than another transaction with a larger time stamp value. Data item X is thus represented by triple X set as X: {X, Wx, Rx) where each X is represented as: X The value of data item. Wx : The write time stamp value, the largest time stamp value of any transaction that was allowed to write a value of X. Rx : The read time stamp value, the largest timestamp value of any transaction that was allowed to read the current value X. (iii) Optimistic Scheduling In the optimistic scheduling schema, the philosophy is to assume that all data items can be successfully updated at the end of a transaction and to read in the values for data item without any locking Reading is done when required and If any data item is found to be inconsistent at the end of a transaction then the transaction is rolled back (Used for recovery procedure in DBMS). In optimistic scheduling each transaction has three phases: (a) The read phase : This phase starts with the activation of a transaction and in this all data items are read into local variables and any modification that are made are only to those local copies. This ends with commitment. (b) Validation phase: In this when data items are modified it check that data after the procedure be rolled back. (c) Write phase: When transaction passes. the validation phase, then whole transaction be written into secondary storage data An optimistic scheme does not use lock and so it is dead lock free even through starvation can still occur. (iv) Multi version technique It is also called time domain addressing scheme, which follows the accounting principle of never overwriting a transaction Any charge are achieved by entering compensating transaction e g m this X is achieved by making a new copy or version of data item X So it is called Multi versions In this way a history of evolution of the value of data item is recorded in the database. With multi version technique, write operations can occur concurrently. Since they do not overwrite with each other. Also read operation can read any version. Q. 6. What do you mean by Database Recovery Techniques? Ans. Recovery : A computer system is an electromechanical device subject to failures of various types. Recovery is the procedure through which data can be again collected, recalled or accessed by using different mechanisms, which has been lost during the processing or due to failure of any type. The types of failures that the computer system is likely to be subjected to include failures of components or sub systems, software failure, power failure, accidents, natural or man-made disasters. Database recovery technique or methods of making the database formation in valid form and original form, which is
damaged by any failure. The aim of the recovery scheme is to allow database operations to be resumed after a failure with a minimum loss of information at an economically justifiable cost. Recovery schemes can be classified as forward or backward recovery. Database system term use the later scheme to recover from errors. (a) Forward Error Recovery: In this scheme when a particular error in the system is detected, the recovery system makes an accurate assessment of the state of the system and then makes appropriate adjustment based on the anticipated result which make the system error free. The aim of the adjustment is to restore the system so that the effects of the error are cancelled and system can continue to operate. This scheme is not applicable to unanticipated errors. (b) Backward Error Recovery: In this scheme no attempt is made to extrapolate and no state is accessed which is error free. In this system is reset to some previous correct state that is known to be free of any errors. You can take backup from floppy disks for backward recovery. Recovery in a centralized DBMS In a centralized DBMS if there be any failure, then there are some methods used to recover the data. Some methods used to recover the data are: (i) To set transaction marker for transaction identification through which we can access the data. (ii) By applying some operations on record These operations are insert, delete and modify (iii) To set log on the system transaction In an online database system, for example, an airline reservation system, there could be hundreds of transactions handled per minute. The log for this type of database contains a very large volume of information. A scheme called checkpoint is used to limit the volume of log information that has to be handled & processed in the events of a system failure involving the loss of volatile information The check point scheme is an additional component of the logging scheme described above. A checkpoint operation performed periodically copies log information onto stable storage. Q 7 What are the various Database Security Issue? Ans. 1. A computer system operator or system programmer can intentionally by pass the normal security and integrity mechanisms, alter or destroy The data in the database or make unauthorized copies of sensitive data. 2. An unauthorized user can bet access to a secure terminal or the password of an authorized user and compromise the database. Such user could also destroy the database file. 3. Authorized users could pass on sensitive information under pressure or form personal gain. 4. System and application programmers could by pass normal security in their program by directly accessing database files and making changes and copies for illegal use. 5. An unauthorized person can get access to the computer system, physically or by using communication channel and compromise the database. Q. 8. What do you mean by the term deadlock? Ans. Consider the following example: transaction Ti gets an exclusive lock on object A, T2 gets an exclusive lock on B, Ti requests an exclusive lock on B and is queued, and T2 requests an exclusive lock on A and is queued. Now, TI is waiting for T2 to release its lock and T2 is waiting for TI to release its lock! Such a cycle of transactions waiting for locks to be released is called a deadlock )Clearly, these two transactions will make no further progress. Worse, they hold locks that may be required by other transactions. The DBMS must either prevent or detect (and resolve) such deadlock situations. Deadlock Prevention
We can prevent deadlocks by giving each transaction a priority and ensuring that lower priority transactions are not allowed to wait for higher priority transactions (or vice versa). One way to assign priorities is to give each transaction a timestamp when at starts up.The lower the timestamp, the higher the transactions priority, that is, the oldest transaction has the highest priority. If a transaction Ti requests a lock and transaction Tj holds a conflicting lock, the lock manager can use one of the following two policies: Wait-die If Ti has higher priority, it is allowed to wait, otherwise it is aborted Wound-wait If Ti has higher priority, abort Tj, otherwise ti waits. In the wait-die scheme, lower priority transactions can never wait for higher priority transactions. In the wound-wait scheme, higher priority transactions never wait for lower priority transactions. In either case no deadlock cycle can develop. A subtle point is that we must also ensure that no transaction is perennially aborted because it never has a sufficiently high priority. (Note that in both schemes, the higher priority transaction is never aborted.) When a transaction is aborted and restarted, it should be given the same timestamp that it had originally. Reissuing timestamps in this way ensures that each transaction will eventually become the oldest transaction, and thus the one with the highest priority, and will get all the locks that it requires. The wait-die scheme is nonpreemptive; only a transaction requesting a lock can be aborted. As a transaction grows older (and its priority increases), it tends to wait for more and more younger transactions. A younger transaction that conflicts with an older transaction may be repeatedly aborted (a disadvantage with respect to wound wait, but on the other hand, a transaction that has all the locks it needs will never be aborted for deadlock reasons (an advantage with respect to wound-wait, which is preemptive). Deadlock Detection Deadlocks tend to be rare and typically involve very few transactions. This observation suggests that rather than taking measures to prevent deadlocks, it may be better to detect and resolve deadlocks as they arise. In the detection approach, the DBMS must periodically check for deadlocks. When a transaction Ti is suspended because a lock that it requests cannot be granted, it must wait until all transactions Tj that currently hold conflicting locks release them. The lock manager maintains a structure called a waits-for graph to detect deadlock cycles. The nodes correspond to active transactions, and there is an arc from Ti to Tj if (and only if) Ti is waiting for Tj to release a lock. The lock manager adds edges to this graph when it queues lock requests and removes edges when it grants lock requests. Observe that the waits-for graph describes all active transactions, some of which will eventually abort. If there is an edge from Ti to Tj in the waits-for graph, and both Ti and Tj eventually commit, there will be an edge in the opposite direction (from Tj to Ti) in the precedence graph (which involves only committed transactions). The waits- for graph is periodically checked for cycles, which indicate deadlock, A deadlock is resolved by aborting a transaction that is on a cycle and releasing its locks; this action allows some of the waiting transactions to proceed.
Fig. Watts-for Graph before and after Deadlock As an alternative to maintaining a waits-for graph, a simplistic way to identify deadlocks is to use a timeout mechanism: if a transaction has been waiting too long for a lock, we can assume (pessimistically) that it is in a deadlock cycle and abort it. Q. 9. What is serializability of schedules? Ans. Serializablity A non-serial schedule that is equivalent to some serial execution of transactions is called a serializiable schedule, For example in below written three schedules, schedule-3 is Serializable schedule and is equivalent to the scheule-1 and schedule-2. The purpose of Serializable scheduling is to find the non-serial schedules that allows the transactions to execute concurrently without interfering with one another and therefore produces a database state that could be produced by a serial execution. Note that serializability also removes the problem of inconsistency.
Definition: The given interleaved execution of some transactions is said to be serializable if it produces the same results as some serial execution of the transactions. In serializablity the ordering of read and write operations are important before any operation to avoid any type of confusion or in-consistency. Serializablity Test can be explained and solved with the help of Precedence Graph. Q 10 Define the concept of aggregation Give two examples of where this concept is useful? Ans. Selecting the data n group of records is called aggregation Data aggregation is in which information is gathered and eprdJiia.summary.1orzn, for purposes such as statistical analysis A common aggregation purpose is to get more information about particular groups based on specific variables such as age profession or income. Q 11 Compare the shadow-paging recovery scheme with log-based recovery Schemes? Ans. Modifying the database without ensuring that the transaction will commit may leave the database in an inconsistent state. Consider transaction Ti that transfers $50 from account A to account B; goal is either to perform all database modifications made by Ti or none at all. Several output operations may be required for Ti (to output A and B). A failure may occur after one of these modifications has been made but before all of them are made. To ensure atomicity despite failures: First output information describing the modifications to stable storage without modifying the database itself; only then start modifying the database. We study two approaches: 1. Log -based recovery 2. Shadow paging We assume (initially) that transactions run serially, that are, one after the other. A log:
Sequence of log records Maintains a record of update activities on the database Kept on stable storage. When transaction Ti starts, it writes log record <Ti start> Before Ti executes write (X), it writes log record <Ti, X, VI, V2>: V1 is the value of X before the write (for undo) V2isthevaluetobewrittentoX When Ti commits, it writes log record <Ti commit> When Ti aborts1 it writes log record <Ti abort > We assume for now that log records are written directly to stable storage (that is, they are not buffered) The two approaches using logs are 1. Deferred database modification 2. Immediate database modification Deferred Database Modification Records all modifications to the log Defers all the writes to after partial commit Transaction starts by writing <T1 start> record to log. A write (X) operation results in writing a log record <T1, X, V> V is the new value for X Note: Old value is not needed for this scheme. The write is not performed on X at this time, but is deferred. When T1 partially commits, <T1 commit> is written to the log. Finally, the log records are read and used to, actually execute the previously deferred write s. During recovery after a crash, a transaction needs to be redone if and only if both < T1 start> and <T commit> are in the log. Redoing a transaction T, (redo Ti) sets the value of all data items updated by the transaction to the new values. Crashes can occur while the transaction is executing the original updates, or while recovery action is being taken. Example transactions T0 and T1 (T0 executes before T1): T0: T1: read(A) read(C) A:A50 C:C100 Write(A) write (C) read(B) B:B+ 50 write(B) Let the original value of A be 1000, that of B be 2000 and that of C be 700. Let us handle the cases when crashes occur at three different instances as shown in (a),(b),(c)
Log at three instances of time (a), (b), (C): Crash at (a): No redo actions need to be taken (b): redo (T0) must be performed since <T0 commit> is present (c): redo (T0) must be performed followed by redo (T1) since <T0 commit> and <T1 commit> are present Immediate Database Modification Allows database updates of an uncommitted transaction. Undoing may be needed. Update logs must have both old value and new value. Update log record must be written before database item is written We assume that the log record is output directly to stable storage output of updated blocks. It can take place at any time before or after transaction commit order in which blocks are output can be different from the order in which they are written. Immediate Database Modification Example
Recovery procedure has two operations instead of one: undo (Ti) restores the value of all data items updated by Ti to their old values, going backwards from the last log record for Ti redo (Ti) sets the value of all data items updated by Ti to the new values, going forward from the first log record for Ti Both operations must be idem potent, that is, even if the operation is executed multiple times the effect is the same as if it is executed once. It is needed since operations may get re executed during recovery. When recovering after failure, transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the record <Ti commit>. Transaction Ti needs to be redone if the log contains. Both the record <Ti start> and the record <Ti commit>. Undo operations are performed first, then redo operations. Immediate Database Modification Recovery
Recovery actions in each case above are: (a) undo (T0): B is restored to 2000 and A to 1000. (b)undo (T1) and redo (T0): C is restored to 700, and then A and B are set to 950 and 2050 respectively. (c) redo (T0) and redo (T1): A and B are set to 950 and 2050 respectively. Then C is set to 600 Checkpoints Problems in recovery procedure; Searching the entire log is time consuming We might unnecessarily redo transactions that have already output their updates to the database. Streamline recovery procedure by periodically performing Check pointing procedure Output all log records currently residing in main memory onto stable storage. Output all modified buffer blocks to the disk Write a log record <checkpoint > onto stable storage. Checkpoint system failure During recovery we need to consider only the most recent transaction Ti that started before the checkpoint, and transactions that started after Ti. Log based recovery The log, sometimes called the trail or journal, is a history of actions executed by the DBMS. Physically, the log is a file of records stored in stable storage, which is assumed to survive crashes; this durability can be achieved by maintaining two or more copies of the log on deferent disks (perhaps in different locations), so that the chance of all copies of the log being simultaneously lost is negligibly small. The most recent portion of the log, called the log tail, is kept in main memory and is periodically forced to stable storage. This way, log records and data records are written to disk at the same granularity (pages or sets of pages). Every log record is given a unique id called the log sequence number (LSN). As with any record id, we can fetch a log record with one disk access given the LSN. Further, LSNs should be assigned in monotonically increasing order; this property is required for the ARIES recovery algorithm. If the log is a sequential file, in principle growing indefinitely, the LSN can simply be the address of the first byte of the log record. For recovery purposes, every page in the database contains the LSN of the most recent log record that describes a change to this page. This LSN is called the page LSN. A log record is written for each of the following actions: 1. Updating a page: After modifying the page, an update type record (described later in this section) is appended to the log tail. The page LSN of the page is then set to the LSN of the update log record. (The page must be pinned in the buffer pool while these actions are carried out.) 2. Commit: When a transaction decides to commit, it force-writes a commit type log record containing the transaction id. That is, the log record is appended to the log, and the log tail is written to stable storage, up to and including the commit record.2 The transaction is considered to have committed at the instant that its commit log record is written to stable storage. (Some additional steps must be taken, e.g., removing the transactions entry in the transaction table; these follow the writing of the commit log record.) 3. Abort : When a transaction is aborted, an abort type log record containing the transaction id is appended to the log, and Undo is initiated for this transaction
4. End As noted above, when a transaction is aborted or committed, some additional actions must be taken beyond writing the abort or commit log record. After all these additional steps are completed, an end type log record containing the transaction id is appended to the log. 5. Undoing an update : When a transaction is rolled back (because the transaction is aborted, or during recovery from a crash), its updates are undone. When the action described by an update log record is undone, a compensation log record, or CLR, is written. Every log record has certain fields: prevLSN, translD, and type. The set of all log records for a given transaction is maintained as a linked list going back in time, using the prevLSN field; this list must be updated whenever a log record is added. The trasID field is the id of the transaction generating the log record, and the type field oIviously indicates the type of the log record. Additional fields depend on the type of the log record. We have already mentioned the, additional contents of the various log record types, with the exception of the update and compensation log record types, which we describe next. Update Log Records. The pagelD field is the page id of the modified page; the length in bytes and the offset of the change are also included. The before-image is the value of the changed bytes before the change; the after-image is the value after the change. An update log record that contains both before- and after-images can be used to redo the change and to undo it. In certain contexts, which we will not discuss further, we can recognize that the change will never be undone (or, perhaps, redone). A redo-only update log record will contain just the after-image; similarly an undo-only update record will contain just the before-image. Log File Contains information about all updates to database: Transaction records. Checkpoint records. Transaction records contain: Transaction identifier Type pf log record, (transaction start, insert, update, delete, abort, commit). Identifier of data item affected by database action (insert, delete, and update operations). Before-image of data item. After-image of data item. Log management information. A technique often used to perform recovery is the transaction log or journal 1. Records information about the progress of transactions in a log since the last consistent state. 2. The database therefore knows the state of the database before and after each transaction. 3. Every so often database is returned to a consistent state and the log may be truncated to remove committed transactions. 4. When the database is returned to a consistent state the process is often referred to as checkpointing. Q. 12. What do you understand by a distributed database? Ans. Distributed database technology is recent development within overall database field. Distributed database can be defined as It is a system consisting of data with different parts under the control of separate DBMS running on interconnected way. Each system has autonomous processing capability and is applicable for the local application. Each system participates in the more global
applications. Distributed data are capable of handling both local and global transactions. Distributed database are handled or controlled by DDBMS (Distributed DBMS). A distributed database system is also defined as it is not stored at a single physical location and it is spread across network of computer that are geographically dispersed and is connected by communication link or by using network. Distributed database has sharing of data. Distributed database is always available and it is reliable. Also we can do the increments growth (addition of data) of data in a distributed system. A Query in a distributed database is divided into sub-query and all the sub- queries are parallel evaluated. For example, consider a banking system in which customer account database is distributed across the bank branch offices, such that each individual customer can process his data or record at the local branch. In other words we can say data is stored at all the locations and any customers can access his data from any location via the communication network. It means customer data is distributed to all the locations and so we call it distributed database. One more advantage of distributed database system is that it looks like a centralized system to the user. For example Indian Railway reservation system has a distributed database system, which can be accessed at any location by any station. Challenges to Distributed System > Monotonicity: Once something is published in an open distributed system, it cannot be taken back. > Pluralism: Different subsystems of an open distributed system include heterogeneous, overlapping and possibly conflicting information. There is no central arbiter of truth in open distributed systems. > Unbounded nondeterminism: Asynchronously, different subsystems can come up and go down and communication links can come in and go out between subsystems of an open distributed system. Therefore the time that it will take to complete an operation cannot be bounded in advance A scalable system is one that can easily be altered to accommodate changes in the number of users, resources and computing entities affected to it. Scalability can be measured in three different dimensions: Load scalability: A distributed system should make it easy for us to expand and contract its resource pool to accommodate heavier or lighter loads. Geographic scalability: A geographically scalable system is one that maintains its usefulness and usability, regardless of how far apart its users or resources are. Administrative scalability: No matter how many different organizations need to share a single distributed system, it should still be easy to use and manage. Some loss of performance may occur in a system that allows itself to scale in one or more of these dimensions. A multiprocessor system is simply a computer that has more than one CPU on its motherboard. If the operating system is built to take advantage of this, it can run different processes on different CPUs, or different threads belonging to the same process. Over the years, many different multiprocessing options have been explored for use in distributed computing. Intel CPUs employ a technology called Hyperthreading that allows more than one thread (usually two) to run on the same CPU. The most recent Sun U1traSPARC TI, Athlon 64 X2 and Intel Pentium D processors feature multiple processor cores to also increase the number of concurrent threads they can run. A multicomputer system is a system made up of several independent computers interconnected by a telecommunications network. Multicomputer systems can be homogeneous or heterogeneous: A homogeneous distributed system is one where all CPUs are similar and are connected by a single type of
network. They are often used for parallel computing which is a kind of distributed computing where every computer is working on different parts of a single problem. In contrast an heterogeneous distributed system is one that can be made up of all sorts of different computers, eventually with vastly differing memory sizes, processing power and even basic underlying architecture. They are in widespread use today, with many companies adopting this architecture due to the speed with which hardware goes obsolete and the cost of upgrading a whole system simultaneously. Various hardware and software architectures exist that are usually used for distributed computing. At a lower level, it is necessary to interconnect multiple CPUs with some sort of network, regardless of that network being printed onto a circuit board or made up of several loosely-coupled devices and cables. At a higher level, it is necessary to interconnect processes running on those CPUs with some sort of communication system. Client-server : Smart client code contacts the server for data, then formats and displays it to the user. Input at the client is committed back to the server when it represents a permanent change. 3-tier architecture : Three tier systems move the client intelligence to a middle tier so that stateless clients can be used. This simplifies application deployment. Most web applications are 3-Tier. N-tier architecture: N-Tier refers typically to web applications which further forward their requests to other enterprise services. This type of application is V the one most responsible for the success of application servers. Tightly coupled (clustered) : refers typically to a set of highly integrated machines that run the same process in parallel, subdividing the task in parts that are made individually by each one, and then put back together to make the final result. Peer-to-peer : an architecture where there is no special machine or machines that provide a service or manage the network resources. Instead all responsibilities are uniformly divided among all machines, known as peers. Service oriented : Where system is organized as a set of highly reusable services that could be offered through a standardized interfaces. Mobile code : Based on the architecture principle of moving processing closest to source of data Replicated repository: Where repository is replicated among distributed system to support online / offline processing provided this lag in data update is acceptable. Distributed computing implements a kind of concurrency. The types of distributed computers are based on Flynns taxonomy of systems;. single instruction, single data (SISD),multiple instruction, single data (MISD), single instruction, multiple data (SIMD) and multiple instruction, multiple data (MIMD). 13. Write short notes on the following : (a) Multiple Granularity (b) Transaction Processing Systems Ans. (a) Multiple Granularity Another specialized locking strategy is called multiple-granularity locking, and it allows us to efficiently set locks on objects that contain other objects. For instance, a database contains several flies, a file is a collection of pages, and a page is a collection of records A transaction that expects to access most of the pages in a file should probably set a lock on the entire file, rather than locking individual pages (or records) as and when it needs them. Doing so reduces the locking overhead considerably Op the other hand, other transactions that require access to parts of the file - even parts that are not needed by this transaction are blocked If a transaction accesses relatively few pages of the Me, it is better to lock only those pages Similarly, if a
transaction accesses several records on a page, it should lock the entire page, and if it accesses just a few records, it should lock just those records. The question to be addressed is how a lock manager can efficiently ensure that a page, for example, is not locked by a transaction while another transaction holds a conflicting lock on the file containing the page (and therefore, implicitly, on the page). The idea is to exploit the hierarchical nature of the contains relationship. A database contains a set of, files, each file contains a set of page, and each page contains a set of records This containment hierarchy can be thought of as a tree of objects, where each node contains all its children (The approach can easily be extended to cover hierarchies that are not trees, but we will not discuss this extension.) A lock on a node locks that node and, implicitly, all its descendants. (Note that this interpretation of a lock is very different from B+ tree locking, where locking a node does not lock any descendants implicitly!) In addition to shared (S) and exclusive (X) locks, multp1e-granularity locking protocols also use two new kinds of locks, called intention shared (IS) and intention exclusive (IX) locks IS locks conflict only with locks IX locks conflict with S and X locks. To lock a node in S (respectively X) mode, a transaction must first lock all its ancestors in IS (respectively IX) mode Thus, if a transaction locks a node in S mode, no other transaction can have locked any ancestor in X mode, similarly, f a transaction locks a node in X mode, no other transaction can have locked any ancestor in S or X mode. This ensures that no other transaction holds a lock on an ancestor that conflicts with the requested S or X lock on the node. A common situation is that a transaction needs to read an entire file and modify a few of the records in it; that is, it needs an S lock on the file and an IX lock so that it can subsequently lock some of the contained objects in ,X mode. It is useful to define a new kind of lock called an SIX lock that is logically equivalent to holding an S lock and an IX lock. A transaction can obtain a single SIX lock (which conflicts with any lock that conflicts with either S or IX) instead of an S lock and an IX lock. A subtle point is that locks must be released in leaf-to-root order for this protocol to work correctly. To see this, consider what happens when a transaction Ti locks all nodes on a path from the root (corresponding to the entire database) to the node corresponding to same page p in IS mode, locks p in S mode, and then releases, the lock on the root node. Another transaction Tj could now obtain an X lock on the root. This lock implicitly gives Tj an X lock on page p, which conflicts with the S lock currently held by Ti. Multiple-granularity locking must be used with 2PL in order to ensure serializability. 2PL dictates when locks can be released. At that time, locks obtained using multiple granularity locking can be released and must be released in leaf-to-root order. Finally, there is the question of how to decide what granularity of locking is appropriate for a given transaction. One approach is to begin by obtaining fine granularity locks (e.g., at the record level) and after the transaction requests a certain number of locks at that granularity, to start obtaining locks at the next higher granularity (e.g., at the page level). This procedure is called lock escalation. (b) Transaction Processing Systems. Same Answer of Question No. 3. Q. 14. What are the desirable properties of transactions in a database? Ans. Same Answer of Question No. 3. Q. 15. What are locking techniques for concurrency control? Explain. Ans. Concurrent control is a method used to ensure that database transaction are executed in a safe manner or It is the process of managing simultaneous operations against a database so that data operation do not interfere with each other in a multi user environment.
One such that shared database is used in on-line manner is the database for an airline reservation that is used by many agents accessing the database from their terminals. A database could also be accessed in batch mode and it is concurrently used with the online mode. The sharing of the database for read only access does not cause any problem, but if one of the transactions running concurrently tries to modify same data item, it could lead to inconsistencies. Further if more than one transaction is allowed to simultaneously modify a data item in a database. It could lead to incorrect values for the data item and an inconsistent database will be created. For example, suppose that two ticked agents access the online reservation system simultaneously to see, if the seat is available on a given flight or not and if both agents make the reservation against the last available seat on that flight then a message of overbooking will be displayed. This will make the data in inconsistent way: We can say that concurrent processing of the programs, process of Job are similar to the multiprogramming, i.e. no. of jobs or programs processed simultaneously to achieve their independent & different goals according to their own requirements. Some concurrency problem, when we apply a correct transaction in the concurrent processing then we see that databases becomes inconsistent after the completion of the transaction In the case of concurrent operation where a number of transactions are running and using the database we cannot make any assumption about the order in which the statement belonging to different transactions will be executed. The order in which these statements are executed is called Schedule. So the processing of these statements which are in schedule and used in concurrent operation and we cannot change the schedule is called concurrent schedule. Some problems occur during scheduling and concurrent processing are as: (a) Lost update problem : Consider the two transactions given below and these transactions are accessing the same data item A. Each of these transactions modifies the data item and write it back. Then. we see that the concurrent processing of the modification of the value of A will create a problem by loosing the old value with the currently updated value.
(b) Inconsistent Read Problem: The lost update problem was caused by concurrent modification of same data item. However concurrency can also cause problem when only one transaction modifies a given set of data while that set of data is being used by other transaction. For example, if there are two transaction occur T5 and T6 in a schedule. Suppose A and B represents some data items having integer value then if both are concurrently processed then one processing will be reading the data and other will be modifying the data. So it will create an inconsistency in the reading in next transactions that which data is correct for reading and which is incorrect.
(c) The phantom phenomenon In phantom phenomenon let us consider an organization where parts are purchased and kept in stock. The parts are withdrawn from the stock and used by number of projects.
To check the extent of loss, we want to see that whether (or if) current quantity of some part purchased and received is equal to the current sum of the quantity of that part in stock, plus the current quantity is used by various projects The phantom problem means if additional items are added, this additional information reflects the transaction and query during the concurrent processing. This problem could be prevented by using the concept of Locking i.e. locking of such type. of records also prevents the addition of such phantom records. (d) Semantic of Concurrent Transaction : As we take two different transactions for different ordering, it is not necessary that the two transactions are commutative. Suppose two transactions A = (A+10) + 20 A = (A+20) + 10 gives same result. But some time the commutative operations is not same Salary = (Salary + 1000) * 1.1 Salary = (Salary x 1.1) + 1000 For example, whether syntax of operation as same, but semantics of these transactions are different during concurrent processing. An important part of concurrency is serial execution or seralizability. When we let some independent transactions in a schedule by setting them in a order such that their execution becomes serially then these type of execution is called serial execution or serialzability. Some problem of concurrent processing are removed by serial execution by setting or ordering the operation in a particular sequences. Serializablity: A non-serial schedule that is equivalent to some serial execution of transactions is called a serializiable schedule, For example in below written three schedules, schedule- 3 is Serializable schedule and is equivalent to the scheule-1 and schedule-2. The purpose of Serializable scheduling is to find the non-serial schedules that allows the transactions to execute concurrently without interfering with one another and therefore produces a database state that could be produced by a serial execution. Note that serializability also removes the problem of inconsistency.
Definition: The given interleaved execution of some transactions is said to be serializable if it produces the same results as some serial execution of the transactions. In serializablity the ordering of read and write operations are important before any operation to avoid any type of confusion or in-consistency, Serializablity Test can be explained and solved with the help of Precedence Graph. Solution to these problems If all schedules in concurrent environment are restricted to serializable schedule, the result obtained will be consistent with some serial execution of the transaction and will be considered correct. Also testing of serialzability of a schedule is not only expensive but it is impractical some time. Thus one of the following concurrent control schemes is applied in concurrent database environment to ensure that the schedule produced by concurrent transaction are serializable.
Some concurrency controls schemes used to solve all the problems occurs during the concurrent scheduling are as discussed below: (i). Locking Scheme (ii) Time stamp based order (iii) Optimistic scheduling (iv) Multi version technique (i) Locking From the point of view of Locking a database can be considered as being made up of set of data items A lock is a variable associated with each such data item Manipulating the value of the lock is called LOCKING. The value of lock variable is used in locking scheme to control the concurrent access and manipulation of the associated data item. The locking is done by a subsystem of DBMS and such system is called Lock Manager. There are two types of Lock: (a) Exclusive Lock: Exclusive Lock is also called update or writes lock. The intention of their mode of locking is to provide exclusive use of data items to one transaction. If a transaction T locks a data item Q in ar exclusive mode, no other transaction can access Q or not even read Q until the lock is released by transaction (b) Shared Lock : Share Lock is also called a read lock. Any number of transactions can concurrently lock and access a data item in the shared mode but none of these transactions can modify the data item. A data item locked in a shared mode cannot be locked in the exclusive mode until the shared lock is released by all transactions holding the lock. A data item in the exclusive mode cannot be locked in the share mode until the exclusive lock on the data item is released. Two Phase Locking: Another method of locking is called Two Phase Locking. In this once a lock is released no additional lock are requested. In other words the release of the lock is delayed, until all the locks on all data items required by the transactions have been acquired. It has two phases, a growing phase in which the number of locks increase from 0 to maximum for the transaction and a contracting phase in which the number of locks held decreases from maximum to zero. Both of these phases is monotonic i.e. the number of locks are only increasing in first phase and decreasing in the 2nd phase. Once a transaction starts releasing locks, it is not allowed to request any further locks. In this way a transaction is obliged to request all locks it may need during its life before it releases any. This leads to control and lower the degree of Concurrency. (ii) Time stamp based order In time stamp based, method, a serial order is created among the concurrent transaction by assigning to each transaction a unique non-decreasing number. The usual value assigned to each transaction is the system clock value at the start of the transaction. It is called Time Stamp ordering. There are two types of time stamp: (a) Write time stamp (b) Read time stamp. A variation of this scheme is used in a distributed environment includes the site of a transaction appendid to the system wide clock value. This value can then be the system wide clock value. This value can then be used on deciding the order in which the conflict between two transactions is resolved. A transaction with a smaller time stamp value is considered to be an older transaction than another transaction with a larger time stamp value. Data item X is thus represented by triple X set as X: {X, Wx, Rx) where each X is represented as X : The value of data item.
Wx : The write time stamp value, the largest time stamp value of any transaction that was allowed to write a value of X. Rx : The read time stamp value, the largest timestamp value of any transaction that was allowed to read the current value X. (iii) Optimistic Scheduling In the optimistic scheduling schema, the philosophy is to assume that all data items can be successfully updated at the end of a transaction and to read in the values for data item without any locking. Reading is done when required and if any data item is found to be inconsistent at the end of a transaction then the transaction is rolled back (Used for recovery procedure in DBMS). In optimistic scheduling each transaction has three phases: (a) The read phase : This phase starts with the activation of a transaction and in this all data items are read into local variables and any modification that are made are only to those local copies. This ends with commitment. (b) Validation phase: In this when data items are modified it check that data after the procedure be rolled back. (c) Write phase: When transaction passes the validation phase, then whole transaction be written into secondary storage data. An optimistic scheme does not use lock and so it is dead lock free even through starvation can still occur. (iv) Multi version technique It is also called time domain addressing scheme, which follows the accounting principle of never overwriting a transaction. Any charge are achieved by entering compensating transaction e.g. in this X is achieved by making a new copy or version of data item X. So it is called Multi version. In this way a history of evolution of the value of data item is recorded in the database. With multi version technique, write operations can occur concurrently. Since they do not overwrite with each other. Also read operation can read any version. Q. 16. Describe the usefulness of granting priviledges to the users. Ans. You use GRANT to assign roles or system privileges to roles or users. The same command works whether you are assigning these roles or system privileges to an individual user or to a role that in turn can be assigned to many users. The Syntax for the GRANT Command The GRANT command takes the following syntax: SYNTAX: GRANT role or system privilege [, role or system privilege I TO user or role or PUBLIC [, user or role] [WITH ADMIN OPTION] The GRANT command can take any number of system privileges and roles and assign them to any number of users or roles. By specifying that you want to grant a role or system privilege to PUBLIC, you are specifying that you want that role or privilege to be granted to all users in the system. The REVOKE command is just the opposite of the GRANT command; it will take a role or system privilege away from a user or role: REVOKE role or system privilege [role or system privilege] FROM user or role or PUBLIC [user or role] Q. 18. Defile the term Generalization.
Ans. Generalization. Generalization seems to be simplification of data, i.e. to bring the data from Unnormalized form to normalized form A botto-up design process - combine a number of entity sets that share the same features into a higher-level entity set . Specialization and generalization are simple inversions of each other; they are represented an E-R diagram in the same way. The terms specialization and generalization are used interchangeably. Q. 19. What is cascading roll back Ans. A cascading rollback occurs in database systems when a transaction (T1) causes a failure and rollback must be performed Other transactions dependent on T1s actions must also be rolled back due to T1s failure, thus causing a cascating effect That is, one transactions failure causes many to fail. Q. 20. Explain various recovery techniques based on deferred update. Ans. Refer to Q.No. 11. Q. 21. What is a log record ? What fields it contains ? How it is used for database recovery? Ans. Refer to Q.No. 11. Q. 22. What does roll name signify? Explain with example. Ans. Oracle provides for easy and controlled privilege management through roles. Roles are named groups of related privileges that you grant to users or other roles. Roles, are designed to ease the administration of end-user system and object privileges. However, roles are not meant to be used for application developers, because the privileges to access objects within stored programmatic constructs need to be granted directly. These properties of roles allow for easier privilege management within a database: . Reduced privilege administration Rather than explicitly granting the same set of privileges to several users, you can grant the privileges for a group of related users to a role, and then only the role needs to be granted to each member of the group. Dynamic privilege management If the privileges of a group must change, only the privileges of the role need to be modified. The security domains of all users granted the groups role automatically reflect the changes made to the role. Selective availability of privileges You can selectively enable or disable the roles granted to a user This allows specific control of a users privileges in any given situation. Application awareness Because the data dictionary records which roles exist, you can design database applications to query the dictionary id automatically enable (and disable) selective roles when a user attempts to execute the application via a given username. Application-specific security You can protect role use with a password. Applications can be created specifically to enable a role when supplied the correct password. Users cannot enable the role if they do not know the password.
Data Masking This Chapter is Written & Contributed By : Er. Mansimran Singh B.E, CSE Q.1 What is data masking? Sol: Data masking is the process of de-identifying (masking) specific data elements within data stores. It ensures that the sensitive data is replaced with realistic but not real data. The result is that sensitive information is not available to users outside of authorized environments. Data masking is done while provisioning non-production environments so that copies created to support test, analysis, and development processes are not exposing sensitive information. Without masking, these policies can risk exposing sensitive data. De-identification or masking is the process of replacing sensitive data with randomly generated data that is valid and functional for application processing, but is not associated with the original record. dgmasker preserves the application and relational integrity of the data set, and transforms it to meet application business rules. In data masking, the format of data remains the same; only the values are changed. The data may be altered in a number of ways, including encryption, character shuffling, and character or word substitution. Whatever method is chosen, the values must be changed in some way that makes detection or reverse engineering impossible. Data masking is not the same thing as restricting the visibility of information in production databases from people who are not authorized to see it. In that situation, the data is actually present in the database and is simply not visible to the unauthorized. There are many good and justifiable reasons for taking this approach in a production system, but adopting a data is present but hidden approach to the protection of data in test and development databases is a recipe for trouble. The reason is that strict controls are in place in production databases and these can present a carefully managed view. Test and development systems are different. Typically, they are an environment in which access is usually much wider. Information is visible to more people and those people often have greater privileges and low level access. From a data visibility standpoint, a test or dev system in which the data is present but hidden is a system which sooner or later will expose its data. In general, a reasonable security assumption is that the more people who have access to the information, the greater the inherent risk of the data being compromised. The modification of the existing data in such a way as to remove all identifiable distinguishing characteristics yet still usable as a test system can provide a valuable layer of security for test and development databases Q.2 Why Mask Data? Sol: Legal Requirements The regulatory environment surrounding the duties and obligations of a data holder to protect the information they maintain are becoming increasingly rigorous in just about every legal jurisdiction. It is a pretty safe assumption that the standards for the security and maintenance of data will become increasingly strict in the future. Loss of Confidence And Public Relations Disasters It can reasonably be said in most locations, that if a data escape happens at your organization, then the formal legal sanctions applied by governmental bodies is not the only problem you will be facing. Possibly it may not even be the biggest of your immediate worries. Inappropriate data exposure, whether accidental or malicious, can have devastating consequences. Often the costs of such an event, both actual
and un-quantifiable can far exceed any fines levied for the violation of the rules. For example, what will it cost the organization if potential customers are not willing to provide sensitive information to your company because they read an article about a data escape in the newspaper. Dealing with the public relations aftermath of seeing the companies name in the press will not be cheap. It also does not take much imagination to realize that senior management are not going to be happy about having to give a press conference to reassure the public. The public relations costs of a data escape usually far exceed the sanctions levied by governmental organizations. Malicious Exposure Most people think the major risk to the information they hold is external entities (and organized syndicates) out to break in and steal the data. The assumption then follows that protecting the network and firewalls is the appropriate and sufficient response. There is no denying that such protection is necessary however it has been shown that in many cases the data is stolen by malicious insiders who have been granted access to the data. No firewall can keep an insider from acquiring data under such circumstances. However, by reducing the number of databases with unmasked information, the overall risk of exposure is mitigated. The external hackers, if they get through the network security, will have far fewer useable targets and a far greater proportion of the inside personnel will have no access to the real data. Accidental Exposure The risk of accidental exposure of information is often neglected when considering the security risks associated with real test data. Often it is thought that there is no point in masking the test data because everybody has access to production anyways. Not so, the risks associated with an accidental exposure of the data remain. Often just masking the most sensitive information (credit card numbers, customer email addresses etc) is enough to somewhat mitigate the damage associated with accidental exposure and the masked databases remain just as functional. Data Masking Architectures Fundamentally, there are two basic types of architectures which are used in the design of data masking software. On the Fly, Server-To-Server, Data Masking Architectures In this architecture the data does not exist in the target database prior to masking. The anonymization rules are applied as part of the process of moving the data from the source to the target. Often this type of masking is integrated into the cloning process which creates the target database. Q.3: What are the Advantages & Disadvantages of Data Masking? Sol: Advantages The data is never present in an unmasked form in the target database. Disadvantages Any errors in the process necessarily interrupt the transfer of the data. The ability to mask data after the transfer has completed can be troublesome. This might happen in cases where the masked target database has been built and it is subsequently decided that a specific column of information really needs to be masked. In this case, the masking software needs to have In-Situ masking capabilities (see below) or the entire clone and masking operation will need to be repeated. The ability to use alternative, perhaps preferred, tools to perform the cloning operation is impacted.
Q.4: Explain "In-Situ Data Masking Architectures" Sol: In this style, the clone of the database to be masked is created by other means and the software simply operates on the cloned database. There are two types of in-situ masking: masking rules which are executed and controlled as a standalone entity on the target and data masking rules which are controlled by a different system which then connects to the target and controls the execution of the rules. Advantages It is possible to apply additional masking operations at any time. The masking operations are separate from the copy process so existing cloning solutions can be used and the data masking rules are possibly simpler to maintain. Disadvantages The data is present in an unmasked state in the target database and hence additional security measures will be required during that time. Q.5: Explain Data Masking Technique. Sol: Substitution This technique consists of randomly replacing the contents of a column of data with information that looks similar but is completely unrelated to the real details. For example, the surnames in a customer database could be sanitized by replacing the real last names with surnames drawn from a largish random list. Substitution is very effective in terms of preserving the look and feel of the existing data. The downside is that a largish store of substitutable information must be available for each column to be substituted. For example, to sanitize surnames by substitution, a list of random last names must be available. Then to sanitize telephone numbers, a list of phone numbers must be available. Frequently, the ability to generate known invalid data (credit card numbers that will pass the checksum tests but never work) is a nice-to-have feature. Substitution data can sometimes be very hard to find in large quantities - however any data masking software should contain datasets of commonly required items. When evaluating data masking software the size, scope and variety of the datasets should be considered. Another useful feature to look for is the ability to build your own custom datasets and add them for use in the masking rules. Shuffling Shuffling is similar to substitution except that the substitution data is derived from the column itself. Essentially the data in a column is randomly moved between rows until there is no longer any reasonable correlation with the remaining information in the row. There is a certain danger in the shuffling technique. It does not prevent people from asking questions like I wonder if so-and-so is on the supplier list? In other words, the original data is still present and sometimes meaningful questions can still be asked of it. Another consideration is the algorithm used to shuffle the data. If the shuffling method can be determined, then the data can be easily un-shuffled. For example, if the shuffle algorithm simply ran down the table swapping the column data in between every group of two rows it would not take much work from an interested party to revert things to their unshuffled state. Shuffling is rarely effective when used on small amounts of data. For example, if there are only 5 rows in a table it probably will not be too difficult to figure out which of the shuffled data really belongs to which
row. On the other hand, if a column of numeric data is shuffled, the sum and average of the column still work out to the same amount. This can sometimes be useful. Shuffle rules are best used on large tables and leave the look and feel of the data intact. They are fast, but great care must be taken to use a sophisticated algorithm to randomize the shuffling of the rows. Number and Date Variance The Number Variance technique is useful on numeric or date data. Simply put, the algorithm involves modifying each number or date value in a column by some random percentage of its real value. This technique has the nice advantage of providing a reasonable disguise for the data while still keeping the range and distribution of values in the column to within existing limits. For example, a column of salary details might have a random variance of 10% placed on it. Some values would be higher, some lower but all would be not too far from their original range. Date fields are also a good candidate for variance techniques. Birth dates, for example, could be varied with in an arbitrary range of 120 days which effectively disguises the personally identifiable information while still preserving the distribution. The variance technique can prevent attempts to discover true records using known date data or the exposure of sensitive numeric or date data. Encryption This technique offers the option of leaving the data in place and visible to those with the appropriate key while remaining effectively useless to anybody without the key. This would seem to be a very good option yet, for anonymous test databases, it is one of the least useful techniques. The advantage of having the real data available to anybody with the key is actually a major disadvantage in a test or development database. The optional visibility provides no major advantage in a test system and the encryption password only needs to escape once and all of the data is compromised. Of course, you can change the key and regenerate the test instances but outsourced, stored or saved copies of the data are all still available under the old password. Encryption also destroys the formatting and look and feel of the data. Encrypted data rarely looks meaningful, in fact, it usually looks like binary data. This sometimes leads to character set issues when manipulating encrypted varchar fields. Certain types of encryption impose constraints on the data format as well. In effect, this means that the fields must be extended with a suitable padding character which must then be stripped off at decryption time. The strength of the encryption is also an issue. Some encryption is more secure than others. According to the experts, most encryption systems can be broken it is just a matter of time and effort. In other words, not very much will keep the national security agencies of largish countries from reading your files should they choose to do so. This may not be a big worry if the requirement is to protect proprietary business information. Never, ever, use a simplistic encryption scheme designed by amateurs. For example, one in which the letter A is replaced by X and the letter B by M etc. is trivially easy to decrypt based on letter frequency probabilities. In fact, first year computer science students are often asked to write such programs as assignments. Nulling Out/Truncating Simply deleting a column of data by replacing it with NULL values is an effective way of ensuring that it is not inappropriately visible in test environments. Unfortunately it is also one of the least desirable options from a test database standpoint. Usually the test teams need to work on the data or at least a
realistic approximation of it. For example, it is very hard to write and test customer account maintenance forms if the customer name, address and contact details are all NULL values. NULLing or truncating data is useful in circumstances where the data is simply not required, but is rarely useful as the entire data sanitization strategy. Masking Out Data Masking data, besides being the generic term for the process of data anonymization, means replacing certain fields with a mask character (such as an X). This effectively disguises the data content while preserving the same formatting on front end screens and reports. For example, a column of credit card numbers might look like: 4346 6454 0020 5379 4493 9238 7315 5787 4297 8296 7496 8724 and after the masking operation the information would appear as: 4346 XXXX XXXX 5379 4493 XXXX XXXX 5787 4297 XXXX XXXX 8724 The masking characters effectively remove much of the sensitive content from the record while still preserving the look and feel. Take care to ensure that enough of the data is masked to preserve security. It would not be hard to regenerate the original credit card number from a masking operation such as: 4297 8296 7496 87XX since the numbers are generated with a specific and well known checksum algorithm. Also care must be taken not to mask out potentially required information. A masking operation such as XXXX XXXX XXXX 5379 would strip the card issuer details from the credit card number. This may, or may not, be desirable. If the data is in a specific, invariable format, then Masking Out is a powerful and fast option. If numerous special cases must be dealt with then masking can be slow, extremely complex to administer and can potentially leave some data items inappropriately masked.