10 File Organization in DBMS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

What is File Organization in DBMS?

The File is a set of documents. We can get to the records by using the primary key. The sort of file organization that was employed for a
given set of records can determine the type and frequency of access.
A logical relationship between distinct records is referred to as file organization. This method specifies how disc blocks are mapped to file
records. The word “file organization” refers to the method by which records are organized into blocks and then placed on a storage media.
The first method of mapping a database to a file is to employ many files, each containing only one fixed-length entry. Another option is to
arrange our files so that we can store data of various lengths. Fixed-length record files are easier to implement than variable-length record
files.
Objectives of File Organization in Data Structure
Here are some ways in which file organization helps:
 It has an ideal record selection, which means records can be selected as quickly as feasible.
 Insert, delete, and update transactions on records should be simple and rapid.
 Duplicate records cannot be created by inserting, updating, or deleting records.
 Records should be stored efficiently to save money on storage.
Types of File Organization
There are a variety of ways to organize your files. On the basis of access or selection, these methods offer advantages and disadvantages.
The programmers choose the finest file organization strategy for their needs when it comes to file organization.

The following are the different types of file organization:


Sequential File Organization
This is the most straightforward technique of file arrangement. Files are saved in this method in sequential order. Read more on Sequential
File Organization here.
Heap File Organization
It is the most fundamental and basic type of organizational structure. It’s based on data chunks. The records are inserted at the end of the
file in the heap file organization. The ordering and sorting of records are not required when the entries are added. Read more on Heap File
Organization here.
Hash File Organization
The computation of the hash function on some of the fields of the records is used by Hash File Organization. The output of the hash function
defines the position of the disc block where the records will be stored. Read more on Hash File Organization here.
B+ File Organization
The advanced way of an indexed sequential access mechanism is the B+ tree file organization. In File, records are stored in a tree-like
structure. Read more on B+ File Organization here.
Indexed Sequential Access Method or ISAM
ISAM (Advanced Sequential File Organizing Approach) is an advanced sequential file organization method. Records are stored in the file
using the primary key in this way. For each primary key, an index value is created and mapped to the record. The address of the record in
any file is contained in this index. Read more on ISAM here.
Cluster File Organization
Clusters are created when two or more records are saved in the same file. There will be two or more than two tables in the very same data
block in these files, and key attributes that are used to link these tables together will only be kept once. Read more on Cluster File
Organization here.
Index data structure
Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the number of disk accesses required
when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a database table quickly.
Index structure:
Indexes can be created using some database columns.

o The first column of the database is the search key that contains a copy of the primary key or candidate key of the
table. The values of the primary key are stored in sorted order so that the corresponding data can be accessed
easily.
o The second column of the database is the data reference. It contains a set of pointers holding the address of the
disk block where the value of the particular key can be found.
AD

Indexing Methods

Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10 bytes long. If their IDs start
with 1, 2, 3....and so on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk block from starting till it reaches 543. The DBMS
will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record after reading 542*2= 1084
bytes which are very less compared to the previous case.
AD

Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as primary indexing. These
primary keys are unique to each record and contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It makes searching faster.
o In this, the number of records in the index table is same as the number of records in the main table.
o It needs more space to store index record itself. The index records have the search key and a pointer to the actual
record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records in the main table in a
gap.

Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary key
columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique value and create
index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these group.
Example: suppose a company contains several employees in each department. Suppose we use a clustering index, where all
employees which belong to the same Dept_ID are considered within a single cluster, and index pointers point to the cluster
as a whole. Here Dept_Id is a non-unique key.

The previous schema is little confusing because one disk block is shared by records which belong to the different cluster. If
we use separate disk block for separate clusters, then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually kept in the
primary memory so that address fetch should be faster. Then the secondary memory searches the actual data based on the
address got from mapping. If the mapping size grows then fetching the address itself becomes slower. In this case, the
sparse index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this method, the huge range
for the columns is selected initially so that the mapping size of the first level becomes small. Then each range is further
divided into smaller ranges. The mapping of the first level is stored in the primary memory, so that address fetch is faster.
The mapping of the second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest entry which is smaller than
or equal to 111 in the first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the address 110, it goes to
the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also done in the same manner.
Comparison of file organization
Tree structured indexing
Sequential File Organization
This method is the easiest method for file organization. In this method, files are stored sequentially. This method can be
implemented in two ways:

1. Pile File Method:


o It is a quite simple method. In this method, we store the record in a sequence, i.e., one after another. Here, the
record will be inserted in the order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched in the memory blocks. When it is found,
then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:


Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records are nothing but a row in the
table. Suppose we want to insert a new record R2 in the sequence, then it will be placed at the end of the file. Here, records
are nothing but a row in any table.

2. Sorted File Method:


o In this method, the new record is always inserted at the file's end, and then it will sort the sequence in ascending or
descending order. Sorting of records is based on any primary key or any other key.
o In the case of modification of any record, it will update the record and then sort the file, and lastly, the updated
record is placed in the right place.
AD

AD

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7. Suppose a new record R2
has to be inserted in the sequence, then it will be inserted at the end of the file, and then it will sort the sequence.

Pros of sequential file organization


o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a student, generating
the salary slip, etc.
o This method is used for report generation or statistical calculations.
Cons of sequential file organization
o It will waste time as we cannot jump on a particular record that is required but we have to move sequentially which
takes our time.
o Sorted file method takes more time and space for sorting the records.

Indexed sequential access method (ISAM)


ISAM method is an advanced sequential file organization. In this method, records are stored in the file using the primary key.
An index value is generated for each primary key and mapped with the record. This index contains the address of the record
in the file.

If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record is
retrieved from the memory.
Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a huge database is quick and
easy.
o This method supports range retrieval and partial retrieval of records. Since the index is based on the primary key
values, we can retrieve the data for the given range of value. In the same way, the partial value can also be easily
searched, i.e., the student name starting with 'JA' can be easily searched.

Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to maintain the sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise, the performance of the
database will slow down.

B+ Tree File Organization


B+ Tree is an advanced method of ISAM file organization. It uses the same concept of key-index, but in a tree
like structure. B+ tree is similar to binary search tree, but it can have more than two leaf nodes. It stores all the
records only at the leaf node. Intermediary nodes will have pointers to the leaf nodes. They do not contain any
data/records.
Consider a student table below. The key value here is STUDENT_ID. And each record contains the details of
each student along with its key value and the index/pointer to the next value. In a B+ tree it can be represented
as below.
Please note that the leaf node 100 means, it has name and address of student with ID 100, as we saw in R1,
R2, R3 etc above.
From the above B+ tree structure, it is evident that
 There is one main node called root of the tree – 105 is the root here.
 There is an intermediary layer with nodes. They do not have actual records stored. They are all
pointers to the leaf node. Only the leaf node contains the data in sorted order.
 The nodes to the left of the root nodes have prior values of root and nodes to the right have next
values of the root. i.e.; 102 and 108 respectively.
 There is one final node, called leaf node, which has only values. i.e.; 100, 101, 103, 104, 106 and 107
 All the leaf nodes are balanced – all the leaf nodes at same distance from the root node. Hence
searching any record is easier.
 Searching any record is linear in this case. Any record can be traversed through single path and
accessed easily.
 Since the intermediary nodes have only pointers to the leaf node, the tree structure is of shorter height.
Shorter the height, faster is the traversal and hence the retrieval of records.
Advantages of B+ Trees
 Since all records are stored only in the leaf node and are sorted sequential linked list, searching is
becomes very easy.
 Using B+, we can retrieve range retrieval or partial retrieval. Traversing through the tree structure
makes this easier and quicker.
 As the number of record increases/decreases, B+ tree structure grows/shrinks. There is no restriction
on B+ tree size, like we have in ISAM.
 Since it is a balance tree structure, any insert/ delete/ update does not affect the performance.
 Since we have all the data stored in the leaf nodes and more branching of internal nodes makes height
of the tree shorter. This reduces disk I/O. Hence it works well in secondary storage devices.
Disadvantages of B+ Trees
 This method is less efficient for static tables.
What is B+ Tree
A balanced binary search tree is the B+ Tree. It uses a multilevel indexing system. Leaf nodes in the B plus tree represent actual data
references. The B plus tree keeps all of the leaf nodes at the same height. A link list is used to connect the leaf nodes in the B+ Tree. As a
result, a B+ Tree can allow both random and sequential access.
Multilevel indexing is a crucial topic to grasp before understanding B+ Tree in the data structure. The index of indices is formed in multilevel
indexing, as shown in the diagram below. It facilitates and accelerates data access.
B+ Tree Properties
 The leaves are all at the same height.
 There are at least two children in the root.
 Except for root, each node can have a max of m children and a min of m/2 children.
 A max of m – 1 keys and a min of m/2 – 1 keys can be stored in each node.
Both keys and records can be placed in the internal and leaf nodes of a B Tree. In a B+ Tree, records or data can only be kept on the leaf
nodes, whereas key values can only be placed on the internal nodes.
To make search queries more efficient, the leaf nodes of the B plus tree in the data structure are connected together in the form of singly-
linked lists.
B+ Trees are used to store vast amounts of data that are too large to fit in the main memory. The internal nodes of the B+ Tree (the keys to
access records) are securely stored memory, whereas leaf nodes are placed in the secondary memory due to the restricted amount of main
memory.
B plus tree internal nodes are often referred to as index nodes. The following diagram depicts a B+ Tree of order 3.

Pros of B+ Tree
 In an equal number of disc accesses, records can be retrieved.
 In comparison to the B tree, the B+ Tree’s height stays balanced and is lower.
 The stored data in a B+ Tree can be accessed both sequentially and directly.
 Indexing is done via keys.
 Because the data is only stored on the leaf nodes, search queries are faster.
B+ Tree Structure
 Every leaf node in the B+ Tree is at the same distance from the root node. The B+ Tree has an order of n, with n being the same
for all B plus trees.
 It has a leaf node and an internal node.
Internal Node
 Except for the root node, an internal node present in the B+ Tree can at least have n/2 record pointers.
 The internal node of the tree can only have n pointers.
Leaf Node
 At least n/2 key values and n/2 record pointers can be found in the B plus tree’s leaf node.
 A leaf node can only have n record pointers and n key values.
 Every B+ Tree leaf node has a block pointer P that points to the next leaf node.
Searching Records in B+ Tree
Suppose that we need to find 55 in the B+ Tree structure below. We’ll start by fetching for the intermediary node, which will lead to the leaf
node, which may have a record of 55.
So we’ll identify a branch between 50 to 75 nodes in the intermediary node. Then we’ll be redirected to the 3rd leaf node at the conclusion. In
this case, the DBMS will use a sequential search to locate 55.

Insertion in B+ Tree
Suppose that we wish to add a record 60 to the structure below. After 55, it will reach the third leaf node. It’s a balanced tree, and one of its
leaf nodes is already full; thus, we can’t put 60 in there.
We must split the leaf node in this scenario so that it may be added to the tree without impacting the fill factor, balance or order.

The values of the 3rd leaf node are (50, 55, 60, 65, 70), and the current root node is 50. To keep the tree’s equilibrium, we’ll divide the leaf
node in half in the middle. As a result, (50, 55) & (60, 65, 70) can be grouped into two leaf nodes.
The intermediate node can’t branch from 50 if these two must be leaf nodes. It must have 60 added to it, and thereafter links to a new leaf
node will be available.
When there is an overflow, we can enter an entry in this manner. In a typical scenario, finding the node in which it fits and then placing it in
that leaf node is extremely simple.
Deletion in B+ Tree
Imagine that we want to remove 60 from the previous example. In this scenario, we must subtract 60 from both the intermediate node and
the fourth leaf node. The tree will not satisfy the B plus tree rule if it is removed from the intermediate node. As a result, we must alter it in
order to achieve a balanced tree.
It will look like the following after eliminating node 60 from the B+ Tree and rearranging the nodes:

Hash based indexing


In a huge database structure, it is very inefficient to search all the index values and reach the desired data. Hashing technique
is used to calculate the direct location of a data record on the disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the hashing function. The memory
location where these records are stored is known as data bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the time, the hash function uses
the primary key to generate the address of the data block. A hash function is a simple mathematical function to any complex
mathematical function. We can even consider the primary key itself as the address of the data block. That means each row
whose address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function can also be a simple
mathematical function like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function to determine the address
of the data block. In this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2
respectively, and records are stored in those data block addresses.
AD

Types of Hashing:

Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we generate an address for
EMP_ID =103 using the hash function mod (5) then it will always result in same bucket address 3. Here, there will be no
change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout. In this example, we will
have five data buckets in the memory used to store the data.

Operations of Static Hashing


o Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the bucket where the data is
stored.
o Insert a Record
AD

When a new record is inserted into the table, then we will generate an address for a new record based on the hash key and
record is stored in that location.
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete the records for that
address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by the hash function is not
empty, or data already exists in that address. This situation in the static hashing is known as bucket overflow. This is a
critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as follows:

1. Open Hashing
When a hash function generates an address at which data is already stored, then the next bucket will be allocated to it. This
mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function generates address as 112 for R3.
But the generated address is already full. So the system searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is linked after the previous one. This
mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash function generates address as
110 for it. But this bucket is full to store the new data. In this case, a new bucket is inserted at the end of 110 buckets and is
linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This method is also known as
Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor performance.

How to search a key


o First, calculate the hash address of the key.
o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record might be.
AD

How to insert a new record


o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so it will go into bucket B1.
The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.

AD

Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into bucket B1, and the last
three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last two bits of both the
entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because last two bits of both the
entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two bits of both the entry are
11.

Advantages of dynamic hashing


o In this method, the performance does not decrease as the data grows in the system. It simply increases the size of
memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There will not be any unused memory
lying.
o This method is good for the dynamic database where data grows and shrinks frequently.

Disadvantages of dynamic hashing


o In this method, if the data size increases then the bucket size is also increased. These addresses of data will be
maintained in the bucket address table. This is because the data address will keep changing as buckets grow and
shrink. If there is a huge increase in data, maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time to reach this situation than
static hashing.

You might also like