Advance Data Structure

Download as pdf or txt
Download as pdf or txt
You are on page 1of 144

BACHLOR OF COMPUTER APPLICATIONS

BCA-PC(L)-235

ADVANCE DATA STRUCTURES

Directorate of Distance Education


Guru Jambheshwar University of
Science& Technology
Hisar - 125001
CONTENTS

1. BASICS OF TREES…….………………………………. 1-27

2. SPECIAL TREES.………………………………………. 28-50

3. HASHING……………………......………………………. 51-68

4. GRAPH………………...………………………………… 69-97

5. SORTING METHODS………......……………………… 98-127

6. SEARCHING TECHNIQUES ....…...…………………. 128-142

ii
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 1

BASICS OF TREES
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Tree
1.4 Binary Trees
1.4.1 Binary Representation of Tree
1.4.2 Binary Tree Traversal
1.4.3 Implementation of Binary Trees
1.4.4 Deletion from a Binary Tree
1.5 Threaded Binary Trees
1.6 Implementation of Threaded Binary Tree
1.7 Check your progress
1.8 Summary
1.9 Keywords
1.10 Review Questions
1.11 References/Suggested Reading

1
1.1 LEARNING OBJECTIVES

After studying this unit, you should be able to:


1. Describe binary tree
2. Describe binary representation of trees
3. Describe implementation of binary tree
4. Describe methods of traversing binary tree
5. Describe how to delete a node from a binary tree.
6. Define threaded binary trees
7. Describe the implementation of threaded binary tree
8. Describe AVL trees
9. Describe how to build height balanced trees
10. Describe multiway search trees
11. Describe 'B' trees
12. Describe Insertion and deletion of nodes in a 'B' tree.

1.2 INTRODUCTION

In the last few units we studied about linear data structures. In many realistic
situations we need to represent data and functioning which are not linear in nature.
For that purpose, we need some non-linear data structures. A tree structure means
that the data is organized such that items of information are related by branches. So
far we have been studying mainly linear types of data structures such as arrays,
lists, stacks and queues, now we will study a non -linear data structure called a
Tree. This structure is mainly used to represent data containing a hierarchical
relationship between elements, familiar examples of such structure are: family
trees, the hierarchy of positions in an organization, an algebraic expression
involving operations for which certain rules of precedence are prescribed etc. For
example, suppose we wish to use a data structure to represent a person and all of
his or her descendants. Assume that the person's name is Alan and he has 3

2
children, John, Joe and Johnson. Also suppose that Joe has 3 children, Alec, Mary
and Chris and Johnson has one child Peter. We can represent Alan and his
descendants quite naturally with the tree structure shown in Figure 1.1.

Fig 1.1

1.3 TREE
Definition of Tree: A Tree is a finite set of one or more nodes such that there is a
specially designated node called the root and remaining nodes are partitioned into
n>0 disjoint sets S 1 ....S n where each of these sets is a tree. S 1 ...S n are called the
subtrees of the root. If we look at Figure 1.1 we see that the root of the tree is
Alan. Tree has three subtrees whose roots are Joe, John and Johnson.
The condition that S 1 ....S n be disjoint sets prohibits subtrees from ever connecting
together.
Node: A node stands for the item of information plus the branches to other items.
Consider the Tree of Figure 1.2 it has 13 nodes.
Degree: The number of subtrees of a node is called its degree. In Figure 1.2 the
degree of node 1 is 3.
Leaf or Terminal Nodes: Nodes that have degree zero is called leaf or terminal
nodes. In Figure 1.2, 6,7,9,10,11,12,13 are 'Leaf' nodes, other nodes of Tree are
called 'Non-leaf' nodes.
Children: The roots of the subtrees of a node I are called the children of node I. I
is the 'parent' of its children.
Siblings: Children of the same parent are called 'siblings' 8,9,10, are siblings just

3
as 11 and 12.
Level: The 'level' of a node is defined by initially letting the root be at level 1. If a
node is at level l, then its children are at level l+1.
Height or Depth: The height or depth of a Tree is defined as the maximum level of
any node in the Tree.
Forest: A 'Forest' is a set of n>0 disjoint trees.

Fig 1.2 An Example Tree

1.4 BINARY TREE

A Binary Tree is a finite set of elements that is either empty or is partitioned into
three disjoint subsets. The first subset contains a single element called the Root of
the tree. The other two subsets are themselves Binary Trees, called the left and
right subtrees of the original tree. A left or right subtree can be empty, Figure 1.3
shows a typical Binary Tree. A node of a Binary Tree can have at the most two
branches.
If A is the root of a Binary Tree and B,C are the roots of its left and right subtrees
respectively then A is said to be the father of B, C and B are called Left and Right
Sons respectively. If every non-leaf node in a Binary Tree has non-empty Left and
Right subtrees the tree is termed as Strictly Binary Tree. The Binary Tree of Figure
64 is Strictly Binary Tree. A Strictly Binary Tree with n leaves always contains 2n -
1 nodes.
A Complete Binary Tree is a Strictly Binary Tree of depth 'd' whose all l eaves are

4
at level d. Figure 1.5 represents complete Binary Tree.

Fig. 1.3 A Binary Tree

Fig. 1.4 Strictly Binary Tree

Fig. 1.5 The Complete Binary Tree

1.4.1 BINARY REPRESENTATION OF TREES


Lemma: If T is a K-ary tree(i.e. a tree of degree k) with n nodes each having a
fixed size, then n(k-1)+1 of the nk link field are zero, n>1.
5
From above Lemma it can be implied that for 3 -ary tree more than 2/3 of the link
fields are zero. The proposition of zero links approaches to 1 as the degree of tree
increases. A possible node structure of a k -ary tree will be:

Fig. 1.6

The importance of using Binary Trees to represent tree is that, for Binary trees only
about 1/2 of the link fields are zero. In arriving at the binary tree representation of
a tree we shall implicitly make use o f the fact that the order of the children of a
node is not important. "Parent-child" relationship between the nodes of a tree is
converted into the "Left Most -Child-Next-Right-Sibling" relationship Every node
has at most one left most child and at most one next right sibling. Rchild of the root
node of every resulting binary tree will be empty.
Example: Suppose we have a tree as given in the Fig 1.7

Fig.1.7
The binary tree corresponding to tree of Fig 1.7 is obtained by connecting together
all the siblings of a node and deleting all links from a node to its children except
for the link to its left most child. The node structure corresponds to that and shown
below:

6
Using the transformation defined above we obtain the representation for the tree as
shown in Fig 1.8

Fig. 1.8
On tilting it roughly 45 o clockwise we get the Fig 1.9.

Fig. 1.9

1.4.2 BINARY TREE TRAVERSAL


The traversal of a Binary Tree is to visit each node in the tree exactly once. A full
traversal produces a linear order for the information in a tree. When traversing a
Binary Tree, we want to treat each node and its subtrees in the same fashion. If we
let L, D, R stand for moving left, printing the data, and moving right when at a node
then there are six possible combinations of traversal: LDR, LRD, DLR, DRL, RDL
7
and RLD. If convention is adopted then we traverse left before right then only three
traversals remain i.e. LDR, LRD, DLR. To these, the names Inorder, Postorder and
Preorder are assigned because there is a natural correspondence between these
traversals and producing the Infix, Postfix and Prefix forms of an expression.
Preorder
To traverse a non-empty tree in preorder we perform the following three operations:
a. Visit the root
b. Traverse the left subtree in Preorder
c. Traverse the right subtree in Preorder.
Inorder
a. Traverse the left subtree in Inorder
b. Visit the root
c. Traverse the right subtree in Inorder.
Postorder
a. Traverse the Left subtree in Postorder.
b. Traverse the right subtree in Postorder.
c. Visit the root.
Example: Consider the Tree of Figure 1.10.

Fig. 1.10 Binary Tree


 In Inorder Traversal the order will be
A/B**C*D+E
8
which is Infix form of the expression.
 In Postorder Traversal the order will be:
ABC**/D*E+
which is postfix form of the Traversal
 In Preorder Traversal the order will be:
-*/A**BCDE
which is the prefix form of the expression.
The above mentioned Traversal Methods can be implemented either recursively or
non-recursively.

1.4.3 IMPLEMENTATION OF BINARY TREES


A node (as in Figure 1.11) of a binary tree can be represented as:
Struct btree
{
int data;
struct btree *left ;
struct btree *right
}

Fig. 1.11 Node of a Binary Tree


In the above structure two pointers of struct btree type are defined. One points to
the left child of this node and other points to the right child of the node.
Algorithm to implement a Binary Tree proceed in two phases. The first phase builds
a binary tree, and the second traverses the tree. In the first phase as we read, the
number for insertion into a Binary Tree, is compared with the contents of a node in
the tree, a left branch is taken if the number is smaller than the contents of the node
and a right branch is taken if it is greater or equal to the contents of the node. Code
to implement this is given below:
insert (struct btree *s, int num)

9
{ /*s is a pointer points to the existin g tree*/
struct btree *temp, *p;
temp = malloc(sizeof(struct btree));
temp  data =num; temp  left=temp  right = NULL;
if(s==NULL)/*if current tree is input*/
s=temp;
else
{
p=s;
while(p  left!=Null&& p  right!=Null)
{
if (p  data<=num) //if num is greater
p=p  right ; //than p  data go right
p=p  left; // else go left
else
}
if (p data <=num)
p right =temp;
else
p left =temp;
}
}
The functioning of the above code can be explained through an example. Suppose
we have to make a tree with the following data:
5, 3, 7, 8, 2, 4, 6, 1.
When 5 came the tree was empty hence tree is formed having only one node pointed
by temp and s. Both the child pointers of this node are NULL.
Then next number is 3. In the while loop this number is compared with 5 because 5
is greater than 3, control is moved to the left child of 5 which is Null. After exiting
from while loop 3, is attached to the left of 5 next number is 7 which is greater than

10
5 hence, it is attached to the right of 5. Proceeding in the same manner the Binary
tree of Figure 1.12 is produced.

Fig. 1.12
Such a binary tree has the property that all elements in the left subtree of a node n
are less than the contents of n, and all elements in the right subtree of n are greater
than or equal to the contents of n. Binary Tree with this property is called a Binary
Search Tree.
In the second phase we can Traverse a Tree in either of the three fashions (ie.
Inorder, Preorder, Postorder). A Recursive Method can be used to traverse a Binary
Tree. The method starts by checking whether the tree is empty or not. If the tree is
not empty i.e. our pointer s of struct btree type doesn't point to Null then we will
proceed. Depending upon the desired fashion of traversing, we will call recursively
the traversing function by passing either left child pointer or right child po inter of
the node as an argument. The function to implement this is given below:
/* Inorder Traversing of a Binary Search Tree */
inorder (struct btree*s)
{
if (s!=NULL)
{
inorder(s®left);
printf("%d", s®data);
} inorder(s®right);
else
} return;
11
/*Preorder Traversing of a Binary Search Tree*/
preorder (struct btree*s)
{
if (s! = NULL)
{
printf("%d",s  data);
preorder(s  left);
preorder(s  right);
}
else
return;
}
/* Post order Traversing of Binar y Search Tree */
postorder (struct btree *s)
{
if (s!=NULL)
{
postorder (s  left);
postorder (s  left);
} printf("%d", s  data);
else
return;
}

1.4.4 DELETION FROM A BINARY TREE


One of the important operations associated with any data structure is deletion of the
specified data item. Assuming that we will pass the specified data item, we wish to
delete from the Binary Tree to the delete function. There are four possible cases:
a. No node contains the specified data
b. The node containing the data has no children
12
c. The node containing the data has exactly one child
d. The node containing the data has two children.
In case (a) we will only need to print the message that the data ite m is not present
in the tree.
Figure 1.13 represents the case (b). We have to delete the node containing data 4.
The node is leaf node and it has no children, it can be deleted by making its parent
pointing to NULL. Which of the two links (Left or Right) of its parent node is set to
NULL depends upon whether the node being deleted is a left child or a right child
of its parent.

Fig. 1.13
Figure 1.14 represents case (c) in which node to be deleted has one child; we have
to adjust the pointer of the parent of the node to be deleted such that after deletion
it points to the child of the node being deleted. In Figure 1.14(a) node 1 with data 1
is to be deleted. After deletion left pointers of node 6 is made to point to child of
node 1 i.e. 4 and now structure would be as shown in Figure 1.13(b).

Fig. 1.14
In case (d) the value is replaced by the smallest value in the right subtree or the
13
largest key value in the left subtree; subsequently empty node is deleted
recursively, Figure 1.15 represents this case. If the node with data value 6 is to be
deleted, then first its value is replaced by smallest value in its right subtree.
Adjustment is done for the parent of this node (with smallest value) according to
case (b) or (c), then the node is deleted.

Fig. 1.15
Function for implementing the deletion procedure is given below:
delete(Tree *root, int num)
{
int found;
tree*par, *x, *xsucc;
if(root= = NULL)
{
print f(\n Empty Tree");
} return;
par=x=NULL;
found =search (root, num, par & x)
/*Search function to find the node to be deleted*/
if (found = =0)
{
print f("\n Data to be deleted, not found");

14
} return;
}
/*If the node to be deleted has two children */
if (x-->left!=NULL&&x  right!=NULL)
{
par=x;
xsucc=x  right;
while(xsucc--->left!=NULL)
{
par=xsucc;
xsucc=xsucc  left;
}
x data=xsucc  data;
x=x succ;
}
/* If the node to be deleted has no child */
if (x  left==NULL && x  right = =NULL)
{
if(par  right==x)
par  right =NULL;
else
par  left=NULL;
free(x);
return;
}
/* If the node to be deleted has only right child */
if (x-->left==NULL && x---> right!=NULL)
{
if(par  right =x)
par  right =NULL;

15
else
par  left=NULL;
else
par  left=NULL;
free(x);
return;
}
/* If the node to be deleted has only right child */
if(xleft==NUll && xright ! =NULL)
{
if (par left =x)
parleft =xright;
else
parright=xright;
free(x);
} return;
/ * If the node to be deleted has only left child */
if (xleft! =NULL && xright ==NULL)
{
if (parleft = =x)
parleft=x-->left;
else
parright =xleft;
free (x);
return;

1.5 THREADED BINARY TREES

Binary Tree has following drawbacks: -


16
1. A Binary Tree has more null links(N+1) than actual pointers(N -1) i.e. 2N total
links.
2. There is no way of Back Tracking (i.e. we can not traverse the nodes above the
node to which we have reached).
If we can use these links cleverly we can remove all the drawbacks of a Binary
Tree.
To make use of these NULL Links, the NULL Links can be replaced by pointers,
called threads to other nodes in the tree. If right child of any node p is NULL then
pointer to this right child will be replaced by a pointer to the node which would be
printed after p when traversing the tree in Inorder, i.e. Inorder Successor. A NULL
leftchild link at node p is replaced by a pointer to the node which immediately
precedes node in Inorder i.e. Inorder Predecessor.
Threads and Normal pointers are distinguished by adding two boolean fields to the
record (structure) of a node i.e. leftthread and rightthread. If left thread of any node
is True then rightchild link of this node contains a thread otherwise contains a
pointer to the right child.
Fig 1.16 shows a Binary Tree and Fig 1.17 (a) and (b) shows its right and left
threads:
Example:

Fig. 1.16

17
Fig.1.17
In order to avoid dangling Pointers the entire tree is treated as a left subtree of the
head node. Fig 1.18 structure of any node of a Threaded Binary Tree will be as
shown below.

Fig 1.18
This structure can be declared as:
Struct three
{
int lthread;
int rthread;
int data ;
struct thtree * rchild;
struct thtree * lchild;
}

1.5.1 IMPLEMENTATION OF THREADED BINARY TREE


Each node will contain data, a left pointer, a true or false value for left thread,
space a right pointer and a true or false value for right thread.
Initially, when a tree is empty we create Head node. We store garbage in its data
field; its right link points to itself and the entire tree is treated as a left subtree of

18
this node. Function to implement this is given below:
insert (struct thtree * s, int x)
{
struct thtree * head ; *p, *z;
head =s;
/* allocating a New Node */
z= malloc(sizeof(struct thtree));
zleft=True; //Indicates a thread
zdata =x; //Assign new data
zright=True;//Indicates a thread
/* If tree is Empty */
if (s==NULL)
{
head =malloc(sizeof(struct thtree));
headleft=false;
headlchild=z; /*z becomes left child of head node
headdata =-9999; /* Garbage */
headright =false;
s=head;
zlchild=head; //left thread to Head
zrchild =head;//right thread to Head
}
else /* if tree is not empty */
{
p=headlchild;
while (p!=head) // traverse till the thread is found attached to
the
head
{
if (pdata >x)

19
{
if (pleft!= true) //checking for a thread
p=plfield;
else
{
zlchild =plchild;
plchild = z;
pleftr=false; /*indicates a link */
zright=true;
zrchild =p;
return ;
}
}
else
{
if (pdata<x)
{
if (pright !=true)
p=p rchild;
else
{
zrchild =prchild;
p rchild =z;
pright =false; /* indicates a link */
zleft =true;
zchild =p;
return;
}
}
}
20
}
}
}

Fig. 1.19

1.5.2 INORDER TRAVERSING OF THREADED BINARY TREE


Because we can traverse upward through threads we don't need to store nodes
temporary on the stack. The while loop will be terminated when our pointer reaches
root, through threads. The function to implement this is given below:
inorder (struct thtree *root)
{
struct thtree *p;
p=rootlchild;
while (p!=root)
{
while (pleft==false)
p=plchild;
printf ("%d", pdata);
while (pright ==true)
{
p=prchild;
21
if(p==root) break;
printf("%d", pdata);
}
p=prchild;
}
}

1.6 CHECK YOUR PROGRESS

1. Fill in the blanks:


a. If a node is at level l, children are at level____
b. A node of _____can have at the most two branches.
c. A strictly binary tree with n leaves always contains _____ nodes.
d. A ________ traversal produces a linear order for the information in a
tree.
e. A binary search tree can be traversed in _____, ______or_______.
f. For a binary tree, every node has at most one ____and utmost one____.
g. A binary tree has more ____than actual pointers.
h. When a tree is empty we create____.
i. A node will be called ________ if the longest path in its left subtree is
longer than the longest path in its right subtree.
j. A node will be called ______ if the longest path in both of its subtrees is
equal.
k. B-trees are forced to grow at their _____.
l. While inserting elements in an AVL tree we have to perform _____to
maintain it as height balanced.
m. We have to ____the node if before insertion it is full up to capacity.

22
1.7 SUMMARY

1. Tree is one of the most important non-linear data structures.


2. Tree is used to represent hierarchical relationship between data items.
3. Binary Tree is a tree in which each node has only two children i.e. left child
and right child.
4. Left child of any node is the root of left subtree of that node, similarly right
child of any node is the root of right subtree of that node.
5. A Binary Search Tree T is a binary tree in which all identifiers in the left
subtree of T are less than the identifier in the root node and all identifiers in
the right subtree of T are greater than the identifier in the root.
6. A Binary Search Tree can be traversed in 3 ways:
a. Preorder Traversal in which root is traversed first and then left subtree
and then right subtree.
b. Inorder Traversal in which left subtree is traversed first then root and
then right subtree.
c. Postorder Traversal in which left subtree is traversed first then right
subtree and then root.
7. When deleting any node from Binary Search Tree we replace this node by
either the highest labeled node in left subtree or lowest labeled node in right
subtree.
8. Height Balance Trees are balanced in terms of height of its left and right
subtrees. As a result of the balanced nature of this type of tree, dynamic
retrievals, insertions and deletions can be performed in less time.
9. A tree is height balanced if b alancing factor of its each node is <1 &> -1,
where balancing factor is difference of heights of its left and right subtrees.
10. While inserting elements in an AVL tree we have to perform rotations to
maintain it as height balanced.
11. If balancing factor is +2 then clockwise rotations is performed, if balancing
factor is -2 then anticlockwise rotation is performed.

23
12. In an M-way search Tree any node has at the most m children and m -1 keys.
13. A 'B' tree of order m can have at the most m children and m -1 keys and at least
m/2 keys (except the root node and leafs).
14. We have to split the node if before insertion it is full upto its capacity.
15. A 'B' tree grows in upward direction and the level increases only after
splitting.

1.8 KEYWORDS

Tree: A finite set of one or more nodes used to represent data containing a
hierarchical relationship between elements.
Node: The Item of information
Degree: The number of subtrees of a node
Leaf: Nodes that have degree zero
Siblings: Children of same parent
Children: Nodes with the same root node.
Forest: A set of n ³ 0 disjoint tree.
Binary tree: A finite set of elements that is either empty or is partitioned into three
disjoint subsets.
Complete Binary Tree: A strictly Binary tree of depth ‘d’ whose all leaves are at
level d.
Binary Search tree: A binary tree with the condition that either it is empty or each
node in the tree contains an identifier and all identifiers in its left subtree are less
than the identifier in its root node, all identifiers in its right subtree are greater than
the identifier in its root node.
Height Balanced Tree: An empty tree is height balanced. A non-empty binary tree
with HL and HR as its left and right subtrees is he ight balanced if HL and HR are
height balanced
HL – HR < 1 where hl and hr are the heights of HL and HR respectively.
'B' Tree: An M-way search tree that is either empty or is of height 1 and satisfies
the following properties:
24
1. The root node has at least 2 children
2. All nodes other than the root node and failure nodes have at least [m/2] children.
3. All failure nodes are at the same level.

1.9 SELF ASSESSMENT TEST

1. What is non-linear data structure and in which cases it is preferable to


represent data through non-linear data structure.
2. What is a Tree?
3. What is a Binary Tree? State different characteristics associated with the
Binary Trees.
4. Write short notes on the following:
a. Strictly Binary Tree
b. Complete Binary Tree
c. Height/Depth of a Tree
5. Write a procedure to find the depth of a Binary Tree.
6. Write a program to create a Binary Tree. Also write a function to insert a node
into a Binary Tree.
7. Write down the recursive functions to traverse a Binary Search Tree in
a. Inorder
b. Preorder
c. Postorder.
8. Write a non-recursive function to traverse a Binary Search Tree in pre -order
manner.
9. Write a recursive "C" function to se arch for an element in a given Binary
Search Tree. Write a non-recursive version for the same.
10. Describe the method for deleting any node from a Binary Search Tree.
11. Write a procedure which deletes all the leaf nodes from Binary Search Tree.
12. Given a Binary Search Tree of names prove that an Inorder Traversal will
always print the names in alphabetical order.

25
13. How the use of Threads removes all the drawbacks of Binary Search Trees.
14. Write a function to create a Threaded Binary Tree.
15. Insert the following keys in an empty AVL Tree: MARCH, MAY,
NOVEMBER, AUGUST, APRIL, JANUARY, DECEMBER, JULY,
FEBRUARY, JUNE, OCTOBER and SEPTEMBER
16. Insert the following keys into a 'B' Tree of order 3:
a, f, b, k, h, m, e, s, r, e, l, n, u, p.
17. Explain the method for deleting an element from a 'B' Tree. Explain when we
require merging of nodes while deleting with examples.

1.10 ANSWER TO CHECK YOUR PROGRESS

1. Answer to fill in the blanks


a. l+1
b. binary tree
c. 2n-1
d. full
e. preorder, Inorder, postorder.
f. Left most child, right sibling
g. Null links
h. Head node
i. left heavy
j. balanced
k. root
l. rotation
m. split.

26
1.11 REFERENCES/SUGGESTED READING
1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,
Springer-New York
2. Seymour Lipschutz, “Data Structure”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in C”,
Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures with
Applications”, McGraw- Hill International Student Edition, New York.
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Tenenbaum, “Data
Structures using C”, Prentice Hall of India Pvt. Ltd., N ew Delhi.
6. Mark Allen Weiss, “ Data structures and Algorithm Analysis in C”, Addison -
Wesley (An Imprint of Pearson Education), Mexico City, Prentice -Hall of
India Pvt. Ltd., New Delhi.

27
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 2

SPECIAL TREES
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Height Balanced Trees or AVL Trees
2.3.1 Building Height Balanced Trees
2.4 Multiway Search Trees
2.5 B Tree
2.5.1 Insertion into A B Tree
2.5.2 Deletion from A B Tree
2.6 Check your progress
2.7 Summary
2.8 Keywords
2.9 Review Questions
2.10 References/Suggested Reading

28
2.1 LEARNING OBJECTIVES

After studying this unit, you should be able to:


1. Describe AVL trees
2. Describe how to build height balanced trees
3. Describe multiway search trees
4. Describe 'B' trees
5. Describe Insertion and deletion of nodes in a 'B' tree.

2.2 INTRODUCTION

In the last few units we studied about linear data structures. In many realistic
situations we need to represent data and functioning which are not linear in nature.
For that purpose, we need some non-linear data structures. A tree structure means
that the data is organized such that items of information are related by branches. So
far we have been studying mainly linear types of data structures such as arrays,
lists, stacks and queues, now we will study a non -linear data structure called a
Tree. This structure is mainly used to represent data containing a hierarchical
relationship between elements, familiar examples of such structure are: family
trees, the hierarchy of positions in an organization, an algebraic expression
involving operations for which certain rules of precedence are prescribed etc. For
example, suppose we wish to use a data structure to represent a person and all of
his or her descendants. Assume that the person's name is Alan and he has 3
children, John, Joe and Johnson. Also suppose that Joe has 3 children, Alec, Mary
and Chris and Johnson has one child Peter. We can represent Alan and his
descendants quite naturally with the tree structure shown in Figure 2.1.

29
Fig 2.1

2.3 HEIGHT BALANCED TREES OR AVL TREES

Adelson-Velskii and Lendis in 1962 introduced a Binar y tree structure that is


balanced with respect to the heights of subtrees. As a result of the balanced nature
of this type of tree, dynamic retrievals can be performed in o(logn) time if the tree
has n nodes in it. At the same time new identifier can be en tered or deleted from
such a tree in time o(logn). The resulting tree remains height balanced. The tree
structure introduced by them is given the name AVL tree.
A Binary Tree of height h is completely balanced or unbalanced if all leaves occur
at nodes of level h or h-1 and if all nodes at levels lower than h -1 have two
children. A tree is height balanced if for each node in the tree, the height of left
subtree differs from the height of right subtree by no more than 1.

Fig 2.2

30
Fig 2.3
The tree given above is height balanced, but it is not completely balanced. The
figure above tree is height balanced, and is completely balanced also. An almost
height balanced tree is called an AVL tree. AVL may or may not be perfectly
balanced.
We can give a recursive definition of AVL trees as given below:
Definition
An empty tree is height balanced, if T is a non-empty binary tree with HL and HR
as its left and right subtrees, then T is height balanced if:
i) HL and HR are height balanced and
ii) hl-hr<1 where hl and hr are the heights of HL and HR respectively.
In order to prevent a tree from becoming unbalanced we associate a 'balance factor'
with each node of the height balanced tees. This indicator will contain one of the
three values which we denote as Left(L), Right (R) or Balance (B) according to the
following definitions:
 Left: A node will be called 'left heavy' if the longest path in its left subtree is
longer than the longest path of its right subtree.
 Balance: A node will be called 'balanced' if the longest path in both of its
subtrees are equal.
 Right: A node will be called 'right heavy' if the longest path in its right subtree
is one longer than the longest path in its left subtree.
In a 'Balanced Tree' each node must be in one of these three states. If there exists
a node in a tree where this is not true, then such a tree is said to be unbalanced. In
31
other words, a tree is said to be height balanced if all of its nodes have a balance
factor of 1,0.
Fig 2.4 gives examples of trees which are balanced. 2.5 represents examples of
unbalanced trees.

Figure 2.4

Fig 2.5

2.3.1 BUILDING HEIGHT BALANCED TREES


An AVL tree is a Binary Search Tree in which the heights of the left and right
subtrees of the root differ by at the most 1 and in which the left and right subtrees
are again AVL trees.
We can insert a new node into an AVL tree by first using the usual binary tree
insertion algorithm comparing the key of the new node with that in the root and
inserting the new node into the left or right subtree as appropriate. The AVL
technique used to keep a tree in height balance requires that each time an insertion
is made according to the insertion rule we must follow the given steps:
 Let the node to be inserted travel down the appropriate branch, keeping track
along the way of the deepest level node on that branch that has a balance factor

32
of +1 or -1. This particular node is called the pivot node.
 Inclusive of and below the pivot node, recompute all balance factors along th e
insertion path. It can balance factors along the insertion path. It can be shown
that no nodes other than these can possibly change their factors using the AVL
method.
 Determine whether the absolute value of the pivot nodes balance factor switched
from 1 to 2.
 If there was such a switch, perform a manipulation of tree pointers contested at
the pivot node to bring the tree back into height balance, Because the visual
effect of this pointer manipulation is to "rotate" the subtree whose root is the
pivot node, the operation in frequently referred to as an AVL-Rotation.
Each node of an AVL tree has the property that the height of the left subtree is
either one more, equal or one less than the height of the right subtree. Balanced
Factor (BF) = (height of left subtree-height of right subtree)
 If two subtrees are of same height BF = 0
 If right subtree is higher BF = -1
 If left subtree is higher BF = +1

Fig 2.6 Tree having a Balance Factor at each Node


The addition of a node to a balanced binary search tree could unbalance it. The
rebalancing can be carried out using four kinds of rotations.
LL, RR, LR, RL
LL and RR are symmetric and LR and RL are symmetric. Rotations are
characterized by the nearest ancestors A, of the inserted node Y, whose balance

33
factor becomes + 2. The following characterization of rotation types is obtained.
 LL New Node Y is inserted in the left subtree of the left subtree of A.
 LR Y is inserted in the right subtree of the left subtree of A.
 RR Y is inserted in the right subtree of the right subtree of A.
 RL Y is inserted into the left subtree of the right subtree of A.

An algorithm to implement a LL rotation of a subtree rooted at pivot is as


follows:
p = right (pivot);
q = right (p);
right (p) = pivot;
left (pivot) = q;
pivot = p;
// readjust the BF;
bf(pivot) = 0;
bf(right(pivot)) = 0;
Algorithm to implement RR rotation is as given.
p = right(pivot);
q = left(p);
left(p) = pivot;
right(pivot) =q;
pivot = p;
bf(pivot) = 0;
bf(left(pivot)) = 0;
Algorithm to implement LR rotation is given below:
x = right(pivot);
y = left(x);
right(pivot) = left(y);
left(x) = right(y);
right(y) = x;
left(y) = pivot;

34
pivot = y;
if bf(pivot) = 0 then
bf(right (pivot)) = 0;
bf(left (pivot)) = 0;
else if
bf(pivot) = 1 then
bf(pivot) = 0;
bf(left (pivot)) = 0
bf(right (pivot)) = -1
else
bf(pivot) = 0;
bf(right (pivot)) = 1;
bf(left (pivot)) = 0;
endif;
endif;
Algorithm to implement RL rotation is given below:
x = left(pivot);
y = right(x);
left (pivot) = right(y);
right(x) = left(y);
left(y) = x;
right(y) = pivot;
pivot = y;
if bf(pivot) = 0 then;
bf(left (pivot)) = 0;
bf(right (pivot)) = 0;
else if bf(pivot) = 0
bf(left (pivot)) = 0;
bf(right (pivot)) = -1;
else
bf(pivot) = 0;
35
bf(left (pivot)) = 1
bf(right (pivot)) = 0
endif;
endif;
Example: Creation of an AVL Tree
Assume that the insertions are made in the order – March, May, November, August,
April, January, December, July, February, June, October and September.
New identifier After Insertion After Balancing
1. March No rebalancing needed

2. May No rebalancing needed

3. November

4. August No rebalancing needed

5. April

36
6. January

7. December No rebalancing needed

8. July No rebalancing needed


37
9. February

10. June

38
2.4 MULTIWAY SEARCH TREES

Binary Search Trees generalize directly to Multiway Search Trees (M -way Search
Trees) in which for some integer called the order of the tree, each node has at the
most m children. If k<m is the number of children , then the node contains exactly k-
1 keys, which partitions all the keys into k subsets. If some of these subsets are
empty, then the corresponding children in the tree are empty.
An M-way Search Tree, T, is a tree in which all nodes are of degree < = m. I f T is
empty, (i.e. T = 0) then T is an M-way Search Tree. When T is not empty it has the
following properties:
1. T is a node of the type
2. n, A 0 , (K 1 , A 1 ), (K 2 , A 2 ) ... (K n , A n )
3. where the A i , 0 £ i £ n are pointers to the subtrees of T and the K i , 1 £ i £ n are
key values; and 1 £ n < m.
4. K i < K i+1 , 1 £ i £ n.
5. All key values in the subtrees A i are less than the key value K i+1 , 0 £ i < n.
6. All key values in the subtree A n are greater than K n .
7. The subtrees A i , 0 £ i £ n are also M-way Search Trees.

39
Node Schematic Format
a 2, b, (20, c), (40, d)
b 2, 0, (10, 0), (15, 0)
c 2, 0, (25, 0), (30, e)
d 2, 0, (45, 0), (50, 0)
e 1, 0, (35, 0)
Fig 2.7 Example of 3-Way Search Tree

If any M-way Search Tree satisfies the given three conditions:


 No empty subtrees appear above the leaves (so that the division of keys into
subsets is as efficient as possible).
 All leaves be on the same level (so that searches will be guaranteed to terminate
with about the same number of accesses).
 Every node (except the leaves) have at least half as many children as the
maximum possible.
then the M-way Search Tree is called Balanced M-way Search Tree.

2.5 B TREE

Definition: A B-tree, T, of order m is an M-way Search Tree that is either empty or


is of height ³ 1 and satisfies the following properties:
1. The root node has at least 2 children.
2. All nodes other than the root node and failure nodes have at least [m/2] children.
3. All failure nodes are at the same level.
NOTE: A failure node represents a node which can be reached during the search
only if the value X being searched for, is not in the tree.
 A B-tree is also known as the balanced sort tree. It finds its use in external
sorting. It is not a binary tree.
 To reduce the disk accesses, several conditions of the tree must be true:
o The height of the tree must be kept to a minimum.
o There must be no empty subtree above the leaves of the tree.

40
o The leaves of the tree must all be at the same level and
o All nodes except the leaves must have at least minimum of children.
B-Tree of order M has following properties:
1. Each node has a maximum of M children and a minimum of M/2 children. Root
node can have minimum two to the maximum number of children.
2. Each node has one fewer key than children with a maximum of M -1 keys.
3. Keys are arranged in a defined order within the node. All keys in the subtree to
the left of a key are predecessors of the key and that on the right are successors
of the key.
4. When a new key is to be inserted into a full node, the node is split into two
nodes, and the key with the medium value is inserted in the pr esent node. In case
the parent node is the root, a new root is created.
5. All leaves are at the same level, i.e. there is no empty subtree above the level of
the leaves.
B-trees of high order are desirable since they result in a reduction in the number of
disk accesses needed to search an index. If the index has N entries then a 'B' tree of
order m=N+1 would have only one level. This choice of m clearly is not reasonable
since by assumption the index is too large to fit in internal memory. Consequently,
the single node representing the index cannot be read into memory and processed.
In arriving at a reasonable choice of m, we must keep in mind that we are really
interested in minimizing the total amount of time needed to search the B -tree for a
value x.

2.5.1 INSERTION INTO A 'B' TREE


A B-tree starts with a single root node (which is also a leaf node) at level 0 (zero).
Once the root node is full with m -1 search key values and we attempt to insert
another entry in the tree, the root node splits into two node s at level 1.
B-trees are not allowed to grow at their leaves instead, they are forced to grow at
the root. First, a search is made to see if the new key is in the tree. This search will
(if the key is truely new) terminate in failure at a leaf.
The new key is then added to the leaf node. If the node was not previously full, then
41
the insertion is finished. When the key is added to a full node, then the node splits
into two nodes at the same level, except that the median key is not put into either of
the two new nodes, but instead sent up the tree to be inserted into the parent node.
When a key is added to a full root, then the root splits in two and the median key
sent upward becomes a new root.
Example 1
To grow a tree of order 5, with the following keys:
a, g, f, b, k, d, h, m, j, e, s, i, r, x, c, l, n, t, u, p
The tree is initially empty:
1. a, g, f, b:

The first four keys will be inserted into one node. They are sorted into proper
order as they are inserted.
2. k:

Now there is no room left for the fifth key 'k' so its insertion causes a node to
split into two, and the median key 'f' moves up to create a new node, which is
new root.
3. d, h, m:

Since the split nodes are only half full, the next three keys can be inserted
without difficulty. These insertions can require rearrangement of the keys
within the node.
4. j:

42
Insertion of key 'j', again splits a node, and this time it is 'j' itself that is
median key, and therefore moves up to join f in the root.
5. e, s, i, r:

6. x:

7. c, l, n, t, u:

8. p:

The insertion of 'p', first splits the node origin ally containing k, l, m, n, sending the
median key m upwards into the node containing c, f, j, r, which is already full.
Hence this node in turn splits, and a new root containing 'j', is created.

43
2.5.2 DELETION FROM A 'B' TREE
If deletion of a value causes a node to be less than half full, it is combined with its
neighbouring nodes and this can propagate all the way to the root.
If the leaf contains more than the minimum of entries, then any one entry can be
deleted with no further action. If the leaf contains the minimum number, then we
first look at the two leaves (or, in the case of a node on the outside, one leaf) that
are immediately adjacent to each other and are children of the same node. If one of
these has more than the minimum number of entries, then one of them can be moved
into the parent node, and the entry from the parent node is moved into the leaf
where the deletion is occurring.
If the adjacent leaf has only the minimum number of entries, then the two leaves
and the median entry from the parent can all be combined as one leaf, which will
contain no more than the maximum number of entries allowed. If this step leaves
the parent node with too few entries, then the process propagates upward.
In the limiting case, the last entry is removed from the root, and then the height of
tree decreases. If the key to be deleted is present in a non -leaf node, and its
immediate successor or predecessor is in a leaf, then the immediate predecessor or
successor is promoted into the position occupied by the deleted key and the key is
deleted from the leaf.
Example of deletion from a B-tree:
1. Delete h, r:

2. Delete p:

Fig. 2.8 Deletion from a B-Tree


44
2.6 CHECK YOUR PROGRESS

1. Fill in the blanks:


a. If a node is at level l, children are at level____
b. A node of _____can have at the most two branches.
c. A strictly binary tree with n leaves always contains_____nodes.
d. A________traversal produces a linear order for the information in a tree.
e. A binary search tree can be traversed in _____, ______or_______.
f. For a binary tree, every node has at most on e____and atmost one____.
g. A binary tree has more____than actural pointers.
h. When a tree is empty we create____.
i. A node will be called ________ if the longest path in its left subtree is
longer than the longest path in its right subtree.
j. A node will be called ______ if the longest path in both of its subtrees is
equal.
k. B-trees are forced to grow at their _____.
l. While inserting elements in an AVL tree we have to perform _____to
maintain it as height balanced.
m. We have to____the node if before insertion it is full upto capacity.

2.7 SUMMARY

1. Tree is one of the most important non-linear data structures.


2. Tree is used to represent hierarchical relationship between data items.
3. Binary Tree is a tree in which each node has only two children i.e. left child
and right child.
4. Left child of any node is the root of left subtree of that node, similarly right
child of any node is the root of right subtree of that node.
5. A Binary Search Tree T is a binary tree in which all identifiers in the left
subtree of T are less than the identifier in the root node and all identifiers in
the right subtree of T are greater than the identifier in the root.
45
6. A Binary Search Tree can be traversed in 3 ways:
a. Preorder Traversal in which root is traversed first and then left subtree
and then right subtree.
b. Inorder Traversal in which left subtree is traversed first then ro ot and
then right subtree.
c. Postorder Traversal in which left subtree is traversed first then right
subtree and then root.
7. When deleting any node from Binary Search Tree we replace this node by
either the highest labeled node in left subtree or lowes t labeled node in right
subtree.
8. Height Balance Trees are balanced in terms of height of its left and right
subtrees. As a result of the balanced nature of this type of tree, dynamic
retrievals, insertions and deletions can be performed in less time.
9. A tree is height balanced if balancing factor of its each node is <1 &> -1,
where balancing factor is difference of heights of its left and right subtrees.
10. While inserting elements in an AVL tree we have to perform rotations to
maintain it as height balanced.
11. If balancing factor is +2 then clockwise rotation is performed, if balancing
factor is -2 then anticlockwise rotation is performed.
12. In an M-way search Tree any node has at the most m children and m -1 keys.
13. A 'B' tree of order m can have at the most m children and m-1 keys and at least
m/2 keys (except the root node and leafs).
14. We have to split the node if before insertion it is full upto its capacity.
15. A 'B' tree grows in upward direction and the level increases only after
splitting.

46
2.8 KEYWORDS

Tree: A finite set of one or more nodes used to represent data containing a
hierarchical relationship between elements.
Node: The Item of information
Degree: The number of subtrees of a node
Leaf: Nodes that have degree zero
Siblings: Children of same parent
Children: Nodes with the same root node.
Forest: A set of n ³ 0 disjoint tree.
Binary tree: A finite set of elements that is either empty or is partitioned into three
disjoint subsets.
Complete Binary Tree: A strictly Binary tree of depth ‘d’ whose all leaves are at
level d.
Binary Search tree: A binary tree with the condition that either it is empty or each
node in the tree contains an identifier and all identifiers in its left subtree are less
than the identifier in its root node, all identifiers in its right subtree are greater than
the identifier in its root node.
Height Balanced Tree: An empty tree is height balanced. A non-empty binary tree
with HL and HR as its left and right subtrees is height balanced if HL and HR are
height balanced
HL – HR < 1 where hl and hr are the heights of HL and HR respectively.
'B' Tree: An M-way search tree that is either empty or is of height 1 and satisfies
the following properties:
1. The root node has at least 2 children
2. All nodes other than the root node and failure nodes have at least [m/2] children.
3. All failure nodes are at the same level.

47
2.9 SELF ASSESSMENT TEST

1. What is non-linear data structure and in which cases it is preferable to


represent data through non-linear data structure.
2. What is a Tree?
3. What is a Binary Tree? State different characteristics associated with the
Binary Trees.
4. Write short notes on the following:
a. Strictly Binary Tree
b. Complete Binary Tree
c. Height/Depth of a Tree
5. Write a procedure to find the depth of a Binary Tre e.
6. Write a program to create a Binary Tree. Also write a function to insert a node
into a Binary Tree.
7. Write down the recursive functions to traverse a Binary Search Tree in
a. Inorder
b. Preorder
c. Postorder.
8. Write a non-recursive function to traverse a Binary Search Tree in pre -order
manner.
9. Write a recursive "C" function to search for an element in a given Binary
search Tree. Write a non-recursive version for the same.
10. Describe the method for deleting any node from a Binary Search Tree.
11. Write a procedure which deletes all the leaf nodes from Binary Search Tree.
12. Given a Binary Search Tree of names prove that an Inorder Traversal will
always print the names in alphabetical order.
13. How the use of Threads removes all the drawbacks of Binary Search Trees.
14. Write a function to create a Threaded Binary Tree.
15. Insert the following keys in an empty AVL Tree: MARCH, MAY,
NOVEMBER, AUGUST, APRIL, JANUARY, DECEMBER, JULY,

48
FEBRUARY, JUNE, OCTOBER and SEPTEMBER
16. Insert the following keys into a 'B' Tree of order 3:
a, f, b, k, h, m, e, s, r, e, l, n, u, p.
17. Explain the method for deleting an element from a 'B' Tree. Explain when we
require merging of nodes while deleting with examples.

2.10 ANSWER TO CHECK YOUR PROGRESS

1. Answer to fill in the blanks


a. l+1
b. binary tree
c. 2n-1
d. full
e. preorder, Inorder, postorder.
f. Left most child, right sibling
g. Null links
h. Head node
i. left heavy
j. balanced
k. root
l. rotation
m. split.

49
2.11 REFERENCES/SUGGESTED READING

1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,


Springer-New York
2. Seymour Lipschutz, “Data Structure”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in C”,
Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures with
Applications”, McGraw- Hill International Student Edition, New York.
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Tenenbaum, “Data
Structures using C”, Prentice Hall of India Pvt. Ltd., New Delhi.
6. Mark Allen Weiss, “Data structures and Algorithm Analysis in C”, Addison -
Wesley (An Imprint of Pearson Education), Mexico City, Prentice -Hall of
India Pvt. Ltd., New Delhi.

50
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 3

HASHING
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Hash Tables
3.4 Hashing Function
3.4.1 Mid Square
3.4.2 Division
3.4.3 Folding
3.4.4 Digit Analysis
3.5 Terms Associated with Hash Tables
3.5.1 Identifier Density
3.5.2 Loading Factor
3.5.3 Synonyms
3.5.4 Collision
3.6 Bucket Overflow
3.7 Handling of bucket Overflows
3.8 Hash Indices
3.9 Check your Progress
3.10 Summary
3.11 Keywords

51
3.12 Self-Assessment Test
3.13 Answers to Check Your Progress
3.14 References/Suggested Reading

3.1 LEARNING OBJECTIVES

After studying this unit, you should be able to:


1. Define hash tables and the terms associated with it.
2. Describe hashing function
3. Describe bucket overflow and its handling
4. Describe open and closed hashing
5. Describe hash indices

3.2 INTRODUCTION

In this unit we will study about hashing. One disadvantage of sequential file
organization is that we must access an index structure to locate data or must
use binary search but that results in more I/O operations. File organization
based on the technique of hashing allows us to avoid accessing an index
structure.
 Hashing is the process of mapping large amount of data item to smaller
table with the help of hashing function.
 Hashing is also known as Hashing Algorithm or Message Digest
Function.
 It is a technique to convert a range of key values into a range of indexes
of an array.
 It is used to facilitate the next level searching method when compared
with the linear or binary search.
 Hashing allows to update and retrieve any data entry in a c onstant time
O(1).

52
 Constant time O(1) means the operation does not depend on the size of
the data.
 Hashing is used with a database to enable items to be retrieved more
quickly.
 It is used in the encryption and decryption of digital signatures.

3.3 HASH TABLES

In tree tables, the search for an identifier key is carried out via a sequence of
comparisons. Hash differs from this, in that the address or location of an
identifier, X, is obtained by computing some arithmetic function f, of X. f ( X)
gives the address of X in the table. This address will be referred to as the hash
or home address of X.
The memory available to maintain the symbol table is assumed to be
sequential. This memory is referred to as the hash table, HT. The hash table is
partitioned into b buckets, HT(0), ... HT(b -1). Each bucket is capable of
holding records. Thus, a bucket is said to consist of s slots, each slot being
large enough to hold 1 record.
Usually s = 1 and each bucket can hold exactly 1 record. A hashing functio n, f
(x) is used to perform an identifier transformation on X. f(x) maps the set of
possible identifiers onto the integers 0 through b -1.
We use the term bucket to denote a unit of storage that can store one or more
records. A bucket is typically a disk bu t could be chosen to be smaller or
larger than a disk block.

3.4 HASHING FUNCTION

A hashing function f, transforms an id entifier X into a bucket address in the


hash table. The desired properties of such a function are that it should be
easily computable and that it should minimize the number of collisions.
If X is an identifier chosen at random from the identifier space, th en the

53
probability that
f(x) = i will be 1/b for all buckets, then a random X has an equal chance of
hashing into any of the b buckets. A hash function satisfying this property will
be termed as "Uniform Hash Function".
The worst possible hash function maps all search key values to the same
bucket. This function is undesirable because all the records have to be kept in
the same bucket.
An ideal hash function distributes the stored keys uniformly across all the
buckets so that every bucket has the same numb er of records. We want to
choose a hash function that assigns search key values to buckets such that the
following holds:
 The distribution is uniform, that is, each bucket is assigned the same
number of search key values from the set of all possible search key values.
 The distribution is random, that is, in the average case, each bucket will
have nearly the same number of values assigned to it, regardless of the
actual distribution of search key values.

CHARACTERISTICS OF GOOD HASHING FUNCTION


1. The hash function should generate different hash values for the similar
string.
2. The hash function is easy to understand and simple to compute.
3. The hash function should produce the keys which will get distributed,
uniformly over an array.
4. A number of collisions should be less while placing the data in the hash
table.
5. The hash function is a perfect hash function when it uses all the input
data.
Example:
Suppose we have a set of strings {“abc”, “def”, “ghi”} that we’d like to store
in a table. Our objective here is to find or update them quickly from a table,
actually in O(1). We are not concerned about ordering them or maintaining
any order at all. Let us think of a simple schema to do this. Suppose we assign

54
“a” = 1, “b” =2, … etc to all alphabetical characters. We can then simply
compute a number for each of the strings by using the sum of the characters as
follows.
“abc” = 1 + 2 + 3=6, “def” = 4 + 5 + 6=15, “ghi” = 7 + 8 + 9=24 If we assume
that we have a table of size 5 to store these strings, we can compute the
location of the string by taking the sum mod 5. So we will then store “abc” in
6 mod 5 = 1, “def” in 15 mod 5 = 0, and “ghi” in 24 mod 5 = 4 in locations 1,
0 and 4 as follows.
01234
def abc ghi
Now the idea is that if we are given a string, we can immediately compute the
location using a simple hash function, which is sum of the character’s mod
Table size. Using this hash value, we can search for the string. This seems to
be great way to store a Dictionary. Therefore, the idea of hashing seems to be
a great way to store pairs of (key, value) in a table.
Problem with Hashing
The method discussed above seems too good to be true as w e begin to think
more about the hash function. First of all, the hash function we used, that is
the sum of the letters, is a bad one. In case we have permutations of the same
letters, “abc”, “bac” etc in the set, we will end up with the same value for the
sum and hence the key. In this case, the strings would hash into the same
location, creating what we call a “collision”. This is obviously not a good
thing. Secondly, we need to find a good table size, preferably a prime number
so that even if the sums are different, then collisions can be avoided, when we
take mod of the sum to find the location. So we ask two questions.
Question 1: How do we pick a good hash function?
Question 2: How do we deal with collisions?
The problem of storing and retrieving data i n O(1) time comes down to
answering the above questions. Picking a “good” hash function is key to
successfully implementing a hash table. What we mean by “good” is that the
function must be easy to compute and avoid collisions as much as possible. If
55
the function is hard to compute, then we lose the advantage gained for lookups
in O(1). Even if we pick a very good hash function, we still will have to deal
with “some” collisions.
Finding a good hash Function
It is difficult to find a “perfect” hash function , that is a function that has no
collisions. But we can do “better” by using hash functions as follows. Suppose
we need to store a dictionary in a hash table. A dictionary is a set of Strings
and we can define a hash function as follows. Assume that S is a string of
length n and
S = S1S2…. Sn
H(S) = H(“S1S2…. Sn”) = S1 + p1S2 + p2S3 + ….+ pn -1 Sn
where p is a prime number. Obviously, each string will lead to a unique
number, but when we take the number Mod Table Size, it is still possible that
we may have collisions but may be fewer collisions than when using a naïve
hash function like the sum of the characters.
Although the above function minimizes the collisions, we still have to deal
with the fact that function must be easy to compute. Rather than direc tly
computing the above functions, we can reduce the number of computations by
rearranging the terms as follows.
H(S) = S1 + p ( S2 + p(S3 + …. P (Sn-1+ p Sn))))
This rearrangement of terms allows us to compute a good hash value quickly.
Several kinds of uniform hash functions are in use. We shall describe few of
them:

3.4.1 MID SQUARE


The "middle of square" function, fm, is computed by squaring the identifier
and then using an appropriate number of bits from the middle of the square to
obtain the bucket address.
Since the middle bits of the square will usually depend upon all of the
characters in the identifier, it is expected that different identifiers would result
in different hash addresses with high probability even when some of the
56
characters are the same.
The number of bits to be used to obtain the bucket address depends on the
table size. If r bits are used, the range of values is 2 r , so the size of hash table
is chosen to be a power of 2 when this kind of scheme is used.

S.No. Identifier Internal Representation


2
1 X X X
2 A 1 1
3 A1 134 20420
4 A2 135 20711
5 A3 136 21204
6 A4 137 21501
7 A9 144 23420
8 B 2 4
9 C 3 11
10 G 7 61
11 D MAX 4150130 21526443617100

Table 3.1

Internal representation of x and x 2 in octal notation. X is input right justified,


zero filled
(Six bits or 2 octal digits per character).

3.4.2 DIVISION
Another simple choice for a hash function is obtained by using the modulo
(mod) operator. The identifier X is divided by some number M and the
remainder is used as the hash address of X.
f D (x) = X mod M.
This gives bucket address in the range 0 - (M-1) and so the hash table is at
least of size b = m. M should be prime number such that M does not divide

57
r k + a where r is the radix of the character set and k and a are very small
numbers.

Fig. 3.2 Identifier A1 right and left justified and zero filled (6 bits per
character)

3.4.3 FOLDING
In this method identifier X is partitioned into several parts, all but the last
being of the same length. These parts are then added together to obtain the
hash address for X. There are two ways to carry out this addition.
In the first, all but the last part is shifted so that the least significant bit of
each part lines up with the corresponding bit of the last part. The different
parts are now added together to set f (x).

Fig. 3.3 Shift Folding


This method is known as shift folding. The other method of adding the parts is
folding at the boundaries. In this method the identifier is folded at the part
boundaries and digits falling into the same position are added together.

58
Fig. 3.4 Folding at Boundaries P i r = Reverse of P i

3.4.4 DIGIT ANALYSIS


This method is particularly useful in the case of static files where all the
identifiers in the table are known in advance. Each identifier X is interpreted
as a number using some radix r. The same radix is used fo r all the identifiers
in the table.
Using this radix, the digits of each identifier are examined. Digits having most
skewed distribution are deleted. Enough digits are deleted so that the number
of digits left is small enough to give an address in the rang e of the hash table.

3.5 TERMS ASSOCIATED WITH HASH TABLES

3.5.1 IDENTIFIER DENSITY


The ratio n/T is called the identifier density, where
n = number of identifiers
T = total number of possible identifiers.
The number of identifiers, n, in use is usually several orders of magnitude less
than the total number of possible identifiers, T.
The number of buckets b, in the hash table are also much less than T.

3.5.2 LOADING FACTOR


59
The loading factor a is equal to n/sb, where
s = number of slots in a bucket
b = total number of buckets
The number of buckets b is also very less than the total number of possible
identifiers, T.

3.5.3 SYNONYMS
The hash function f almost always maps several different identifiers into the
same bucket. Two identifiers I 1 , I 2 are said to be synonyms with respect to f if
f(I 1 ) = f (I 2 ). Distinct synonyms are entered into the same bucket so long as
all the s slots in that bucket have not been used.

3.5.4 COLLISION
A collision is said to occur, when two non -identical identifiers are hashed into
the same bucket. When the bucket size is 1, collision and overflow occurs
simultaneously.

3.6 BUCKET OVERFLOW

So far we have assumed that, when a record is inserted, the bucket to which it
is mapped has space to store the record, if the bucket does not have enough
space, a bucket overflow is said to have occurred. Bucket overflow can occur
due to several reasons:
INSUFFICIENT BUCKETS
The number of buckets which we denote by nb, must be chosen such that
nb>nr/fr, where nr denotes the total number of records that will be stored and
fr denotes the number of records that will fit in a bucket.
SKEW
Some buckets are assigned more records than others, so a bucket may overflow
even when other buckets still have space. This situation is called bucket skew.

60
3.7 HANDLING OF BUCKET OVERFLOWS

When situation of overflow occurs it should be resolved and the records must
be placed somewhere else in the table.
We shall describe here some of the methods used for overflow resolution.
OVERFLOW CHAINING
Despite allocation of a few more buckets than required, bucket overflow can
still occur. We handle bucket overflow using Overflow Buckets. If a record
has be inserted into a bucket b, and b is alre ady full, an overflow bucket is
provided for b, and the record is inserted into the overflow bucket. If the
overflow bucket is also full, another overflow bucket is provi ded and so on.
All the overflow buckets of a given bucket are chained together in a linked
list. Overflow handling using such a linked list is ca lled Overflow Chaining.
Hash table is considered as an array of pointers to the records, that is, an array
of linked list.

Fig. 3.5 A Chained Hash Table

ADVANTAGES OF CHAINING

 Space Saving: Since the hash table is a contiguous array, enough space
61
must be set aside at compilation time to avoid overflow. On the other
hand, if the hash table contains only the pointers to the records, then the
size of the hash table may be reduced.
 Collision Resolution: Chaining allows simple and efficient collision
handling. To handle collision only a link field is to be added.
 Overflow: It is no longer necessary that the size of the hash table exceeds
the number of records.
If there are more records than entries in the table, it means that some of the
linked lists serve the purpose of containing more than one record.
DELETION
Deletion proceeds in exactly the same way as deletion from a simple linked
list. So in a chained hash table deletion becomes a quick and easy task.
OPEN HASHING
The form of hash structure that we have just described is sometimes referred
to as closed hashing. Under an alternate approach, called open hashing, the
set of buckets is fixed and there are no overflow chains, instead if a bucket is
full, records are inserted in some other bucket in the initial set of buckets B.
Two policies are used when assigning a new bucket to a record.
LINEAR PROBING
One policy is to use the next bucket (in cyclic order) that has space. This
policy is called "linear probing".
It starts with a hash address (the location where the collision occurred) and
does a sequential search through a table for the desired key or an empty
location. Hence, this method searches in straight line, and it is therefore called
linear probing.
The table should be considered circular, so that when the last location is
reached, the search proceeds to the first location of the table.
REHASHING
Another policy, in which we compute another hash function for a new record,
is used. If this position is filled, then some other method is needed to get
another position and so on.
62
CLUSTERING
The major drawback of linear probing is that as the table becomes about half
full, there is a tendency towards clustering, that is records start to appear in
long strings of adjacent positions with gaps between the strings.
QUADRATIC PROBING
If there is a collision at hash address h, this method probes the table at
locations H+1, H+4, H+9 ..., that is at location H+i 2 for i = 1, 2, ... that is, the
increment function is i 2 . Quadratic probing reduces clustering, but it will not
probe all locations in the table.
KEY DEPENDENT INCREMENTS
In key dependent increments we assume a function which can be part of the
key itself.
Example: We could truncate the key to a single character and use its code as
the increment.

63
Fig. 3.6

3.8 HASH INDICES

Hashing can also be used for index -structure beside for file organization.
Search keys are organized by hash index with their associated pointers, into a
hash file structure.
We apply a hash function on a search key to identify a memory block (bucket)
and store that key, and its associated pointers in a bucket. Figure 14.5 shows a
secondary hash index on the EMPLOYEE file, for the search key emp_no. The

64
hash function computes the modulo 7 of the sum of the digits of the emp_no.
The hash index has seven buckets, each of size 2. In this example, emp_no is a
primary key.
Since, a file itself is organized using hashing, there is no need for a separate
hash index on it. Hence, hash i ndex is never needed as a primary index
structure instead used only as a secondary structure. EMPLOYEE file has
emp_no (primary key). emp_name fields.

3.9 CHECK YOUR PROGRESS

1. What is a hash table?


a. A structure that maps values to keys
b. A structure that maps keys to values
c. A structure used for storage
d. A structure used to implement stack and queue
2. Which of the following is not a technique to avoid a collision?
a. Make the hash function appear random
b. Use the chaining method
c. Use uniform hashing
d. Increasing hash table size
3. Fill in the blanks:
a. A hashing function transforms an identifier into a —————— in
the hash table.
b. ————— is useful in the case of static files.
c. A —————— is said to occur, when two non -identical identifiers
are hashed into the same bucket.
d. Bucket overflow can occur due to ————— to —————.
e. Search keys are organized by ————— with their associated
pointers, into a hash file structure.

65
3.10 SUMMARY

1. Sequential file organization requires an index structure to locate data.


2. File organization based on hashing, by contrast allows us to find the
address of a data item directly by computing a function on the search key
value.
3. Static hashing suffers with collision and overflow.
4. To resolve overflow condition, we use several methods like open hashing
and closed hashing.

3.11 KEYWORDS

Hash Table: The memory available to maintain the symbol table is assumed to
be sequential. Their memory is referred to as t he hash table.
Collision: A collision is said to occur, when two non -identical un-identifiers
are hashed into the same bucket.
Bucket Skew: Situation where some buckets are assigned more records than
others, so a bucket may overflow even when other buckets still have space.

3.12 SELF ASSESSMENT TEST

1. What is hashing? How is it preferred over sequential accessing?


2. What is a hashing function and what are its desired characteristics?
3. Differentiate between closed and open hashing. Discuss the relative
merits of each technique in database applications.
4. Define the following terms associated with hashing.
a. Identifier Density
b. Loading Factor
c. Synonyms
d. Collision
e. Overflow
66
5. State and explain various hashing functions with examples.
6. What are the causes of bucket overflow in a hash file organization?
7. Define various techniques used for overflow resolution.
8. What is overflow chaining? What are the demerits of this techniqu e?
9. State and define overflow resolution techniques under open hashing.

3.13 ANSWERS TO CHECK YOUR PROGRESS

1. c
2. d
3. Answers to fill in the blanks
a. bucket address
b. Digit Analysis
c. Collision
d. insufficient buckets, bucket skew
e. hash index

3.14 REFERENCES / SUGGESTED READINGS


1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,
Springer-New York
2. Seymour Lipschutz, “Data Structure”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in
C”, Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures
with Applications”, McGraw- Hill International Student Edition, New
York.
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Tenenbaum,
“Data Structures using C”, Prentice Hall of India Pvt. Ltd., New Delhi.

67
6. Mark Allen Weiss, “Data structures and Algorithm Analysis in C”,
Addison-Wesley (An Imprint of Pearson Education), Mexico City,
Prentice-Hall of India Pvt. Ltd., New Delhi.

68
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 4

GRAPHS
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Graph
4.3.1 Representation of Graphs
4.3.2 Adjacency Matrix
4.3.3 Adjacency List
4.3.4 Adjacency Multilist
4.4 Traversal of a Graph
4.4.1 Depth First Search
4.4.2 Breadth First Search
4.5 Weighted Graphs
4.6 Spanning Trees
4.7 Coding for Graph Traversal
4.8 Check your Progress
4.9 Summary
4.10 Keywords
4.11 Self-Assessment Test
4.12 Answers to Check your Progress
4.13 References/Suggested Reading

69
4.1 LEARNING OBJECTIVES

After studying this unit, you should be able to:


1. Define graph and its various types
2. Describe various ways to represent graphs
3. Describe warshall algorithm
4. Describe minimal algorithm
5. Describe various ways used for traversal of graphs
6. Define weighted graphs
7. Describe Digkstra's shortest path algorithm
8. Define minimal spanning trees

4.2 INTRODUCTION
In the last few units we studied about trees. In this unit we will study another very
important non-linear data structure i.e. graphs. We will also study different
representations and implementation of graphs. Graph is a superset of trees. When
certain restrictions and limitations are imposed on a Graph then it becomes a tree.
Hence, every tree is a graph but every graph is not necessarily a tree.

4.3 GRAPH

A graph G consists of a non-empty set V called the set of nodes (points, vertice s) of
the graph, a set E which is the set of edges of the graph and a mapping from the set
of edges E to a pair of elements of V.
Any two nodes which are connected by an edge in a graph are called "adjacent
nodes".
In a graph G(V,E) an edge which is directed from one node to another is called a
"directed edge", while an edge which has no specific direction is called an
"undirected edge". A graph in which every edge is directed is called a "directed
70
graph" or a "digraph". A graph in which every edge is undirected is called an
"undirected graph".
If some of the edges are directed and some are undirected in a graph then the graph
is called a "mixed graph".
Let (V,E) be a graph and xÎE be a directed edge associated with the ordered pair of
nodes (u,v), then the edge is said to be "initiating" or "originating" in the node u
and "terminating" or "ending" in the node y. The nodes u and v are also called
"initial or terminal" nodes of the edge x. An edge xÎE which joins the nodes u and
v, whether it be directed or undirected, is said to be "incident" to the node u and v.
An edge of a graph which joins a node to itself is called a "loop". In some directed
as well as undirected graphs we may have certain pairs of nodes join ed by more
than one edge. Such edges are called "parallel". Any graph which contains some
parallel edges is called a "multigraph". If there is not more than one edge but a
pair of nodes then, such a graph is called "simple graph." A graph in which
weights are assigned to every edge is called a "weighted graph". In a graph, a
node which is not adjacent to any other node is called "isolated node". A graph
containing only isolated nodes is called a "null graph". In a directed graph for any
node v the number of edges which have v as initial node is called the "outdegree" of
the node v. The number of edges to have v as their terminal node is called the
"Indegree" of v and sum of outdegree and indegree of a node v is called its "total
degree".
In the case of undirected graph the total degree of v is equal to the number of edges
incident on v. The total degree of a loop is 2 and that of an isolated node is 0.
Any sequence of edges of a digraph such that the terminal node of the edge if any,
appearing next in the sequence defines the path of the graph. A path is said to
traverse through the nodes appearing in the sequence originating in the initial node
of the first edge and ending in the terminal node of the first edge and ending at the
terminal node of the last edge in the sequence. The number of edges in the sequence
of a path is called the "length" of the path.
A path of a digraph in which the edges are distinct is called simple path (edge
simple). A path in which all the nodes through which traversing is done, ar e distinct
71
is called "elementary path (node simple)" .
A path which originates and ends in the same node is called "cycle(circuit)". A
cycle is called "elementary" if it does not traverse through any node more than
once.

Fig. 4.1
A Graph G, consists of two sets V and E. V is a finite non-empty set of vertices. E
is a set of pairs of vertices, these pairs are called edges. V(G) and E(G) will
represent the sets of vertices and edges of Graph G. In an undirected graph the pair
of vertices representing any edge is unordered. Thus , the pair (V 1 , V 2 ) and (V 2 ,
V 1 ) represent the same edge.

72
In a directed graph each edge is represented by a directed pair <V 1 , V 2 >, V 1 is the
tail and V 2 the head of edge. Thus <V 2 , V 1 > and <V 1 , V 2 > represent two different
edges.

Fig. 4.2 Three Sample Graph


The graph G1 and G2 are undirected. G3 is a directed graph.
 V(G1) = {1, 2, 3, 4,}; E(G1) = {(1, 2), (1, 3), (1, 4), (2, 3), (2, 4) (3, 4)}
 V(G2) = {1, 2, 3, 4, 5, 6, 7}; E(G2) = {(1, 2), (1, 3), (2, 4), (2, 5), (3, 6), (3, 7)}
 V(G3) = {1, 2, 3}; E(G3) = {<1, 2>, <2, 1>, <2, 3>}
The Graph G2 is also a tree while the Graphs G1 and G3 are not. A graph may not
have multiple occurrences of the same edge, when this restric tion is removed from a
graph, the resulting data object is referred to as a multigraph, which is not a graph.

Fig. 4.3 Multigraph


The number of distinct unordered pairs (V i , V j ) with V i ¹ V j in a graph with n
vertices is n*(n-1)/2. This is the maximum number of edges in any n vertex
undirected graph. An n vertex undirected graph with exactly n(n -1)/2 edges is said
to be complete. In the Figure G1 is the complete graph with 4 vertices while G2 and
G3 are not complete graphs.
In the case of directed graph on n vertices the maximum number of edges is n(n -1).

73
A subgraph of G is a graph G' such that V(G') Í V(G) and E(G') Í E(G). A path from
vertex V p , vertex to vertex V q in graph G is a sequence of vertices V p , V i1 , V i2 ,
V i3 ... V in , V q such that (V p , V i ) (V i1 , V i2 ) ... (V in , V q ) are edges of E(G). If G' is
directed then the path consists of <V p1 , V i1 ,> <V i1 , V i2 >, ..., <V in , V q >, edges in
E(G'). The length of path is the number of edges on it. A simple path is a path in
which all vertices except possibly the first and the last are distinct. e.g., Paths 1, 2,
4, 3 and 1, 3, 4, 2 are both of length 3 in G1. The first is a simple path while the
other is not.
A cycle is a simple path in which the first and last vertices are the same. In an
undirected graph, G, two vertices V1 and V2 are said to be connected if there is a
path in G from V1 to V2. If the graph is undirected then there must also be a path
from V2 to V1.
An undirected graph is said to be connected if for every pair of distinct vertices Vi,
Vj in V(G) there is a path from Vi to Vj in G. A connected component or simply a
component of an undirected graph is a maximal connected subgraph.
e.g.,

Fig. 4.4 A Graph With Two Connected Components


The graph G given above has two components H1 and H2. A tree is a connected
acyclic graph. A directed graph G is said to be strongly connected if for every pair
of distinct vertices V i , V j in V(G) there is a directed path from V i to V j and also
from V j to V i . A strongly connected component is a maximal subgraph that is

74
strongly connected.
e.g.,

Fig. 4.5 Strongly Connected Component


The degree of a vertex is the number of edges in cident to that vertex. In case G is a
directed graph, in-degree of vertex V is defined to be the number of edges for which
V is the head. The outdegree is defined to be the number of edges for which V is the
tail. Directed graphs are also known as digraphs .

4.4 REPRESENTATION OF GRAPHS

Three most commonly used representations are:


1. Adjacency matrices
2. Adjacency lists
3. Adjacency multilist

4.4.1 ADJACENCY MATRIX


Let G = (V, E) be a graph with n vertices, n ³ 1. The adjacency matrix of (G) is a 2-
dimensional nxn array, say A, with the property that A(ij) = 1 if the edge (V i , V j )
for a directed graph G is in E(G). The adjacency matrix for an undirected graph is
symmetric as the edge (V i , V j ) is in E(G) if the edge (V j , V i ) is also in E(G).
The adjacency matrix for a directed graph need not be symmetric.
e.g.:

75
Fig. 4.6 Adjacency Matrix Representation of Graph
A simple digraph which doesn't have any cycle is called "acyclic". Naturally such
graphs cannot have any loop.
Example:
Let, V ={v 1 ,v 2 ,v 3 }
vxv ={(v 1 v 2 ), (v 1 v 3 ), (v 2 v 1 ),(v 2 v 3 ), (v 3 v 1 ),(v 3 v 2 ),(v 1 v 1 ),(v 2 v 2 ), (v 3 v 3 )}
Suppose we have a graph with 3 vertices: Consider the following graph with set V:
v1 v2 v3
v1 0 1 0
A= v2 0 0 0 Direction from row to column
v3 1 1 1
This matrix is also called "Adjacency martrix"
0 1 1
1 0 1
1 1 1
For a given digraph G (V,E) an Adjacency Matrix depends upon the ordering of the
elements of V. For different orderings of the elements of V we get different
adjacency matrices of the same graph G.

76
Path Matrix
We can extend the idea of matrix representation to multigraph and weighted graphs.
For simple undirected graphs such an extension simply gives Symmetric Adjacency
Matrix. In the case of multigraph or a weighted graph we write a ij =w ij , where w ij
denotes the weight of the edge (Vi,Vj). If (Vi, Vj) ÏE then we write w ij =0
For a null graph, Adjacency Matrix is a null matrix. If there are loops at each
node but no other edges in the graph then the Adjacency Matrix is the identity or
unit matrix.
Let us denote the elements of A 2 by a ij 2 then
V1 V2 V3 V4
V1 0 1 0 1

A [i] = V2 1 0 0 0
V3 1 1 0 1
V4 0 1 0 0

a ij (2) =  n k=1 a ik ,a kj

for any fixed k a ik .a kj =1, if both a ik and a kj equals 1 i.e., (V i V k ) (V k V j ) are the
edges of the graph.
The matrices A 2 , A 3 and A 4 for the graph represented by the above matrix are:
V1 V2 V3 V4 V1 V2 V3 V4
V1 1 1 0 0 V1 1 1 0 1
A2 = V2 0 1 0 1 A3 = V2 0 1 0 1
V3 1 2 0 1 V3 2 2 0 1
V4 0 0 0 0 V4 1 1 0 0

77
V1 V2 V3 V4
V1 1 2 0 1
A4 = V2 1 1 0 1
V3 2 3 0 2
V4 1 1 0 0
((1) (2) (3).... can be considered as the length of path)
The matrix representing the paths of length less than or equal to n which exists from
V i to V j is
2 n
B n = A+A +......A

if we write dij = V n r=1 (a ik L b kj ) V i, j = 1,2......n.


V = union of (OR)
L = And
For example:
d 11 =(a 11 b 11 ) V(a 12 b 21 ) V ........(a 1n V bn1 )
If for any k, the kth element in the row for A and the kth element in the column for
(2) (x-1) (x)
B are both 1 then d ij =1; otherwise d ij =0. Let us write ALA =A , ALA =A for
any r=2,3........ the entry in the ith row and jth column of A(x) is 1 if there is at
least one path of length r from Vi to Vj. The path matrix P is given by:
(2) (3) (n) n (k )
P=AVA VA V...VA =V k=V n rs1 A
Warshall Algorithm
Given the Adjacency Matrix A, this matrix produces the path matrix P.
1. [Initialization]
P¬A
2 [Perform a pass]
Repeat thru step A for k=1(1) n
3 [Process Rows]
Repeat step 4 for i=1(1) n.
4 [Across column]
Repeat for j=1(1) n
78
P ij ¬p ij V (V ik ^ P kj )
5. [Exit]
Minimal Algorithm
Given the Adjacency Matrix B in which the zero elements are replaced by infinity
or by some large number, the matrix produced by the following algorithm shows the
minimum length of paths between the nodes. MIN is a function that selects the
algebraic minimum of its two arguments.
[1] c¬B
[2] Repeat thru step 4 for k =1(1) n
[3] Repeat thru step 4 for j =1(1) n
c ij = MIN (c ij , c ik + c kj )
[5] exit
Example (Warshall)
Suppose we have 3 vertices v 1 v 2 v 3

I Iteration
(1)
 R=1, i=1, j=1,2,3 for p
P 11 =P 11 V (P 11 P 11 ) =0V(0L0) =0
P 12 =P 12 V (P 11 P 12 ) =1V(0L1) =1
P 13 =P 13 V (P 11 P 13 ) =0V(0L0) =0

 k=1, i=2, j=1,2,3

79
P 21 =P 21 V(P 21 ^P 22 ) =0
P 22 =P 22 V (P 21 ^ P 22 ) =0
P 23 =0

 k=1, i=3, j=1,2,3


P 31 =P 31 V (P 31 ^P 31 ) =1
P 32 =P 32 V (P 31 ^P 32 ) =1
P 33 =P 33 V (P 31 ^ P 33 ) = 1
Hence,
0 1 0
Pv = 0 0 0
1 1 1
II Insertion
(2)
for k=2 for p
i=1
j=1,2,3
P 11 =P 11 V(P 12 ^ P 21 )=0
P 12 =P 12 V(P 12 ^ P 22 )=0
P 13 =P 13 V(P 12 ^ P 23 )=
. .
. .
. .
Similarly, we can calculate matrix for
k=3 and so on

4.4.2 ADJACENCY LIST


In this representation the n rows of the Adjacency Matrix are represented as n
linked lists. There is one list for each vertex in G. The nodes in list i represent the
vertices that are adjacent from vertex i.
Each node has two fields, ‘vertex and link’. The vertex field contains the indices of

80
the vertices adjacent to vertex i.
Each node has a head node. Head nodes are sequential providing easy random
access to the adjacency list for any particular vertex.
e.g., Graph Adjacency List

Fig. 4.7

4.4.3 ADJANCENCY MULTILIST


In the adjacency list representation of an undirected graph each edge (V i , V j ) is
represented by two entries, one on the list for V i and the other on the list for V j . In
some situations, it is necessary to be able to determin e the second entry for a
particular edge and mark that edge as already having been examined. This can be
accomplished easily if the adjacency lists are actually maintained as multilist (i.e.,
lists in which nodes may be shared among several lists).
For each edge there will be exactly one node, but this node will be in two lists, i.e.,
the adjacency list for each of the two nodes it is incident to. The node structure will
be:

81
where M is a one-bit mark field that may be used to indicate whether or not the
edge has been examined.
e.g.,

The lists are:


Vertex 1: N1  N2  N3
Vertex 2: N1  N4  N5
Vertex 3: N2  N4  N6
Vertex 4: N3  N5  N6
Fig. 4.8 Adjacency Multilist for the given Graph

4.5 TRAVERSAL OF A GRAPH

 Depth first search


 Breadth first search

4.5.1 DEPTH FIRST SEARCH


DFS algorithm is analogous to preorder traversal of a tree. Given an undirected

82
graph g= (V,E) and a vertex V in V(G), we are interested in visiting all vertices in
G that are reachable from (i.e. all vertices connected to V).
DFS of an undirected graph proceeds as follows:
The start vertex C is visited. Next an unvisited vertex w adjacent to V is selected
and a depth first search initiated. (Adjacent means connected by an arc from the
search node.)
Algorithm DFS (V)
1. Given an adjacent graph G=(V,E) with n vertices an array VISITED [N]
initially set to zero this algo visits all vertices reachable from V and VISITED [
] are global.
VISITED [V] ¬1
2. Check for adjacent vertex W to V and call recursively
for each vertex W adjacent to V do
if (visited (W)=0) then call DFS(W)
end
3. [exit]

4.5.2 BREADTH FIRST SEARCH


Starting at vertex V and making is as visited BFS differs from DFS in that all
unvisited vertices adjacent to V are visited next. Then unvisited vertices adjacent to
these vertices are visited and so on .
Algorithm BFS (V)
1. [A BFS of G is carried out beginning at vertex V. All vertices visited are
marked as VISITED(I)=1. The graph G and array VISITED are global and
VISITED is initialized to 0]
VISITED (V) ¬1
initialize Q to be empty [Q is queue]
2. [Perform Iterations]
loop
for all vertices W adjacent to V do
if VISITED (W) =0
83
then [Add (W,Q); VISITED (W) ¬1]
end
if Q is empty then return;
DELETE (V,Q)
forever
<exit>
Graph G

Fig. 4.9
Traversing by DFS: V 1 , V 2 , V 4 , V 8 , V 5 , V 3 , V 6 , V 7
Traversing by BFS: V 1 , V 2 , V 3 , V 4 , V 5 , V 6 , V 7 , V 8
Analysis of Algorithms
1. DFS: If G is represented by its adjacency matrix, then the time to determine all
vertices adjacent to V is O(N). Since at the most n vertices are visited, the total
time is O(N 2 ).
2. BFS: Each vertex visited gets into the queue exactly once, so the
loop.....forever is iterated at most n times. If an adjacency matrix is used, then
the for loop takes o(n) times for each vertex visited.
The total time is therefore O(N 2 ).

4.6 WEIGHTED GRAPHS

A graph in which weights are assigned to every edge is called a weighted graph.
Shortest Path in Weighted Graphs
84
Let G= (v,e,w) be a weighted graph where w is weight (in terms of time or cost)
assigned to that edge. The length of a path in G is defined as the sum of lengths of
the edges in the path.
Dijkstra's Shortest Path Algorithm
Let TÍV with AÏT (A= starting nod e). Let P be the subset of vertices V -T, let L(T)
denote the length of a shortest path among all paths from A to T that does not
include any other vertex in T. We call L(T) the index of T with respect to P.
Let X be a vertex in T. Let P' be PU{X} and T' be T-{X}. Let L'(T) denote the
index of a vertex T in T' with respect to P'. We claim that
L'(T) =Min[L(T), L(X), L(W,T)] -----------(i)
Algorithm
1. Initially let P= [a] and T =V-[a] , for every vertex t in T, let l(t)= w (A,t).
2. Select the vertex in T that has the smallest index with respect to P, let it be
denoted as vertex x.
3. If x is the vertex we wish to reach from A stop. If not let P'=PU{x} and T=T -
{x}. For every vertex t in T compute its index with respect to P' accordin g to 1.
4. Repeat steps 2 and 3 using P' as P and T' as T.
5. Stop.
Example
Find the shortest path from A to Z for the given graph:
1.

Fig. 4.10
Solution
Initially P={A} and
T={B,C,E,D,Z}
The lengths of different vertices (with respect to P) are:
L(B)=1 , L(C)=4, L(D)=L(E)= L(Z)= 

85
 so L(B)= 1 is the shortest value
Now let P={A,B} and T={C,D,E,Z}
so, L(C)=3, L(D)=8, L(E)=6, L(Z)=
 So L(C) =3, is the shortest value.
Now P={A,B, C} and T'={D,E,Z}
L'(D)= Min{8,}=8. L'(E)= Min{6,3+1}=4
L'(Z) = Min {,3+}= 
Successive observations are shown in following figures:
2.

3.

4.

5.

6.

86
Fig. 4.11
Thus the shortest path from A to Z is (A, B, C, E, D, Z) of length 9.

4.7 SPANNING TREES

It is a tree obtained from a graph which covers all its vertices.


Minimal Spanning Tree
It is a tree from the set of spanning trees which has minimum weight.
There are two methods to find minimum spanning tree.
 Prim's Method
 Kruskal's method
1. Prim’s Method
Draw n isolated vertices and label them as V1,V2.....Vn.
Tabulate the given weights of the edges of G in an nxn table.
Set the weights of the non-existing edges as very large (¥)

87
V1 V2 V3 V4 V5 V6
V1 - 10 16 11 10 17
V2 10 - 9.5 ¥ ¥ 19.5
V3 16 9.5 - 7 ¥ 12
V4 11 ¥ 7 - 8 7
V5 10 ¥ ¥ 8 - 9
V6 17 19.5 12 7 9 -

Fig. 4.12
Start from vertex V 1 and connect it to its nearest neighbour say V k . Now
consider V 1 and V k as one subgraph and connect this subgraph to its closest
neighbour (i.e. to a vertex other than V 1 and V k that has the smallest entry
among all entries in rows 1 and k.)
2. Kruskal's Method
In this approach minimum cost spanning tree T is built edge by edge. Edges are
considered for inclusion in T in ascending order. An edge is included in T if it
doesn't form a cycle with edges already in T. Since G is connected and has n>0
vertices exactly (n-1) edges will be selected for inclusion in T.

Edge Cost Action Resulting Tree


.... ...... ......... .......................

(2,3) 5 Accept

(2,4) 6 Accept

88
(3,4) 10 Reject "
(2,6) 11 Accept

(4,6) 14 Reject "


(1,2) 16 Accept

(4,5) 18 Accept

Fig 4.13

4.8 CODING GRAPH TRAVERSAL

Breadth First Search


#include <stdio.h>
#define black 2
# define max 20
typedef struct node_type
{
int vertex;
struct node_type *link;
} node;
typedef struct queue_type
{
int info;
struct queue_type *next;
} queue;
void main (void)
89
{
node*adj[Max];
int n,s;
clrscr() ;
printf("Enter No. of Nodes:");
scanf ("%d", & n);
create_graph (adj, n);
input_graph(adj, n);
clrscr();
printf ("enter source vcr tax");
scanf ("%d" , &s);
printf ("bfs from vertex")
bfs (adj, n,s);
getch();
}
void create_graph (node * adj[], int num)
{
int i ;
for (i=1; i<=num; i++)
adj[i] =Null;
}
void bfs(node * adj[], int n, int s)
{
node *ptr;
queue*front, *rear;
int i, color(max);
create_empty_queue (&front, &rear);
for (i=1; i<=n; i++) color[i]=white;
color(s)=Grey;;
enqueue(&front, &rear, s);
while (!empty(front))
90
{
v=head[front);
ptr=adj[u];
v=ptr---->vertex;
if (color[v]=white)
{
color[v] =grey;
enqueue (&front, &rear,v);
}
ptr= ptr--->link;
}
v=dequeue(&front, &rear);
printf("%d", u);
color[u]= black;
}
Depth First Search
void main (void)
{
node *adj[max]
int n, i, t =0;
int n, i, t=0;
printf("Enter no. of nodes in graph");
scanf("%d", &n);
create_graph(adj, n);
input_graph(adj, n>);
for(i=1; i<n; i++) color[i]=white;
for (i=1; i<=n; i++)
{
if(color[i]== white)
{
t++; printf("\n dfs tree =%d", t);
91
dfs_visit (adj, i);
}
}
delete_graph(adj,n);
}
void dfs_visit(adj, int v)
{
int v; node *ptr;
color[u]=grey;
ptr=adj[u];
white (ptr!=null)
{
v=ptr--->vertex;
if (color[v]=white)
dfs_visit(adj,v);
ptr=ptr---->link;
}
printf("%d", v);
color[u]= black;
}

4.9 CHECK YOUR PROGRESS

1. Fill in the blanks:


a. A connected planar graph having 6 vertices, 7 edges contains
_____________ regions.
b. In a simple graph, the number of edges is equal to ___________twice the
sum of the degrees of the vertices.
c. If a simple graph G, contains n vertices and m edges, the number of edges
in the Graph G'(Complement of G) is ___________
2. What is the number of edges present in a complete graph having n vertices?
92
a. (n*(n+1))/2
b. (n*(n-1))/2
c. n
d. Information given is insufficient
3. What is the maximum number of edges in a bipartite graph having 10 vertices?
a. 24
b. 21
c. 25
d. 16
4. Which of the following is true?
a. A graph may contain no edges and many vertices
b. A graph may contain many edges and no vertices
c. A graph may contain no edges and no vertices
d. A graph may contain no vertices and many edges

4.10 SUMMARY

1. A graph consists of two non-empty subsets E(G) and V(G), where V(G) is a set
of vertices and E(G) is a set of edges connecting those vertices.
2. Graph is a superset of tree. Every tree is a graph but every graph is not
necessarily a tree.
3. A graph in which every edge is directed is called directed graph or digraph. A
graph in which every edge is undirected is called an undirected graph.
4. A graph can be represented in various ways
a. By Adjacency Matrix
b. Path Matrix
c. Adjacency List
d. Adjacency Multilist.
5. There are two methods to traverse a graph.
a. Depth first search
b. Breadth first search

93
6. Spanning tree is a tree obtained from a graph which covers all its vertices.
7. A graph in which weights are assigned to every edge is ca lled Weighted Graph.

4.11 KEYWORDS

Digraph: A graph in which every edge is directed


Undirected graph: A graph in which every edge is undirected.
Multigraph: Any graph which contains some parallel edges.
Length of path: The number of edges in the sequence of a path.
Null graph: A graph containing only isolated nodes.
Weighted graph: A graph in which weights are assigned to every edge.
Loop: An edge of a graph which joins a node to itself.
Spanning tree: A tree obtained from a graph which covers all its vertices.
Minimum spanning tree: A tree from the set of spanning tree which has minimum
weight.

4.12 SELF ASSESSMENT TEST

1. What is a graph? Compare graphs with trees.


2. When does a graph become a tree?
3. Define these in terms of a graph
a. Vertex
b. Edges
c. Indegree
d. Outdegree
4. Define these graphs:
a. Undirected graphs
b. Directed graphs
c. Weighted graphs
d. Mixed graphs
e. Multi graphs

94
5. What are the different representations of a graph? Explain adjacency matrix
representation of a graph.
6. Write adjacency matrix representation of th e following graphs

7. Explain adjacency list representation of the graphs. Write Adjacency list


representation of the graphs of previous question.
8. Write a program that converts Adjacency matrix representation into its
adjacency list representation.
9. Explain adjacency multilist representation of a graph.
10. What is a path matrix?
11. Explain Warshall's minimal algorithm for finding the path matrix of a graph
given its adjacency matrix.
12. What do you mean by traversal of any graph?
13. Write depth first search algorithm for the traversal of any graph. Write a "C"
program for the same. Explain your algorithm's time complexity with the help
of an example.
14. Explain breadth first search algorithm for the traversal of any graph with
suitable examples. Define time complexity of the algorithm. Write a "C"
program for the same.
15. Another way to represent a graph is by its incidence matrix inc. There is one
row for each vertex and one column for each edge. Then INC(i,j)=1 if edge j is
incident to vertex i. Write a program that converts adjacency matrix
representation of any graph to its incidence matrix representation.
16. Define spanning tree and minimal spanning tree. Write Kruskal's algorithm for
finding minimal spanning tree of any graph. Find the minimal spanning tree of
the following graphs by Kruskal's algorithm.

95
17. Write Prim's algorithm for finding minimal spanning tree of any graph. Find the
minimal spanning trees of the graph of previous questions by Prim's algorithm.
18. By considering the complete graph with n vertices, show that the number of
spanning trees is at least 2 n-1 -1.
19. Prove that when DFS and BFS are applied to a connected graph the edges of the
graph form a tree.
20. What do you understand by shortest path from one node to another in a
weighted graph. Write Dijkstra's algorithm to find the shortest path in a
weighted graph. Find the shortest path from 3 to T using Dijkstra's algorithm in
the following graphs:
i)

ii)

96
4.13 ANSWERS TO CHECK YOUR PROGRESS

1. Answer to fill in the blanks


a. twice
b. 3
c. (n*n-n-2*m)/2
2. b
3. c
4. b

4.14 REFERENCES/SUGGESTED READING

1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,


Springer-New York
2. Seymour Lipschutz, “Data Structure”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in C”,
Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures with
Applications”, McGraw- Hill International Student Edition, New York.
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Tenenbaum, “Data
Structures using C”, Prentice Hall of India Pvt. Ltd., New Delhi.
6. Mark Allen Weiss, “Data structures and Algorithm Analysis in C”, Addison-
Wesley (An Imprint of Pearson Education), Mexico City, Prentice -Hall of
India Pvt. Ltd., New Delhi.

97
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 5

SORTING METHODS
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Sorting
5.3.1. Bubble Sort
5.3.2. Selection Sort
5.3.3. Insertion Sort
5.3.4. Merge Sort
5.3.5. Quick Sort
5.3.6. Heap Sort
5.4 Check your Progress
5.5 Summary
5.6 Keywords
5.7 Self-Assessment Test
5.8 Answers to Check Your Progress
5.9 References/Suggested Reading

98
5.1 LEARNING OBJECTIVES
After studying this unit, you should be able to:
1. Describe searching techniques
2. Describe sequential, binary and index sequential search
3. Define sorting
4. Describe various types of sorting
5. Describe 'O' notation
6. Describe bubble sort and quick sort techniques

5.2 INTRODUCTION
In the previous units, we have discussed many data structures and their storage
representations. Algorithms for a number of operations such as insertion and
deletion, which are commonly performed on these structures were described in
detail. This unit discusses two additional operations that are frequently performed
on data structures namely, sorting and searching. We will see that efficient
algorithms for these operations can be realized whe n data are properly structured.

5.3 SORTING

Sorting refers to the operation of arranging data in some given order, such as
increasing or decreasing, with numerical data or alphabetically, with character data.
Let P be a list of n elements P 1 , P 2 , P 3 ...P n in memory. Sorting P refers to the
operation of rearranging the contents of P so that they are increasing in order
(numerically or lexicographically), that is,
P 1 < P 2 < P 3 ...< P n
Since P has n elements, there are n! ways that the contents can appear in P. These
ways correspond precisely to the n! permutations of 1, 2, ...n. Accordingly, each
sorting algorithm must take care of these n! possibilities.
99
CLASSIFICATION OF SORTING
Sorting can be classified in two types:
1. Internal Sorting
2. External Sorting
Internal Sorting: If the records that have to be sorted are in the internal memory.
External Sorting: If the records that have to be sorted are in secondary memory.
EFFICIENCY CONSIDERATIONS
The most important considerations are:
1. The amount of machine time necessary for running the program.
2. The amount of space necessary for the program.
In most of the computer applications one is optimized at the expense of another.
The actual time units to sort a file of size n varies from machine to machine, from
one program to another, and from one set of data to another.
So, we find out the corresponding change in the amount of time required to sort a
file induced by a change in the file size n. The time efficiency is not measured by
the number of time units required but rather by the number of critical operations
performed.
Critical operations are those that take most of the execution time. For example, key
comparisons (that is, the comparisons of the key of two records in the file to
determine which is greater), movement of records or pointer to record, or
interchange of two records.

5.3.1 BUBBLE SORT


Another well-known sorting method is the Bubble Sort. In this approach, there are
at the most n-1 passes required.
In bubble sort, the file is passed through sequentially several times. In each pass
each element is compared with its successor. i.e. (x[i] with x [i+1]) and
interchanged if they are not in the proper order.
Example:
 Let x be an array of integers of size n.
 Let the array be sorted such that x[i]  x[j] for 0  i < j < n.
100
 x [ ] = {25, 57, 48, 37, 12, 92, 86, 33}
This method will cause records with small keys to move or "bubble up". After the
first pass, the record with the largest key will be in the n th position.
Example
The following comparisons are made on the first pass:
x[0] with x[1] (25 with 57) No interchange
x[1] with x[2] (57 with 48) Interchange
x[2] with x[3] (57 with 37) Interchange
x[3] with x[4] (57 with 12) Interchange
x[4] with x[5] (57 with 92) No Interchange
x[5] with x[6] (92 with 86) Interchange
x[6] with x[7] (92 with 33) Interchange
Thus after the first pass, the file is in the order
25, 48, 37, 12, 57, 86, 33, 92
After the first pass, the largest element (92) is in its pro per position within the
array. In general, x[n-i] will be in its proper position after insertion i. The complete
set of iterations are as follows:
iteration 0 25 57 48 37 12 92 86 33
(original file)
Iteration 1 25 48 37 12 57 86 33 92
Iteration 2 25 37 12 48 57 33 86 92
Iteration 3 25 12 37 48 33 57 86 92
Iteration 4 12 25 37 33 48 57 86 92
Iteration 5 12 25 33 37 48 57 86 92
Iteration 6 12 25 33 37 48 57 86 92
Iteration 7 12 25 33 37 48 57 86 92
Since all the elements in position greater t han or equal to n-I are already in proper
position after iteration i, they need not be considered in succeeding iterations. Thus,
on the first pass n-1 comparisons are made, on the second pass n -2 comparisons are
made, and on the (n-1) th pass only one comparison is made (between x[0] and x[1].

101
This enhancement speeds up the process.
Although n-1 passes are required to sort a file, however, in the preceding example
the file was sorted after five iterations making the last two iterations unnecessary.
To eliminate the unnecessary passes we keep a record of whether or not any
interchange is made in a given pass, so that further passes can be eliminated.

ALGORITHM OF BUBBLE SORT

Void bubble (int x [ ], int n)


{
int hold, j, pass;
int switched = TRUE;
for (pass = 0; pass < n-1 && switched == TRUE; pass ++)
{
/* outer loop controls the number of passes */
switched = FALSE /* initially no interchange */
/* has been made on */
/* this pass */
for (j = 0; j < n-pass-1; j++)
/* inner loop governs each individual pass */
If (x[j] > x [j+1])
{
/* elements out of order */
/* an interchange is made */
switched = TRUE;
hold = x[j];
x[j] = x[j+1];
x [j+1] = hold;
} /* end if */
} /* end for */
} /* end bubble */
The algorithm is straightforward. Before each pass, the interchange marker is
102
initialized to zero. This marker is incremented each time an interchange is made. If
at the end of a pass, marker has a value of zero, the sort is complete.

Fig. 5.1 Trace of a bubble sort

EFFICIENCY OF BUBBLE SORT

Considering no improvements are made:


 There are n-1 passes and n-1 comparisons on each pass.
 Thus the total number of comparisons are (n-1) (n-1) = n 2 -2n+1 which is O(n 2 ).
Considering the improvements are made:
 If there are k iterations the total number of comparisons are (n -1) + (n-2) + (n-3)
+ .. + (n-k), which equals (2kn - k 2 - k) /2.
 The average number of iterations, k, is O(n), so that the entire formula is still
O(n 2 ), although the constant multiplier is smaller than before.

5.3.2 SELECTION SORT


One of the easiest ways to sort a table is by selection. Beginning with the first
record in the table, a search is performed to locate the element, if found it is placed
as the first record in the table, rest of the records are shifted by one place. A search
103
for the second smallest key is then carried out. This is accomplished by examining
the keys of the records from the second element onwards. The element which has
the second smallest key is placed as the second element of the table, rest of the
records are shifted by one place. The process of searching for the record with the
next smallest key and placing it in its proper position continues until all the records
have been sorted in ascending order.
A selection sort is one in which successive elements are selected i n order and
placed into their proper sorted positions. The elements of the input may have to be
pre-processed to make the ordered selection possible. The basic idea of the
selection sort is to repeatedly select the smallest key in the remaining unsorted
array.
A general algorithm for the selection sort is given below:
1. Repeat through step 5 a total of n-1 times.
2. Record the position of the table already sorted.
3. Repeat step 4 for the elements in the unsorted position of the table.
4. Record location of smallest element in unsorted table.
5. Place smallest element in the first place of table and shift the array by one place.
Example:
Consider the following unsorted array to be sorted using selection sort.
15 6 13 22 3 52 2 Smallest
Unsorted array
Step 1: 2 is chosen as it is the smallest element and is put in sorted part at 1st
position. Now the array becomes as:
2 15 6 13 22 3 52
Sorted Unsorted array
Step 2: The second pass identifier 3 is the smallest element in remaining unsorted
array. Adding 3 also in the sorted part at 2nd position we have the following array:
2 3 15 6 13 22 52
Sorted Unsorted
Step 3: In third pass 6 is chosen as smallest and put at position 3 and the array
becomes:
104
2 3 6 15 13 22 52
Sorted Unsorted
Step 4:
2 3 6 13 15 22 52
Sorted Unsorted
Step 5: 2 3 6 13 15 22 52
Sorted Unsorted
Step 6: 2 3 6 13 15 22 52
Sorted Unsorted
Step 7: 2 3 6 13 15 22 52
Sorted
Therefore, after the seventh pass the array is sorted. The above mentioned
technique uses a separate area to store sorted part, therefore, mor e space is required
to work on it.
ANALYSIS OF SELECTION SORT
The ordering in a selection sort is not important. If an entry is in its correct final
position, then it will never be moved. Every time any pair of entries is swapped,
then at least one of them moves into its final position, and therefore at most n -1
swaps are done altogether in sorting a list of n entries.
The comparison is done for minimum value, with the length of sublist ranging from
n down to 2. Thus, altogether there are:
(n-1) + (n-2) + .. + 1 = 1/2*n(n-1)
comparisons of keys which approximates to
n2/2 + O (n)
The order of selection sort comes out to be of O(n 2 ).

5.3.3 INSERTION SORT


An insertion sort is one that sorts a set of records by inserting recor ds into an
existing sorted file. Suppose an array A with n elements A[1], A[2],.......A[N] is in
memory. The insertion sort algorithm scans A from A[1] to A[N], inserting each
element A[K] into its proper position in the previously sorted sub array A[1],
105
A[2],... A[K-1].
Example
Sort the following list using the insertion sort method: 4, 1, 3, 2, 5
i) 4 Place 4 in I st position

ii) 1 4 1 < 4, therefore insert prior to 4

iii) 1 3 4 3 > 1, insert between 1 & 4

iv) 1 2 3 4 2 > 1, insert between 1 & 3

v) 1 2 3 4 5 5>4, insert after 4


Thus, to find the correct position, search the list till an item just greater than the
target is found; shift all the items from this point one down the list, insert the target
in the vacated slot.
Algorithm to Implement Insertion Sort
insert sort (x, n)
int x[ ], n;
{
int i, k, y;
for (k = 1; k < n; k++)
{
y = x [k];
for (i = k-1; i > = 0 && y < x [i]; i --)
x [i+1] = x[i];
x [i+1] = y;
}
}

ANALYSIS OF INSERTION SORT

If the initial file is sorted, only one comparison is made on each pass, so that sort is
106
O(n). If the file is initially sorted in reverse order, the sort is O(N 2 ), since the total
number of comparisons are:
(n-1) + (n-2) + ... + 3 + 2 + 1 = (N-1) * N/2
which is O(N 2 ).
The closer the file is to sorted order, the more efficient the simple insertion sort
becomes. The space requirements for the sort consists of only one temporary
variable, Y. The speed of the sort can be improved somewhat by using a binary
search to find the proper position for x[k] in the sorted file.
x[0], x[1] ... x[k-1]
This reduces the total number of comparisons from O(N 2 ) to O(n log 2 (n)).
However even if the correct position i for x[k] is found in O(log 2 (n)) steps, each of
the elements x[i+1] ... x[k-1] must be moved by one position. This latter operation
to be performed N times requires O(N 2 ) replacement.

5.3.4 MERGE SORT


The operation of sorting is closely related to the process of merging. This sorting
method uses merging of two ordered lists which can be combined to produce a
single sorted list. This process can be accomplished easily by successively selecting
the record with the smallest key occurring in either of the lists and placing this
record in a new table, thereby creating an ordered list.
Merge sort is one of the divide and conquer class of algorithm. The basic idea is to:
 Divide the list into a number of sublists.
 Sort each of these sublists.
 Merge them to get a single sorted list.
Two-way merge sort divides the list into two, sorts the sublist and then merges them
to get the sorted list, also called concatenate sort.
Multiple merging can also be accomplished by performing a simple merge
repeatedly. For example if we have 16 lists to merge, we can first merge them in
pairs. The result of this first step yields eight tables which are again merged in pairs
to give four tables. This process is repeated until a single table is obtained. In this
107
example four separate passes are required to yield a single list. In general, k
separate passes are required to merge 2k separate lists into a single list.
Example

The format statement of above method is as follows:


ALGORITHM FOR MERGING TWO SORTED FILES TO GET THIRD
SORTED FILE
Let (X l , ..., X m ) and (X m+1 , ..., X n ) be two sorted files. Let the merged file is (Z l ,
..., Z n ).
Procedure:
merge (x, z, l, m, n)
int x[ ], z[ ], l, m, n;
{
/* (x[l], ..., x[m]) and (x[m+1], ..., x[n]) are two sorted lists with keys, such
that x[l] £ ... £ x[m] * /
int i, j, k, t;
i = l;
k = l; /* i, j & k are positions in these files * /
j = m+1;
while (( i £ m) & & (j < = n))
108
{
if (x[i] < = x[j])
{
z[k] = x[i];
i ++;
}
else
{
z[k] = x [j];
j++;
}
k = k+1;
}
If (i > m)
{
for (t = j; t < n; t++)
z[k + t - j] = x[t];
}
else
{
for (t = i; t < m; t++)
z[k + t - i] = x[t];
}
}
The above algorithm for merge sort has one important property that after pass K,
the array A will be positioned into sorted subarrays of exactly L = 2 K elements
(except the last subarray).
By dividing n (size of array A) by 2*L, we get the quotient Q whi ch is number of
pairs of sorted subarrays, of size L, i.e.
Q = INT (N/2*L)
S=2*L*Q will be the total number of elements in the Q pairs of subarrays. R=N -S
109
denotes the number of remaining elements.
ANALYSIS OF MERGE SORT
Analysis of Algorithm 'MERGE'
The while loop is iterated at most n -l + 1 times. The if statement moves at most one
record per iteration (considering sorting of records), the total time is therefore
O(n-l + 1). If records are of length m then this time is O(m(n -l + 1)).
Analysis of 'MSORT'
On the ith pass the files being merged are of size 2 i-1 . Consequently, a total of
[log 2 (N)] passes are made over the data. Since two files can be merged in linear
time (algorithm 'MERGE'), each pass of merge sort takes O(N) time. As there are
[log 2 (N)] passes, the total computing time is O(N * log 2 N).

5.3.5 QUICK SORT


Quick sort is also known as partition exchange sort. An element (a) is chosen from
a specific position within the array such that x is partitioned and a is placed at
position j and the following conditions hold:
1. Each of the elements in position 0 through j -1 is less than or equal to a.
2. Each of the elements in position j+1 through n -1 is greater than or equal to a.
The purpose of the Quick Sort is to move a data item in the correct direction just
enough for it to reach its final place in the array. The method, therefore reduces
unnecessary swaps, and moves an item a great distance in one move. A pivotal item
near the middle of the array is chosen, and then items on either side are moved so
that the data items on one side of the pivot are smaller than the pivot, whereas those
on the other side are larger, the middle (pivot) item is in its correct position. The
procedure is then applied recursively to the parts of the array, on either side of the
pivot, until the whole array is sorted.
Example:
If an initial array is given as:
25 57 48 37 12 92 86 33
and the first element (25) is placed in its proper position, the resulting array is:

110
12 25 57 48 37 92 86 33
At this point, 25 is in its proper position in the array (x[1]), each element below
that position (12) is less than or equal to 25, and each element above that position
(57, 40, 37, 92 86 and 33) is greater than or equal to 25.
Since 25 is in its final position the original problem has been decomposed into the
problem of sorting the two subarrays.
(12) and (57 48 37 92 86 33)
First of these subarrays has one element so there is no need to sort it. Repeating the
process on the subarray x[2] through x[7] yields:
12 25 (48 37 33) 57 (92 86)
and further repetitions yield
12 25 (37 33) 48 57 (92 86)
12 25 (33) 37 48 57 (92 86)
12 25 33 37 48 57 (92 86)
12 25 33 37 48 57 (86) 92
12 25 33 37 48 57 86 92
Algorithm to Implement Quick Sort
qsort (int x[ ], int m, int n)
{
/* m and n contains upper and lower bounds of array * /
/* array has to be sorted in non -decreasing order */
int i, j, k, t;
if (m < n)
{
i = m;
j = n+1;
k = x [m]; /* key */
while(1)
{
do
{
111
i = i+1;
}
while (x [i] < k);
do
{
j = j - 1;
}
while (x [j] > k);
if (i < j)
{
t = x [i];
x [i] = x [j];
x [j] = t;
}
else break;
}
}
t = x [m];
x [m] = x [j];
x [j] = t;
q sort (x, m, j -1);
q sort (x, j+1; +n);
}
We shall illustrate the mechanics of this method by applying it to an array of
numbers. Suppose the array A initially appears as:
(15, 20, 5, 8, 95, 12, 80, 17, 9, 55)
Figure 5.2 shows a quick sort applied to this array.

112
A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8) A(9) A(10)
15 20 5 8 95 12 80 17 9 55
9 20 5 8 95 12 80 17 () 55
9 () 5 8 95 12 80 17 20 55
9 12 5 8 95 () 80 17 20 55
9 12 5 8 () 95 80 17 20 55
9 12 5 8 15 95 80 17 20 55
Fig. 5.2 Quick sort of an array
The following steps are involved:
1. Remove the Ist data item, 15, mark its position and scan the array from right to
left, comparing data item values with 15. When you find the Ist smaller value,
remove it from its current position and put in position A(1).
2. Scan line 2 from left to right beginning with position A(2), comparing data item
values with 15. When you find the Ist value greater than 15, extract it and store
in the position marked by parentheses in line 2.
3. Begin the right to left scan of line 3 with position A(8) looking for a value
smaller than 15. When you find it, extract it and store it in the position marked
by the parentheses in line 3.
4. Begin scanning line 4 from left to right at position A(3), find a value greater than
15, remove it, mark its position, and store it inside the parentheses in line 4. This
is shown in line 5.
5. Now, when you scan line 5 from right to left beginning at position A(7), you find
no value smaller than 15. Moreover, you com e to a parentheses position, position
A(5). This is the location to put the Ist data item, 15, as shown in line 6. At this
stage 15 is in its correct place relative to final sorted array.
ANALYSIS OF QUICK SORT
Assumption: i) File is of size n where n is a power of 2.
Say n = 2 m, so that m = log 2 n.
Average Case Behaviour
113
Proper position for the pivot always turns out to be exact middle of the subarray.
There will be approximately n comparisons on the first pass after which file will
split into two subfiles each of size n/2 approximately.
For each of these two files there will be approximately n/2 comparisons. So, after
halving the subfiles m times, there are n files of size 1. Thus the total number of
comparisons for the entire sort is approximately:
n+2 * (n/2) + 4 * (n/4) + 8 * (n/8) + .. + n* (n/n)
or
n + n + n + n + .. n (m terms)
There are m terms because the file is divided m times. Thus the total number of
comparisons are:
O(n * m) or O(n log n) [as m = log 2 n]

Worst Case Behaviour


The worst case occurs when the first pivot fails to split the list. This happens when
the original file is already sorted. If, for example, x[b] is in its correct position, the
original file is split into subfiles of sizes O and n -1.
If this process continues, a total of n-1 subfiles are sorted, the first of size n; the
second of size (n-1) and so on. Total number of comparisons to sort the entire file
are
n + (n-1) + (n-2) + .. + (2)
which is O(n 2 )
Thus, the quick sort works best with completely unsorted files and worst for files
that are completely sorted.

5.3.6 HEAP SORT


The algorithm which we now formulate is a combination of algorithms by Floyd and
Williams. In general, a heap which represents a table of n records satisfies the
property
K j < K i ; for 2< j<n and i=[j/2]. The binary tree is allocated sequentially such that
the indices of the left and right sons (if they exist) of record 1 are 2i and 2i+1

114
respectively.
A complete binary tree is said to satisfy the "Heap Condition" if the key of e ach
node is greater than or equal to the keys in its children. Thus, the root node will
have the largest key value. Trees can be represented as arrays, by first numbering
the nodes (starting from the root) from left to right.
The key value of nodes are then assigned to array positions whose index is given by
the number of the node.

Fig. 5.3 Heap 1

The relationships of the node can also be determined from the array representation.
If a node is at position j, its children will be at positions 2j and 2j+1. Its parent will
be at position j/2. A heap is a complete binary tree in which each node satisfies the
heap condition, represented as an array. The operation on a heap works in two
steps:
1. The required node is inserted / deleted / or replaced.
2. First step may cause violation of the heap condition so the heap is traversed and
modified to rectify any such violations.

INSERTION IN HEAP

Consider an insertion of node R in the heap 1.


1. Initially R is added as the right child of J and given the number 13.
2. But R J, the heap condition is violated.
3. Move R upto position 6 and move J to position 13.

115
4. R P, therefore, the heap condition is still violated.
5. Swap R and P.
6. The heap condition is now satisfied by all nodes.

Fig. 5.4 Heap 2

Fig. 5.5 Heap 3


A general algorithm for creating a heap is given below:
1. Repeat through step 7 while there still is another record to be placed in the heap.
2. Obtain child to be placed at leaf level.
3. Obtain child to be parent for this child.
4. Repeat through step 6 while the child has a parent & the ke y of the child is
greater than that of its parent.
5. Move parent down to position of child.
6. Obtain position of new parent for the child.
7. Copy child record into its proper place.
More formal procedure Create_Heap (K,N) is given below. K is a given ta ble
containing the keys of the N records of a table, this algorithm creates a heap as
previously described. The index variable Q controls the number of insertions which
are to be performed. The integer variable J denotes the index of the parent of key

116
k[I]. Key contains the key of the record being inserted into an existing heap.
1. [Build Heap]
Repeat through step 7 for Q =2,3.....N
2. [Initialize construction phase]
IfQ
KEYfK[Q]
3. [Obtain parent of new record]
JfTrunc (I/2)
4. [Place new record in existing heap]
Repeat through step 6 while I>1 and KEY > K[J]
5. [Interchange record]
K[1] fK[J]
6. [Obtain next parent]
IfJ
JfTrunc (I/2)
if J <1
then Jf1
7. [Copy new record into its proper place]
K [I] fKEY
8. [Finished]
Return
DELETION FROM HEAP
When we delete any node from the tree, the child with greater value will be
promoted to replace that node.
Consider deletion of M from heap 2.
The larger of M's children are promoted to 5.
An efficient sorting method is based on the heap construction and node removal
from the heap in order. This algorithm is guaranteed to sort N elements in N log N
steps.

117
CONSTRUCTION OF HEAP

Two methods of heap construction and then removal in order from the heap to sort
the list are as follows:
Top-Down Heap Construction
 Insert items into an initially empty heap, keeping the heap condition intact in all
steps.
Bottom-Up Heap Construction
 Build a heap with the items in order presented.
 From the right most node modify to satisfy the heap condition.
e.g.: To build a heap of the following using both methods of construction.
Let the input string be "PROFESSIONAL".

Top-Down Construction

118
Fig. 5.6
Bottom-up Construction

119
Fig. 5.7

SORTING USING HEAP

The sorted elements will be placed in x[ ] an array of size 12.


a)

i) Remove 'S' and store in x[12].

120
b)

ii) Remove 'S' and store in x[11].


c)

iii) Remove 'R' and store in x[10].


d)

121
iv) Remove 'P; and store in x [9].
e)

v) Remove 'O' and store in x[8].


f)

vi) Remove 'O' and store in x[7].


g)
122
vii) Remove 'N' and store in x[6].
h)

viii) Remove 'L' and store in x[5].


i)

ix) Similarly, the remaining 3 nodes are removed and the heap modified to get
the sorted list AEFILNOOPRSS.

Fig. 5.8

123
5.4 CHECK YOUR PROGRESS

1. Fill in the blanks:


a. The average number of comparisons done by sequential search are ————.
b. The maximum number of key comparisons in binary search is ——————.
c. ———— operations take most of the execution time.
d. ———— is also known as partition exchange sort.
e. —————— reduces the drawbacks of sequential search.
2. For a linear search, the average number of comparison for a file with n records
are:
a. n/2
b. n
c. log 2 n
d. None of the above
3. Pick up the odd one from the following:
a. range sort
b. radix sort
c. hash heap sort
d. none of the above

5.5 SUMMARY

1. Searching and sorting are two important operations associated with any data
structure.
2. Efficient and reliable data processing depends upon sorted data.
3. The internal and external sorting methods have their relative efficiencies in
different applications.
4. There are three methods for performing searching:
a. Sequential Searching
b. Binary Searching
c. Index Sequential Searching
124
5. Binary search reduces the drawbacks of sequential search.
6. In Bubble sort in each pass each element is compared with its successor. i.e.,
x[i] with x[i+1] and interchanged if they are not in the proper order.
7. Total number of comparisons in bubble sort are O(n 2 ).
8. In quick sort an element x is chosen from a sp ecific position within the array
such that each element in position 0 through that element is less than or equal
to x and each of the elements in position greater than the position of x is
greater than or equal to x.

5.6 KEYWORDS

Sequential Search: The simplest technique for searching an unordered table for a
particular record in a sequential manner until the desired record is found.
Binary Search: The searching technique applied for searching the entries in a
sorted table.
Sorting: The operation of arranging data in some given order, such as increasing or
decreasing with numerical data or alphabetically, with character data.
Bubble Sort: A sorting technique where the file to be sorted is passed through
sequentially several times. In each pass each element is compared with its successor
and interchanged if they are not in the proper order.
Quick Sort: A divide and conquer algorithm which works by creating two problems
of half size, solving them recursively, then combining the solutions to the small
problems to get a solution to the original problem.

5.7 SELF ASSESSMENT TEST

4. What is searching? What are the different techniques for searching?


5. Explain binary searching. How is binary searching better than sequential
searching? Explain with example.
6. What is index sequential searching. Write a 'C' program for index sequential
searching.
125
7. Write an algorithm for binary search and explain its tim e complexity.
8. Why is there a need for sorting?
9. Why is bubble sort called by this name?
10. Write a 'C' program for bubble sort. Discuss its time complexity.
11. Explain Quick Sort.
12. Write a 'C' program for quick sort. Explain its time complexity.
13. When is the bubble sort better than the quick sort.
14. Show that algorithm for quick sort takes O(n 2 ) time when the input file is
already in sorted order.

5.8 ANSWERS TO CHECK YOUR PROGRESS


1. Answers to fill in the blanks
a. n
b. log 2 n
c. critical
d. quick sort
e. Binary Search
2. b
3. b

5.9 REFERENCES / SUGGESTED READINGS


1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,
Springer-New York
2. Seymour Lipschutz, “Data Structure ”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in C”,
Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures with
Applications”, McGraw- Hill International Student Edition, New York.

126
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Tenenbaum, “Data
Structures using C”, Prentice Hall of India Pvt. Ltd., New Delhi.
6. Mark Allen Weiss, “Data structures and Algorithm Analysis in C”,
Addison-Wesley (An Imprint of Pearson Education), Mexico City,
Prentice-Hall of India Pvt. Ltd., New Delhi.

127
Subject: Advance Data Structures
Course Code: BCA-PC(L)-323

Lesson No. 6

SEARCHING TECHNIQUES
REVISED /UPDATED SLM BY KAPILA DEVI

STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 Sequential Search
6.4 Binary Search
6.5 Indexed sequential Search
6.6 Check your Progress
6.7 Summary
6.8 Keywords
6.9 Self-Assessment Test
6.10 Answers to Check Your Progress
6.11 References/Suggested Reading

128
6.1 LEARNING OBJECTIVES
After studying this unit, you should be able to:
1. Describe searching techniques
2. Describe sequential, binary and index sequential search
3. Define sorting
4. Describe various types of sorting
5. Describe 'O' notation
6. Describe bubble sort and quick sort techniques

6.2 INTRODUCTION
In the previous units, we have discussed many data structures and their storage
representations. Algorithms for a number of operations such as insertion and
deletion, which are commonly performed on these structures were described in
detail. This unit discusses additional operations that are frequently perfo rmed on
data structures namely searching. We will see that efficient algorithms for these
operations can be realized when data are prop erly structured.
Information retrieval in the required format is the common activity in all computer
applications. This involves searching, sorting and merging. Searching methods are
designed to take advantage of the file organization and optimize the search for a
particular record or establish its absence.

6.3 SEQUENTIAL SEARCH


The simplest technique for searching an unordered table for a particular record is to
scan each entry in the table in a sequential manner until the desired record is found.
This is the most natural way of searching. In this method we simply go through a
list or a file till the required record is found or end of list or file is encountered.
Ordering of list is not important.
Algorithm

129
/* Algorithm to search a list of values to find the required value */
Input : Array of size N. Target value T.
Output : Position of T in the list
Begin 1. Set Found to false
Set I to 0
2. While (I < = N) and (Found is false)
If List [I] = T
Found = true
Else
I = I+1
3. If Found is false
T is not present in the list
End
The first step of the algorithm initializes the key value of the sentinel record to x.
In the second step, a sequential search is performed on the n+1 reco rds. If the index
of the record found denotes record R n+1 then the search has failed; otherwise the
search is successful and contains the index of the desired record.
EFFICIENCY OF SEQUENTIAL SEARCH
Let us assume that no insertion and deletion is made i.e. we are searching through a
table of constant size n. If the record we are searching is the first one in the table,
only
one comparison is performed. If the record is the last one in the table, n
comparisons
are necessary. Hence, the average number of co mparisons done by sequential search
are:
==
Sequential search is easy to write and is efficient for short list. It does not require
sorted data. It is disastrous for long lists.
Features of Sequential Search Algorithm
1. It is used for unsorted and unordered small list of elements.

130
2. It has a time complexity of O(n), which means the time is linearly dependent
on the number of elements, which is not bad, but not that good too.
3. It has a very simple implementation.

6.4 BINARY SEARCH


Another relatively simple method of accessing a table is the binary search method.
The entries in the table are stored in alphabetically or numerically increasing order.
The drawback of sequential search can be eliminated if it becomes possible to
eliminate large portions of the list from consideration in subsequent iterations.
Binary search requires sorted data to operate on. In this method, search begins by
examining the records in the middle of file rather than the one at one of the ends as
in sequential search.
Let us assume that file being searched is sorted on increasing value of key. Based
on the result of the comparison with the middle key, Km, following conclusions can
be drawn.
 If K < Km then if the record being searched for is in the file, it must be in the
lower numbered half of the file.
 If K = Km then the middle record is the one being searched for.
 If K > Km then if the record being searched for is in th e file it must be in the
higher numbered half of the file.
A search for a particular item which contains key value resembles the search for a
name in a telephone directory. The approximate middle entry of the table is located,
and its key value is examined. If its value is too high, then the key value of the
middle entry of the first half of the table is examined and the procedure is repeated
on the first half until the required item is found. If the value is too low, then the key
of middle entry of the second half of the table is tried and the procedure is repeated
on the second half.
Algorithm
Procedure of binary Search
Begin

131
low = 0;
hi = n - 1;
while (low < = hi)
{
mid = (low + hi) /2;
if (key == k (mid))
return (mid);
if (key < k (mid))
hi = mid -1;
else
low = mid+1;
}
/* end while * /
end

EFFICIENCY OF BINARY SEARCH

Each comparison in the binary search reduces the number of possible candidates by
a factor of 2. Thus, the maximum number of key comparisons is approximately
log 2 n.

Features of Binary Search


1. It is great to search through large sorted arrays.
2. It has a time complexity of O(log n) which is a very good time complexity.
3. It has a simple implementation.

6.5 INDEXED SEQUENTIAL SEARCH


This method increases the amount of space required for searching. In this method an
auxiliary table, called an index is created in addition to the stored file itself. Each
element in the index consists of a key KINDEX and a pointer to the record in the
file that corresponds to KINDEX.
The elements in the index, as well as the elements in the file, must be sorted on the

132
key. If the index is one third the size of the file, every third record of the file is
represented in the index.

Fig. 6.1: Indexed Sequential File

6.6 TIME COMPLEXITY OF ALGORITHEM

For any defined problem, there can be N numbe r of solution. This is true in general.
If I have a problem and I discuss about the problem with all of my friends, they will
all suggest me different solutions. And I am the one who has to decide which
solution is the best based on the circumstances.
Similarly, for any problem which must be solved using a program, there can be
infinite number of solutions. Let's take a simple example to understand this. Below
we have two different algorithms to find square of a number (for some time, forget
that square of any number n is n*n):
One solution to this problem can be, running a loop for n times, starting with the
number n and adding n to it, every time.
/*
we have to calculate the square of n

133
*/
for i=1 to n
do n = n + n
// when the loop ends n will hold its square
return n
Or, we can simply use a mathematical operator * to find the square.
/*
we have to calculate the square of n
*/
return n*n
In the above two simple algorithms, you saw how a s ingle problem can have many
solutions. While the first solution required a loop which will execute for n number
of times, the second solution used a mathematical operator * to return the result in
one line. So which one is the better approach, of course th e second one.
What is Time Complexity?
Time complexity of an algorithm signifies the total time required by the program to
run till its completion.
The time complexity of algorithms is most commonly expressed using the big O
notation. It's an asymptotic notation to represent the time complexity. We will study
about it in detail in the next tutorial.
Time Complexity is most commonly estimated by counting the number of
elementary steps performed by any algorithm to finish execution. Like in the
example above, for the first code the loop will run n number of times, so the time
complexity will be n at least and as the value of n will increase the time taken will
also increase. While for the second code, time complexity is constant, because it
will never be dependent on the value of n, it will always give the result in 1 step.
And since the algorithm's performance may vary with different types of input data,
hence for an algorithm we usually use the worst-case Time complexity of an
algorithm because that is the maximum time taken for any input size.
Calculating Time Complexity

134
Now let’s tap onto the next big topic related to Time complexity, which is How to
Calculate Time Complexity. It becomes very confusing some times, but we will try
to explain it in the simplest way.
Now the most common metric for calculating time complexity is Big O notation.
This removes all constant factors so that the running time can be estimated in
relation to N, as N approaches infinity. In general, you can think of it like this:
statement;
Above we have a single statement. Its Time Complexity will be Constant. The
running time of the statement will not change in relation to N.
For (i=0; i < N; i++)
{
statement;
}
The time complexity for the above algorithm will be Linear. The running time of
the loop is directly proportional to N. When N doubles, so does the running time.
for (i=0; i < N; i++)
{
for (j=0; j < N;j++)
{
statement;
}
}
This time, the time complexity for the above code will be Quadratic. The running
time of the two loops is proportional to the square of N. When N doubles, the
running time increases by N * N.
While (low <= high)
{
mid = (low + high) / 2;
if (target < list[mid])
high = mid - 1;
else if (target > list[mid])
135
low = mid + 1;
else break;
}
This is an algorithm to break a set of numbers into halves, to search a particular
field(we will study this in detail later). Now, this algorithm will have
a Logarithmic Time Complexity. The running time of t he algorithm is proportional
to the number of times N can be divided by 2(N is high -low here). This is because
the algorithm divides the working area in half with each iteration.
void quicksort (int list[], int left, int right)
{
int pivot = partition (list, left, right);
quicksort (list, left, pivot - 1);
quicksort (list, pivot + 1, right);
}
Taking the previous algorithm forward, above we have a small logic of Quick Sort
(we will study this in detail later). Now in Quick Sort, we divide the li st into halves
every time, but we repeat the iteration N times (where N is the size of list). Hence
time complexity will be N*log (N). The running time consists of N loops (iterative
or recursive) that are logarithmic, thus the algorithm is a combination o f linear and
logarithmic.
NOTE: In general, doing something with every item in one dimension is linear,
doing something with every item in two dimensions is quadratic, and dividing the
working area in half is logarithmic.
Types of Notations for Time Complexity
Now we will discuss and understand the various notations used for Time
Complexity.
1. Big Oh denotes "fewer than or the same as" <expression> iterations.
2. Big Omega denotes "more than or the same as" <expression> iterations.
3. Big Theta denotes "the same as" <expression> iterations.
4. Little Oh denotes "fewer than" <expression> iterations.
5. Little Omega denotes "more than" <expression> iterations.
136
Understanding Notations of Time Complexity with Example
O(expression) is the set of functions that grow slower than or at the same rate as
expression. It indicates the maximum required by an algorithm for all input values.
It represents the worst case of an algorithm's time complexity.
Omega(expression) is the set of functions that grow faster than or at the same rate
as expression. It indicates the minimum time required by an algorithm for all input
values. It represents the best case of an algorithm's time complexity.
Theta(expression) consist of all the functions that lie in both O(expression) and
Omega(expression). It indicates the average bound of an algorithm. It represents the
average case of an algorithm's time complexity.
Suppose you've calculated that an algorithm takes f(n) operations, where,
f(n) = 3*n^2 + 2*n + 4. // n^2 means square of n
Since this polynomial grows at the same rate as n2, then you could say that the
function f lies in the set Theta(n2). (It also lies in the sets O(n2) and Omega(n2) for
the same reason.)
The simplest explanation is, because Theta denotes the same as the expression.
Hence, as f(n) grows by a factor of n2, the time complexity can be best represented
as Theta(n2).
‘O’ Notation in detail
Given two functions f(n) and g(n), we say that f(n) is of the order of g(n) or that
f(n) is O(g(n)) if there exist positive integers a and b such that
f(n)  a * g (n) for n  b.

For example, if f(n) = n 2 + 100n and g (n) = n 2


f(n) is O(g(n)), since n 2 + 100n is less than or equal to 2n 2 for all n greater than or
equal to 100. In this case a = 2 and b = 100.
The same f(n) is also O(n 3 ), since n 2 + 100 n is less than or equal to 2n 3 for all n
greater than or equal to 8. If f(n) is O(g(n)) and g(n) is O(h(n)), then f(n) is O(h(n)).
For example, n n + 100n is O(n 2 ) and n 2 is O(n 3 ), then n 2 +100n is O(n 3 ) for (a = 1,
b =1). This is called the transitive property.
If the function is C * n then its order will be O(n k ) for any constant c and k. As c *
137
n is less than or equal to c * n k for any n ³ 1 (i.e a = c, b = 1). If f(n) is n k then its
order will be O(n k+j ) for any j ³ O(for a = 1, b = 1).
If f(n) and g(n) are both O(h (n)), the new function f(n) + g(n) is also O(h(n)). If
f(n) is any polynomial whose leading power is k [ i.e. f(n) = c 1 * n k +c 2 * n k-1 + ..
+ c k * n+c k+1 ]

f(n) is O(n k ).

ALGORITHM EFFICIENCY IN LOGARITHMIC FUNCTION

Let log m n and log k n be two functions. Let xm be log m n and xk be log k n then

m xm = n and k xk = n
Since if m x = n
So that log m n = x

So that m xm = k xk
Taking log m of both sides.

xm = log m (k xk )

Now it can easily be shown that log z (x y ) equals y *log Z x for any x, y and z, so
that the last equation can be written as
log m n = xk * log m k [since xm = log m n]
or as
log m n = (log m k) * log k n [since xk = log k n]
Thus log m n and log k n are constant multiples of each other.
If f(n) = c * g(n) then f(n) is O(g(n)) thus log m n is O(log k n) and log k n is O(log m n)
for any m and k.
The following facts establish an order hierarchy of functions:
 C is O(1) for any constant C.
 C is O(log n), but log n is not O(1).
 C * log k n is O(log n) for any constant C, K.

138
 C * log k n is O(n), but n is not O(log n).

 C * n k is O(n k ) for any constant C, K.


 C * n k is O(n k+j ), but n k+j is not O(n k ).
 C * n * log k n is O(n log n) for any constant C, K.

 C * n* log k n is O(n 2 ), but n 2 is not O(n log n).

 C * n j * log k n is O(n j log n) for any constant c, j, k.

 C * n j * (log k n is O(n j+1 ), but n j+1 is not O(n j log n).

 C * n j * (log k n)l is O(n j (log n) l ) for any constant c, j, k, l.

 C * n j * (log k n)l is O(n j+1 ) but n j+1 is not O(n j (log n) l ).

6.6 CHECK YOUR PROGRESS

1. Fill in the blanks:


a. The average number of comparisons done by sequential search are ————.
b. The maximum number of key comparisons in binary search is ——————.
c. ———— operations take most of the execution time.
d. ———— is also known as partition exchange sort.
e. —————— reduces the drawbacks of sequential search.
2. For a linear search, the average number of comparison for a file with n records
are:
a. n/2
b. n
c. log 2 n
d. None of the above
3. Pick up the odd one from the following:
a. range sort
b. radix sort
c. hash heap sort
d. none of the above
139
6.7 SUMMARY

1. Searching and sorting are two important operations associated with any data
structure.
2. Efficient and reliable data processing depends upon sorted data.
3. The internal and external sorting methods have their relative efficiencies in
different applications.
4. There are three methods for performing searching:
a. Sequential Searching
b. Binary Searching
c. Index Sequential Searching
5. Binary search reduces the drawbacks of sequential search.
6. In Bubble sort in each pass each element is compared with its successor. i.e.,
x[i] with x[i+1] and interchanged if they are not in the proper order.
7. Total number of comparisons in bubble sort are O(n 2 ).
8. In quick sort an element x is chosen from a specific posit ion within the array
such that each element in position 0 through that element is less than or equal
to x and each of the elements in position greater than the position of x is
greater than or equal to x.

6.8 KEYWORDS

Sequential Search: The simplest technique for searching an unordered table for a
particular record in a sequential manner until the desired record is found.
Binary Search: The searching technique applied for searching the entries in a
sorted table.
Time Complexity: Time complexity of an algorithm signifies the total time
required by the program to run till its completion.

140
6.9 SELF ASSESSMENT TEST

1. What is searching? What are the different techniques for searching?


2. Explain binary searching. How is binary searching better than sequential
searching? Explain with example.
3. What is index sequential searching. Write a 'C' program for index sequential
searching.
4. Write an algorithm for binary search and explain its time complexity.
5. Why is there a need for sorting?
6. Why is bubble sort called by this name?
7. Write a 'C' program for bubble sort. Discuss its time complexity.
8. Explain Quick Sort.
9. Write a 'C' program for quick sort. Explain its time complexity.
10. When is the bubble sort better than the quick sort.
11. Show that algorithm for quick sort takes O(n 2 ) time when the input file is
already in sorted order.

6.10 ANSWERS TO CHECK YOUR PROGRESS

1. Answers to fill in the blanks


a. n
b. log 2 n
c. critical
d. quick sort
e. Binary Search
2. b
3. b

141
6.11 REFERENCES / SUGGESTED READINGS
1. Birkhanser-Boston, An Introduction to Data Structures and Algorithms,
Springer-New York
2. Seymour Lipschutz, “Data Structure”, Tata McGraw Hill.
3. Horowitz, Sahni & Anderson-Freed, “Fundamentals of Data Structure in C”,
Orient Longman.
4. Trembley, J.P. And Sorenson P.G., “An Introduction to Data Structures with
Applications”, McGraw- Hill International Student Edition, New York.
5. Yedidyan Langsam, Moshe J. Augenstein and Aaron M. Te nenbaum, “Data
Structures using C”, Prentice Hall of India Pvt. Ltd., New Delhi.
6. Mark Allen Weiss, “Data structures and Algorithm Analysis in C”,
Addison-Wesley (An Imprint of Pearson Education), Mexico City,
Prentice-Hall of India Pvt. Ltd., New Delhi.

142

You might also like