Data Structures Notes
Data Structures Notes
Course Objective:
Make to understand the significance of data structures and imply them in building efficient
algorithms
Course Outcomes:
After completion of this course, the student will be able to
1.Understand the concepts of time and space complexities.
2.Understand the concept of Abstract Data Type.
3.Choose appropriate data structures to represent data items in real world problems.
4.Analyze the search and space complexities of algorithms.
5.Design programs using a variety of data structures such as stacks, queues, hash tables,
binary trees, search trees, heaps, graphs, and B-trees, and implement various searching and
sorting techniques.
UNIT I
Basic Concepts
Data objects and Structures,
Algorithm Specification-Introduction,
Recursive algorithms,
Data abstraction,
Performance analysis-Time complexity and Space complexity,
Asymptotic Notation-Big O, Omega and Theta notations,
Complexity Analysis Examples,
Introduction to Linear and Non-Linear data structures.
1 Basic Concepts
To develop a program of an algorithm we should select an appropriate data structure for that
algorithm. Therefore, data structure is represented as:
A data structure is said to be linear if its elements form a sequence or a linear list. The linear data
structures like an array, stacks, queues and linked lists organize data in linear order. A data structure
is said to be nonlinear if its elements form a hierarchical classification where, data items appear at
various levels.
Trees and Graphs are widely used non-linear data structures. Tree and graph structures represents
hierarchical relationship between individual data elements. Graphs are nothing but trees with
certain restrictions removed.
Primitive Data Structures are the basic data structures that directly operate upon the machine
instructions. They have different representations on different computers. Integers, floating point
numbers, character constants, string constants and pointers come under this category.
Non-primitive data structures are more complicated data structures and are derived from primitive
data structures. They emphasize on grouping same or different data items with relationship between
each data item. Arrays, lists and files come under this category. Figure 1.1 shows the classification of
data structures.
The collection of data you work with in a program have some kind of structure or organization. No
matter how complex your data structures are they can be broken down into two fundamental types:
• Contiguous
• Non-Contiguous.
In contiguous structures, terms of data are kept together in memory (either RAM or in a file). An
array is an example of a contiguous structure. Since each element in the array is located next to one
or two other elements. In contrast, items in a non-contiguous structure and scattered in memory,
but we linked to each other in some way. A linked list is an example of a non-contiguous data
structure. Here, the nodes of the list are linked together using pointers stored in each node. Figure
1.2 below illustrates the difference between contiguous and non-contiguous structures.
Contiguous structures:
Contiguous structures can be broken drawn further into two kinds: those that contain data items of
all the same size, and those where the size may differ. Figure 1.2 shows example of each kind. The
first kind is called the array. Figure 1.3(a) shows an example of an array of numbers. In an array, each
element is of the same type, and thus has the same size.
The second kind of contiguous structure is called structure, figure 1.3(b) shows a simple structure
consisting of a person‘s name and age. In a struct, elements may be of different data types and thus
may have different sizes.
For example, a person‘s age can be represented with a simple integer that occupies two bytes of
memory. But his or her name, represented as a string of characters, may require many bytes and
may even be of varying length.
Couples with the atomic types (that is, the single data-item built-in types such as integer, float and
pointers), arrays and structs provide all the “mortar” you need to build more exotic form of data
structure, including the non-contiguous forms.
Non-contiguous structures:
Non-contiguous structures are implemented as a collection of data-items, called nodes, where each
node can point to one or more other nodes in the collection. The simplest kind of non-contiguous
structure is linked list.
A linked list represents a linear, one-dimension type of non-contiguous structure, where there is only
the notation of backwards and forwards. A tree such as shown in figure 1.4(b) is an example of a
two-dimensional non-contiguous structure. Here, there is the notion of up and down and left and
right.
In a tree each node has only one link that leads into the node and links can only go down the tree.
The most general type of non-contiguous structure, called a graph has no such restrictions. Figure
1.4(c) is an example of a graph.
Hybrid structures:
If two basic types of structures are mixed then it is a hybrid form. Then one part contiguous and
another part non-contiguous. For example, figure 1.5 shows how to implement a double–linked list
using three parallel arrays, possibly stored a past from each other in memory.
The array D contains the data for the list, whereas the array P and N hold the previous and next
“pointers‘‘. The pointers are actually nothing more than indexes into the D array. For instance, D[i]
holds the data for node i and p[i] holds the index to the node previous to i, where may or may not
reside at position i–1. Likewise, N[i] holds the index to the next node in the list.
An Algorithm is a sequence of unambiguous instructions which can be used to solve the problem. In
addition, every algorithm must satisfy the following criteria.
Input: there are zero or more quantities, which are externally supplied.
Finiteness: if we trace out the instructions of an algorithm, then for all cases the algorithm will
terminate after a finite number of steps.
Effectiveness: every instruction must be sufficiently basic that it can in principle be carried out by a
person using only pencil or paper. It is not enough that each operation be definite, but it must also
be feasible.
There are some criteria that we need to take care while designing the algorithm for any process:
A program that runs faster is better program, so saving time is an obvious goal. Likewise, a program
that saves space over a competing program is considered desirable.
Recursion is of two types depending on whether a function calls itself from within itself or
whether two functions call one another mutually. The former is called direct recursion and
the latter is called indirect recursion.
Thus, there are two types of recursions:
Direct Recursion
Indirect Recursion
may be further categorized as:
Linear Recursion
Binary Recursion
Multiple Recursion
Linear Recursion: It is the most common type of Recursion in which function calls itself
repeatedly until base condition [termination case] is reached. Once the base case is reached
the results are return to the caller function. If a recursive function is called only once then it is
called a linear recursion.
In C++, we use access labels to define the abstract interface to the class. A class may contain
zero or more access labels −
Members defined with a public label are accessible to all parts of the program. The
data-abstraction view of a type is defined by its public members.
Members defined with a private label are not accessible to code that uses the class.
The private sections hide the implementation from code that uses the type.
There are no restrictions on how often an access label may appear. Each access label
specifies the access level of the succeeding member definitions. The specified access level
remains in effect until the next access label is encountered or the closing right brace of the
class body is seen.
Class internals are protected from inadvertent user-level errors, which might corrupt
the state of the object.
The class implementation may evolve over time in response to changing requirements
or bug reports without requiring change in user-level code.
By defining data members only in the private section of the class, the class author is free to
make changes in the data. If the implementation changes, only the class code needs to be
examined to see what affect the change may have. If data is public, then any function that
directly access the data members of the old representation might be broken.
Ex:
#include <iostream>
using namespace std;
class Adder {
public:
// constructor
Adder(int i = 0) {
total = i;
}
private:
// hidden data from outside world
int total;
};
int main() {
Adder a;
a.addNum(10);
a.addNum(20);
a.addNum(30);
For Example, when analyzing some algorithm, one might find that the time (or the
number of steps) it takes to complete a problem od size n is given by T(n)=4n^2-2n+2.
If we ignore constants (which makes sense because those depend on the particular
hardware the program is run on) and slower going terms, we could say “T(n) grows at the
order of n^2” and write T(n)=O(n^2).
T(n)=O(f(n)), (pronounced order of or big oh), says that the growth rate of T(n) is less
than or equal (<) that of f(n)
Omega Notation (Ω)
Ω Notation provides an asymptotic lower bound. It is Useful for finding the Best time an
Algorithm can Take
Ω (f(n))>={g(n): there exists c>0 and n0 such that g(n)<=c.f(n) for all n>n0.}
Theta Notation, Ɵ
The notation describes asymptotic tight bounds
A theoretical measure of the execution of an algorithm, usually the time or memory needed,
given the problem size n, which is usually the number of items. Informally, saying some
equation f(n)= Ɵ(g(n) means it is within a constant multiple of g(n). the equation is read “f of
n is theta g of n”.
F(n)= Ɵ(g(n)) means there are positive constants c 1, c2, and k such that 0<=c1g(n)
<=f(n)<=c2g(n) for all n>=k. The values of c1, c2 and k must be fixed for the function f and
must not depend on n.
UNIT-II
2.1 Sparse matrices-array and linked representations.
2.2 Linear list ADT-array representation and linked list representation,
2.3 Singly Linked Lists-Operations Insertion, Deletion,
2.4 Circular linked lists-Operations for Circular linked lists,
2.5 Doubly Linked Lists Operations-Insertion, Deletion.
2.6 Stack ADT, definition, array and linked list implementations,
2.7 applications-infix to postfix conversion, Postfix expression evaluation,
2.8 recursion implementation,
2.9 Queue ADT, definition, array and linked list, Implementations,
2.10 Circular Queues-Insertion and deletion operations,
2.11 Polynomial.
A matrix is a two-dimensional data object made of m rows and n columns, therefore having
total m x n values. If most of the elements of the matrix have 0 value, then it is called a
sparse matrix.
00304
00570
00000
02600
Representing a sparse matrix by a 2D array leads to wastage of lots of memory as zeroes in
the matrix are of no use in most of the cases. So, instead of storing zeroes with non-zero
elements, we only store non-zero elements. This means storing non-zero elements
with triples- (Row, Column, value).
Sparse Matrix Representations can be done in many ways following are two common
representations:
1. Array representation
2. Linked list representation
Method 1: Using Arrays:
2D array is used to represent a sparse matrix in which there are three rows named as
Row: Index of row, where non-zero element is located
Column: Index of column, where non-zero element is located
Value: Value of the non zero element located at index – (row,column)
int main()
{
// Assume 4x5 sparse matrix
int sparseMatrix[4][5] =
{
{0 , 0 , 3 , 0 , 4 },
{0 , 0 , 5 , 7 , 0 },
{0 , 0 , 0 , 0 , 0 },
{0 , 2 , 6 , 0 , 0 }
};
int size = 0;
for (int i = 0; i < 4; i++)
for (int j = 0; j < 5; j++)
if (sparseMatrix[i][j] != 0)
size++;
cout <<"\n";
}
return 0;
}
int row;
int col;
int data;
Node *next;
};
r = new Node();
r->row = row_index;
r->col = col_index;
r->data = x;
r->next = NULL;
temp->next = r;
}
}
ptr = start;
while (ptr != NULL)
{
cout << ptr->col << " ";
ptr = ptr->next;
}
// Driver Code
int main()
{
if (sparseMatrix[i][j] != 0)
create_new_node(&first, i, j, sparseMatrix[i][j]);
}
}
printList(first);
return 0;
}
Static/Dynamic
Memory is allocated at the time of compilation/run-time
Linear/Non-Linear
Maintain a Linear relationship between element
Problem solving with a computer means processing data
To process data, we need to define the data type and the operation to be performed on the
data
The definition of the data type and the definition of the operation to be applied to the data is
part of the idea behind an Abstract Data Type (ADT)
The user of an ADT needs only to know that a set of operations are available for the data
type, but does not need to know how they are applied.
Several simple ADTs such as integer, real, character, pointer and so on, have been
implemented are available for use in most languages.
A data type is characterized by:
A set of values
A Data representation, which is common to all these values, and
A set of operations, which can be applied uniformly to all these values.
Languages like C provides the following primitive data types:
Boolean
Char, byte, int
Flat, double
Each primitive data type has:
A set of values
A Data type representation
A set of operations
In computer science, an abstract data type (ADT) is a mathematical model for a certain class
of data structures that have similar behaviour.
An abstract data type is defined indirectly, only by the operations that may be performed on it
and by mathematical constraints on the effects (and possibly cost) of those operations.
An ADT may be implemented by specific data types or data structures, in many ways and in
many programming languages; or described in a formal specification language.
Example an abstract stack could be defined by three operations:
Push, that inserts some data item onto the structure,
Position of Ai is i
First element A1 is called “head”
Last element is An Called “Tail”
Operations on Lists
MakeEmpty
PrintList
Find
FindKth
Insert
Delete
Next
Previous
List An Example:
The elements of list are 34,12,52,16,12
Find(52)->3
Insert(20,4)-> 34,12,52,20,16,12
Delete(52)-> 34,12,20,16,12
FindKth(3)-> 20
List can be implemented using:
Arrays
Linked List
Cursor [Linked List using Arrays]
Arrays
Array is a static data structure that represents a collection of fixed number of
homogenous data items or
A fixed sized indexed sequence of elements, all of the same type.
The individual elements are typically stored in consecutive memory locations
The length of array is determined when the array is created, can be changed.
Any component of the array can be inspected or updated by using its index.
https://www.slideshare.net/ayyakathir/unit-i-linear-data-structures-list
10x2 + 26x, here 10 and 26 are coefficients and 2, 1 is its exponential value.
Points to keep in Mind while working with Polynomials:
The sign of each coefficient and exponent is stored within the coefficient and
the exponent itself
Additional terms having equal exponent is possible one
The storage allocation for each term in the polynomial must be done in
ascending and descending order of their exponent
There may arise some situation where you need to evaluate many polynomial
expressions and perform basic arithmetic operations like addition and subtraction
with those numbers. For this, you will have to get a way to represent those
polynomials. The simple way is to represent a polynomial with degree 'n' and store
the coefficient of n+1 terms of the polynomial in the array. So every array element
will consist of two values:
Coefficient and
Exponent
UNIT III
3.1 Trees
3.1.1 Definition, terminology,
3.1.2 Binary trees-definition,
3.1.3 Properties of Binary Trees, Binary Tree ADT,
3.1.4 representation of Binary Trees-array and linked representations,
3.1.5 Binary Tree traversals,
3.1.6 threaded binary trees,
3.2 Priority Queues –Definition and applications,
3.2.1 Max Priority Queue ADT implementation-Max Heap-Definition,
3.2.2 Insertion into a Max Heap, Deletion from a Max Heap.
3.3 Disjoint set ADT -Equivalence relations,
3.3.1 the dynamic equivalence problem,
3.3.2 Basic data structure,
3.3.3 Smart union algorithms, Path compression,
3.3.4 worst case for union by rank and path compression, and an application -generation of
mazes.
Tree - Terminology
linear data structure data is organized in sequential order and in non-linear data
structure data is organized in random order. A tree is a very popular non-linear data
structure used in a wide range of applications. A tree data structure can be defined
as follows...
In tree data structure, every individual element is called as Node. Node in a tree
data structure stores the actual data of that particular element and link to next
element in hierarchical structure.
In a tree data structure, if we have N number of nodes then we can have a maximum
of N-1 number of links.
Example
Terminology
In a tree data structure, we use the following terminology...
1. Root
In a tree data structure, the first node is called as Root Node. Every tree must have
a root node. We can say that the root node is the origin of the tree data structure. In
any tree, there must be only one root node. We never have multiple root nodes in a
tree.
2. Edge
In a tree data structure, the connecting link between any two nodes is called
as EDGE. In a tree with 'N' number of nodes there will be a maximum of 'N-1'
number of edges.
3. Parent
In a tree data structure, the node which is a predecessor of any node is called
as PARENT NODE. In simple words, the node which has a branch from it to any
other node is called a parent node. Parent node can also be defined as "The node
which has child / children".
4. Child
In a tree data structure, the node which is descendant of any node is called
as CHILD Node. In simple words, the node which has a link from its parent node is
called as child node. In a tree, any parent node can have any number of child nodes.
In a tree, all the nodes except root are child nodes.
5. Siblings
In a tree data structure, nodes which belong to same Parent are called as SIBLINGS.
In simple words, the nodes with the same parent are called Sibling nodes.
6. Leaf
In a tree data structure, the node which does not have a child is called as LEAF
Node. In simple words, a leaf is a node with no child.
In a tree data structure, the leaf nodes are also called as External Nodes. External
node is also a node with no child. In a tree, leaf node is also called as ' Terminal'
node.
7. Internal Nodes
In a tree data structure, the node which has atleast one child is called as INTERNAL
Node. In simple words, an internal node is a node with atleast one child.
In a tree data structure, nodes other than leaf nodes are called as Internal
Nodes. The root node is also said to be Internal Node if the tree has more than
one node. Internal nodes are also called as 'Non-Terminal' nodes.
8. Degree
In a tree data structure, the total number of children of a node is called
as DEGREE of that Node. In simple words, the Degree of a node is total number of
children it has. The highest degree of a node among all the nodes in a tree is called
as 'Degree of Tree'
9. Level
In a tree data structure, the root node is said to be at Level 0 and the children of root
node are at Level 1 and the children of the nodes which are at Level 1 will be at
Level 2 and so on... In simple words, in a tree each step from top to bottom is called
as a Level and the Level count starts with '0' and incremented by one at each level
(Step).
10. Height
In a tree data structure, the total number of edges from leaf node to a particular
node in the longest path is called as HEIGHT of that Node. In a tree, height of the
root node is said to be height of the tree. In a tree, height of all leaf nodes is
'0'.
11. Depth
In a tree data structure, the total number of edges from root node to a particular
node is called as DEPTH of that Node. In a tree, the total number of edges from root
node to a leaf node in the longest path is said to be Depth of the tree. In simple
words, the highest depth of any leaf node in a tree is said to be depth of that tree. In
a tree, depth of the root node is '0'.
12. Path
In a tree data structure, the sequence of Nodes and Edges from one node to another
node is called as PATH between that two Nodes. Length of a Path is total number
of nodes in that path. In below example the path A - B - E - J has length 4.
Tree Representations
A tree data structure can be represented in two methods. Those methods are as
follows...
1. List Representation
1. List Representation
In this representation, we use two types of nodes one for representing the node with
data called 'data node' and another for representing only references called
'reference node'. We start with a 'data node' from the root node in the tree. Then it is
linked to an internal node through a 'reference node' which is further linked to any
other node directly. This process repeats for all the nodes in the tree.
The above example tree can be represented using List representation as follows...
In this representation, every node's data field stores the actual value of that node. If
that node has left a child, then left reference field stores the address of that left child
node otherwise stores NULL. If that node has the right sibling, then right reference
field stores the address of right sibling node otherwise stores NULL.
The above example tree can be represented using Left Child - Right Sibling
representation as follows...
In a normal tree, every node can have any number of children. A binary tree is a
special type of tree data structure in which every node can have a maximum of 2
children. One is known as a left child and the other is known as right child.
A tree in which every node can have a maximum of two children is called
Binary Tree.
A binary tree in which every node has either two or zero number of
children is called Strictly Binary Tree
Strictly binary tree is also called as Full Binary Tree or Proper Binary Tree or 2-
Tree
Example
A binary tree in which every internal node has exactly two children and all
leaf nodes are at same level is called Complete Binary Tree.
The full binary tree obtained by adding dummy nodes to a binary tree is
called as Extended Binary Tree.
In above figure, a normal binary tree is converted into full binary tree by adding
dummy nodes (In pink colour).
A tree is an abstract data type that stores elements hierarchically. With the
exception of the top element, each element in a tree has a parent element and
zero or
more children elements. A tree is usually visualized by placing elements inside
ovals or rectangles, and by drawing the connections between parents and
children
with straight lines. (See Figure 7.2.) We typically call the top element the root
of the tree, but it is drawn as the highest element, with the other elements being
connected below (just the opposite of a botanical tree)
UNIT IV
4.1 Searching
4.1.1 Linear Search,
4.1.2 Binary Search,
4.1.3 Hashing-Introduction,
4.1.4 hash tables,
4.1.5 hash functions,
4.1.6 Overflow Handling,
4.1.7 Comparison of Searching methods.
4.2 Sorting:
4.2.1 Insertion Sort,
4.2.2 Selection Sort,
4.2.3 Radix Sort,
4.2.4 Quick sort,
4.2.5 Heap Sort,
4.2.6 Merge sort,
4.2.7 Comparison of Sorting methods.
The wide variety of mass storage devices makes external sorting much more device dependent than
internal sorting. The algorithms that we will consider work on tapes, which are probably the most
restrictive storage medium. Since access to an element on tape is done by winding the tape to the
correct location, tapes can be efficiently accessed only in sequential order (in either direction). We
will assume that we have at least three tape drives to perform the sorting. We need two drives to do
an efficient sort; the third drive simplifies matters. If only one tape drive is present, then we are in
trouble: any algorithm will require O(n 2 ) tape accesses.
The basic external sorting algorithm uses the merge routine from merge sort. Suppose we have four
tapes, Ta1, Ta2, Tb1, Tb2, which are two input and two output tapes.
Depending on the point in the algorithm, the a and b tapes are either input tapes or output tapes.
Suppose the data is initially on Ta1. Suppose further that the internal memory can hold (and sort) m
records at a time. A natural first step is to read m records at a time from the input tape, sort the
records internally, and then write the sorted records alternately to Tb1 and Tb2. We will call each set
of sorted records a run. When this is done, we rewind all the tapes. Suppose we have the same input
as our example for Shell sort.
Now Tb1 and Tb2 contain a group of runs. We take the first run from each tape and merge them,
writing the result, which is a run twice as long, onto Ta1. Then we take the next run from each tape,
merge these, and write the result to Ta2. We continue this process, alternating between Ta1 and
Ta2, until either Tb1 or Tb2 is empty. At this point either both are empty or there is one run left. In
the latter case, we copy this run to the appropriate tape. We rewind all four tapes, and repeat the
same steps, this time using the a tapes as input and the b tapes as output. This will give runs of 4m.
We continue the process until we get one run of length n.
This algorithm will require log(n/m) passes, plus the initial run-constructing pass. For instance, if we
have 10 million records of 128 bytes each, and four megabytes of internal memory, then the first
pass will create 320 runs. We would then need nine more passes to complete the sort. Our example
requires log 13/3 = 3 more passes, which are shown in the following figure.
Multiway Merge
If we have extra tapes, then we can expect to reduce the number of passes required to sort our
input. We do this by extending the basic (two-way) merge to a k-way merge.
Merging two runs is done by winding each input tape to the beginning of each run. Then the smaller
element is found, placed on an output tape, and the appropriate input tape is advanced. If there are
k input tapes, this strategy works the same way, the only difference being that it is slightly more
complicated to find the smallest of the k elements. We can find the smallest of these elements by
using a priority queue. To obtain the next element to write on the output tape, we perform a
delete_min operation. The appropriate input tape is advanced, and if the run on the input tape is not
yet completed, we insert the new element into the priority queue. Using the same example as
before, we distribute the input onto the three tapes
After the initial run construction phase, the number of passes required using k-way merging is
logk(n/m) , because the runs get k times as large in each pass. For the example above, the formula is
verified, since log3 13/3 = 2. If we have 10 tapes, then k = 5, and our large example from the
previous section would require log5 320 = 4 passes.
Polyphase Merge
The k-way merging strategy developed in the last section requires the use of 2k tapes. This could be
prohibitive for some applications. It is possible to get by with only k + 1 tapes. As an example, we will
show how to perform two-way merging using only three tapes
Suppose we have three tapes, T1, T2, and T3, and an input file on T1 that will produce 34 runs. One
option is to put 17 runs on each of T2 and T3. We could then merge this result onto T1, obtaining
one tape with 17 runs. The problem is that since all the runs are on one tape, we must now put
some of these runs on T2 to perform another merge. The logical way to do this is to copy the first
eight runs from T1 onto T2 and then perform the merge. This has the effect of adding an extra half
pass for every pass we do.
An alternative method is to split the original 34 runs unevenly. Suppose we put 21 runs on T2 and 13
runs on T3. We would then merge 13 runs onto T1 before T3 was empty. At this point, we could
rewind T1 and T3, and merge T1, with 13 runs, and T2, which has 8 runs, onto T3. We could then
merge 8 runs until T2 was empty, which would leave 5 runs left on T1 and 8 runs on T3. We could
then merge T1 and T3, and so on. The following table below shows the number of runs on each tape
after each pass.
The original distribution of runs makes a great deal of difference. For instance, if 22 runs are placed
on T2, with 12 on T3, then after the first merge, we obtain 12 runs on T1 and 10 runs on T2. Afte
another merge, there are 10 runs on T1 and 2 runs on T3. At this point the going gets slow, because
we can only merge two sets of runs before T3 is exhausted. Then T1 has 8 runs and T2 has 2 runs.
Again, we can only merge two sets of runs, obtaining T1 with 6 runs and T3 with 2 runs. After three
more passes, T2 has two runs and the other tapes are empty. We must copy one run to another
tape, and then we can finish the merge.
It turns out that the first distribution we gave is optimal. If the number of runs is a Fibonacci number
Fn, then the best way to distribute them is to split them into two Fibonacci numbers Fn-1 and Fn-2.
Otherwise, it is necessary to pad the tape with dummy runs in order to get the number of runs up to
a Fibonacci number. We leave the details of how to place the initial set of runs on the tapes as an
exercise.
We can extend this to a k-way merge, in which case we need kth order Fibonacci numbers for the
distribution, where the kth order Fibonacci number is defined as F (k) (n) = F (k) (n - 1) + F (k) (n - 2) +
+ F (k) (n - k), with the appropriate initial conditions F (k) (n) = 0,0 n k - 2, F (k) (k - 1) =1.
Replacement Selection
The last item we will consider is construction of the runs. The strategy we have used so far is the
simplest possible: We read as many records as possible and sort them, writing the result to some
tape. This seems like the best approach possible, until one realizes that as soon as the first record is
written to an output tape, the memory it used becomes available for another record. If the next
record on the input tape is larger than the record we have just output, then it can be included in the
run.
Using this observation, we can give an algorithm for producing runs. This technique is commonly
referred to as replacement selection.
Initially, m records are read into memory and placed in a priority queue. We perform a delete_min,
writing the smallest record to the output tape. We read the next record from the input tape. If it is
larger than the record we have just written, we can add it to the priority queue. Otherwise, it cannot
go into the current run. Since the priority queue is smaller by one element, we can store this new
element in the dead space of the priority queue until the run is completed and use the element for
the next run. Storing an element in the dead space is similar to what is done in heapsort. We
continue doing this until the size of the priority queue is zero, at which point the run is over. We
start a new run by building a new priority queue, using all the elements in the dead space. Figure
7.18 shows the run construction for the small example we have been using, with m = 3. Dead
elements are indicated by an asterisk.
In this example, replacement selection produces only three runs, compared with the five runs
obtained by sorting. Because of this, a three-way merge finishes in one pass instead of two. If the
input is randomly distributed, replacement selection can be shown to produce runs of average
length 2m. For our large example, we would expect 160 runs instead of 320 runs, so a five-way
merge would require four passes. In this case, we have not saved a pass, although we might if we get
lucky and have 125 runs or less. Since external sorts take so long, every pass saved can make a
significant difference in the running time.
UNIT V
5.1 Graphs
5.1.1 Definitions, Terminology,
A graph is a pictorial representation of a set of objects where some pairs of objects are
connected by links. The interconnected objects are represented by points termed as vertices,
and the links that connect the vertices are called edges.
Formally, a graph is a pair of sets (V, E), where V is the set of vertices and E is the set of
edges, connecting the pairs of vertices. Take a look at the following graph −
Mathematical graphs can be represented in data structure. We can represent a graph using an
array of vertices and a two-dimensional array of edges. Before we proceed further, let's
familiarize ourselves with some important terms −
Vertex − Each node of the graph is represented as a vertex. In the following example,
the labeled circle represents vertices. Thus, A to G are vertices. We can represent
them using an array as shown in the following image. Here A can be identified by
index 0. B can be identified using index 1 and so on.
Edge − Edge represents a path between two vertices or a line between two vertices.
In the following example, the lines from A to B, B to C, and so on represents edges.
We can use a two-dimensional array to represent an array as shown in the following
image. Here AB can be represented as 1 at row 0, column 1, BC as 1 at row 1,
column 2 and so on, keeping other combinations as 0.
Adjacency − Two node or vertices are adjacent if they are connected to each other
through an edge. In the following example, B is adjacent to A, C is adjacent to B,
and so on.
Path − Path represents a sequence of edges between the two vertices. In the
following example, ABCD represents a path from A to D.
Basic Operations
As in the example given above, DFS algorithm traverses from S to A to D to G to E to B first, then to
F and lastly to C. It employs the following rules.
Rule 1 − Visit the adjacent unvisited vertex. Mark it as visited. Display it. Push it in a stack.
Rule 2 − If no adjacent vertex is found, pop up a vertex from the stack. (It will pop up all the
vertices from the stack, which do not have adjacent vertices.)
Rule 3 − Repeat Rule 1 and Rule 2 until the stack is empty.
We choose B, mark it as
visited and put onto the
stack. Here B does not
have any unvisited
adjacent node. So, we
pop B from the stack.
As C does not have any unvisited adjacent node so we keep popping the stack until we find
a node that has an unvisited adjacent node. In this case, there's none and we keep popping
until the stack is empty.
Breadth First Search (BFS) algorithm traverses a graph in a breadthward motion and uses a
queue to remember to get the next vertex to start a search, when a dead end occurs in any
iteration.
As in the example given above, BFS algorithm traverses from A to B to E to F first then to C
and G lastly to D. It employs the following rules.
Rule 1 − Visit the adjacent unvisited vertex. Mark it as visited. Display it. Insert it in
a queue.
Rule 2 − If no adjacent vertex is found, remove the first vertex from the queue.
Rule 3 − Repeat Rule 1 and Rule 2 until the queue is empty.
We start from
visiting S (starting node), and
mark it as visited.
From A we have D as
unvisited adjacent node. We
mark it as visited and enqueue
it.
At this stage, we are left with no unmarked (unvisited) nodes. But as per the algorithm we
keep on dequeuing in order to get all unvisited nodes. When the queue gets emptied, the
program is over.
DFS BFS
1. Stack Queue
2. Depth breadth
3. First adjacent vertex all the Adjacent vertex
Important Terms
Binary Search tree exhibits a special behavior. A node's left child must have a value less
than its parent's value and the node's right child must have a value greater than its parent
value.
Tree Node
The code to write a tree node would be similar to what is given below. It has a data part and
references to its left and right child nodes.
struct node {
int data;
struct node *leftChild;
struct node *rightChild;
};
In a tree, all nodes share common construct.
The basic operations that can be performed on a binary search tree data structure, are the
following −
Insert − Inserts an element in a tree/create a tree.
Search − Searches an element in a tree.
Preorder Traversal − Traverses a tree in a pre-order manner.
Inorder Traversal − Traverses a tree in an in-order manner.
Postorder Traversal − Traverses a tree in a post-order manner.
We shall learn creating (inserting into) a tree structure and searching a data item in a tree in
this chapter. We shall learn about tree traversing methods in the coming chapter.
Insert Operation
The very first insertion creates the tree. Afterwards, whenever an element is to be inserted,
first locate its proper location. Start searching from the root node, then if the data is less than
the key value, search for the empty location in the left subtree and insert the data. Otherwise,
search for the empty location in the right subtree and insert the data.
Algorithm
If root is NULL
then create root node
return
endwhile
insert data
end If
Implementation
The implementation of insert function should look like this −
void insert(int data) {
struct node *tempNode = (struct node*) malloc(sizeof(struct node));
struct node *current;
struct node *parent;
tempNode->data = data;
tempNode->leftChild = NULL;
tempNode->rightChild = NULL;
while(1) {
parent = current;
}
}
Search Operation
Whenever an element is to be searched, start searching from the root node, then if the data is
less than the key value, search for the element in the left subtree. Otherwise, search for the
element in the right subtree. Follow the same algorithm for each node.
Algorithm
If root.data is equal to search.data
return root
else
while data not found
If data found
return node
endwhile
end if
The implementation of this algorithm should look like this.
struct node* search(int data) {
struct node *current = root;
printf("Visiting elements: ");
while(current->data != data) {
if(current != NULL)
printf("%d ",current->data);
//not found
if(current == NULL) {
return NULL;
}
return current;
}
}
Traversal is a process to visit all the nodes of a tree and may print their values too. Because,
all nodes are connected via edges (links) we always start from the root (head) node. That is,
we cannot randomly access a node in a tree. There are three ways which we use to traverse a
tree −
In-order Traversal
Pre-order Traversal
Post-order Traversal
Generally, we traverse a tree to search or locate a given item or key in the tree or to print all
the values it contains.
In-order Traversal
In this traversal method, the left subtree is visited first, then the root and later the right sub-
tree. We should always remember that every node may represent a subtree itself.
If a binary tree is traversed in-order, the output will produce sorted key values in an
ascending order.
We start from A, and following in-order traversal, we move to its left subtree B. B is also
traversed in-order. The process goes on until all the nodes are visited. The output of inorder
traversal of this tree will be −
D→B→E→A→F→C→G
Algorithm
Until all nodes are traversed −
Step 1 − Recursively traverse left subtree.
Step 2 − Visit root node.
Step 3 − Recursively traverse right subtree.
Pre-order Traversal
In this traversal method, the root node is visited first, then the left subtree and finally the
right subtree.
We start from A, and following pre-order traversal, we first visit A itself and then move to
its left subtree B. B is also traversed pre-order. The process goes on until all the nodes are
visited. The output of pre-order traversal of this tree will be −
A→B→D→E→C→F→G
Algorithm
Until all nodes are traversed −
Step 1 − Visit root node.
Step 2 − Recursively traverse left subtree.
Step 3 − Recursively traverse right subtree.
Post-order Traversal
In this traversal method, the root node is visited last, hence the name. First we traverse the
left subtree, then the right subtree and finally the root node.
We start from A, and following Post-order traversal, we first visit the left subtree B. B is
also traversed post-order. The process goes on until all the nodes are visited. The output of
post-order traversal of this tree will be −
D→E→B→F→G→C→A
Algorithm
Until all nodes are traversed −
Step 1 − Recursively traverse left subtree.
Step 2 − Recursively traverse right subtree.
Step 3 − Visit root node.
A Binary Search Tree (BST) is a tree in which all the nodes follow the below-mentioned
properties −
The value of the key of the left sub-tree is less than the value of its parent (root)
node's key.
The value of the key of the right sub-tree is greater than or equal to the value of its
parent (root) node's key.
Thus, BST divides all its sub-trees into two segments; the left sub-tree and the right sub-tree
and can be defined as −
left_subtree (keys) < node (key) ≤ right_subtree (keys)
Representation
BST is a collection of nodes arranged in a way where they maintain BST properties. Each
node has a key and an associated value. While searching, the desired key is compared to the
keys in BST and if found, the associated value is retrieved.
Following is a pictorial representation of BST −
We observe that the root node key (27) has all less-valued keys on the left sub-tree and the
higher valued keys on the right sub-tree.
Basic Operations
Node
Define a node having some data, references to its left and right child nodes.
struct node {
int data;
struct node *leftChild;
struct node *rightChild;
};
Search Operation
Whenever an element is to be searched, start searching from the root node. Then if the data
is less than the key value, search for the element in the left subtree. Otherwise, search for the
element in the right subtree. Follow the same algorithm for each node.
Algorithm
struct node* search(int data){
struct node *current = root;
printf("Visiting elements: ");
while(current->data != data){
if(current != NULL) {
printf("%d ",current->data);
//not found
if(current == NULL){
return NULL;
}
}
}
return current;
}
Insert Operation
Whenever an element is to be inserted, first locate its proper location. Start searching from
the root node, then if the data is less than the key value, search for the empty location in the
left subtree and insert the data. Otherwise, search for the empty location in the right subtree
and insert the data.
Algorithm
void insert(int data) {
struct node *tempNode = (struct node*) malloc(sizeof(struct node));
struct node *current;
struct node *parent;
tempNode->data = data;
tempNode->leftChild = NULL;
tempNode->rightChild = NULL;
root = tempNode;
} else {
current = root;
parent = NULL;
while(1) {
parent = current;
if(current == NULL) {
parent->leftChild = tempNode;
return;
}
} //go to right of the tree
else {
current = current->rightChild;
In the second tree, the left subtree of C has height 2 and the right subtree has
height 0, so the difference is 2. In the third tree, the right subtree of A has height 2
and the left is missing, so it is 0, and the difference is 2 again. AVL tree permits
difference (balance factor) to be only 1.
AVL Rotations
To balance itself, an AVL tree may perform the following four kinds of rotations −
Left rotation
Right rotation
Left-Right rotation
Right-Left rotation
The first two rotations are single rotations and the next two rotations are double
rotations. To have an unbalanced tree, we at least need a tree of height 2. With this
simple tree, let's understand them one by one.
Left Rotation
If a tree becomes unbalanced, when a node is inserted into the right subtree of the
right subtree, then we perform a single left rotation −
In our example, node A has become unbalanced as a node is inserted in the right
subtree of A's right subtree. We perform the left rotation by making A the left-
subtree of B.
Right Rotation
AVL tree may become unbalanced, if a node is inserted in the left subtree of the left
subtree. The tree then needs a right rotation.
As depicted, the unbalanced node becomes the right child of its left child by
performing a right rotation.
Left-Right Rotation
Double rotations are slightly complex version of already explained versions of
rotations. To understand them better, we should take note of each action performed
while rotation. Let's first check how to perform Left-Right rotation. A left-right
rotation is a combination of left rotation followed by right rotation.
State Action
A node has been inserted into the right subtree of the left
subtree. This makes C an unbalanced node. These scenarios
cause AVL tree to perform left-right rotation.
We shall now right-rotate the tree, making B the new root node
of this subtree. C now becomes the right subtree of its own left
subtree.
Right-Left Rotation
The second type of double rotation is Right-Left Rotation. It is a combination of right
rotation followed by left rotation.
State Action
A node has been inserted into the left subtree of the right
subtree. This makes A, an unbalanced node with balance
factor 2.
1. Search
2. Insertion
3. Deletion
Step 2 - Compare the search element with the value of root node in the tree.
Step 3 - If both are matched, then display "Given node is found!!!" and
terminate the function
Step 4 - If both are not matched, then check whether search element is
smaller or larger than that node value.
Step 5 - If search element is smaller, then continue the search process in left
subtree.
Step 6 - If search element is larger, then continue the search process in right
subtree.
Step 7 - Repeat the same until we find the exact element or until the search
element is compared with the leaf node.
Step 8 - If we reach to the node having the value equal to the search value,
then display "Element is found" and terminate the function.
Step 9 - If we reach to the leaf node and if it is also not matched with the
search element, then display "Element is not found" and terminate the
function.
Step 1 - Insert the new element into the tree using Binary Search Tree
insertion logic.
The Red-Black Trees are self-balancing binary search tree. There are some
conditions for each node. These are like below −
Introduction:
A red-black tree is a kind of self-balancing binary search tree where each
node has an extra bit, and that bit is often interpreted as the colour (red or
black). These colours are used to ensure that the tree remains balanced
during insertions and deletions. Although the balance of the tree is not
perfect, it is good enough to reduce the searching time and maintain it
around O(log n) time, where n is the total number of elements in the tree.
This tree was invented in 1972 by Rudolf Bayer.
It must be noted that as each node requires only 1 bit of space to store the
colour information, these types of trees show identical memory footprint to
the classic (uncoloured) binary search tree.
Rules That Every Red-Black Tree Follows:
1. Every node has a colour either red or black.
2. The root of the tree is always black.
3. There are no two adjacent red nodes (A red node cannot have a
red parent or red child).
4. Every path from a node (including root) to any of its descendants
NULL nodes has the same number of black nodes.
5. All leaf nodes are black nodes.
Why Red-Black Trees?
Most of the BST operations (e.g., search, max, min, insert, delete.. etc) take
O(h) time where h is the height of the BST. The cost of these operations may
become O(n) for a skewed Binary tree. If we make sure that the height of the
tree remains O(log n) after every insertion and deletion, then we can
guarantee an upper bound of O(log n) for all these operations. The height of
a Red-Black tree is always O(log n) where n is the number of nodes in the
tree.
Sr.
No. Algorithm Time Complexity
1. Search O(log n)
2. Insert O(log n)
3. Delete O(log n)
as black nodes. So, a red-black tree of height h has black height >=
h/2.
2. Height of a red-black tree with n nodes is h<= 2 log 2(n + 1).
3. All leaves (NIL) are black.
4. The black depth of a node is defined as the number of black nodes
from the root to that node i.e the number of black ancestors.
5. Every red-black tree is a special case of a binary tree.
Step 2: END
For the program, you can refer it for AVL tree.
Solution:
1. Start from the root.
2. Compare the inserting element with root, if less than root, then
recurse for left, else recurse for right.
3. If the element to search is found anywhere, return true, else return
false.
Here we will see some search trees and their differences. There are many different
search trees. They are different in nature. The basic search tree is Binary Search
Tree (BST). Some other search trees are AVL tree, B tree, Red-Black tree, splay
tree etc.
These trees can be compares based on their operations. We will see the time
complexity of these trees
1. Point (3, 6) will divide the space into two parts: Draw line X = 3.
2. Point (2, 7) will divide the space to the left of line X = 3 into two
parts horizontally.
3. Point (17, 15) will divide the space to the right of line X = 3 into two
parts horizontally.
Point (6, 12) will divide the space below line Y = 15 and to the right
of line X = 3 into two parts.
Point (13, 15) will divide the space below line Y = 15 and to the
right of line X = 6 into two parts.
Point (10, 19) will divide the space to the right of line X = 3 and
above line Y = 15 into two parts.
const int k = 2;
return root;
}
return true;
}
int n = sizeof(points)/sizeof(points[0]);
return 0;
Output:
Found
Not Found