The document discusses hash tables and dictionaries as data structures for implementing sets and dictionaries. It covers topics like open hashing using chaining to handle collisions, closed hashing using rehashing functions like linear probing to handle collisions, and considerations for designing good hash functions like satisfying the simple uniform hashing property. Example applications mentioned include symbol tables, text retrieval systems, and database systems.
The document discusses hash tables and dictionaries as data structures for implementing sets and dictionaries. It covers topics like open hashing using chaining to handle collisions, closed hashing using rehashing functions like linear probing to handle collisions, and considerations for designing good hash functions like satisfying the simple uniform hashing property. Example applications mentioned include symbol tables, text retrieval systems, and database systems.
The document discusses hash tables and dictionaries as data structures for implementing sets and dictionaries. It covers topics like open hashing using chaining to handle collisions, closed hashing using rehashing functions like linear probing to handle collisions, and considerations for designing good hash functions like satisfying the simple uniform hashing property. Example applications mentioned include symbol tables, text retrieval systems, and database systems.
The document discusses hash tables and dictionaries as data structures for implementing sets and dictionaries. It covers topics like open hashing using chaining to handle collisions, closed hashing using rehashing functions like linear probing to handle collisions, and considerations for designing good hash functions like satisfying the simple uniform hashing property. Example applications mentioned include symbol tables, text retrieval systems, and database systems.
The document discusses various data structures for implementing dictionaries like hash tables, trees, skip lists etc. It also talks about sets, dictionaries and different operations that can be performed on sets. Hash tables provide an efficient way of implementing dictionaries with average case time complexity of O(1).
Some of the data structures discussed for implementing dictionaries are hash tables, binary search trees, balanced search trees, splay trees, tries, multiway search trees etc.
The two main issues in hash tables are collisions that can occur due to different keys hashing to the same value and choosing an appropriate hash function that minimizes collisions and hashes uniformly.
1.1 Dictionaries: A dictionary is a container of elements from a totally ordered universe that supports the basic operations of inserting/deleting elements and searching for a given element. In this chapter, we present hash tables which provide an efficient implicit realization of a dictionary. Efficient explicit implementations include binary search trees and balanced search trees.The abstract data type Set which includes dictionaries, priority queues, etc. as subclasses. 1.2 Sets: A set is a collection of well defined elements. The members of a set are all different. A set ADT can be defined to comprise the following operations: 1. Union (A, B, C) 2. Intersection (A, B, C) 3. Difference (A, B, C) 4. Merge (A, B, C) 5. Find (x) 6. Member (x, A) or Search (x, A) 7. Makenull (A) 8. Equal (A, B) 9. Assign (A, B) 10.Insert (x, A) 11.Delete (x, A) 12.Min (A) (if A is an ordered set) Set implementation: Possible data structures include: o Bit Vector o Array o Linked List Unsorted Sorted Dictionaries: A dictionary is a dynamic set ADT with the operations: 1.Makenull (D) 2.Insert (x, D) 3.Delete (x, D) 4.Search (x, D) Useful in implementing symbol tables, text retrieval systems, database systems, page mapping tables, etc. Implementation: 1.Fixed Length arrays 2.Linked lists : sorted, unsorted, skip-lists 3.Hash Tables : open, closed 4.Trees www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
2
o Binary Search Trees (BSTs) o Balanced BSTs AVL Trees Red-Black Trees o Splay Trees o Multiway Search Trees 2-3 Trees B Trees o Tries Let n be the number of elements is a dictionary D. The following is a summary of the performance of some basic implementation methods: Worst case complexity of
O(n) O(n) O(n) O(n)
O(n) O(n) O(n) O(1)
O(n) O(n) O(n) O(n) Among these, the sorted list has the best average case performance. In this chapter, we discuss two data structures for dictionaries, namely Hash Tables and Skip Lists. 1.3 Hash Tables: An extremely effective and practical way of implementing dictionaries. O(1) time for search, insert, and delete in the average case. O(n) time in the worst case; by careful design, we can make the probability that more than constant time is required to be arbitrarily small. Hash Tables o Open or External o Closed or Internal 1.4 Open hashing Let: U be the universe of keys: o integers o character strings o complex bit patterns B the set of hash values (also called the buckets or bins). Let B = {0, 1,..., m - 1}where m > 0 is a positive integer. A hash function h : U B associates buckets (hash values) to keys. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
3
Two main issues: 1. Collisions If x 1 and x 2 are two different keys, it is possible that h(x 1 ) = h(x 2 ). This is called a collision. Collision resolution is the most important issue in hash table implementations. 2. Hash Functions Choosing a hash function that minimizes the number of collisions and also hashes uniformly is another critical issue. Collision Resolution by Chaining Put all the elements that hash to the same value in a linked list. See Figure 3.1. Figure 3.1: Collision resolution by chaining
Example: See Figure 3.2. Consider the keys 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100. Let the hash function be: h(x) = x % 7
o unsorted lists o sorted lists (these are better) Insert (x, T) Insert x at the head of list T[h(key (x))] Search (x, T) Search for an element x in the list T[h(key (x))] Delete (x, T) Delete x from the list T[h(key (x))] Worst case complexity of all these operations is O(n) In the average case, the running time is O(1 + ), where
= load factor (3.1) n = number of elements stored (3.2) m = number of hash values or buckets It is assumed that the hash value h(k) can be computed in O(1) time. If n is O(m), the average case complexity of these operations becomes O(1) ! 1.5 Closed hashing: All elements are stored in the hash table itself Avoids pointers; only computes the sequence of slots to be examined. Collisions are handled by generating a sequence of rehash values. h : x {0, 1, 2,..., m - 1} Given a key x, it has a hash value h(x,0) and a set of rehash values h(x, 1), h(x,2), . . . , h(x, m-1) We require that for every key x, the probe sequence < h(x,0), h(x, 1), h(x,2), . . . , h(x, m-1)> be a permutation of <0, 1, ..., m-1>. This ensures that every hash table position is eventually considered as a slot for storing a record with a key value x. Search (x, T) Search will continue until you find the element x (successful search) or an empty slot (unsuccessful search). Delete (x, T) No delete if the search is unsuccessful. If the search is successful, then put the label DELETED (different from an empty slot). Insert (x, T) No need to insert if the search is successful. If the search is unsuccessful, insert at the first position with a DELETED tag.
1.6 Rehashing methods Denote h(x, 0) by simply h(x). 1.Linear probing h(x, i) = (h(x) + i) mod m 2.Quadratic Probing h(x, i) = (h(x) + C 1 i + C 2 i 2 ) mod m
where C 1 and C 2 are constants. 3.Double Hashing h(x, i) = (h(x) + i mod m
A Comparison of Rehashing Methods
m distinct probe Primary clustering sequences
m distinct probe No primary clustering; sequences but secondary clustering
m 2 distinct probe No primary clustering sequences No secondary clustering An example: Assume linear probing with the following hashing and rehashing functions: h(x, 0) = x%7
h(x, i) = (h(x, 0) + i)%7 Start with an empty table.
Delete (45, T) 0 14 Insert (16, T) 1 empty 2 30 3 10 4 16 5 empty 6 20 An other example: Let m be the number of slots. Assume : every even numbered slot occupied and every odd numbered slot empty
any hash value between 0 . . . m-1 is equally likely www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
7
to be generated.
linear probing empty occupied empty occupied empty occupied empty occupied
Expected number of probes for a successful search = 1 Expected number of probes for an unsuccessful search = (1) + (2)
= 1.5
1.7 Hashing Function: What is a good hash function? Should satisfy the simple uniform hashing property. Let U = universe of keys Let the hash values be 0, 1, . . . , m-1 Let us assume that each key is drawn independently from U according to a probability distribution P. i.e., for k U P(k) = Probability that k is drawn Then simple uniform hashing requires that P(k) = for each j = 0, 1,..., m - 1 that is, each bucket is equally likely to be occupied. Example of a hash function that satisfies simple uniform hashing property: Suppose the keys are known to be random real numbers k independently and uniformly distributed in the range [0,1). www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
8
h(k) = km satisfies the simple uniform hashing property. Qualitative information about P is often useful in the design process. For example, consider a compiler's symbol table in which the keys are arbitrary character strings representing identifiers in a program. It is common for closely related symbols, say pt, pts, ptt, to appear in the same program. A good hash function would minimize the chance that such variants hash to the same slot. A common approach is to derive a hash value in a way that is expected to be independent of any patterns that might exist in the data. o The division method computes the hash value as the remainder when the key is divided by a prime number. Unless that prime is somehow related to patterns in the distribution P, this method gives good results.
Division method: A key is mapped into one of m slots using the function h(k) = k mod m Requires only a single division, hence fast m should not be : o a power of 2, since if m = 2 p , then h(k) is just the p lowest order bits of k o a power of 10, since then the hash function does not depend on all the decimal digits of k o 2 p - 1. If k is a character string interpreted in radix 2 p , two strings that are identical except for a transposition of two adjacent characters will hash to the same value. Good values for m
o primes not too close to exact powers of 2. Multiplication Method: There are two steps: 1. Multiply the key k by a constant A in the range 0 < A < 1 and extract the fractional part of kA 2. Multiply this fractional part by m and take the floor. h(k) = m(kA mod 1) where
Advantage of the method is that the value of m is not critical. We typically choose it to be a power of 2: m = 2 p
for some integer p so that we can then easily implement the function on most computers as follows: Suppose the word size = w. Assume that k fits into a single word. First multiply k by the w-bit integer A.2 w . The result is a 2w - bit value r 1 2 w + r 0
where r 1 is the high order word of the product and r 0 is the low order word of the product. The desired p-bit hash value consists of the p most significant bits of r 0 . Works practically with any value of A, but works better with some values than the others. The optimal choice depends on the characteristics of the data being hashed. Knuth recommends A = 0.6180339887... (Golden Ratio) 1.8 Universal Hashing: This involves choosing a hash function randomly in a way that is independent of the keys that are actually going to be stored. We select the hash function at random from a carefully designed class of functions. Let be a finite collection of hash functions that map a given universe U of keys into the range {0, 1, 2,..., m - 1}. is called universal if for each pair of distinct keys x, y U, the number of hash functions h for which h(x) = h(y) is precisely equal to
With a function randomly chosen from , the chance of a collision between x and y where x y is exactly . Example of a universal class of hash functions: Let table size m be prime. Decompose a key x into r +1 bytes. (i.e., characters or fixed-width binary strings). Thus www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
10
x = (x 0 , x 1 ,..., x r ) Assume that the maximum value of a byte to be less than m. Let a = (a 0 , a 1 ,..., a r ) denote a sequence of r + 1 elements chosen randomly from the set {0, 1,..., m - 1}. Define a hash function h a by h a (x) = a i x i mod m With this definition, = {h a } can be shown to be universal. Note that it has m r + 1
members. 1.9 Analysis of closed hashing: Load factor = Assume uniform hashing. In this scheme, the probe sequence < h(k, 0),..., h(k, m - 1) > for each key k is equally likely to be any permutation of < 0, 1,..., m - 1 > Result 1: Unsuccessful Search The expected number of probes in an unsuccessful search Proof: In an unsuccessful search, every probe but the last accesses an occupied slot that does not contain the desired key and the last slot probed is empty.
Let the random variable X denote the number of occupied slots probed before hitting an empty slot. X can take values 0, 1,..., n. It is easy to see that the expected number of probes in an unsuccessful search is 1 + E[X]. p i = P{ }, for i = 0, 1, 2,...
for i > n, p i = 0 since we can find at most n occupied slots. Thus the expected number of probes = 1 + ip i (*) To evaluate (*), define q i = P{at least i probes access occupied slots} and use the identity: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
11
ip i = q i
To compute q i , we proceed as follows: q 1 = since the probability that the first probe accesses an occupied slot is n/m. With uniform hashing, a second probe, if necessary, is to one of the remaining m - 1 unprobed slots, n - 1 of which are occupied. Thus q 2 = since we make a second probe only if the first probe accesses an occupied slot. In general, q i = ...
= since
Thus the expected number of probes in an unsuccessful search
One probe is always made, with probability approximately a second probe is needed, with probability approximately a third probe is needed, and so on. Result 2: Insertion Insertion of an element requires at most probes on average. Proof is obvious because inserting a key requires an unsuccessful search followed by a placement of the key in the first empty slot found. Result 3: Successful Search The expected number of probes in a successful search is at most ln + assuming that each key in the table is equally likely to be searched for. A search for a key k follows the same probe sequence as was followed when the element with key k was inserted. Thus, if k was the (i + 1)th key inserted into the hash table, the expected number of probes made in a search for k is at most = Averaging over all the n keys in the hash table gives us the expected # of probes in a successful search:
=
= H m - H m - n
where H i = is the i th Harmonic number. Using the well known bounds, www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
13
lni H i ln i + 1, we obtain (H m - H m - n )
lnm + 1 - ln(m - n)
= ln +
= ln +
Result 4: Deletion Expected # of probes in a deletion ln + Proof is obvious since deletion always follows a successful search. Figure 3.3: Performance of closed hashing
Figure 3.3 depicts the performance of closed hashing for all the four operations discussed above. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
14
1.10 Hash Table Restructuring When a hash table has a large number of entries (ie., let us say n 2m in open hash table or n 0.9m in closed hash table), the average time for operations can become quite substantial. In such cases, one idea is to simply create a new hash table with more number of buckets (say twice or any appropriate large number). In such a case, the currently existing elements will have to be inserted into the new table. This may call for o rehashing of all these key values o transferring all the records This effort will be less than it took to insert them into the original table. Subsequent dictionary operations will be more efficient and can more than make up for the overhead in creating the larger table.
1.11 Skip Lists Skip lists use probabilistic balancing rather than strictly enforced balancing. Although skip lists have bad worst-case performance, no input sequence consistently produces the worst-case performance (like quicksort). It is very unlikely that a skip list will be significantly unbalanced. For example, in a dictionary of more than 250 elements, the chance is that a search will take more than 3 times the expected time 10 -6 . Skip lists have balance properties similar to that of search trees built by random insertions, yet do not require insertions to be random.
Figure 3.4: A singly linked list
Consider a singly linked list as in Figure 3.4. We might need to examine every node of the list when searching a singly linked list. Figure 3.5 is a sorted list where every other node has an additional pointer, to the node two ahead of it in the list. Here we have to examine no more than +1 nodes.
Figure 3.5: Every other node has an additional pointer
Figure 3.6: Every second node has a pointer two ahead of it
In the list of Figure 3.6, every second node has a pointer two ahead of it; every fourth node has a pointer four ahead if it. Here we need to examine no more than + 2 nodes. In Figure 3.7, (every (2 i ) th node has a pointer (2 i ) node ahead (i = 1, 2,...); then the number of nodes to be examined can be reduced to log 2 n while only doubling the number of pointers.
Figure 3.7: Every (2 i ) th node has a pointer to a node (2 i )nodes ahead (i = 1, 2,...)
Figure 3.8: A skip list
A node that has k forward pointers is called a level k node. If every (2 i ) th node has a pointer (2 i ) nodes ahead, then # of level 1 nodes 50 % www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
16
# of level 2 nodes 25 % # of level 3 nodes 12.5 % Such a data structure can be used for fast searching but insertions and deletions will be extremely cumbersome, since levels of nodes will have to change. What would happen if the levels of nodes were randomly chosen but in the same proportions (Figure 3.8)? o level of a node is chosen randomly when the node is inserted o A node's i th pointer, instead of pointing to a node that is 2 i - 1 nodes ahead, points to the next node of level i or higher. o In this case, insertions and deletions will not change the level of any node. o Some arrangements of levels would give poor execution times but it can be shown that such arrangements are rare. Such a linked representation is called a skip list. Each element is represented by a node the level of which is chosen randomly when the node is inserted, without regard for the number of elements in the data structure. A level i node has i forward pointers, indexed 1 through i. There is no need to store the level of a node in the node. Maxlevel is the maximum number of levels in a node. o Level of a list = Maxlevel o Level of empty list = 1 o Level of header = Maxlevel
Initialization: An element NIL is allocated and given a key greater than any legal key. All levels of all lists are terminated with NIL. A new list is initialized so that the level of list = maxlevel and all forward pointers of the list's header point to NIL Search: We search for an element by traversing forward pointers that do not overshoot the node containing the element being searched for. When no more progress can be made at the current level of forward pointers, the search moves down to the next level. When we can make no more progress at level 1, we must be immediately in front of the node that contains the desired element (if it is in the list). Insertion and Deletion: Insertion and deletion are through search and splice update [i] contains a pointer to the rightmost node of level i or higher that is to the left of the location of insertion or deletion. If an insertion generates a node with a level greater than the previous maximum level, we update the maximum level and initialize appropriate portions of update list. After a deletion, we check to see if we have deleted the maximum level element of the list and if so, decrease the maximum level of the list. Figure 3.9 provides an example of Insert and Delete. The pseudocode for Insert and Delete is shown below. search (list, searchkey); { x = list header; for (i = list level; i 1; i - -) { while (x forward[i] key < searchkey) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
17
x = x forward[i]; } x = x forward[i]; if (x key = searchkey) return (true) else return false; } Figure 3.9: A skip list insert (list, searchkey);
{ x = list header ; for (i = list level; i 1; i - -) { while (x forward[i] key < searchkey) x = x forward[i]; update[i] = x } x = x forward[1]; if (x key = searchkey) return (``key already present'') else { newLevel = randomLevel(); if newLevel > list level { for (i = list level + 1; i newLevel; i ++ ) update[i] = list header; } x = makenode(newLevel, searchkey); www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
18
for (i = 1, i newLevel; i++) { x forward[i] = update[i] forward[i]; update[i] forward[i] = x } }} delete (list, searchkey); { x = list header ; for (i = list level; i 1; i - ) { while (x forward[i] key <searchkey) x = x forward[i]; update[i] = x } x = x forward[1]; if (x key = searchkey) { for (i = 1; i list level; i ++) { if (update[i] forward[i] x) break; if (update[i] forward[i] = x forward[i]; } free(x) While (( list 1) && (list header forward [list+level] = NIL)) list level = list level - 1
1.12 Analysis of Skip Lists In a skiplist of 16 elements, we may have 9 elements at level 1 3 elements at level 2 3 elements at level 3 1 element at level 6 One important question is: Where do we start our search? Analysis shows we should start from level L(n) where L(n) = log 2 n In general if p is the probability fraction, L(n) = log n www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
19
where p is the fraction of the nodes with level i pointers which also have level (i + 1) pointers. However, starting at the highest level does not alter the efficiency in a significant way. Another important question to ask is: What should be MaxLevel? A good choice is MaxLevel = L(N) = log N where N is an upperbound on the number of elements is a skiplist. Complexity of search, delete, insert is dominated by the time required to search for the appropriate element. This in turn is proportional to the length of the search path. This is determined by the pattern in which elements with different levels appear as we traverse the list. Insert and delete involve additional cost proportional to the level of the node being inserted or deleted.
Analysis of Expected Search Cost We analyze the search path backwards, traveling up and to the left. We pretend as if the level of a node is being determined only when it is observed while backtracking the search path. 1. From level 1, we rise to level L(n) 2. We move leftward 3. We climb from L(n) to maxLevel. At a particular point in the climb, we are at the i th forward pointer of a node x and we have no knowledge about the levels of nodes to the left of x or about the level of x (other than that the level of x must be i ). See situation (a) in figure. Assume ``x'' is not the header. If level of x is i, then we are in situation (b). If level of x is > i, we are in situation (c). Prob{ we are in situation cA} = p Each time we are in situation c, we climb up one level. Let C(k) = expected length of a search path that
.climbs up k levels in an infinite list. Then: C(0) = 0 C(k) = p {cost in situation c }
Our assumption that the list is infinite is pessimistic. When we bump into the header in our backward climb, we simply climb it up without performing any leftward (horizontal) movements. This gives us an upperbound of
On the expected length of a path that climbs up from level 1 to level L(n) in a list of n elements. (because L(n) - 1 level are being climbed). The rest of the journey includes 1. leftward movements, the number of which is bounded by the number of elements of level L(n)or higher. his has an expected value = 2. We also move upwards from level L(n) to the maximum level in the list. p { maximum level in the list > k}
Putting our results together, we get: Total expected cost to climb out of a list on n elements + which is O(log n). We can also compute the probability if the actual cost of a search exceeds expected cost by more than a specified ratio. Examples 1. p = ; n = 256 p {search will take longer than 3 times the expected cost } = 2. p = ; n = 4096 p {search will take longer than 2 times the expected cost } = 3. p = ; n = 1024 p {search will take longer than 3 times } = How do we choose p ? Decreasing p increases variability of running times. p = ; generation of random level can be done from a stream of random bits. Decreasing p decreases # of pointers per node. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
22
Overheads are related to L(n) rather than . p = is probably the best choice, unless variability of running times is an issue.
Assignment questions: Dictionaries 1. Explain about the skip list representation of dictionary with an example? 2. What are the data members of skipList class? Write the constructor for skipList? 3. Write a short notes on i. Sets ii. Dictionaries iii. Hash Tables 4. Discuss i. Open Hashing ii. Closed Hashing iii. Rehashing Methods 5. Explain Hashing Functions i. Division Method ii. Multiplication Method iii. Universal Hashing 6. Give the analysis of Closed Hashing Result (Unsuccessful Search, Insertion, Successful Search, Deletion)? 7. How to restructure Hash Table? 8. Explain Skip Lists? Analysis of Skip Lists?
UNIT-II Topics: AVL Trees: Maximum Height of an AVL Tree, Insertions and Deletions. 2-3 Trees : Insertion, Deletion.
2.1Balanced Trees:AVL Trees Also called as: Height Balanced Binary Search Trees. Search, Insertion, and Deletion can be implemented in worst case O(log n) time Definition An AVL tree is a binary search tree in which 1. the heights of the right subtree and left subtree of the root differ by at most 1 2. the left subtree and the right subtree are themselves AVL trees 3. A node is said to be left-high if the left subtree has / greater height
right-high if the right subtree has greater height
equal if the heights of the LST and RST are the same - Figure 5.1: Examples of AVL trees
Examples: Several examples of AVL trees are shown in Figure 5.1.
Figure 5.2: An AVL tree with height h
2.2 Maximum Height of an AVL Tree What is the maximum height of an AVL tree having exactly n nodes? To answer this question, we will pose the following question: What is the minimum number of nodes (sparsest possible AVL tree) an AVL tree of height h can have ? Let F h be an AVL tree of height h, having the minimum number of nodes. F h can be visualized as in Figure 5.2. Let F l and F r be AVL trees which are the left subtree and right subtree, respectively, of F h . Then F l or F r must have height h-2. Suppose F l has height h-1 so that F r has height h-2. Note that F r has to be an AVL tree having the minimum number of nodes among all AVL trees with height of h-1. Similarly, F r will have the minimum number of nodes among all AVL trees of height h--2. Thus we have | F h | = | F h - 1 | + | F h - 2 | + 1 where | F r | denotes the number of nodes in F r . Such trees are called Fibonacci trees. See Figure 5.3. Some Fibonacci trees are shown in Figure 4.20. Note that | F 0 | = 1 and | F 1 | = 2. Adding 1 to both sides, we get | F h | + 1 = (| F h - 1 | + 1) + (| F h - 2 | + 1) Thus the numbers | F h | + 1 are Fibonacci numbers. Using the approximate formula for Fibonacci numbers, we get | F h | + 1
h 1.44log| F n | www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
25
The sparsest possible AVL tree with n nodes has height h 1.44log n
The worst case height of an AVL tree with n nodes is 1.44log n
Figure 5.3: Fibonacci trees
Figure 5.4: Rotations in a binary search tree
2.3 AVL Trees: Insertions and Deletions While inserting a new node or deleting an existing node, the resulting tree may violate the (stringent) AVL property. To reinstate the AVL property, we use rotations. See Figure 5.4. Rotation in a BST Left rotation and right rotation can be realized by a three-way rotation of pointers. Left Rotation: temp = p right ; p right = temp left ; www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
26
temp left = p ; p = temp ; Left rotation and right rotation preserve * BST property * Inorder ordering of keys Problem Scenarios in AVL Tree Insertions o left subtree of node has degree higher by 2 left child of node is left high (A) left child or node is right high (B) o right subtree has degree higher by 2 right child of node is left high (C) right child or node is right high (D) The AVL tree property may be violated at any node, not necessarily the root. Fixing the AVL property involves doing a series of single or double rotations. Double rotation involves a left rotation followed or preceded by a right rotation. In an AVL tree of height h, no more than rotations are required to fix the AVL property.
Figure 5.5: Insertion in AVL trees: Scenario D www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
27
Figure 5.6: Insertion in AVL trees: Scenario C www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
Insertion: Problem Scenario 1: (Scenario D) Scenario A is symmetrical to the above. See Figure 5.5. Insertion: Problem Scenario 2: (Scenario C) Scenario B is symmetrical to this. See Figure 5.6. Deletion: Problem Scenario 1: Depending on the original height of T 2 , the height of the tree will be either unchanged (height of T 2 = h) or gets reduced (if height of T 2 = h - 1). See Figure 5.7. There is a scenario symmetric to this. Deletion: Problem Scenario 2: See Figure 5.8. As usual, there is a symmetric scenario
2.4 2-3 Trees Definition A 2-3 Tree is a null tree (zero nodes) or a single node tree (only one node) or a multiple node tree with the following properties: 1. Each interior node has two or three children 2. Each path from the root to a leaf has the same length. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
31
Fields of a Node : Internal Node p 1 k 1 p 2 k 2 p 3
p 1 : Pointer to the first child
p 2 : Pointer to the second child
p 3 : Pointer to the third child
k 1 : Smallest key that is a descendent of the second child
k 2 : Smallest key that is a descendent of the third child Leaf Node key other fields Records are placed at the leaves. Each leaf contains a record (and key) Example: See Figure 5.23
Figure 5.23: An example of a 2-3 tree
Search The values recorded at the internal nodes can be used to guide the search path. To search for a record with key value x, we first start at the root. Let k 1 and k 2 be the two values stored here. 1. If x < k 1 , move to the first child 2. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
32
If x k 1 and the node has only two children, move to the second child 3. If x k 1 and the node has three children, move to the second child if x < k 2 and to the third child if x k 2 . Eventually, we reach a leaf. x is in the tree iff x is at this leaf. Path Lengths A 2-3 Tree with k levels has between 2 k - 1 and 3 k - 1 leaves Thus a 2-3 tree with n elements (leaves) requires
at least 1 + log 3 n levels at most 1 + log 2 n levels 2.5 2-3 Trees: Insertion For an example, see Figure 5.24.
Figure 5.24: Insertion in 2-3 trees: An example www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
Insert (x) 1. Locate the node v, which should be the parent of x 2. If v has two children, make x another child of v and place it in the proper order adjust k 1 and k 2 at node v to reflect the new situation 3. If v has three children, split v into two nodes v and v'. Make the two smallest among four children stay with v and assign the other two as children of v'. Recursively insert v' among the children of w where w = parent of v The recursive insertion can proceed all the way up to the root, making it necessary to split the root. In this case, create a new root, thus increasing the number of levels by 1. 2.5 2-3 Trees: Deletion Delete (x) 1. Locate the leaf L containing x and let v be the parent of L 2. Delete L. This may leave v with only one child. If v is the root, delete v and let its lone child become the new root. This reduces the number of levels by 1. If v is not the root, let p = parent of v If p has a child with 3 children, transfer an appropriate one to vif this child is adjacent (sibling) to v. If children of p adjacent to v have only two children, transfer the lone child of v to an adjacent sibling of v and delete v. If p now has only one child, repeat recursively with p in place of v. The recursion can ripple all the way up to the root, leading to a decrease in the number of levels. Example: See Figure 5.25.
Assignment questions: Balanced Trees 1. Write a short notes on balancing trees? 2. What is an AVL Trees? Maximum Height of an AVL Tree?application area of avl? 3. Give the operations performed on avl tree with an example Insertions and 4. Deletions. 5. What is 2-3 Trees ?explain Insertion, Deletion operations?application area of 2-3 trees? 6. Start with an empty 2-3 tree and insert the keys 2, 1, 5, 6, 7, 4, 3, 8, 9, 10, and 11 in this order. Draw the 2-3 tree following each insertion. Remove 5, 7, 9 from 2-3 tree constructed for above elements and draw the 2-3 tree after each removal.
3.1 Priority Queues A Priority queue is an important abstract data type in Computer Science. Major operations supported by priority queues are INSERT and DELETEMIN. They find extensive use in implementing schedulers in OS, and distributed systems representing event lists in discrete event simulation implementing numerous graph algorithms efficiently selecting k th largest or k th smallest elements in lists (order statistics problems) sorting applications Simple ways of implementing a priority queue include: Linked list - sorted and unsorted Binary search tree Binary Heaps Heaps (occasionally called as partially ordered trees) are a very popular data structure for implementing priority queues. A heap is either a min-heap or a max-heap. A min-heap supports the insert and deletemin operations while a max-heap supports the insert and deletemax operations. Heaps could be binary or d-ary. Binary heaps are special forms of binary trees while d- ary heaps are a special class of general trees. Binary heaps were first introduced by Williams in 1964. We discuss binary min-heaps. The discussion is identical for binary max-heaps. DEF. A binary heap is a complete binary tree with elements from a partially ordered set, such that the element at every node is less than (or equal to) the element at its left child and the element at its right child. Figure 6.1 shows and example of a heap. Figure 6.1: An example of a heap and its array representation
Since a heap is a complete binary tree, the elements can be conveniently stored in an array. If an element is at position i in the array, then the left child will be in position 2i www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
37
and the right child will be in position 2i + 1. By the same token, a non-root element at position i will have its parent at position Because of its structure, a heap with height k will have between 2 k and 2 k + 1 - 1 elements. Therefore a heap with n elements will have height = log 2 n Because of the heap property, the minimum element will always be present at the root of the heap. Thus the findmin operation will have worst-case O (1) running time. 3.2 Implementation of Insert and Deletemin Insert To insert an element say x, into the heap with n elements, we first create a hole in position (n+1) and see if the heap property is violated by putting x into the hole. If the heap property is not violated, then we have found the correct position for x. Otherwise, we ``push-up'' or ``percolate- up'' x until the heap property is restored. To do this, we slide the element that is in the hole's parent node into the hole, thus bubbling the hole up toward the root. We continue this process until x can be placed in the whole. See Figure 6.2 for an example.
Figure 6.2: Insertion into a heap www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
38
Worstcase complexity of insert is O (h) where h is the height of the heap. Thus insertions are O (log n) where n is the number of elements in the heap. Deletemin When the minimum is deleted, a hole is created at the root level. Since the heap now has one less element and the heap is a complete binary tree, the element in the least position is to be relocated. This we first do by placing the last element in the hole created at the root. This will leave the heap property possibly violated at the root level. We now ``push-down'' or ``percolate- down'' the hole at the root until the violation of heap property is stopped. While pushing down the hole, it is important to slide it down to the less of its two children (pushing up the latter). This is done so as not to create another violation of heap property. See Figure 6.3. It is easy to see that the worst-case running time of deletemin is O (log n) where n is the number of elements in the heap.
3.3 Creating Heap Given a set of n elements, the problem here is to create a heap of these elements. obvious approach is to insert the n elements, one at a time, into an initially empty heap. Since the worstcase complexity of inserts is O (log n), this approach has a worstcase running time of O ( n log n) Another approach, purposed by Floyd in 1964, is to use a procedure called ``pushdown'' repeatedly, starting with the array consisting of the given n elements in the input-order. o The function pushdown (first,last) assumes that the elements a[first], ..., a[last] obey the heap property, except possibly the children of a[first]. The function pushes down a[first] until the heap property violation is stopped. o The following code will accomplish what is required:
Figure 6.4 shows an example of this for an array [5 9 4 2 1 6] The worstcase complexity of this approach can be shown to b O (n) by virtue of the following result. Figure 6.4: Creation of heap www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
40
Result: For the perfect binary tree of height h containing2h + 1 - 1 nodes, the sum of the heights of the nodes is 2h + 1 - 1 - (h + 1). Proof: The perfect or full binary tree of height h has 1 node at height h, 2 nodes art height (h-1), 4 nodes at height (h-2), ... Therefore the required sum S is given by S = 2 i (h - i) = h + 2(h - 1) + 4(h - 2) + ... + 2 h - 1
Multiplying both sides by 2 yields 2S = 2h + 4(h - 1) + 8(h - 2) + ... + 2 h
Subtracting this above equation from the equation prior to that, we obtain
S = 2h + 1 - 1 - (h + 1)
It is easy to see that the above is an upper bound on the sum of heights of nodes of a complete binary tree. Since a complete binary tree of height h has between 2h and 2h + 1 nodes, the above sum is therefore O (n) where n is the number of nodes in the heap. Since the worstcase complexity of the heap building algorithm is of the order of the sum of heights of the nodes of the heap built, we then have the worst-case complexity of heap building as O (n). 3.4 Binomial Queues Support merge, insert, and deletemin operations in O(log n) worstcase time. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
41
A binomial queue is a forest of heap-ordered trees. o Each of the heap-ordered trees is of a constrained form known as a binomial tree.
Figure 6.5: Examples of Binomial Trees
A binomial tree of height 0 is a one-node tree; A binomial tree B k of height k is formed by attaching a binomial tree B k - 1 to the root of another binomial tree, B k - 1 . See Figure 6.5 for an example of binomial tree. A binomial tree B k consists of a root with children B 0 , B 1 , ... B k - 1 . B k has exactly 2 k nodes. Number of nodes at depth d is kC d
If we impose heap order on the binomial trees and allow at most one binomial tree of any height, we can uniquely represent a priority queue of any size by a forest of binomial trees. Example: A priority queue of size 13 could be represented by the forest B 3 , B 2 , B 0 . A natural way of writing this is: 1101. See Figure 6.6 for a binomial queue with six elements.
Figure 6.6: A binomial queue H 1 with six elements www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
42
3.5 Binomial Queue Operations Find-min This is implemented by scanning the roots of all the trees. Since there are at most log n different trees, this leads to a worstcase complexity of 0(log n). Alternatively, one can keep track of the current minimum and perform find-min in 0(1) time if we remember to update the minimum if it changes during other operations. Merge Let H1: 6 elements H2: 7 elements We are now left with 1 tree of height 0 3 trees of height 2 Note that of the 3 binomial trees of height 2, we could have any pair to get another binomial heap. Since merging two binomial trees takes constant time and there are 0(log n) binomial trees, merge takes 0(log n) in the worstcase. See Figures 6.7 and 6.8 for two examples. Figure 6.7: Examples of Merging www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
Insertion This is a special case of merging since we merely create a one-node tree and perform a merge. Worstcase complexity therefore will be O(log n). More precisely; if the priority queue into which the element is being inserted has the property that the smallest nonexistent binomial tree is Bi, the running time is proportional to i + 1. Eg: Insertion into H3 (which has 13 elements) terminates in two steps. Since each tree of a particular degree in a binomial queue is present with probability , if we define the random variable X as representing the number of steps in an insert operation, then Figure 6.9: Examples of Inserts www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
Thus average number of steps in an insert operation = 2 Thus we expect an insertion to terminate in two steps on the average. Further more, performing n inserts on an initially empty binomial queue will take 0(n) worstcase time. See Figures 6.9 and 6.10 for an example. Deletemin First find the binomial tree with the smallest root. Let this be B k . Let H be the original priority queue. Remove B k from the forest in H forming another binomial queue H ' . Now remove the root of B k creating binomial trees B 0 , B 1 , ..., B k - 1 , which collectively form a binomial queue H '' . Now, merge H ' and H '' .
It is easy to see that the worstcase complexity of deletemin is 0(log n). Implementation of a Binomial Queue deletemin operation requires ability to find all subtrees of the root. Thus children of each node should be available (say a linked list) deletemin requires that the children be ordered by the size of their subtrees. we need to make sure it is easy to merge tress. Two binomial trees can be merged only if they have the same size, hence the size of the tree must be stored in the root. While www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
48
merging, one of the trees becomes the last child of the other, so we should keep track of the last child of each node. A good data structure to use is a circular doubly linked list what each node is of the following form: data first left right rank No. of child sibling sibling children 3.6 Binomial Amortized Analysis Amortized Analysis of Merge To merge two binomial queues, an operation similar to addition of binary integers is performed: At any stage, we may have zero, one, two, or three B k trees, depending on whether or not the two priority queues contain a B k tree and whether or not a B k tree is carried over from the previous step. o if there is zero or more B k tree, it is placed as a tree in the resulting binomial queue. o If there are two, they are merged into a B k + 1 tree and carried over o If there are three, one is retained and other two merged. Result 1: A binomial queue of n elements can be built by n successive insertions in 0(n) time. Brute force Analysis Define the cost of each insertions to be o 1time unit + an extra unit for each linking step Thus the total will be n units plus the total number of linking steps. o The 1st, 3rd, ... and each odd-numbered step requires no linking steps since there is no B 0 present. o A quarter of the insertions require only one linking step: 2nd, 6th, 10, ... o One eighth of insertions require two linking steps. We could do all this and bound the number of linking steps by n. The above analysis will not help when we try to analyze a sequence of operations that include more than just insertions. Amortized Analysis Consider the result of an insertion. o If there is no B 0 tree, then the insertion costs one time unit. The result of insertion is that there is now a B 0 tree and the forest has one more tree. o If there is a B 0 tree but not B 1 tree, then insertion costs 2 time units. The new forest will have a B 1 tree but not a B 0 tree. Thus number of trees in the forest is unchanged. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
49
o An insertion that costs 3 time units will create a B 2 tree but destroy a B 0 and B 1 , yielding one less tree in the forest. o In general, an insertion that costs c units results in a net increase of 2 - c trees. Since a B c - 1 tree is created all B i trees, 0 i c - 1 are removed. Thus expensive insertions remove trees and cheap insertions create trees. Let t i =
c i =
We have c 0 = 0 t i + (c i - c i - 1 ) = 2 Result 2: The amortized running times of Insert, Delete-min, and Merge are 0(1), 0(log n), and 0(log n) respectively. Potential function = # of trees in the queue To prove this result we choose: Insertion t i =
c i =
a i = t i + (c i - c i - 1 ) a i = 2 i t i
= a i - (c n - c 0 ) = 2n - (c n - c 0 ) As long as (c n - c 0 ) is positive, we are done. In any case (c n - c 0 ) is bounded by log n if we start with an empty tree. Merge: Assume that the two queues to be merged have n 1 and n 2 nodes with T 1 and T 2 trees. Let n = n 1 + n 2 . Actual time to perform merge is given by: t i = 0(logn 1 + logn 2 ) = 0(max(logn 1 , logn 2 ) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
50
= 0(log n) (c i - c i - 1 ) is at most (log n) since there can be at most (log n) trees after merge. Deletemin: The analysis here follows the same argument as for merge. 3.7 Lazy Binomial Queues Binomial queues in which merging is done lazily. Here, to merge two binomial queues, we simply concatenate the two lists of binomial trees. In the resulting forest, there may be several trees of the same size. Because of the lazy merge, merge and insert are both worstcase 0(1) time. Deletemin: o Convert lazy binomial queue into a standard binomial queue o Do deletemin as in standard queue. Fibonacci Heaps Fibonacci heap supports all basic heap operations in 0(1) amortized time, with the exception of deletemin and delete which take 0(log n) amortized time. Fibonacci heaps generalize binomial queues by adding two new concepts: A different implementation of decrease-key Lazy merging: Two heaps are merged only when it is required. It can be shown in a Fibonacci heap that any node of rank r 1 has at least Fr + 1 descendents.
Assignment Questions:
Priority Queues 1. What are Binary Heaps ? application area of Binary heaps? 2. Discuss implementation of Creating Heap, Insert and Delete min? 3. What are Binomial Queues?explainBinomial Queue Operations? 4. Binomial Amortized Analysis? 5. What is Lazy Binomial Queues?
UNIT-IV Topics: Graphs : Operations on Graphs: Vertex insertion, vertex deletion, find vertex, edge addition, edge deletion, Graph Traversals- Depth First Search and Breadth First Search(Non recursive) . Graph storage Representation- Adjacency matrix, adjacency lists.
4.1 Graphs(directed graphs) In numerous situations in Computer Science, Mathematics, Engineering, and many other disciplines, Graphs have found wide use in representing arbitrary relationships among data objects. There are two broad categories of graphs: Directed graphs (digraphs) and Undirected graphs. In this chapter, we study some important graph-based algorithms in Computer Science and examine data structure related issues. A directed graph or digraph G comprises 1. a set of vertices V 2. a set of directed edges or arcs, E (an arc is an ordered pair of vertices) Example: See Figure 7.1
Figure 7.1: A digraph with 4 vertices and 5 arcs
V = {1, 2, 3, 4} E = {(1,2), (1,3), (2,4), (3,2), (4,3)} If there is an arc (v, w), we say w is adjacent to v. A path is a sequence of vertices v 1 , v 2 ,...v n such that the vertex pairs (v 1 , v 2 ),(v 2 , v 3 ), ..., (v n - 1 , v n ) are arcs in the graph. The length of a path is the number of arcs on the path. A single vertex v by itself denotes a path of length zero. A path v 1 , v 2 ,..., v n is said to be simple if the vertices are distinct, except possibly v 1 and v n . If v 1 = v n and n > 1, we call such a path a simple cycle.
Graph operations: Edge: This is a bidirected connection between two vertices Arc: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
52
This is a directed connection from one vertex to other. Create a node which has the following fields: 1.pointer to the next vertex 2.data (name of the vertex) 3.indegree 4.outdegree 5.boolean data which will tell whether the vertex is visited or not 6.pointer to the arc and arc is a node having the fields 1.data or name of vertex 2.boolean data 3.pointer to the next arc.
1.insert vertex: Create a vertex node 2.delete vertex: To delete a vertex see that its indegree and outdegree is zero if not we cant delete the vertex 3.insert arc: Create a arc node and connect that to the corresponding vertex 4.delete arc: Delete the arc node from the corresponding source and destination vertex 5.find vertex: Traverse throught the adjacency list and return the vertex if found else unsuccessful find. Note: Implementation of graph operations using adjacency list is given at the end of the unit. 6.graph Traversal 1. Depth first traversal 2. Breadth first traversal 4.2 Depth first Traversal: An important way of traversing all vertices of a digraph, with applications in many problems. Let L[v] be the adjacency list for vertex v. To do a depth first search of all vertices emanating from v, we have the following recursive scheme:
void dfs (vertex v) { vertex w; mark v as visited; for each vertex w L[v] dfs(w) }
Figure 7.10: Depth first search of a digraph www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
53
To do a dfs on the entire digraph, we first unmark all vertices and do a dfs on all unmarked vertices:
for v V mark v as unvisited; for v V if (v is unvisited) dfs(v); } Example: See Figure 7.10. DFS Arcs: Tree Arcs: a b, N(a) < N(b) leading to unvisited vertices Non-Tree Arcs: o back arcs: a b, N(a) N(b), b is an ancestor o forward arcs: a b, N(a) < N(b), b is a proper descendant o cross arcs: a b. Neither ancestor nor descendant 4.3 Breadth First Traversal: Level by level traversal A queue data structure is used
Figure 7.11: Breadth-first search of the digraph in Figure 7.11 www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
55
The complexity of both DFS and BFS is O(e). Implementation of Breadth-First Search void bfs (v) /* visits all vertices connected to v in breadth-first fashion */ vertex x, y; vertexqueue Q; { mark [v] = visited ; enqueue (v, Q); while (Q not empty) { x = front (Q) ; dequeue (Q) ; for (each vertex y adjacent to x) if (mark[y] = unvisited) { mark[y] = visited ; enqueue(y,Q) ; insert ((x,y), T) } } }
Example: See Figure 7.11. Undirected graphs: G = (V, E) Each edge is an unordered pair of vertices www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
56
v and w are adjacent if (v, w) or equivalently (w, v) is an edge. The edge (v, w) is incident upon the vertices v and w. path is a sequence of vertices v 1 , v 2 ,..., v n , such that (v i , v i + 1 ) is an edge for 1 i < n. A path is simple if all vertices on it are distinct, except possibly v 1 and v n . The path has length n - 1 and connects v 1 and v n . connected if every pair of vertices is connected. subgraph G' = (V', E') of a graph G = (V, E) is a graph such that 1. V' V 2. E' contains some edges (u, v) of E such that both u and vare in V' If E' contains all edges (u, v) E for which u, v V', the subgraph G' is called an induced subgraph. connected component of a graph is an induced subgraph which is connected and which is not a proper subgraph of any other connected subgraph of G (i.e., maximal connected induced subgraph). simple cycle is a simple path of length 3, that connects a vertex to itself. A graph is cyclic if it contains at least one (simple) cycle. free tree is a connected, acyclic graph.
1. Every free tree with n vertices contains exactly n - 1 edges 2. If we add any edge to a free tree, we get a cycle. 8.1 for several examples. Depth first traversal(undirected graphs): See Figure 8.2 for an example. Assume the following adjacency lists. Vertex Adj. List
a (b, c, d, e) b (a, d, e) c (a, f, g) d (a, b, e) e (a, b, d) f (c, g) g (c, f) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
58
Figure 8.2: Depth-first search of an undirected graph
DFS of an undirected graph involves only two types of arcs. 1.Tree arcs. 2.Back arcs (there is no distinction between back arcs and forward arcs) Cross arcs also don't exist because given any two vertices, if there exists an arc between them, then one of them will be an ancestor and the other a descendant in the DFS. Thus all arcs that would have been cross arcs in the DFS of a digraph would become tree arcs in the case of an undirected graph. Breadth First Traversal( Undirected Graphs) Example: See Figure 8.3. Assume the adjacency lists: Vertex Adj. List
a (b, c, d, e) b (a, d, e) c (a, f, g) d (a, b, e) e (a, b, d) f (c, g) g (c, f)
Figure 8.3: Breadth-first search of an undirected graph
4.4 Graph storage representation:
1. Adjacency Matrix: The matrix is of order (n x n) where n is the number of vertices. The adjacency matrix for the graph in Figure above is
2. Adjacency List: Figures 7.2 and 7.3 provide two different adjacency list representations for the graph of Figure above www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
60
Figure 7.2: Adjacency list representation of the digraph
Figure 7.3: Adjacency list using cursors
Graph operations implementation using cpp: To implement operations on graph 1.vertex operation 2.vertex deletion 3.finding vertex 4.edge addition and deletion
#include<iostream.h> #include<conio.h> #include<stdlib.h> struct vertex { vertex *pnextvertex; int data; int indegree; int outdegree; short processed; struct arc *parc; }; struct arc { struct vertex *destination; arc *pnextarc; }; class graph { public: int count; vertex *first; www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
main() { int ch,v,z,del,s,d,so,de,fi; graph ob; clrscr(); while(1) { cout<<"\n";
cout<<"MENU"<<"\n1.insertvertex\n2.display\n3.count\n4.deletevertex\n5.insertarc"; cout<<"\n6.deletearc\n7.findvertex\n8.exit"; cout<<"\n"; cout<<"enter ur choice from the menu:"; cin>>ch; switch(ch) { case 1://insert vertex cout<<"\n enter the vertex:"; cin>>v; z=ob.insertvertex(v); if(z==1) cout<<"\ninsertvertex"<<" "<<v<<" "<<"is successful"; else cout<<"\ninsertvertex"<<" "<<v<<" "<<"is unsuccessful due to memory overflow"; break; case 2:ob.display(); break; case 3:cout<<"\nthe number of vertices in the adjacency list are:"<<""<<ob.count; break; case 4:cout<<"\n enter the vertex to be deleted:"; cin>>del; z=ob.deletevertex(del); if(z==1) cout<<"\n the vertex"<<" "<<del<<" "<<"is deleted successfully"; else if(z==1) cout<<"\n the vertex"<<" "<<del<<" "<<"cannot be deleted"; else cout<<"\n the graph is empty"; break; case 5:cout<<"\n enter the source vertex:"; cin>>s; www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
66
cout<<"\n enter the destination vertex:"; cin>>d; z=ob.insertarc(s,d); if(z==1) cout<<"\n arc inserted successfully between"<<" "<<s<<" "<<d; else if(z==-2||z==-3) cout<<"\ncan't find source"<<" "<<s<<" "<<"or destination"<<" "<<d; else cout<<"\ngraph empty"; break; case 6:cout<<"\nenter the source vertex:"; cin>>so; cout<<"\n enter the destination vertex:"; cin>>de; z=ob.deletearc(so,de); if(z==1) cout<<"\n arc deleted successfully between"<<so<<" "<<de; else if(z==-2||z==-3) cout<<"\n graph is empty or the vertices"<<" "<<so<<" "<<de<<" "<<"not found"; break; case 7:cout<<"\n enter the vertex to be found:"; cin>>fi; z=ob.findvertex(fi); if(z==1) cout<<"\n the vertex"<<" "<<fi<<"is found"; else cout<<"\n the vertex"<<" "<<fi<<"is not found"; break; case 8:exit(0); default:cout<<"\n valid data.....try again"; } } } OUTPUT: MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 1 www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
67
Enter the vertex :1 Insert vertex 1 is successful MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 1 Enter the vertex :2 Insert vertex 2is successful MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 1 Enter the vertex :3 Insert vertex 3 is successful MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 1 Enter the vertex :4 Insert vertex 4 is successful MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
68
8.exit Enter ur choice from the menu : 3 The number vertices in the adjacency list are 4 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 5 Enter the source vertex :1 Enter the destination vertex :3 Arc Inserted vertex 1 is successfully b/w 1 3 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 5 Enter the source vertex :1 Enter the destination vertex :2 Arc Inserted vertex 1 is successfully b/w 1 2 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 5 Enter the source vertex :3 Enter the destination vertex :1 Arc Inserted vertex 1 is successfully b/w 3 1 MENU 1.insert vertex 2.diplay 3.count www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
69
4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 2 1->2->3 2 3->1 4 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 3 The number vertices in the adjacency list are 4 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 4 Enter the vertex to be deleted :1 The vertex cant be deleted MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 4 Enter the vertex to be deleted :4 The vertex 4 cant be deleted successfully MENU www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
70
1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 4 Enter the vertex to be deleted :2 The vertex 2 is found successfully MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 6 Enter the source vertex :1 Enter the destination vertex :2 Vertex deleted successfully b/w 1 2 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit Enter ur choice from the menu : 2 1->3 2 3->1 MENU 1.insert vertex 2.diplay 3.count 4.deletevertex 5.insertarc 6.deletearc 7.findvertex 8.exit www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
71
Enter ur choice from the menu : 8
Implementation of dft: #include<iostream.h> #include<conio.h> #include<stdlib.h> int cost[10][10],i,j,k,n,stk[10],top,v,visit[10],visited[10]; void main() { int m; clrscr(); cout<<"enter no of vertices"; cin>>n; cout<<"enter no of edges"; cin>>m; cout<<"\nEDGES\n"; for(k=1;k<=m;k++) { cin>>i>>j; cost[i][j]=1; } cout<<"enter intial vertex"; cin>>v; cout<<"order of visited vertices"; cout<<v<<" "; visited[v]=1; k=1; while(k<n) { for(j=n;j>=1;j--) if(cost[v][j]!=0 && visited[j]!=1 && visit[j]!=1) { visit[j]=1; stk[top]=j; top++; } v=stk[--top]; cout<<v<<" "; k++; visit[v]=0; visit[v]=1; } getch(); }
1. Write a notes on operations on Graphs i. Vertex insertion ii. vertex deletion iii. find vertex iv. edge addition v. edge deletion 2. Discuss Graph Traversals- Depth First Search and Breadth First Search. 3. Give the implementation of DFS and BFS (Non recursive) . 4. Explain Graph storage Representation i. Adjacency matrix ii. adjacency lists.
5.1 Graph algorithms: Minimum-Cost Spanning Trees Let G = (V, E) be a connected graph in which each edge (u, v) E has an associated cost C(u, v). A Spanning Tree for G is a subgraph of G that it is a free tree connecting all vertices in V. The cost of a spanning tree is the sum of costs on its edges. An MST of G is a spanning tree of G having a minimum cost. See Figure 8.4 for several examples. Figure 8.4: Spanning trees in a connected graph
SupposeG = (V, E)is a connected graph with costs defined on all e E. Let U be some proper subset of V. If (u, v) is an edge of lowest cost such that u Uand v V - U, then there exists an MST that includes (u, v) as an edge. See Figure 8.5.
Figure 8.5: An illustration of MST property
Proof of MST Property Suppose to the contrary that there is no MST for the graph Gwhich includes the edge (u, v). Let T be any MST of G. By our assumption, T does not contain (u, v). Adding (u, v) to Twill introduce a cycle since T is a free tree. This cycle involves (u, v). Therefore there is a path fromv to u that does not pass through this edge. This means that another edge (u', v') in T such that u' Uandv' V - U. See Figure 8.6.
Figure 8.6: Construction of a minimal spanning tree
Deleting edge (u', v') breaks the cycle and yields a spanning tree T' whose cost is certainly that of T since C(u, v) C(u', v'). Thus we have constructed an MST that includes (u, v).
Figure 8.7: An example graph for finding an MST
To illustrate, consider the graph in Figure 8.7 and refer to Figures 8.8 and 8.9. Consider the sets: U = {1, 2, 5}
V - U = {3, 4, 6}. Figure 8.8: A spanning tree in the above graph, with cost 26
Spanning tree with cost = 26. Now, least cost edge from U to V - U is (1,3). By including (1,3) in the above spanning tree, a cycle will form (for example, 1-2-3-1). Let us replace the edge (2,3) by the edge (1,3).
Figure 8.9: Another spanning tree, but with cost 22 www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
77
This has yielded a ST with cost = 22 DEF. A set of edges T in a connected graph promising if it can be extended to produce a minimal spanning tree for the graph. By definition, T = is always promising since a weighted connected graph always has at least one MST. Also, if a promising set of edges T is a spanning tree, then it must be an MST. Def: An edge is said to leave a given set of nodes if exactly one end of this edge is in the set. MST Lemma: Let G = (V, E) be weighted connected graph U V a strict subset of nodes in G T E a promising set of edges in E such that no edge in T leaves U e a least cost edge that leaves U Then the set of edgesT ' = T {e} is promising.
Figure 8.10: Illustration of MST Lemma
Proof Since T is promising, it can be extended to obtain an MST, says S. If e S, there is nothing to prove. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
78
If e S, then we add edge e to S, we create exactly one cycle (since S is a spanning tree). In this cycle, since e leaves U there exists at least one other edge, f say, that also leaves U (otherwise the cycle could not close). See Figure 8.10. If we now remove f, the cycle disappears and we obtain a new tree R that spans G. Note thatR = (S {e}) - {f} Also note that weight of e weight of f since e is a least cost edge leaving U. Therefore R is also an MST and it includes edge e. Furthermore T R and so can be extended to the MST R. Thus T is a promising set of edges. 5.2 Prim's Algorithm This algorithm is directly based on the MST property. Assume that V = {1, 2,..., n}. { T = ; U = { 1 }; while (U V) { let (u, v) be the lowest cost edge such that u U and v V - U; T = T {(u, v)} U = U {v} } } See Figure 8.11 for an example. O(n 2 ) algorithm. Proof of Correctness of Prim's Algorithm Theorem: Prim's algorithm finds a minimum spanning tree. Proof: Let G = (V,E) be a weighted, connected graph. Let T be the edge set that is grown in Prim's algorithm. The proof is by mathematical induction on the number of edges in T and using the MST Lemma. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
79
Basis: The empty set is promising since a connected, weighted graph always has at least one MST. Induction Step: Assume that T is promising just before the algorithm adds a new edge e = (u,v). Let U be the set of nodes grown in Prim's algorithm. Then all three conditions in the MST Lemma are satisfied and therefore T U e is also promising. When the algorithm stops, U includes all vertices of the graph and hence T is a spanning tree. Since T is also promising, it will be a MST. Implementation of Prim's Algorithm Use two arrays, closest and lowcost. For i V - U, closest[i] gives the vertex in U that is closest to i For i V - U, lowcost[i] gives the cost of the edge (i, closest(i)) Figure 8.11: Illustration of Prim's algorithm
Figure 8.12: An example graph for illustrating Prim's algorithm www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
80
At each step, we can scan lowcost to find the vertex in V - U that is closest to U. Then we update lowcost and closest taking into account the new addition to U. Complexity: O(n 2 ) Example: Consider the digraph shown in Figure 8.12.
U = {1} V - U = {2, 3, 4, 5, 6} closest lowcost V - U U 2 1 6 3 1 1 4 1 5 5 1
6 1
Select vertex 3 to include in U
U = {1, 3} V - U = {2, 4, 5, 6} www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
81
closest lowcost
V - U U 2 3 5 4 1 5 5 3 6 6 3 4 Now select vertex 6
U = {1, 3, 6} V - U = {2, 4, 5, 6} closest lowcost
V - U U 2 3 5 4 6 2 5 3 6 Now select vertex 4, and so on 5.3 Kruskal's Algorithm Complexity is O(elog e) where e is the number of edges. Can be made even more efficient by a proper choice of data structures. Greedy algorithm Algorithm:
Let G = (V, E) be the given graph, with | V| = n { Start with a graph T = (V, ) consisting of only the vertices of G and no edges; /* This can be viewed as n connected components, each vertex being one connected component */ Arrange E in the order of increasing costs; for (i = 1, i n - 1, i + +) { Select the next smallest cost edge; if (the edge connects two different connected components) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
82
add the edge to T; } }
At the end of the algorithm, we will be left with a single component that comprises all the vertices and this component will be an MST for G. Proof of Correctness of Kruskal's Algorithm Theorem: Kruskal's algorithm finds a minimum spanning tree. Proof: Let G = (V, E) be a weighted, connected graph. Let T be the edge set that is grown in Kruskal's algorithm. The proof is by mathematical induction on the number of edges in T. o We show that if T is promising at any stage of the algorithm, then it is still promising when a new edge is added to it in Kruskal's algorithm o When the algorithm terminates, it will happen that T gives a solution to the problem and hence an MST. Basis: T = is promising since a weighted connected graph always has at least one MST. Induction Step: Let T be promising just before adding a new edge e = (u, v). The edges T divide the nodes of G into one or more connected components. u and v will be in two different components. Let U be the set of nodes in the component that includes u. Note that o U is a strict subset of V o T is a promising set of edges such that no edge in T leaves U (since an edge T either has both ends in U or has neither end in U) o e is a least cost edge that leaves U (since Kruskal's algorithm, being greedy, would have chosen e only after examining edges shorter than e) The above three conditions are precisely like in the MST Lemma and hence we can conclude that the T {e} is also promising. When the algorithm stops, T gives not merely a spanning tree but a minimal spanning tree since it is promising.
Figure 8.13: An illustration of Kruskal's algorithm www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
83
Program void kruskal (vertex-set V; edge-set E; edge-set T) int ncomp; /* current number of components */ priority-queue edges /* partially ordered tree */ mfset components; /* merge-find set data structure */ vertex u, v; edge e; int nextcomp; /* name for new component */ int ucomp, vcomp; /* component names */ { makenull (T); makenull (edges); nextcomp = 0; ncomp = n; for (v V) /* initialize a component to have one vertex of V*/ www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
84
{ nextcomp++ ; initial (nextcomp, v, components); } for (e E) insert (e, edges); /* initialize priority queue of edges */ while (ncomp > 1) { e = deletemin (edges); let e = (u, v); ucomp = find(u, components); vcomp = find(v, components); if (ucomp! = vcomp) { merge (ucomp, vcomp, components); ncomp = ncomp - 1; } } }
Implementation Choose a partially ordered tree for representing the sorted set of edges To represent connected components and interconnecting them, we need to implement: 1. MERGE (A, B, C) . . . merge components A and B in C and call the result A or B arbitrarily. 2. FIND (v, C) . . . returns the name of the component of C of which vertex v is a member. This operation will be used to determine whether the two vertices of an edge are in the same or in different components. 3. INITIAL (A, v, C) . . . makes A the name of the component in C containing only one vertex, namely v The above data structure is called an MFSET Running Time of Kruskal's Algorithm Creation of the priority queue * If there are e edges, it is easy to see that it takes O(elog e) time to insert the edges into a partially ordered tree * O(e) algorithms are possible for this problem Each deletemin operation takes O(log e) time in the worst case. Thus finding and deleting least-cost edges, over the while iterations contribute O(log e) in the worst case. The total time for performing all the merge and find depends on the method used. O(elog e) without path compression www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
85
O(e (e)) with the path compression, where (e) is the inverse of an Ackerman function. Example: See Figure 8.13. E = {(1,3), (4,6), (2,5), (3,6), (3,4), (1,4), (2,3), (1,2), (3,5), (5,6) } 5.4 Single Source Shortest Paths Problem: Dijkstra's Algorithm Greedy algorithm It works by maintaining a set S of ``special'' vertices whose shortest distance from the source is already known. At each step, a ``non-special'' vertex is absorbed into S. The absorption of an element of V - S into S is done by a greedy strategy. The following provides the steps of the algorithm. Let V = {1, 2,..., n} and source = 1 (7.1 ) C[i, j] =
{ S= { 1 }; for (i = 2; i < n; i++) D[i] = C[1,i]; for (i=1; i < = n-1; i++) { choose a vertex w V-S such that D[w] is a minimum; S = S {w }; for each vertex v V-S D[v] = min (D[v], D[w] + C[w, v]) } } The above algorithm gives the costs of the shortest paths from source vertex to every other vertex. The actual shortest paths can also be constructed by modifying the above algorithm. Theorem: Dijkstra's algorithm finds the shortest paths from a single source to all other nodes of a weighted digraph with positive weights. Proof: Let V = 1, 2, ..., n and 1 be the source vertex. We use mathematical induction to show that (a) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
86
If a node i ( 1) S, then D[i] gives the length of the shortest path from the source to i. (b) if a node i S, then D[i] gives the length of the shortest special path from the source to i. Basis: Initially S = 1 and hence (a) is vacuously true. For i S, the only special path from the source is the direct edge if present from source to i and D is initialized accordingly. Hence (b) is also true. Induction for condition (a) The induction hypothesis is that both (a) and (b) hold just before we add a new vertex v to S. For every node already in S before adding v, nothing changes, so condition (a) is still true. We have to only show (a) for v which is just added to S.
Figure 7.4: The shortest path to v cannot visit x
Before adding it to S, we must check that D[v] gives the length of the shortest path from source to v. By the induction hypothesis, D[v] certainly gives the length of the shortest special path. We therefore have to verify that the shortest path from the source to v does not pass through any nodes that do not belong to S. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
87
Suppose to the contrary. That is, suppose that when we follow the shortest path from source to v, we encounter nodes not belonging to S. Let x be the first such node encountered (see Figure 7.4 ). The initial segment of the path from source to x is a special path and by part (b) of the induction hypothesis, the length of this path is D[x]. Since edge weights are no-negative, the total distance to v via x is greater than or equal to D[x]. However since the algorithm has chosen v ahead of x, D[x] D[v]. Thus the path via x cannot be shorter than the shortest special path leading to v. Induction step for condition (b): Let v and S. When v is added to S, these are two possibilities for the shortest special path from source to w: 1. It remains as before 2. It now passes through v ( and possibly other nodes in S) In the first case, there is nothing to prove. In the second case, let y be the last node of S visited before arriving at w. The length of such a path is D[y] + C[y,w]. At first glance, to compute the new value of d[w], it looks as if we should compare the old value of D[w] with D[y] + C[y,w] for every y S (including v) This comparison was however made for all y S except v, when y was added to S in the algorithm. Thus the new value of D[w] can be computed simply by comparing the old value with D[v] + C[v,w]. This the algorithm does. When the algorithm stops, all the nodes but one are in S and it is clear that the vector D[1], D[2], ..., D[n]) contains the lengths of the shortest paths from source to respective vertices. Example: Consider the digraph in Figure 7.5.
Figure 7.5: A digraph example for Dijkstra's algorithm
Initially: S = {1} D[2] = 10 D[3] = D[4] = 30 D[5] = 100 Iteration 1 Select w = 2, so that S = {1, 2} D[3] = min( , D[2] + C[2, 3]) = 60 (7.2) D[4] = min(30, D[2] + C[2, 4]) = 30 (7.3) D[5] = min(100, D[2] + C[2, 5]) = 100 Iteration 2 Select w = 4, so that S = {1, 2, 4} D[3] = min(60, D[4] + C[4, 3]) = 50 (7.4) D[5] = min(100, D[4] + C[4, 5]) = 90 Iteration 3 Select w = 3, so that S = {1, 2, 4, 3} D[5] = min(90, D[3] + C[3, 5]) = 60 Iteration 4 Select w = 5, so that S = {1, 2, 4, 3, 5} D[2] = 10 (7.5) D[3] = 50 (7.6) D[4] = 30 (7.7) D[5] = 60 Complexity of Dijkstra's Algorithm With adjacency matrix representation, the running time is O(n2) By using an adjacency list representation and a partially ordered tree data structure for organizing the set V - S, the complexity can be shown to be O(elog n) where e is the number of edges and n is the number of vertices in the digraph. All Pairs Shortest Paths Problem: Floyd's Algorithm Given
5.5 Floyd's Algorithm This is an O(n 3 ) algorithm, where n is the number of vertices in the digraph. Uses the principle of Dynamic Programming Let V = {1, 2,..., n}. The algorithm uses a matrix A[1..n][1..n] to compute the lengths of the shortest paths. Initially, A[i, j] = C[i, j] if i j (7.12) = 0 if i = j Note that C[i, j] is taken as if there is no directed arc from ito j. The algorithm makes n passes over A. Let A 0 , A 1 ,..., A n be the values of A on the n passes, with A 0 being the initial value. Just after the k-1 th iteration, A k - 1 [i, j] =
(7.13)
(7.14)
See Figure 7.8 The k th pass explores whether the vertex k lies on an optimal path from i to j, i, j
Figure 7.8: Principle of Floyd's algorithm
We use www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
Figure 7.9: A digraph example for Floyd's algorithm www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
91
Example: See Figure 7.9. C = A 0 = ; A 1 = (7.15) A 2 = ; A 3 =
With adjacency matrix representation, Floyd's algorithm has a worst case complexity of O(n 3 ) where n is the number of vertices If Dijkstra's algorithm is used for the same purpose, then with an adjacency list representation, the worst case complexity will be O(nelog n). Thus if e is O(n 2 ), then the complexity will be O(n 3 log n) while if e is O(n), then the complexity is O(n 2 log n). 5.6 Warshall's Algorithm
Def: Given the adjacency matrix C of any digraph C = (v,E), the matrix A is called the transitive closure of C if i, jtex2html_image_mark>#tex2html_wrap_inline27764# V, A[i, j] = 1 if there is a path of length one or more from vertex i to vertex j www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
92
= 0 otherwise Warshall's algorithm enables to compute the transitive closure of the adjacency matrix f any digraph. Warshall's algorithm predates Floyd's algorithm and simple uses the following formula in the kth passes of Floyd's algorithm: Ak[i, j] = Ak - 1[i, j] (Ak - 1[i, k] Ak - 1[k, j]) where the matrix elements are treated as boolean values 0 or 1 and the symbols and denote ``logical or'' and ``logical and'' respectively. It is easy to see that Warshall's algorithm has a worst case complexity of O(n3) where n is the number of vertices of the graph.
Assignment Questions: Graph algorithms 1. Explain Minimum-Cost Spanning Trees i. Prim's Algorithm, ii. Kruskal's Algorithm 2. Write a notes on Shortest Path Algorithms: Dijkstra's Algorithm? 3. Explain All Pairs Shortest Paths Problem: i. Floyd's Algorithm ii. Warshall's Algorithm
UNIT-VI Topics: Sorting Methods : Order Statistics: Lower Bound on Complexity for Sorting Methods: Lower Bound on Worst Case Complexity, Lower Bound on Average Case Complexity, Heap Sort, Quick Sort, Radix Sorting, Merge Sort.
6.1 Order statistics: Given a list of n records, and an integer k, find the record whose key is the k th in the sorted order of keys. Note that K = 1 corresponds to finding the minimum
= n corresponds to finding the maximum
K = corresponds to finding the median
Algorithm 1: One can arrange the n keys into a heap and pick the k th largest in k steps, where each step takes logarithmic time in the number of elements of the heap. Since the construction of initial heap can be accomplished in worst case O(n)time, the worst case complexity of this algorithm is: O(n + klog n)
Therefore if k or k , the above algorithm will have O(n) worst case complexity. Algorithm 2: Variation of quicksort. Select (i, j, k) /* finds the k th element among A[i] ... A[j] */ { pick a pivot element v; partition A[i] ... A[j] so as to get A[i] ... A[m - 1] with keys < v and A[m] ... A[j] with keys v; if(k m - 1) Select (i, m - 1, k) else Select (m, j, k - m + i) } Worst case complexity of the above algorithm is O(n 2 ) (as in quicksort). Average Case: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
94
o Note that select calls itself only once at a time whereas quicksort called itself twice each time. o On an average, select calls itself on a subarray half as long as the subarray. To be conservative, suppose that each call of select is on an array the size on previous call. Then, T(n) Cn + T which can be shown to be O(n). Algorithm 3: This is also a variation of quicksort and the main idea is to find a good pivot. Assume that all the elements are distinct. The algorithm works otherwise too. 1. Divide the n elements into groups of 5 leaving aside between 0 and 4 elements that cannot be placed in a group.
Sort each group of 5 elements by any algorithm and take the middle element from each group. This yields n/5 medians. 2. Use the SELECT algorithm to find the median of these n/5 elements. Choose this median as the pivot. Note that this pivot is in position Also, the pivot exceeds of the middle elements and each of these middle elements exceeds two elements. Thus, the pivot exceeds at least 3 elements. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
95
Also, by a similar argument, the pivot is less than at least
3 elements. If n 75, 3 3 = 21 > In other words, n 75 3 This would mean that, if n 75, the pivot is greater than at least elements and less than at least elements. Consequently, when we partition an array with this pivot, the k th element is isolated to within a range of at most of the elements.
Implementation keytype select (int i, j, k); /* returns key of the k th largest element among A[i] ... A[j]*/ { if((j - i) < 75) find the k th largest by some simple algorithm else { for(m = 0;m (j - i - 4)/5;m + +) { find the third element among A[i + 5 * m] ... A[i + 5 * m + 4] and swap it with A[i + m]; pivot = select (i,(j - i - 4)/5,(j - i - 4)/10) m = partition (i, j, pivot); if(k m - i) return (select (i, m - 1, k)) else www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
96
return (select (m, j,(k - (m - i))) } } } The worst case complexity of the above algorithm can be described by T(n)
C 1 if n 75
C 2 n + T + T if n > 75
Let us guess the solution T(n) = Cn for n > 75. For n 75, assumeT(m) Cm for m < n. Then T(n)
C 2 n + +
C 2 n + Cn
Thus T(n) is O(n). Instead of groups of 5, let us say we choose groups of 7. It is easy to show that the worst case complexity is again O(n). In general, any size of the groups that ensures the sum of the two arguments of T(.) to be less than nwill yield O(n) worst case complexity. 6.2 Lower bound on complexity for sorting methods: Result 1 The worst case complexity of any sorting algorithm that only uses key comparisons is (nlog n)
Result 2 The average case complexity of any sorting algorithm that only uses key comparisons is (nlog n)
The above results are proved using a Decision Tree which is a binary tree in which the nodes represent the status of the algorithm after making some comparisons. Consider a node x in a decision tree and let y be its left child and zits right child. See Figure 9.4.
Figure 9.4: A decision tree scenario www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
97
Basically, y represents a state consisting of the information known at x plus the fact that the key k 1 is less than key k 2 . For a decision tree for insertion sort on 3 elements, see Figure 9.5.
Figure 9.5: Decision tree for a 3-element insertion sort
6.3 Lower bound on worst case complexity: Given a list of n distinct elements, there are n! possible outcomes that represent correct sorted orders.
o any decision tree describing a correct sorting algorithm on a list of n elements will have at least n! leaves. o In fact, if we delete nodes corresponding to unnecessary comparisons and if we delete leaves that correspond to an inconsistent sequence of comparison results, there will be exactly n! leaves. The length of a path from the root to a leaf gives the number of comparisons made when the ordering represented by that leaf is the sorted order for a given input list L. The worst case complexity of an algorithm is given by the length of the longest path in the associated decision tree. To obtain a lower bound on the worst case complexity of sorting algorithm, we have to consider all possible decision trees having n! leaves and take the minimum longest path. In any decision tree, it is clear that the longest path will have a length of at least log n! Since n!
log n!
nlog n More Precisely, n!
or log(n!)
log
= logn -
Thus any sorting algorithm that only uses comparisons has a worst case complexity (nlog n) 6.4 Lower bound on average case complexity: We shall show that in any decision tree with K leaves, the average depth of a leaf is at least log K We shall show the result for any binary tree with K leaves. Suppose the result is not true. Suppose T is the counterexample with the fewest nodes. T cannot be a single node because log 1 = 0. Let T have k leaves. T can only be of the following two forms. Now see Figure 9.6.
Figure 9.6: Two possibilities for a counterexample with fewest nodes
Suppose T is of the from Tree 1. The tree rooted at n 1 , has fewer nodes than T but the same number of leaves and the hence an even smaller counterexample than T. Thus T cannot be of Tree 1 form. Suppose T is of the form of Trees 2. The trees T 1 and T 2 rooted at n 1 and n 2 are smaller than T and therefore the Average depth of T 1
This contradicts the premise that the average depth of T is < log k. Thus T cannot be of the form of Tree 2. Thus in any decision tree with n! leaves, the average path length to a leaf is at least log(n!) O(nlog n) 6.5 Heap sort Worst case as well as average case running time is O(nlog n) Heap: Recall that a heap is a complete binary tree such that the weight of every node is less than the weights of its children.
Figure 9.1: Example of a heap
A heap with n elements can be conveniently represented as the first nelements of an array. Furthermore, the children of a[i] can be found in a[2i] (left child) and a[2i + 1] (right child) See Figure 9.1 for an example Generic heap sort algorithm: { Insert all records to form a heap S; while (S is not empty) { y = min (S); print the value of y; delete y from S; } www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
101
}
Heap sort crucially uses a function called pushdown (first, last). This assumes that the elements a[first], a[first + 1], ..., a[last] obey the heap property, except possibly the children of a[first]. The function pushes a[first] down until the heap property is restored. Example: See Figure 9.2. Here, a[1],..., a[10] is a heap except that the children of a[1]violate the heap property. Figure 9.2: Illustration of some heap operations
Heapsort Algorithm { for i = , i > = 1;i - - pushdown (i, n); /* initial heap construction */ for( i > = 2;i - -) { swap records a[i] and a[1]; pushdown (1, i - 1) } } It can be shown that the initial heap construction takes O(n) time in the worst case. The sorting portion takes worst case O(nlog n) time. Heap sort can also be used for computing order statistics, ie., k th lowest in a list of records.
6.6 Quick sort Divide and conquer algorithm designed by CAR Hoare in 1962. Worst case O(n 2 ) time, but average case O(nlog n) time. Better average case performance than heap sort. Algorithm: To sort the records a[i], a[i + 1],..., a[j], in place. quicksort (i, j) { if (a[i] ... a[j] contain at least two distinct keys) { let v be the larger of the first two distinct keys; Partition a[i] ... a[j] so that for some k between i + 1 and j, a[i] ... a[k - 1] all have keys <v, and a[k] ... a[j] all have keys v; quicksort (i, k - 1); quicksort (k, j); } }
v is called the pivot. It could be any element such that the resulting partition desirably has two equal sized groups. Algorithm for partitioning: int partition (int i; int j; key-type pivot); /* partitions the elements a[k] ... a[j], wrt the pivot and returns the position k */ { int , r; { = i; /* starts from left end */ r = j; /*r starts from right end */ do swap the records a[ ] and a[r]; while (a[ ] . key < pivot) = + 1; while (a[r] . key pivot) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
103
r = r - 1; while( r); return ( ); } } For an example, see Figure 9.3. Worst case arises when the input is already sorted: O(n 2 ) Average case : O(nlog n)
Figure 9.3: Quicksort applied to a list of 10 elements
Quick sort average case analysis: Assume that all initial orderings of the keys are equally likely; Assume that the keys are distinct
Note that the presence of equal keys will make the sorting easier, not harder. Also assume that when we call quicksort (i, j), all orders for A[i]SUP> ... A[j] are equally likely. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
104
Let T(n) = average time taken by quicksort to sort n elements T(1) = C 1 where C 1 is some constant. Recall that the pivot is the larger of the first two elements. When n > 1, quicksort splits the subarray, taking C 2 ntime, where C 2 is another constant. Also, since the pivot is the larger of the first two elements, left groups tend to be larger than the right groups. The left group can have i elements where i = 1, 2,..., n - 1, since the left group has at least one element and the right group also has at least one element. Let us fix i and try to compute the probability: P{ left group has i elements} Now, left group has i elements pivot must be the (i + 1) st element among the n elements
If the pivot is in position 1, then the element in position 2 is one of the i smaller elements and vice-versa. P{ Position 1 contains one of the i smaller elements } = Similarly, P{ Position 2 contains one of the i smaller elements} = Therefore, P{ left group has i elements} = This leads to T(n) C 2 n + T(i) + T(n - i) Using www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
105
f (i) = f (n - i), we get T(n) C 2 n + T(i) + T(n - i) The above expression is in the form it would have been if we had picked a truly random pivot at each step. The above simplifies to: T(n) C 2 n + T(i) The above is the recurrence that one would get if all sizes between 1 and n - 1for the left group were equally likely. Thus picking the larger of the two elements doesn't really affect the size distribution. We shall guess the solution T(n) Cnlog n for some constant C and prove its correctness using induction. Basis: n = 2 Cnlog n = 2C
which is correct. Induction Step: AssumeT(i) Cilogi i < n. T(n)
PickingC 4C 2 , we have, C 2 n - 0 Thus T(n) is O(nlog n) 6.7 Merge sort Divide and conquer algorithm O(nlog n) worst case performance Useful in external sorting applications. Algorithm
list-type mergesort (list-type L; int n) { if (n = 1) return (L) else { split L into two halves L 1 and L 2 ; return (merge (mergesort L 1 , , mergesort L 2 , } }
Let T(n) be the running time of mergesort on a list of size n. Then. T(n)
It can be shown that T(n) is O(nlog n). Mergesort as an External Sorting Procedure Assume that the data to be sorted is stored in a file The essential idea is to organize a file into progressively larger runs where: A run is a sequence of records r 1 ,..., r k , such that r i r i + 1 for i = 1,..., k - 1. We say a file, r 1 , r 2 ,..., r m , of records is organized into runs or length k if: i 1 such that ki m, the sequence r k(i - 1) + 1 , r k(i - 1) + 2 ,..., r ki
is a run of length k and furthermore if m is not divisible by k and m = pk + q where q < k, the sequence (r m - q + 1 ,..., r m ) which is called the tail is a run of length q. Example The following is organized into runs of length 3 with the tail having length 2. 7 15 29 8 11 13 16 22 31 5 12 Basic step Consider two files f1 and f2 organized into runs of length k. Assume that 1. If n i is the number of runs (including tails) of f 1 (i = 1, 2), then | n 1 - n 2 | 1 2. At most one of f 1 and f 2 has a tail 3. The one with a tail (if any) has at least as many runs as the other. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
108
In mergesort, we read one run from each of f 1 and f 2 , merge the runs into a run of length 2k and append it to one of two files g 1 and g 2 . By alternating between g 1 and g 2 , these files would be organized into runs of length 2k, satisfying (1), (2), and (3). Algorithm First divide all n records into two files f 1 and f 2 as evenly as possible. f 1 and f 2 can be regarded as organized into runs of length 1. Merge the runs of length 1 and distribute them into files g 1 and g 2 which will now be organized into runs of length 2. Now make f 1 and f 2 empty and merge g 1 and g 2 into f 1 and f 2 creating runs of length 4. Repeat ...
Example f 1 28 3 93 54 65 30 90 10 69 8 22 f 2 31 5 96 85 9 39 13 8 77 10 runs of length 1 merge into g 1 and g 2
g 1 28 31 93 96 9 65 13 90 69 77 22 g 2 3 5 54 85 30 39 8 10 8 10 runs of length 2 merge into f 1 and f 2
f 1 3 5 28 31 9 30 39 65 8 10 69 77 f 2 54 85 93 96 8 10 13 90 22 runs of length 4 merge into g 1 and g 2
g 1 3 5 28 31 54 85 93 96 8 10 22 69 77 g 2 8 9 10 13 30 39 65 90 runs of length 8 merge into f 1 and f 2
f 1 3 5 8 9 10 13 28 30 31 39 54 65 85 90 93 96
f 2 8 10 22 69 77 runs of length 16 merge into g 1 and g 2
g 1 will be left with the sorted sequence After i passes, we have two files consisting of runs of length 2 i . If 2 i n, then one of the two files will be empty and the other will contain a single run of length n (which is the sorted sequence).
Therefore number of passes = logn Each pass requires the reading of two files and the writing of two files, all of length . Total number of blocks read or written in a pass (where b is the number of records that fit into a block) Total number of block reads or writes for the entire sorting} O . The procedure to merge reads and writes only one record at a time. Thus it is not required to have a complete run in memory at any time. Mergesort need not start with runs of length 1. We could start with a pass that, for some appropriate k, reads groups of k records into main memory, sorts them (say quicksort) and writes them out as a run of length k. Multiway Mergesort Here we merge m files at a time. Let the files f 1 ,..., f m be organized as runs of length k.
We can read m runs, one from each file, and merge them into one run of length mk. This run is placed on one of m files g 1 ,..., g m , each getting a run in turn. The merging process will involve each time selecting the minimum among m currently smallest unselected records from each file.
By using a priority queue that supports insert and delete in logarithmic time (e.g., partially ordered tree), the process can be accomplished in O(log m)time. Number of passes - log m n www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
110
Effort in each pass = O(nlog 2 m) overall complexity O(nlog 2 m.log m n) Advantages
We save by a factor of log 2 m, the number of times we read each record
We can process data O(m) times faster with m disk units. 6.8 Radix sort Here we use some special information about the keys and design sorting algorithms which beat the O(nlog n) lower bound for comparison-based sorting methods. Consider sorting n integers in the range 0 to n 2 - 1. We do it in two phases. Phase 1: We use n bins, one for each of the integers 0, 1, ... , n - 1. We place each integer i on the list to be sorted into the bin numbered Each bin will then contain a list of integers leaving the same remainder when divided by n.
At the end, we concatenate the bins in order to obtain a list L. Phase 2: The integers on the list L are redistributed into bins, but using the bin selection function:
Now append integers to the ends of lists. Finally, concatenate the lists to get the final sorted sequence. Example n = 10 Initial list : 36, 9, 0, 25, 1, 49, 64, 16, 81, 4 Phase 1 : bin = i mod 10 = right most digit of i Bin
0, 1, 4, 9, 25, 36, 49, 64, 81 In general, assume that the key-type consists of k components f 1 , f 2 , ... , f k , of type t 1 , t 2 , ... , t k . Suppose it is required to sort the records in lexicographic order of their keys. That is, (a 1 , a 2 ,..., a k ) < (b 1 , b 2 ,..., b k ) if one of the following holds: 1. a 1 < b 1
2. a 1 = b 1 ;a 2 < b 2
3. a 1 = b 1 ;a 2 = b 2 ;a 3 < b 3
. . . k. a 1 = b 1 ;...;a k - 1 = b k - 1 ;a k < b k
Radix sort proceeds in the following way. First binsort all records first on f k , the least significant digit, then concatenate the bins lowest value first. Then binsort the above concatenated list on f k - 1 and then concatenate the bins lowest value first. In general, after binsorting on f k , f k - 1 ,..., f i , the records will appear in lexicographic order if the key consisted of only the fieldsf i ,..., f k .
void radixsort; / * Sorts a list A of n records with keys consisting of fields f 1 ,..., f k of types t 1 ,..., t k . The function uses k arrays B 1 ,..., B k of type array [t i ] of list-type, i = 1,..., k, where list-type is a linked list of records */ { for(i = k;i 1;i - -) { for (each value v of type t i ) make B i [v] empty; /* clear bins */ www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
113
for (each value r on list A) move r from A onto the end of bin B i [v], where v is the value of the field f i of the key of r; for (each value v of type t i from lowest to highest) concatenate B i [v] onto the end of A } }
Elements to be sorted are presented in the form of a linked list obviating the need for copying a record. We just move records from one list to another. For concatenation to be done quickly, we need pointers to the end of lists. Complexity of Radix Sort Inner loop 1 takes O(s i ) time where s i = number of different values of type t i
Inner loop 2 takes O(n) time Inner loop 3 times O(s i ) time
Thus the total time = O(s i + n)
= O kn + s i
= O n + s i
For example, if keys are integers in the range 0 to n k - 1, then we can view the integers as radix - n integers each k digits long. Then t i has range 0 ... n - 1 for 1 i k.
Thus s i = n and total running time is O(n) similarly, if keys are character strings of length k, for constant k, then s i = 128 (say) for i = 1,..., k and again we get the total running time as O(n).
2. Give the methodsfor i. Lower Bound on Worst Case Complexity ii. Lower Bound on Average Case Complexity 3. Give the implementation of also explain with an example i. Heap Sort ii. Quick Sort iii. Radix Sorting iv. Merge Sort.
UNIT-VII Topics: Pattern matching and Tries : Pattern matching algorithms- the Boyer Moore algorithm, the Knuth-Morris-Pratt algorithm Tries: Definitions and concepts of digital search tree, Binary trie, Patricia , Multi-way trie
7.1 Pattern Matching Pattern matching is to find a pattern, which is relatively small, in a text, which is supposed to be very large. Patterns and texts can be one-dimensional, or two- dimensional. In the case of one-dimensional, examples can be text editor and DNA analysis. In the text editor, we have 26 characters and some special symbols, whereas in the DNA case, we have four characters of A, C, G, and T. In the text editor, a pattern is often a word, whose length is around 10, and the length of the text is a few hundred up to one million. In the DNA analysis, a gene in a few hundred long and the human genome is about 3 billion long. In the case of two-dimensional, the typical application is a pattern matching in computer vision. A pattern is about (100, 100) and text typically (1024,1024). Either one-dimensional or two-dimensional, the text is very large, and therefore a fast algorithm to find the occurrence(s) of pattern in it is needed. We start from a nave algorithm for one-dimensional. Size of text : n Text: a b a b a a Pattern a b a b b a b a b Size of pattern: m mismatch At first the pattern is set to the left end of the text, and matching process starts. After a mismatch is found(at the bold characters), pattern is shifted one place right and a new matching process starts, and so on. The pattern and text are in arrays pat[1..m] and text[1..n] respectively. Algorithm 1. Naive pattern matching algorithm 1. j:=1; 2. while j <= n-m+1 do begin 3. i:=1; 4. while (i<=m) and (pat[i]=text[j]) do begin 5. i:=i+1; 6. j:=j+1 7. end www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
116
; 8. if i<=m then j:=j-i+2 /* shift the pattern one place right */ 9. else write (found at , j-i+1) 10. end . The worst case happens when pat and text are all as but b at the end, such as pat = aaaaab and text = aaaaaaaaaaaaaaaaaaaaaaaaaaaaab. The time is obviously O(mn). On average the situation is not as bad, but in the next section we introduce a much better algorithm. We call the operation pat[i]=text[j] a comparison between characters, and a b a b a a a b a b b a b a b 49 measure the complexity of a given algorithm by the number of character comparisons. The rest of computing time is proportional to this measure.
7.2 Knuth-Morris-Pratt algorithm (KMP algorithm) In computer science, the KnuthMorrisPratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977. where W = "ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers: m which denotes the position within S which is the beginning of a prospective match for W i the index in W denoting the character currently under consideration. In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step, we get S[3] is a space and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting m = 4 and i = 0. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus, not only do we omit previously matched characters of S, but also previously matched characters of W. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin searching at the next character of S: m = 11, reset i = 0. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before, we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This time we are able to complete the match, whose first character is S[15]. Algorithm: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
118
algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found)
define variables: an integer, m 0 (the beginning of the current match in S) an integer, i 0 (the position of the current character in W) an array of integers, T (the table, computed elsewhere)
while m + i < length(S) do if W[i] = S[m + i] then if i = length(W) - 1 then return m let i i + 1 else let m m + i - T[i] if T[i] > -1 then let i T[i] else let i 0
(if we reach here, we have searched all of S unsuccessfully) return the length of S Efficiency of KMP: Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k). These complexities are the same, no matter how many repetitive patterns are in W or S. Implementation of KMP: #include<iostream.h> #include<conio.h> #include<string.h> class str { private: char t[58],p[67],f[45]; int i,j,m,n; public: void failure(char[]); int kmpmatch(char[],char[]); }; void str::failure(char x[]) { www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
clrscr(); char t[50],p[20]; cout<<"\nEnter the text(without spaces):"; cin>>t; cout<<"\nEnter the pattern to match:"; cin>>p; int a=b.kmpmatch(t,p); if(a!=-1) cout<<"\nMatch at position"<<" "<<a; else cout<<"\nsorry!Pattern not found"; getch(); } 7.3 Boyer Moore Algorithm In computer science, the BoyerMoore string search algorithm is an efficient string searching algorithm that is the standard benchmark for practical string search literature. [1] It was developed by Robert S. Boyer and J Strother Moore in 1977. [2] The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the pattern either is much shorter than the text or does persist across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm runs faster as the pattern length increases. Definitions: S[i] refers to the character at index i of string S, counting from 1. S[i..j] refers to the substring of string S starting at index i and ending at j, inclusive. A prefix of S is a substring S[1..i] for some i in range [1, n], where n is the length of S. A suffix of S is a substring S[i..n] for some i in range [1, n], where n is the length of S. The string to be searched for is called the pattern and is referred to with symbol P. The string being searched in is called the text and is referred to with symbol T. The length of P is n. The length of T is m. An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T. A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k]. A N P A N M A N - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N - Alignments of pattern PAN to text ANPANMAN, from k=3 to k=8. A match occurs at k=5 Description: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
121
The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible. The algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving backward: the strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted to the right according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found. The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P. The Good Suffix Rule Description - - - - X - - K - - - - - M A N P A N A M A N A P - A N A M P N A M - - - - - - - - - A N A M P N A M - Demonstration of good suffix rule with pattern ANAMPNAM. The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. It is the reason comparisons begin at the end of the pattern rather than the start, and is formally stated thus: [3]
Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' of t in P such that t' is not a suffix of P and the character to the left of t' in P differs from the character to the left of t in P. Shift P to the right so that substring t' in P is below substring t in T. If t' does not exist, then shift the left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T. If no such shift is possible, then shift P by n places to the right. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T. If no such shift is possible, then shift P by n places, that is, shift P past T. Preprocessing The good suffix rule requires two tables: one for use in the general case, and another for use when either the general case returns no meaningful result or a match occurs. These tables will be designated L and H respectively. Their definitions are as follows: [3]
For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix of P[1..L[i]] and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to be zero if there is no position satisfying the condition. Let H[i] denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. If none exists, let H[i] be zero. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
122
Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift for index i in P is given by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has been found. Performance: The Boyer-Moore algorithm as presented in the original paper has worst-case running time of O(n+m) only if the pattern does not appear in the text. Implementation of boyer moore: #include<iostream.h> #include<conio.h> #include<string.h> class bm { public: char M[20],P[15]; public: int last(char); int psize(); int boyer(int,int); int min(int,int); };
int bm::last(char ch) { int l[150],i,k; for(i=0;i<128;i++) l[i]=-1; k=psize(); for(i=0;i<k;i++) l[P[i]]=i; return l[ch]; } int bm::psize() { int y; y=strlen(P); return y; } int bm::boyer(int n,int m) { int i=m-1; int j=m-1; do { if(P[j]==M[i]) { if(j==0) www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
7.4 Tries Digital serach trees: Definition: A digital search tree is a binary tree in which each node contains one element. The element-to-node assignment is determined by the binary representation of the element keys. We number the bits in the binary representation of a key from left to right beginning at one. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
124
Ex: bit one of 1000 is 1, and bits two , three , four are 0. All keys in the left subtree of a node at level I have bit i equal to zero whereas those in the right subtree of nodes at this level have bit i = 1. Assume fixed number of bits Not empty => Root contains one dictionary pair (any pair) All remaining pairs whose key begins with a 0 are in the left subtree. All remaining pairs whose key begins with a 1 are in the right subtree. Left and right subtrees are digital subtrees on remaining bits. This digital search tree contains the keys 1000,0010,1001,0001,1100,0000
Example: Start with an empty digital search tree and insert a pair whose key is 0110
Search and Insert: The digital search tree functions to search and insert are quite similar to the corresponding functions for binary search trees. The essential difference is that the subtree to move to is determined by a bit in the search key rather than by the result of the comparison of the search key and the key in the current node. Try to build the digital search tree: A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 I 01001 N 01110 G 00111 X 11000 M 01101 P 10000 When we dealing with very long keys, the cost of a key comparison is high. We can reduce the number of key comparisons to one by using a related structure called Patricia www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
126
We shall develop this structure in three steps. First, we introduce a structure called a binary trie. Then we transform binary tries into compressed binary tries. Finally, from compressed binary tries we obtain Patricia. 7.5 Binary Trie A binary trie is a binary tree that has two kinds of nodes: branch nodes and element nodes. A branch node has the two data members LeftChild and RightChild. It has no data member. An element node has the single data member data. Branch nodes are used to build a binary tree search structure similar to that of a digital search tree. This leads to element nodes Six element binary trie:
Compressed Binary trie: The binary trie contains branch nodes whose degree is one. By adding another data member, BitNumber , to each branch node, we can eliminate all degree-one branch nodes from the trie. The BitNumber data member of a branch node gives the bit number of the key that is to be used at this node. Binary Trie with degree one nodes eliminated:
7.6 Patricia:( Practical Algorithm to Retrieve Information Coded in Alphanumeric) Compressed binary tries may be represented using nodes of a single type. The new nodes, called augmented branch nodes, are the original branch nodes augmented by the data www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
127
member data. The resulting structure is called Patricia and is obtained from a compressed binary trie in the following way: (1)Replace each branch node by an augmented branch node. (2)Eliminate the element nodes. (3)Store the data previously in the element node in the data data members of the augmented branch nodes. Since every nonempty compressed binary trie has one less branch node than it has element nodes, it is necessary to add one augmented branch node. This node is called the head node . The remaining structure is the left subtree of the head node. The head node has BitNumber equal to zero. Its right-child data member is not used. The assignment of data to augmented branch node is less than or equal to that in the parent of the element node that contained this data . (4)Replace the original pointers to element nodes by pointers to the respective augmented branch nodes.
typedef struct patricia_tree *patricia; struct patricia_tree { int bit_number; element data; patricia left_child, right_child; }; patricia root; patricia search: Patricia search(patricia t, unsigned k) { /*search the Patricia tree t; return the last node y encountered; if k = y ->data.key, the key is in the tree */ Patricia p, y; If (!t) return NULL; /* empty tree*/ y=t->left_child; p=t; while (y->bit_number > p->bit_number){ p=y; y=(bit(k, y->bit_number)) ? y->right_child : y->left_child; } return y; www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
128
} Patricia Insert: void insert (patricia *t, element x){ /* insert x into the Patricia tree *t */ patricia s, p, y, z; int i; if (!(*t)) { /* empty tree*/ *t = (patricia)malloc(sizeof(patricia_tree)); if (IS_FULL(*t)) { fprintf(stderr, The memory is full\n) ; exit(1); } (*t)->bit_number = 0 (*t)->data = x; (*t)->left_child = *t; } y = search(*t,x.key); if (x.key == y->data.key) { fprintf(stderr, The key is in the tree. Insertion fails.\n); exit(1);} /* find the first bit where x.key and y->data.key differ*/ for(i = 1; bit (x.key,i) == bit(y->data.key,i); i++ ); /* search tree using the first i-1 bits*/ s = (*t)->left_child; p = *t; while (s->bit_number > p->bit_number && s->bit_number < 1){ p = s; s = (bit(x.key,s->bit_number)) ? s->right_child : s->left_child;} /* add x as a child of p */ z = (patricia)malloc(sizeof(patricia_tree)); if (IS_FULL(z)) { fprintf(stderr, The memory is full\n); exit(1); } z->data = x; z->bit_number = i; z->left_child = (bit(x.key,i)) ? s: z; z->right_child = (bit(x.key,i)) ? z : s; if (s == p->left_child) p->left_child = z; else p->right_child = z;
Assignment Questions: Pattern matching and Tries 1. Explain Pattern matching algorithms i. the Boyer Moore algorithm ii. the Knuth-Morris-Pratt algorithm www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
130
2. Define Tries?Give the concepts of digital search tree? What are the applications of Tries. 3. Write a short notes on i. binary trie ii. Patricia iii. Multi-way trie 4. Describe an efficient algorithm to find the longest palindrome that is a suffix of a string T of length n.
UNIT-VIII Topics: File Structures: Fundamental File Processing Operations-opening files, closing files, Reading and Writing file contents, Special characters in files. Fundamental File Structure Concepts- Field and record organization, Managing fixed- length,fixed-field buffers.
8.1 Fundamental file processing operations physical file A file as seen by the operating system, and which actually exists on secondary storage. logical file A file as seen by a program Programs read and write data from logical files. Before a logical file can be used, it must be associated with a physical file. This act of connection is called "opening" the file. Data in a physical file is persistent. Data in a logical file is temporary. A logical file is identified (within the program) by a program variable or constant. C++ supports file access on three levels: Unbuffered, unformatted file access using handles. Buffered, formatted file access using the FILE structure. Buffered, formatted file access using classes. 8.1.1Opening Files open To associate a logical program file with a physical system file. protection mode The security status of a file, defining who is allowed to access a file and which access modes are allowed. access mode The type of (file) access allowed. file descriptor A cardinal number used as the identifier for a logical file by operating systems such as UNIX and PC-DOS. In C++, a file is opened by using library functions. The name of the physical file must be supplied to an open function. The open function must also be supplied with an access mode. The open function can also be supplied with a protection mode. The access mode has several aspects: Is the file is to be accessed by reading, by writing, or by both? What should be done with existing contents of the file? Should a new file be created if none exists? Should any character translation be done? For handle level access, the logical file is declared as an int. The handle is also known as a file descriptor. The C++ open function is used to open a file for handle level access. The value returned by the open is assigned to the file variable. For FILE level access, the logical file is declared as a FILE *. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
132
The C++ fopen function is used to open a file for FILE level access. The value returned by the fopen is assigned to the file variable. For class level access, the logical file is declared as an fstream (or as an ifstream or an ofstream.) The C++ open method of the class is used to open a file for class level access. 8.1.2 Closing Files To disassociate a logical program file from a physical system file. Closing a file frees system resources for reuse. Data may not be actually written to the physical file until a logical file is closed. A programs should close a file when it is no longer needed. The C++ close function is used to close a file for handle level access. The logical file to be closed is an argument to the handle close function. The C++ fclose function is used to close a file for FILE level access. The logical file to be closed is an argument to the FILE fclose function. The C++ close method of the class is used to close a file for class level access. 8.1.3 Reading and writing read To tranfer data from a file to program variable(s). write To tranfer data to a file from program variable(s) or constant(s). end-of-file A physical location just beyond the last datum in a file. The read and write operations are performed on the logical file with calls to library functions. For read, one or more variables must be supplied to the read function, to receive the data from the file. For write, one or more values (as variables or constants) must be supplied to the write function, to provide the data for the file. For unformatted transfers, the amount of data to be transferred must also be supplied. The C++ read function is used to read a file with handle level access. The C++ write function is used to write a file with handle level access. The C++ fread function is used to read a file with FILE level access. The C++ fwrite function is used to write a file with FILE level access. The C++ read method of the class is used to read a file with class level access. The C++ write method of the class is used to write a file with class level access. The acronym for end-of-file is EOF. When a file reaches EOF, no more data can be read. Data can be written at or past EOF. The C++ eof function is used to detect end-of-file with handle level access. The handle eof function detects when the file pointer is at end-of-file. The C++ feof function is used to detect end-of-file with FILE level access. The FILE feof function detects when the file pointer is past end-of-file. The C++ eof method of the class is used to detect end-of-file with class level access. The stream class eof function detects when the file pointer is past end-of-file.
Seeking: seek To move to a specified location in a file. byte offset The distance, measured in bytes, from the beginning. Seeking moves an attribute in the file called the file pointer. C++ library functions allow seeking. In DOS, Windows, and UNIX, files are organized as streams of bytes, and locations are in terms of byte count. Seeking can be specified from one of three reference points: o The beginning of the file. o The end of the file. o The current file pointer position. The C++ lseek function is used to seek with handle level access. The C++ fseek function is used to seek with FILE level access. The C++ seekg method of the class is used to seek with class level access for read (get.) The C++ seekp method of the class is used to seek with class level access for write (put.) C++ allows, but does not require, separate file pointers and seek functions for reading and writing. Most implementations of C++ do not have separate file pointers. 8.1.4 Special characters in files Specifics of files can vary with the operating system. The C++ language was written originally for the UNIX operating system. UNIX and DOS (Windows) systems handle separators between lines differently. In UNIX files, lines are separated by a single new line character (ASCII line feed.) In DOS (Windows) files, lines are separated by a two characters (ASCII carriage return and line feed.) When DOS files are opened in text mode, the internal separator ('\n') is translated to the the external separator (<CR><LF>) during read and write. When DOS files are opened in binary mode, the internal separator ('\n') is not translated to the the external separator (<CR><LF>) during read and write. In DOS (Windows) files, end-of-file can be marked by a "control-Z" character (ASCII SUB). In C++ implementations for DOS, a control-Z in a file is interpreted as end-of-file. Other operating systems may handle line separation and end-of-file differently. Implementations of C++ should treat text files so that the internal representations are the same as UNIX. On UNIX systems, files opened in text mode or binary mode behave the same way. 8.2 Fundamental file structure concepts Persistent=Retained after execution of the program which created it. When we build file structures, we are making it possible to make data persistent. That is, one program can store data from memory to a file, and terminate. Later, another program can retrieve the data from the file, and process it in memory. In this chapter, we look at file structures which can be used to organize the data within the file, and at the algorithms which can be used to store and retrieve the data sequentially. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
134
8.2.1 Field and Record organization Record:A subdivision of a file, containing data related to a single entity. Field:A subdivision of a record containing a single attribute of the entity which the record describes. stream of bytes:A file which is regarded as being without structure beyond separation into a sequential set of bytes. Within a program, data is temporarily stored in variables. Individual values can be aggregated into structures, which can be treated as a single variable with parts. In C++, classes are typically used as as an aggregate structure. C++ Person class (version 0.1): class Person { public: char FirstName [11]; char LastName[11]; char Address [21]; char City [21]; char State [3]; char ZIP [5]; }; With this class declaration, variables can be declared to be of type Person. The individual fields within a Person can be referred to as the name of the variable and the name of the field, separated by a period (.). C++ Program: #include
class Person { public: char FirstName [11]; char LastName[11]; char Address [31]; char City [21]; char State [3]; char ZIP [5]; };
void Display (Person Someone) { cout << Someone.FirstName << Someone.LastName << Someone.Address << Someone.City << Someone.State << Someone.ZIP; } In memory, each Person will appear as an aggregate, with the individual values being parts of the aggregate: Person Clerk FirstName LastName Address City State ZIP Fred Flintstone 4444 Granite Place Rockville MD 0001 The output of this program will be: FredFlintstone4444 Granite PlaceRockvilleMD00001LilyMunster1313 Mockingbird LaneHollywoodCA90210 Obviously, this output could be improved. It is marginally readable by people, and it would be difficult to program a computer to read and correctly interpret this output. Stream file: In the Windows, DOS, UNIX, and LINUX operating systems, files are not internally structured; they are streams of individual bytes.
F r e d F l i n t s t o n e 4 4 4 4 G r a n ... The only file structure recognized by these operating systems is the separation of a text file into lines. o For Windows and DOS, two characters are used between lines, a carriage return (ASCII 13) and a line feed (ASCII 10); o For UNIX and LINUX, one character is used between lines, a line feed (ASCII 10); The code in applications programs can, however, impose internal organization on stream files. Delineation of records in files: www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
136
Fixed length records: A record which is predetermined to be the same length as the other records in the file.
Record 1 Record 2 Record 3 Record 4 Record 5 The file is divided into records of equal size. All records within a file have the same size. Different files can have different length records. Programs which access the file must know the record length. Offset, or position, of the nth record of a file can be calculated. There is no external overhead for record separation. There may be internal fragmentation (unused space within records.) There will be no external fragmentation (unused space outside of records) except for deleted records. Individual records can always be updated in place.
Example (80 byte records): 0 66 69 72 73 74 20 6C 69 6E 65 0 0 1 0 0 0 first line...... 10 0 0 0 0 0 0 0 0 FF FF FF FF 0 0 0 0 ................ 20 68 FB 12 0 DC E0 40 0 3C BA 42 0 78 FB 12 0 h.....@.<.B.x... 30 CD E3 40 0 3C BA 42 0 8 BB 42 0 E4 FB 12 0 ..@.<.B...B..... 40 3C 18 41 0 C4 FB 12 0 2 0 0 0 FC 3A 7C 0 <.A..........:|. 50 73 65 63 6F 6E 64 20 6C 69 6E 65 0 1 0 0 0 second line..... 60 0 0 0 0 0 0 0 0 FF FF FF FF 0 0 0 0 ................ 70 68 FB 12 0 DC E0 40 0 3C BA 42 0 78 FB 12 0 h.....@.<.B.x... 80 CD E3 40 0 3C BA 42 0 8 BB 42 0 E4 FB 12 0 ..@.<.B...B..... 90 3C 18 41 0 C4 FB 12 0 2 0 0 0 FC 3A 7C 0 <.A..........:|. Advantage: the offset of each record can be calculated from its record number. This makes direct access possible.There is no space overhead. Disadvantage: there will probably be internal fragmentation (unusable space within records.) Delimited variable length records: variable length record A record which can differ in length from the other records of the file. delimited record A variable length record which is terminated by a special character or sequence of characters. delimiter A special character or group of characters stored after a field or record, which indicates the end of the preceding unit. Record 1 # Record 2 # Record 3 # Record 4 # Record 5 # The records within a file are followed by a delimiting byte or series of bytes. The delimiter cannot occur within the records. Records within a file can have different sizes. Different files can have different length records. Programs which access the file must know the delimiter. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
137
Offset, or position, of the nth record of a file cannot be calculated. There is external overhead for record separation equal to the size of the delimiter per record. There should be no internal fragmentation (unused space within records.) There may be no external fragmentation (unused space outside of records) after file updating. Individual records cannot always be updated in place.
Example (Delimiter = ASCII 30 (IE) = RS character: 0 66 69 72 73 74 20 6C 69 6E 65 1E 73 65 63 6F 6E first line.secon 10 64 20 6C 69 6E 65 1E d line. Example (Delimiter = '\n'): 0 46 69 72 73 74 20 28 31 73 74 29 20 4C 69 6E 65 First (1st) Line 10 D A 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 6C ..Second (2nd) l 20 69 6E 65 D A ine.. Disadvantage: the offset of each record cannot be calculated from its record number. This makes direct access impossible. Advantage: there is space overhead for the length prefix. Advantage: there will probably be no internal fragmentation (unusable space within records.) Length prefixed variable length records:
110 Record 1 40 Record 2 100 Record 3 80 Record 4 70 Record 5 The records within a file are prefixed by a length byte or bytes. Records within a file can have different sizes. Different files can have different length records. Programs which access the file must know the size and format of the length prefix. Offset, or position, of the nth record of a file cannot be calculated. There is external overhead for record separation equal to the size of the length prefix per record. There should be no internal fragmentation (unused space within records.) There may be no external fragmentation (unused space outside of records) after file updating. Individual records cannot always be updated in place. Example: 0 A 0 46 69 72 73 74 20 4C 69 6E 65 B 0 53 65 ..First Line..Se 10 63 6F 6E 64 20 4C 69 6E 65 1F 0 54 68 69 72 64 cond Line..Third 20 20 4C 69 6E 65 20 77 69 74 68 20 6D 6F 72 65 20 Line with more 30 63 68 61 72 61 63 74 65 72 73 characters Disadvantage: the offset of each record can be calculated from its record number. This makes direct access possible. Disadvantage: there is space overhead for the delimiter suffix. Advantage: there will probably be no internal fragmentation (unusable space within records.)
An auxiliary file can be used to point to the beginning of each record. In this case, the data records can be contiguous. If the records are contiguous, the only access is through the index file. Example: Index File: 0 12 0 0 0 25 0 0 0 47 0 0 0 ....%...G...
Data File: 0 46 69 72 73 74 20 28 31 73 74 29 20 53 74 72 69 First (1st) Stri 10 6E 67 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 53 ngSecond (2nd) S 20 74 72 69 6E 67 54 68 69 72 64 20 28 33 72 64 29 tringThird (3rd) 30 20 53 74 72 69 6E 67 20 77 68 69 63 68 20 69 73 String which is 40 20 6C 6F 6E 67 65 72 longer Advantage: the offset of each record is be contained in the index, and can be looked up from its record number. This makes direct access possible. Disadvantage: there is space overhead for the index file. Disadvantage: there is time overhead for the index file. Advantage: there will probably be no internal fragmentation (unusable space within records.) The time overhead for accessing the index file can be minimized by reading the entire index file into memory when the files are opened. Fixed field count records: Records can be recognized if they always contain the same (predetermined) number of fields. Delineation of fields in a record: Fixed length fields:
Field 1 Field 2 Field 3 Field 4 Field 5 Each record is divided into fields of correspondingly equal size. Different fields within a record have different sizes. Different records can have different length fields. Programs which access the record must know the field lengths. There is no external overhead for field separation. There may be internal fragmentation (unused space within fields.) Delimited variable length fields:
Field 1 ! Field 2 ! Field 3 ! Field 4 ! Field 5 ! The fields within a record are followed by a delimiting byte or series of bytes. www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
139
Fields within a record can have different sizes. Different records can have different length fields. Programs which access the record must know the delimiter. The delimiter cannot occur within the data. If used with delimited records, the field delimiter must be different from the record delimiter. There is external overhead for field separation equal to the size of the delimiter per field. There should be no internal fragmentation (unused space within fields.) Length prefixed variable length fields:
12 Field 1 4 Field 2 10 Field 3 8 Field 4 7 Field 5 The fields within a record are prefixed by a length byte or bytes. Fields within a record can have different sizes. Different records can have different length fields. Programs which access the record must know the size and format of the length prefix. There is external overhead for field separation equal to the size of the length prefix per field. There should be no internal fragmentation (unused space within fields.) Representing record or field length: Record or field length can be represented in either binary or character form. The length can be considered as another hidden field within the record. This length field can be either fixed length or delimited. When character form is used, a space can be used to delimit the length field. A two byte fixed length field could be used to hold lengths of 0 to 65535 bytes in binary form. A two byte fixed length field could be used to hold lengths of 0 to 99 bytes in decimal character form. A variable length field delimited by a space could be used to hold effectively any length. In some languages, such as strict Pascal, it is difficult to mix binary values and character values in the same file. The C++ language is flexible enough so that the use of either binary or character format is easy. Tagged fields: Tags, in the form "Keyword=Value", can be used in fields. Use of tags does not in itself allow separation of fields, which must be done with another method. Use of tags adds significant space overhead to the file. Use of tags does add flexibility to the file structure. Fields can be added without affecting the basic structure of the file. Tags can be useful when records have sparse fields - that is, when a significant number of the possible attributes are absent. Mixing numbers and Characters: Use of a File Dump File-dump gives us the ability to look inside a file at the actual bytes that are stored Octal Dump: od -xc filename e.g. The number 40, stored as ASCII characters and as a short integer www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
140
Byte order: The byte order of integers (and floating point numbers) is not the same on all computers. This is hardware dependent (CPU), not software dependent. Many computers store numbers as might be expected: 40 10 = 28 16 is stored in a four byte integer as 00 00 00 28. PCs reverse the byte order, and store numbers with the least significant byte first: 40 10 = 28 16 is stored in a four byte integer as 28 00 00 00. On most computers, the number 40 would be stored in character form in its ASCII values: 34 30. IBM mainframe computers use EBCDIC instead of ASCII, and would store "40" as F4 F0. 8.3 Managing Fixed length,fixed field buffers For having the fixed length and fixed field buffer instead of writing the size of each field or each record we can write the methods that control the fixed length of each field. The class FixedLengthBuffer is subclass of IOBuffer.This class supports both the fixed length and fixed field buffers.The object of FixedLengthBuffer class can record the size of each record. FixedLengthBuffer class as given below class FixedFieldBuffer:public FixedLengthBuffer { public: FixedFieldBuffer(int maxFields,int RecordSize=3000); FixedFieldBuffer(int maxFields,int *fieldSize); int AddField(int fieldSize);//define the next field int Pack(const void* field,int size=-1); int Unpack(void * field,int maxBytes=-1); int NumberOfFields()const;// return number of defined fields protected: int * FieldSize;//array to hold field sizes int maxFields;//max number of fields int NumFields;//actual number of defined fields }; The AddField method is used to specify the size of the field.The total number of fields can be obtained using NumberOfFields method.
Assignment Questions: File Structures 1. Explain the fundamental File Processing Operations i. opening files ii. closing files iii. Reading and Writing file contents iv. Special characters in files. 2. Discuss the fundamental File Structure Concepts i. Field and record organization www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net www.jntuworld.com || www.jwjobs.net
141
ii. Managing fixed-length, iii. fixed-field buffers.