Data Structures Unit 2
Data Structures Unit 2
1
Department of Computer Science and Engineering
Data Structures
R18 CSE/IT II year
By
D. Subhashini
Associate Professor
Department of Computer Science and Engineering
Aurora’s Technological and Research Institute
Hyderabad
UNIT II
Dictionaries: linear list representation, skip list
representation, operations - insertion, deletion and
searching.
Hash Table Representation: hash functions, collision
resolution-separate chaining, open
addressing-linear probing, quadratic probing, double
hashing, rehashing, extendible hashing.
Dictionaries
A dictionary is a general-purpose data structure for storing a group of objects. A
dictionary has a set of keys and each key has a single associated value. When presented
with a key, the dictionary will return the associated value.
For example, the results of a classroom test could be represented as a dictionary with
pupil's names as keys and their scores as the values.
A dictionary is also called a hash, a map, a hashmap in different programming languages.
They're all the same thing: a key-value store.
Typically, the keys in a dictionary must be simple types (such as integers or strings) while
the values can be of any type. Dictionaries are often implemented as hash tables.
Keys in a dictionary must be unique; an attempt to create a duplicate key will typically
overwrite the existing value for that key.
Note that there is a difference (which may be important) between a key not existing in a
dictionary, and the key existing but with its corresponding value being null.
Implementations of Dictionaries
Dictionaries can be implemented mainly in the following two ways.
Linear List Representation
There are two method of representing dictionaries using linear list.
1. Sorted Array- An array data structure is used to implement the dictionary.
2. Sorted Chain- A linked list data structure is used to implement the dictionary
Skip List Representation
Retrieve a value
Insert or update a value (typically, if the key does not exist in the dictionary, the key-
value pair is inserted; if the key already exists, its corresponding value is overwritten
with the new one)
Remove a key-value pair
Test for existence of a key
Complexity
Average
Case Worst Case
Node Structure
Each node carries a key and a forward array carrying pointers to nodes of a different
level. A level i node, carries i forward pointers indexed through 0 to i.
Key of next node is greater than the key to be inserted then we store the pointer
to current node i at update[i] and move one level down and continue our
search.
Key of next node is greater than the key to be inserted then we store the pointer
to current node i at update[i] and move one level down and continue our
search.
At the lowest level (0), if the element next to the rightmost element has key equal to
the search key, then we have found key otherwise failure.
Consider this example where we want to search for key 17:
Here at level 3, there is no element (arrow in red) after deleting element 6. So we will
decrement level of skip list by 1.
Hash table is a data structure used for storing and retrieving data very quickly.
Insertion of data in the hash table is based on the key value. Hence every entry
in the hash table is associated with some key.
Using the hash key the required piece of data can be searched in the hash table
by few or more key comparisons. The searching time is then dependent upon the
size of the hash table.
The effective representation of dictionary can be done using hash table. We can
place the dictionary entries in the hash table using hash function.
HASH FUNCTION
Hash function is a function which is used to put the data in the hash table. Hence
one can use the same hash function to retrieve the data from the hash table.
For example: Consider that we want place some employee records in the hash table.
The record of employee is placed with the help of key: employee ID. The employee ID
is a 7 digit number for placing the record in the hash table. To place the record 7 digit
number is converted into 3 digits by taking only last three digits of the key.
If the key is 496700 it can be stored at 0th position. The second key 8421002, the
record of those key is placed at 2nd position in the array.
Hence the hash function will be: H(key) = key%1000
Where key%1000 is a hash function and key obtained by hash function is called hash
key.
Bucket and Home bucket: The hash function H(key) is used to map several
dictionary entries in the hash table. Each position of the hash table is called bucket.
The function H(key) is home bucket for the dictionary with pair whose value is key.
There are various types of hash functions that are used to place the record in the
hash table-
Division Method:
The hash function depends upon the remainder of division. Typically the divisor is
table length.
For ex: If the record 54, 72, 89, 37 is placed in the hash table and if the table size
is 10 then h(key) = record % table size
0
1
54%10=4 2 72
72%10=2 3
89%10=9 4 54
37%10=7 5
6
7 37
8
9 89
Mid Square:
In the mid square method, the key is squared and the middle or mid part of the result
is used as the index. If the key is a string, it has to be preprocessed to produce a
number.
Consider that if we want to place a record 3111 then
31112 = 9678321
for the hash table of size 1000 H(3111) = 783 (the middle 3 digits)
The given record’s key is multiplied by some constant value. The formula for computing
the hash key is:
H(key) = floor(50*(107*0.61803398987))
= floor(3306.4818458045)
= 3306
At 3306 location in the hash table the record 107 will be placed.
Digit Folding:
The key is divided into separate parts and using some simple operation these parts are
combined to produce the hash key.
For ex: consider a record 12365412 then it is divided into separate parts as 123 654 12
and these are added together
H(key) = 123+654+12
= 789
The record will be placed at location 789
Digit Analysis:
The digit analysis is used in a situation when all the identifiers are known in advance.
We first transform the identifiers into numbers using some radix, r. Then examine the
digits of each identifier. Some digits having most skewed distributions are deleted. This
deleting of digits is continued until the number of remaining digits is small enough to
give an address in the range of the hash table. Then these digits are used to calculate
the hash address.
COLLISION
The hash function is a function that works on key and returns a value, using which the
record can be placed in the hash table. Thus this function helps us in placing the record
in the hash table at appropriate position and due to this we can retrieve the record
directly from that location. This function need to be designed very carefully and it
should not return the same hash key address for two different records. This is an
undesirable situation in hashing.
Definition:
The situation in which the hash function returns the same hash key (home bucket) for
more than one record is called collision and two same hash keys returned for different
records is called synonym.
Similarly when there is no room for a new pair in the hash table then such a situation
is called overflow. Sometimes when we handle collision, it may lead to overflow
conditions. Collision and overflow show the poor hash functions.
For example, 0
1 131
Consider a hash function. 2
3 43
H(key) = recordkey%10 having the hash table size of 10 4 44
5
The record keys to be placed are 6 36
7 57
131, 44, 43, 78, 19, 36, 57 and 77 8 78
131%10=1 9 19
44%10=4
43%10=3
78%10=8
19%10=9
36%10=6
57%10=7
77%10=7
Now if we try to place 77 in the hash table then we get the hash key to be 7 and at
index 7 already the record key 57 is placed. This situation is called collision. From the
index 7 as we find that there is no room to place 77 in the hash table. This situation is
called overflow.
CHAINING
For ex: Consider the keys to be placed in their home buckets are 131, 3, 4, 21, 61, 7,
97, 8, 9
Here D = 10
0
1 131 21 61 NULL
3 NULL
NULL
4
7 97 NULL
A chain is maintained for colliding elements. for instance 131 has a home bucket (key)
1. similarly key 21 and 61 demand for home bucket 1. Hence a chain is maintained at
index 1.
This is the easiest method of handling collision. When collision occurs i.e. when two
records demand for the same home bucket in the hash table then collision can be
solved by placing the second record linearly down whenever the empty bucket is found.
When use linear probing (open addressing), the hash table is represented as a one-
dimensional array with indices that range from 0 to the desired table size-1.
Before inserting any elements into this table, we must initialize the table to represent
the situation where all slots are empty. This allows us to detect overflows and collisions
when we inset elements into the table. Then using some suitable hash function the
element can be inserted into the hash table.
For example:
Consider that following keys are to be inserted in the hash table 131, 4, 8, 7, 21, 5,
31, 61, 9, 29
Index 1 will be the home bucket for 131. Continuing in this fashion we will place 4, 8,
7. Now the next key to be inserted is 21. According to the hash function
H(key)=21%10
H(key) = 1
But the index 1 location is already occupied by 131 i.e. collision occurs. To resolve this
collision we will linearly move down and at the next empty location we will prob the
element. Therefore 21 will be placed at the index 2. If the next element is 5 then we
get the home bucket for 5 as index 5 and this bucket is empty so we will put the
element 5 at index 5.
2 NULL 21 21
3 NULL NULL 31
4 4 4 4
5 NULL 5 5
6 NULL NULL 61
7 7 7 7
8 8 8 8
The next record key is 9. According to decision hash function it demands for the home
bucket 9. Hence we will place 9 at index 9. Now the next final record key 29 and it
hashes a key 9. But home bucket 9 is already occupied. And there is no next empty
bucket as the table size is limited to index 9. The overflow occurs. To handle it we
move back to bucket 0 and is the location over there is empty 29 will be placed at 0th
index.
39
19%10 = 9 cluster is formed
18%10 = 8 29
39%10 = 9 8
29%10 = 9
8%10 = 8
QUADRATIC PROBING: 19
Quadratic probing operates by taking the original hash value and adding successive
values of an arbitrary quadratic polynomial to the starting value. This method uses
following formula.
90 % 10 = 0 3
55 % 10 = 5 4
22 % 10 = 2 5 55
11 % 10 = 1 6
7 37
Now if we want to place 17 a collision will occur as 8
17%10 = 7 and bucket 7 has already an element 37. 9
Hence we will apply quadratic probing to insert this record in the hash table.
Hi (key) = (Hash(key) + i2) % m
Consider i = 0 then (17 + 02) % 10 = 7
D. Subhashini, Assoc. prof
Aurora’s Technological and Research Institute
15
Department of Computer Science and Engineering
H1(37) = 37 % 10 = 7
H1(90) = 90 % 10 = 0 37
H1(45) = 45 % 10 = 5
H1(22) = 22 % 10 = 2
H1(49) = 49 % 10 = 9 49
That means we have to insert the element 17 at 4 places from 37. e to take 4
In short we ha v jumps. Therefore the 17 will be placed at index 1. 37
H1(55) = 55 % 10 =5 Collision
Key
H2(55) = 7-(55 % 7) 90
=7–6=1
17
That means we have to take one jump from index 5 to place 55. 22
45
55
37
49
The double hashing requires another hash function whose probing efficiency
is same as some another hash function required when handling random collision.
The double hashing is more complex to implement than quadratic probing.
The quadratic probing is fast technique than double hashing.
REHASHING
Rehashing is a technique in which the table is resized, i.e., the size of table is doubled
by creating a new table. It is preferable if the total size of table is a prime number.
There are situations in which the rehashing is required.
In such situations, we have to transfer entries from old table to the new table by re
computing their positions using hash functions.
Consider we have to insert the elements 37, 90, 55, 22, 17, 49, and 87. the table size
is 10 and will use hash function.,
37 % 10 = 7
90 % 10= 0
55 % 10 = 5
22 % 10 = 2
17 % 10 = 7 Collision solved by linear probing 49 % 10 = 9
Now this table is almost full and if we try to insert more elements collisions will occur
and eventually further insertions will fail. Hence we will rehash by doubling the table
size. The old table size is 10 then we should double this size for new table, that
becomes 20. But 20 is not a prime number, we will prefer to make the table size as
23. And new hash function will be
37 % 23 = 14
90 % 23 = 21
55 % 23 = 9
22 % 23 = 22
17 % 23 = 17
49 % 23 = 3
87 % 23 = 18
Advantages:
1. This technique provides the programmer a flexibility to enlarge the table size if
required.
2. Only the space gets doubled with simple hash function which avoids occurrence
of collisions.
EXTENSIBLE HASHING
Extensible hashing is a technique which handles a large amount of data. The data
to be placed in the hash table is by extracting certain number of bits.
Extensible hashing grow and shrink similar to B-trees.
In extensible hashing referring the size of directory the elements are to be placed
in buckets. The levels are indicated in parenthesis.
0 1
Levels
(0) (1)
001 111
data to be placed in bucket
010
The bucket can hold the data of its global depth. If data in bucket is more than
global depth then, split the bucket and double the directory.
Consider we have to insert 1, 4, 5, 7, 8, 10. Assume each page can hold 2 data
entries (2 is the depth).
Step 1: Insert 2, 4
1 = 001
0
4 = 100
(0)
001
100
1 = 001
0 1
4 = 100
(0) (1)
5 = 101
100 001
101 Based on last bit the data is inserted.
Step 2: Insert 7
7 = 111
But as depth is full we can not insert 7 here. Then double the directory and split the bucket. After
insertion of 7. Now consider last two bits.
00 01 10 11
(1) (2) (2)
00 01 10 11
(2)
(1)
001 111
100
101
1000
Step 4: Insert 1 0
Deletion Operation:
00 01 10 11
Delete 7.
00 01 10 11
(1) (1)
100 001 Note that the level was increased when we insert 7.
1000 101 Now on deletion
of 7, the level should get decremented.
00 00 10 11
(1) (1)
100 001
101
Applications of hashing: