Hashing in Data Structures
Hashing in Data Structures
Hashing in Data Structures
What is Hashing?
• Hashing is the process of mapping large amount of data item to smaller table with the help
of hashing function. It uses hash tables to store the data in an array format. Each value in the
array has been assigned a unique index number. Hash tables use a technique to generate these
unique index numbers for each value stored in an array format. This technique is called the
hash technique. You only need to find the index of the desired item, rather than finding the
data. With indexing, you can quickly scan the entire list and retrieve the item you wish.
Indexing also helps in inserting operations when you need to insert data at a specific location.
No matter how big or small the table is, you can update and retrieve data within seconds.
• Hashing is also known as Hashing Algorithm or Message Digest Function.
• It is a technique to convert a range of key values into a range of indexes of an array.
Suppose we want to design a system for storing users and their passwords. And we want to access to
be performed efficiently:
1. Insert a user name and corresponding password.
2. Search a user and change their password.
3. Delete a user and corresponding password.
Think about it. We can use the following data structures to maintain it: array with pair element,
time using hash table. O(1) means that, no matter how much data, it will execute in constant time.
USER1 368 4
USER2 369 5
USER3 370 6
USER4 371 0
USER5 372 1
INDEX 0 1 2 3 4 5 6
OUTPUT:
• Hashing is used to facilitate the next level searching method when compared with the linear
or binary search.
• Hashing allows to update and retrieve any data entry in a constant time O(1).
• Constant time O(1) means the operation does not depend on the size of the data.
• Hashing is used with a database to enable items to be retrieved more quickly.
• It is used in the encryption and decryption of digital signatures.
1. The hash function converts the item into a small integer or hash value. This integer is used
as an index to store the original data.
2. It stores the data in a hash table. You can use a hash key to locate data quickly.
A fixed process converts a key to a hash key is known as a Hash Function. The hash function in a
data structure maps the arbitrary size of data to fixed-sized data. It returns the following values: a
small integer value (also known as hash value), hash codes, and hash sums. The hashing techniques
in the data structure are very interesting, such as:
hash = hashfunc(key)
index = hash % array_size
The three characteristics of the hash function in the data structure are:
1. Collision free
2. Property to be hidden
3. Puzzle friendly
• The above figure shows the hash table with the size of n = 10. Each position of the hash
table is called as Slot. In the above hash table, there are n slots in the table, names = {0, 1,
2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every
slot is empty.
• Division method or reminder method takes an item and divides it by the table size and
returns the remainder as its hash value.
• After computing the hash values, we can insert each item into the hash table at the
designated position as shown in the above figure. In the hash table, 6 of the 10 slots are
occupied, it is referred to as the load factor and denoted by, λ = No. of items / table size.
For example , λ = 6/10.
• It is easy to search for an item using hash function where it computes the slot name for the
item and then checks the hash table to see if it is present.
• Constant amount of time O(1) is required to compute the hash value and index of the hash
table at that location.
• Hash tables must support 3 fundamental operations:
Collision in Hashing-
In hashing,
• Hash function is used to compute the hash value for a key.
• Hash value is then used as an index to store the key in the hash table.
• Hash function may return the same hash value for two or more keys.
When the hash value of a key maps to an already occupied bucket of the hash table,
it is called as a Collision.
Two keys could possibly provide the same value since a hash function returns a little number for a
key that is a large integer or string. A collision handling mechanism must be used to deal with the
circumstance when a newly added key maps to a hash table slot that is already occupied.
Even if we have a large table to put the keys on, collisions are still highly frequent. Birthday Paradox
is a crucial finding. With only 23 people, there is a 50% chance that two people will share the same
birthday.
Collision Resolution Techniques are the techniques used for resolving or handling the collision.
Time Complexity-
For Searching-
• In worst case, all the keys might map to the same bucket of the hash table.
• In such a case, all the keys will be present in a single linked list.
• Sequential search will have to be performed on the linked list to perform the search.
• So, time taken for searching in worst case is O(n).
For Deletion-
• In worst case, the key might have to be searched first and then deleted.
• In worst case, time taken for searching is O(n).
• So, time taken for deletion in worst case is O(n).
If Load factor (α) = constant, then time complexity of Insert, Search, Delete = Θ(1)
PRACTICE PROBLEM BASED ON SEPARATE CHAINING
Problem-
Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101
Use separate chaining technique for collision resolution.
Solution-
The given sequence of keys will be inserted in the hash table as-
Step-01:
• Draw an empty hash table.
• For the given hash function, the possible range of hash values is [0, 6].
• So, draw an empty hash table consisting of 7 buckets as-
Step-02:
Step-04:
• The next key to be inserted in the hash table = 76.
• Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
• So, key 76 will be inserted in bucket-6 of the hash table as-
Step-06:
• The next key to be inserted in the hash table = 92.
• Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
• Since bucket-1 is already occupied, so collision occurs.
• Separate chaining handles the collision by creating a linked list to bucket-1.
• So, key 92 will be inserted in bucket-1 of the hash table as-
Step-07:
• The next key to be inserted in the hash table = 73.
• Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
• So, key 73 will be inserted in bucket-3 of the hash table as-
NOTE-
• During insertion, the buckets marked as “deleted” are treated like any other empty bucket.
• During searching, the search is not terminated on encountering the bucket marked as
“deleted”.
• The search terminates only after the required key or an empty bucket is found
1. Linear Probing:
In linear probing, the hash table is searched sequentially that starts from the original location of
the hash. If in case the location that we get is already occupied, then we check for the next
location.
The function used for rehashing is as follows: rehash(key) = (n+1)%table-size.
For example, The typical gap between two probes is 1 as seen in the example below:
Let hash(x) be the slot index computed using a hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
…………………………………………..
…………………………………………..
Let us consider a simple hash function as “key mod 7” and a sequence of keys as 50, 700, 76, 85, 92, 73, 101.
Hash table
• Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will
map to slot number 0 because 50%5=0. So insert it into slot number 0.
• Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is
already at slot number 0 so, search for the next empty slot and insert it.
• Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is
already at slot number 1 so, search for the next empty slot and insert it.
• Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it
into slot number 3.
• Take the above example, if we insert next item 40 in our collection, it would have a hash
value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem. This
problem is called as Collision or Clash. Collision creates a problem for hashing technique.
• Linear probing is used for resolving the collisions in hash table, data structures for
maintaining a collection of key-value pairs.
• Linear probing was invented by Gene Amdahl, Elaine M. McGraw and Arthur Samuel in
1954 and analyzed by Donald Knuth in 1963.
• It is a component of open addressing scheme for using a hash table to solve the dictionary
problem.
• The simplest method is called Linear Probing. Formula to compute linear probing is:
P = (1 + P) % (MOD) Table_size
For example,
If we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also
had a hash value of 0, it becomes a problem.
P = H(40)
40 % 10 = 0
Position 0 is occupied by 70. so we look elsewhere for a position to store 40.
2. Quadratic Probing
If you observe carefully, then you will understand that the interval between probes will increase
proportionally to the hash value. Quadratic probing is a method with the help of which we can
solve the problem of clustering that was discussed above. This method is also known as the mid-
square method. In this method, we look for the i2‘th slot in the ith iteration. We always start from
the original hash location. If only the location is occupied then we check the other slots.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
…………………………………………..
…………………………………………..
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision
resolution strategy to be f(i) = i2 . Insert = 22, 30, and 50.
• Step 1: Create a table of size 7.
Hash table
• Step 3: Inserting 50
• Hash(50) = 50 % 7 = 1
• In our hash table slot 1 is already occupied. So, we will search for slot 1+12,
i.e. 1+1 = 2,
• Again slot 2 is found occupied, so we will search for cell 1+22, i.e.1+4 = 5,
• Now, cell 5 is not occupied so we will place 50 in slot 5.
• Step 2: Insert 43
• 43 % 7 = 1, location 1 is empty so insert 43 into 1 slot.
• Step 4: Insert 72
• 72 % 7 = 2, but location 2 is already being occupied and this is a collision.
• So we need to resolve this collision using double hashing.
hnew = [h1(72) + i * (h2(72)] % 7
= [2 + 1 * (1 + 72 % 5)] % 7
=5%7
= 5,
Now, as 5 is an empty slot,
so we can insert 72 into 5th slot.
• Linear probing has the best cache performance but suffers from clustering. One more
advantage of Linear probing is easy to compute.
• Quadratic probing lies between the two in terms of cache performance and clustering.
• Double hashing has poor cache performance but no clustering. Double hashing
requires more computation time as two hash functions need to be computed.
Note: Cache performance of chaining is not good because when we traverse a Linked List, we are
basically jumping from one node to another, all across the computer’s memory. For this reason, the
CPU cannot cache the nodes which aren’t visited yet, this doesn’t help us. But with Open
Addressing, data isn’t spread, so if the CPU detects that a segment of memory is constantly being
accessed, it gets cached for quick access.
OUTPUT:
Rehashing is a technique in which the table is resized, i.e., the size of table is doubled by creating a
new table. It is preferable is the total size of table is a prime number. There are situations in which
the rehashing is required.
• Hash is used for cache mapping for fast access to the data.
• Hash can be used for password verification.
• Hash is used in cryptography as a message digest.
Conclusion
Hashing is to resolve the challenge of finding an item quickly in a collection. For example, if we
have a list of millions of English words and we wish to find a particular term then we would use
hashing to locate and find it more efficiently. It would be inefficient to check each item on the
millions of lists until we find a match. Hashing reduces search time by restricting the search to a
smaller set of words at the beginning.