Dsa 4

Hashing
Lecturer: Dr. Keyhanipour

Outline
• Dictionary ADT
• Hashing:
• The Concept
• Collision Resolution
• Chooing A Hash Function
• Advanced Collision Resolution
2
Dictionary
• Dictionary ADT – a dynamic set with methods:
• Search(S, k) – an access operation that returns a pointer 𝑥 to an
element where 𝑥. 𝑘𝑒𝑦 = 𝑘
• Insert(S, x) – a manipulation operation that adds the element
pointed to by 𝑥 to 𝑆
• Delete(S, x) – a manipulation operation that removes the element
pointed to by 𝑥 from 𝑆
• Each element has a key part
3
Dictionaries
• Dictionaries store elements so that they can be located quickly
using keys
• Example:
• A dictionary may hold bank accounts
• Each account is an object that is identified by an account number
• Each account stores a wealth of additional information
• An application wishing to operate on an account would have to
provide the account number as a search key.
4
Dictionaries (2)
• Supporting order (methods such as min, max, successor and
predecessor) is not required, thus it is enough that keys are
comparable for equality
5
Dictionaries (3)
• Different data structures to realize dictionaries:
• Arrays, linked lists (inefficient)
• Hash tables
• Binary trees
• Red/Black trees
• AVL trees
• B-trees
6
An Example
• Suppose that a large phone company wants to provide the
caller ID capability:
• given a phone number, return the caller’s name
• phone numbers range from 0 to r = 108 − 1
• want to do this as efficiently as possible
7
The Problem
• A few suboptimal ways to design this dictionary
• Direct addressing: an array indexed by key
• takes 𝑂(1) time,
• needs 𝑂(𝑟) space - huge amount of wasted space
(null) (null) Jens Jensen (null) (null)

0000-0000 0000-0000 9635-8904 0000-0000 0000-0000
• A linked list: takes 𝑂(𝑟) time, 𝑂(𝑟) space
Jens Jensen Ole Olsen

9635-8904 9635-9999
8
Another Solution
• We can do better, with a Hash table
• Like an array, but come up with a function to map the large
range into one which we can manage
• e.g., take the original key, modulo the (relatively small) size of the
array, and use that as an index
• Insert (9635-8904, Jens Jensen) into a hashed array with, say,
five slots
• 96358904 𝑚𝑜𝑑 5 = 4
(null) (null) (null) (null) Jens Jensen

0 1 2 3 4
9
Another Solution (2)
• A lookup uses the same process: hash the query key, then
check the array at that slot
(null) (null) (null) (null) Jens Jensen

0 1 2 3 4
10
Direct-address Tables
• Assumptions:
• The universe 𝑈 of keys is reasonably small:
𝑈 = {0, 1, 2, … , 𝑚 − 1}, for some small 𝑚
• No two elements have the same key
• Implementation:
• Allocate an array of size 𝑚
• Insert the kth element into the kth slot in the array
11
• How to implement a dynamic set by a direct-address table T :
• Each key in the universe 𝑈 = {0, 1, … , 9} corresponds to an index in the table.
• The set 𝐾 = {2, 3, 5, 8} of
actual keys determines the
slots in the table that
contain pointers to
elements. The other slots,
heavily shaded, contain
NIL.
• If the set contains no
element with key k, then:
T[k] = NIL.
12
• An Array is an example of a direct-address table:
• Given a key 𝑘, the corresponding is stored in 𝑇[𝑘]
• Assume no two elements have the same key
• Operations of direct-address tables: all (1) time

• DIRECT-ADDRESS-SEARCH(𝑇, 𝑘): return 𝑇[𝑘]
• DIRECT-ADDRESS-INSERT(𝑇, 𝑥): 𝑇 𝑥. 𝑘𝑒𝑦 = 𝑥
• DIRECT-ADDRESS-DELETE(𝑇, 𝑥): 𝑇 𝑥. 𝑘𝑒𝑦 = 𝑁𝐼𝐿
13
• Advantage:
• O(1) time for all operations
• Disadvantages:
• Wasteful if the number of elements actually inserted is significantly
smaller than the size of the universe (m)
• Only applicable for small values of m , i.e. a limited range of keys
14
Hash Tables
• Performance is almost as the direct-address table, but without
the limitations
• The universe 𝑈 may be very large
• The storage requirement is 𝑂(|𝐾|), where 𝐾 is the set of keys actually
used
• Disadvantage – 𝑂(1) performance is now average case, not the
worst case
15
Hash Tables
• We need a hash function to map keys from the universe 𝑈 into
the hash table
ℎ: 𝑈 → {0, 1, … , 𝑚 − 1}
• For each key 𝑘, the hash function computes a hash value ℎ(𝑘)
• If two keys hash to the same value:
ℎ(𝑘1) = ℎ(𝑘2), we call this a collision
16
Hash Tables
• Using a hash function ℎ
to map keys to hash-table
slots.
• Because keys 𝑘2 and 𝑘5
map to the same slot,
they collide.
17
Collisions
• Can we avoid collisions altogether?
• No. Since |𝑈| > 𝑚, some keys must have the same hash value (based
on the Pigeonhole principle)
• A good hash function will be as ‘random’ as possible
18
Collision Resolution
• Chaining (also called open hash)
• Elements stored in their ‘correct’ slot
• Collisions resolved by creating linked lists
• Open addressing (also called

closed hash)
• All elements stored inside
the table
• Maybe rehashed if their slot
is full
19
Collision Resolution: Chaining (Open Hash)
20
Chaining (Open Hash)
• Collision resolution by chaining. Each hash-table slot 𝑇[𝑗] contains a
linked list of all the keys whose hash value is 𝑗 .
• For example, ℎ 𝑘1 = ℎ(𝑘4 ), ℎ 𝑘5 = ℎ(𝑘2 ) = ℎ(𝑘7 ) and ℎ 𝑘8 = ℎ(𝑘6 ).
• The linked list can be either singly or doubly linked; we show it as
doubly-linked list because deletion is faster that way.
21
Collision resolution – Chaining
• All keys that have the same hash value are placed in a linked list
• Insertion:
• CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])]
• Can be done at the beginning of the list in 𝑂(1) time
• Searching:
• CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)].
• Running time is proportional to the length of the list of elements in slot h(k).
• Deletion:
• CHAINED-HASH-DELETE(T, x) delete x from the list T[h(key[x])]
• Given pointer x to the element to delete, so no search is needed to find this element
(𝑂(1) time).
22
Analysis of Hashing
• The hash function h maps the universe U of keys into the slots
of hash table T[0...m-1]
ℎ: 𝑈 → {0, 1, … , 𝑚 − 1}
• Assumptions:
• Each key is equally likely to be hashed into any slot (bucket); Simple
Uniform Hashing
• Assume time to compute h(k) is (1)
• Given hash table T with m slots holding n elements, the load
factor is defined as: a=n/m
23
Analysis of Hashing
• Given a key, how long does it take to find an element with that key, or
to determine that there is no element with that key?
• Analysis is in terms of the Load factor: 𝛼 = 𝑛/𝑚.
• n = # elements in the table.
• m = # slots in the table = # (possibly empty) linked lists.
• Load factor is average number of elements per linked list.
• Can have α < 1, α > 1 or α = 1.
• Worst case is when all n keys hash to the same slot ⇒ get a single list of length
n ⇒ worst-case time to search is θ(n), plus time to compute hash function.
• Average case depends on how well the hash function distributes the keys
among the slots.
24
Analysis of Hashing
• To find an element:
• using ℎ , look up its position in table 𝑇
• search for the element in the linked list of the hashed slot
• Unsuccessful search:
• element is not in the linked list
• uniform hashing yields an average list length 𝛼 = 𝑛/𝑚
• expected number of elements to be examined 𝛼
• search time 𝑂(1 + 𝛼) (this includes computing the hash value)
25
Analysis of Hashing
• Successful search:
• assume that a new element is inserted at the end of the linked list
• The expected time for a successful search is also 𝜃(1 + 𝛼).
26
Analysis of Hashing: Summary
• Searching takes:
• constant time on average
• 𝑂(𝑛) worst-case
• Insertion takes 𝑂(1) worst-case time
• Deletion takes 𝑂(1) worst-case time when the lists are doubly-
linked
27
Hash Function Design
28
Hash Function Properties
• Properties of a good hash function:
• Easy to evaluate – ℎ(𝑥) can be computed very quickly
• Uniform distribution over all the table slots
• Different keys are mapped to different slots (as much as possible)
• Good hash functions are very rare
• Often use heuristics, based on the domain of the keys, to create
a hash function that performs well.
29
Hash Functions
• Hash functions assume that the keys are natural numbers.
• How to deal with hashing non-integer keys:
• Find some way of turning the keys into integers
• In our example, remove the hyphen in 9635-8904 to get 96358904!
• For a character string, add up the ASCII values of the characters
• Example: Interpret a character string as an integer expressed in some radix
notation. Suppose the string is “CLRS”. ASCII values: C = 67, L = 76, R = 82,
S = 83. So interpret CLRS as: 67 × 103 + 76 × 102 + 82 × 101 + 83 ×
100 = 75503.
• Then use a standard hash function on the integers.
30
Hash Functions: Division Method
• Use the remainder:
• ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 𝑚
• 𝑘 is the key, 𝑚 the size of the table
• Need to choose 𝑚
• 𝑚 = 𝑏 𝑒 (bad)
• if 𝑚 is a power of 2, ℎ(𝑘) gives the 𝑒 least significant bits of 𝑘
• all keys with the same ending go to the same place
• 𝑚 prime (good)
• helps ensure uniform distribution
• primes not too close to exact powers of 2
31
Hash Functions: Division Method
• Example 1:
• hash table for 𝑛 = 2000 character strings
• we don’t mind examining 3 elements
• 𝑚 = 701, a prime near 2000/3, but not near any power of 2
• → ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 701
• Further examples:
• 𝑚 = 13
• ℎ(3)? • 3 mod 13 = 3
• ℎ(12)? • 12 mod 13 = 12
• ℎ(13)? • 13 mod 13 = 0
• ℎ(47)? • 47 mod 13 = 8
32
Hash Functions: Multiplication Method
• Use:
• ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1)
• 𝑘 is the key, 𝑚 the size of the table, and 𝐴 is a constant: 0 < 𝐴 < 1
• The steps involved:
• map 0 … 𝑘𝑚𝑎𝑥 into 0 … 𝑘𝑚𝑎𝑥 A
• take the fractional part (𝑚𝑜𝑑 1)
• map it into 0 … 𝑚 − 1
• Disadvantage: Slower than division method.
• Advantage: Value of m is not critical.
33
• Choice of 𝑚 and 𝐴:
• value of 𝑚 is not critical, typically use 𝑚 = 2𝑝
• optimal choice of 𝐴 depends on the characteristics of the data
5−1
• Knuth says use: 𝐴 = ≈ 0.6180339887 … (conjugate of the golden
2
ratio) – Fibonacci hashing
34
• Example:
• 𝑚 = 8 (𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑝 = 3), 𝐴 = 0.62. ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1)
• If 𝑘 = 72, then:
• To compute ℎ(𝑘):
• 𝑘𝐴 = 44.64,
• 𝑘𝐴 𝑚𝑜𝑑 1 = 0.64,
• 𝑚(𝑘𝐴 𝑚𝑜𝑑 1) = 5.12,
• ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1) = 5
35
Universal Hashing
• For any choice of hash function, there exists a bad set of
identifiers
• A malicious adversary could choose keys to be hashed such
that all go into the same slot (bucket)
• Average retrieval time is Θ(𝑛)
• Solution:
• a random hash function
• choose hash function independently of keys!
• create a set of hash functions 𝐻 , from which ℎ can be randomly
selected
36
Universal Hashing (2)
• A collection 𝐻 of hash functions is universal if for any randomly

chosen 𝑓 from 𝐻 (and two keys 𝑘 and 𝑙):
Pr{𝑓(𝑘) = 𝑓(𝑙)} ≤ 1/𝑚
• Using chaining and universal hashing, the expected time for

each SEARCH operation is 𝑂(1).
37
Collision Resolution: Open Addressing
38
More on Collisions
• A key is mapped to an already occupied table location
• what to do?
• Use a collision handling technique
• We’ve seen Chaining
• Can also use Open Addressing
• Probing
• Double Hashing
39
Open Addressing
• All elements are stored in the hash table (can fill up!), i.e., 𝑛 ≤ 𝑚
• Each table entry contains either an element or null
• When searching for an element, systematically probe table slots
• Modify hash function to take the probe number 𝑖 as the second
parameter:
ℎ: 𝑈 × {0,1, … , 𝑚 − 1} → {0,1, … , 𝑚 − 1}
• Hash function, ℎ, determines the sequence of slots examined for a
given key
40
Open Addressing
• Probe sequence for a given key k given by:
ℎ(𝑘, 0), ℎ(𝑘, 1), . . . , ℎ(𝑘, 𝑚 − 1) − a permutation of 0,1, . . . , 𝑚 − 1
• Each element occupies a single slot in the hash table – no

chaining is done
• To insert an element, we probe the table according to the hash
function until an empty slot is found.
• Thus, in open addressing, the hash table can “fill up” so that no
further insertions can be made; one consequence is that the
load factor 𝛼 can never exceed 1.
41
Open Addressing
• Insertion: HASH_INSERT (T, k)
• The HASH-INSERT procedure takes 1 i0
as input a hash table 𝑇 and a key 𝑘. 2 repeat j  h(k, i)
• It either returns the slot number 3 if (T[j]== NIL)
where it stores key 𝑘 or flags an error 4 then T[j]  k
because the hash table is already full.
5 return j
6 else i  i + 1
7 until (i == m)
8 error “hash table overflow”
42
Open Addressing
• Search: HASH_SEARCH (T, k)
• The procedure HASH-SEARCH 1 i0
takes as input a hash table 𝑇 and
2 repeat j  h(k, i)
a key 𝑘, returning 𝑗 if it finds that
slot 𝑗 contains key 𝑘, or NIL if key 3 if (T[j]== k)
𝑘 is not present in table 𝑇 . 4 return j
5 else i  i + 1
6 until ((T[j]== NIL) or (i == m))
7 return NIL
43
Open Addressing
• Open addressing techniques:
• Linear probing
• Quadratic probing
• Double hashing
44
Linear Probing
• A hash value is computed using any hash function ℎ′ , and then
the number of the current attempt is added to it:
ℎ 𝑘, 𝑖 = ℎ′ 𝑘 + 𝑖 𝑚𝑜𝑑 𝑚
• Slots are examined sequentially, until an empty one is found
45
Linear Probing: An Example
• Example:
• You are given a hash table 𝐻 with 11 slots (𝑚)
• Suppose ℎ′ 𝑘 = 𝑘; then use linear probing and a hash function:
ℎ(𝑘, 𝑖) = (𝑘 + 𝑖) 𝑚𝑜𝑑 𝑚
To hash the following elements: 10, 22, 31, 4, 15, 28, 17, 88, 59.
46
Linear Probing
Solution:
• ℎ(10, 0) = (10 + 0) 𝑚𝑜𝑑 11 = 10
• ℎ(22, 0) = (22 + 0) 𝑚𝑜𝑑 11 = 0
• ℎ(31, 0) = (31 + 0) 𝑚𝑜𝑑 11 = 9
• ℎ(4, 0) = (4 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 0) = (15 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 1) = (15 + 1) 𝑚𝑜𝑑 11 = 5
• ℎ(28, 0) = (28 + 0) 𝑚𝑜𝑑 11 = 6
• ℎ(17, 0) = (17 + 0) 𝑚𝑜𝑑 11 = 6
0 1 2 3 4 5 6 7 8 9 10
22 4 15 28 31 10
47
Linear Probing
Solution (cont.):
• ℎ(17, 1) = (17 + 1) 𝑚𝑜𝑑 11 = 7
• ℎ(88, 0) = (88 + 0) 𝑚𝑜𝑑 11 = 0
• ℎ(88, 1) = (88 + 1) 𝑚𝑜𝑑 11 = 1
• ℎ(59, 0) = (59 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(59, 1) = (59 + 1) 𝑚𝑜𝑑 11 = 5
• ℎ(59, 2) = (59 + 2) 𝑚𝑜𝑑 11 = 6
• ℎ(59, 3) = (59 + 3) 𝑚𝑜𝑑 11 = 7
• ℎ(59, 4) = (59 + 4) 𝑚𝑜𝑑 11 = 8
0 1 2 3 4 5 6 7 8 9 10
22 88 4 15 28 17 59 31 10
48
Linear Probing
• Linear probing is easy to implement, but it suffers from a
problem known as Primary clustering: filling consecutive slots.
• Long runs of occupied slots build up, increasing the average
search time.
49
Quadratic Probing
• In this case, the second attempt is a more complex function
of 𝑖:
ℎ 𝑘, 𝑖 = ℎ′ 𝑘 + 𝑐1 𝑖 + 𝑐2 𝑖 2 𝑚𝑜𝑑 𝑚
• Tries to avoid primary clustering

• However, suffers from secondary clustering, which is to
make full use of the hash table, the values of 𝑐1 , 𝑐2 , and 𝑚
are constrained.
50
Quadratic Probing: An Example
• Example:
• You are given a hash table 𝐻 with 11 slots
• Suppose ℎ′ 𝑘 = 𝑘; then use linear probing and a hash function
ℎ(𝑘, 𝑖) = (𝑘 + 𝑖 + 3𝑖 2 ) 𝑚𝑜𝑑 11
To hash the following elements: 10, 22, 31, 4, 15, 28
51
Quadratic Probing
Solution: ℎ(𝑘, 𝑖) = (𝑘 + 𝑖 + 3𝑖 2 ) 𝑚𝑜𝑑 11
• ℎ(10, 0) = (10 𝑚𝑜𝑑 11) = 10
• ℎ(22, 0) = (22 𝑚𝑜𝑑 11) = 0
• ℎ(31, 0) = (31 𝑚𝑜𝑑 11) = 9
• ℎ 4, 0 = 4 𝑚𝑜𝑑 11 = 4
• ℎ(15, 0) = (15 𝑚𝑜𝑑 11) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 1) = ((15 + 1 + 3) 𝑚𝑜𝑑 11) = 8
• ℎ(28, 0) = (28 𝑚𝑜𝑑 11) = 6
0 1 2 3 4 5 6 7 8 9 10
22 4 28 15 31 10
52
Double Hashing
• Given two hash functions:
ℎ 𝑘, 𝑖 = ℎ1 𝑘 + 𝑖 × ℎ2 𝑘 𝑚𝑜𝑑 𝑚
• One of the best methods for open addressing collision

resolution
• Permutations are almost random
• For the entire hash to be searched, 𝑚 and ℎ2 𝑘 must be
relatively prime
53
Double Hashing
• Possible selections of ℎ2 𝑘 :
1. Select 𝑚 to be a power of 2, and design ℎ2 𝑘 to produce odd
numbers
2. Select 𝑚 to be prime, and 𝑚′ to be 𝑚 − 1
• ℎ1 𝑘 = 𝑘 𝑚𝑜𝑑 𝑚
• ℎ2 𝑘 = 1 + (𝑘 𝑚𝑜𝑑 𝑚′ )
54
Double Hashing: Example
ℎ 𝑘, 𝑖 = ℎ1 𝑘 + 𝑖 × ℎ2 𝑘 𝑚𝑜𝑑 𝑚
• Suppose we have a hash table of size 13 with:
ℎ1 𝑘 = 𝑘 𝑚𝑜𝑑 13 and ℎ2 𝑘 = 1 + (𝑘 𝑚𝑜𝑑 11).
13 11
• Since 14 ՞ ≡ 1 and14 ՞ ≡ 3, we insert the key 14 into
empty slot 9, after examining slots 1 and 5 and finding
them to be occupied.
• ℎ1 14 = 1, ℎ2 14 = 4
• → ℎ 14,0 = 1 + 0 × 4 𝑚𝑜𝑑 13 = 1 (Collision)
• → ℎ 14,1 = 1 + 1 × 4 𝑚𝑜𝑑 13 = 5 (Collision)
• → ℎ 14,2 = 1 + 2 × 4 𝑚𝑜𝑑 13 = 9
55

Dsa 4

Uploaded by

Copyright:

Available Formats

Dsa 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dsa 4

Uploaded by

Copyright:

Available Formats

Hashing

Lecturer: Dr. Keyhanipour

(null) (null) Jens Jensen (null) (null)

• A linked list: takes 𝑂(𝑟) time, 𝑂(𝑟) space

Jens Jensen Ole Olsen

(null) (null) (null) (null) Jens Jensen

(null) (null) (null) (null) Jens Jensen

• Operations of direct-address tables: all (1) time

• Open addressing (also called

• A collection 𝐻 of hash functions is universal if for any randomly

Pr{𝑓(𝑘) = 𝑓(𝑙)} ≤ 1/𝑚

• Using chaining and universal hashing, the expected time for

• Each element occupies a single slot in the hash table – no

• Slots are examined sequentially, until an empty one is found

• Tries to avoid primary clustering

• One of the best methods for open addressing collision

You might also like