Hashing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Hashing-Introduction

21CSC201J - Data Structures and


Algorithms
by
J.Arthy
Definition
• Hashing refers to the process of generating
a fixed-size output from an input of variable
size using the mathematical formulas
known as hash functions.
• This technique determines an index or
location for the storage of an item in a data
structure.
Terminologies
• Hash Function: Mathematical formula
which, when applied to a key produces an
integer which can be used as an index for
the key in the hash table.
• Hash Code: The hash function crunches
the data and give a unique hash code. This
hash code is typically integer value that can
be used an index.
• Hash Table: The hash code then points you
to a specific location within the hash table.
Hash Key
• Hash key (also known as a hash value or
hash code) is a fixed-size numerical or
alphanumeric representation generated by a
hashing algorithm.
• It is derived from the input data, such as a
text string or a file, through a process
known as hashing.
Indexing into hash table
Properties of a Good Hash
Function
• Low cost
• Determinism
• Uniformity
Different Hash Functions
• Division Method
• Multiplication Method
• Mid Square Method
• Folding Method
Multiplication
• h(k)=⌊m(kAmod1)⌋
• Where ⌊ ⌋ denotes the floor function.
Division
(k)=k mod m
Where k is the key and 𝑚m is a prime
number.
Mid Square
In the mid-square method, the key is squared, and the middle digits of the result
are taken as the hash value.
Steps:
Square the key.
Extract the middle digits of the squared value

Folding
The folding method involves dividing the key into equal parts, summing the parts,
and then taking the modulo with respect to 𝑚m.
Steps:
Divide the key into parts.
Sum the parts.
Take the modulo 𝑚m of the sum.
Simple hashes
• It's possible to have very simple hash
functions if you are certain of your keys
• For example,
› suppose we know that the keys s will be real
numbers uniformly distributed over 0  s < 1
› Then a very fast, very good hash function is
• hash(s) = floor(s·m)
• where m is the size of the table
very simple mapping
• hash(s) = floor(s·m) maps from 0  s < 1 to 0..m-1
› m = 10

s 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

floor(s*m) 0 1 2 3 4 5 6 7 8 9

Note the even distribution. There are collisions, but we will deal with them later.
Perfect hashing
• In some cases it's possible to map a known
set of keys uniquely to a set of index values
• You must know every single key beforehand
and be able to derive a function that works
one-to-one (not necessarily onto)
s 120 331 912 74 665 47 888 219

hash(s) 0 1 2 3 4 5 6 7 8 9
modulo mapping
• a mod m maps from integers to 0..m-1
› one to one? no
› onto? yes

x -4 -3 -2 -1 0 1 2 3 4 5 6 7

x mod 4 0 1 2 3 0 1 2 3 0 1 2 3
Hash function : mod
• If keys are integers, we can use the hash
function:
› Hash(key) = key mod TableSize
• Problem 1: What if TableSize is 11 and all
keys are 2 repeated digits? (eg, 22, 33, …)
› all keys map to the same index
› Need to pick TableSize carefully: often, a prime
number
Hash Function : add chars
• If keys are strings can get an integer by adding up ASCII
values of characters in key
hashValue = 0;
while (*key != ‘\0’)
hashValue += *key++;
character C S E 3 7 3 <0>
ASCII value 67 83 69 32 51 55 51 0

• We are converting a very large number (c0c1c2c3c4) to a


relatively small number (c0+c1+c2+c3+c4)
Issues with hash add char
• Problems with adding up character values for
string keys
› If string keys are short, will not hash evenly
to all of the hash table
› Different character combinations hash to
same value
• “abc”, “bca”, and “cab” all add up to the same
value
Hash function : chars as digits
• Suppose keys can use any of 26 characters plus
blank (27 characters numbered 0 to 26)
› these are digits in a base 27 representation of a number
› can use 32 instead of 27 and shift left by 5 bits for fast
multiplication, ie, consider the number to be a base 32
value
• A key conversion function for short strings
› “abc” = 1*322 + 2*321 + 3 =1091
› “bca” = 2*322 + 3*321 + 1 =2243
› “cab” = 3*322 + 1*321 + 2 =6342
Collisions
• Collisions occur when the hash function maps
twwo different keys to the same location.The
methods to resove collisions are

1. Open Addessing
2. Chaining
Open Addressing
• All the keys are kept inside the hash table, unlike separate
chaining.
• The hash table contains the only key information.
• The methods for open addressing are as follows:

• Linear Probing
• Quadratic Probing
• Double Hashing
Linear Probing
• If slot hash (x) % S is full, then
we try ( hash (x) + 1 ) % S
• If ( hash (x) + 1 ) % S is also
full, then we try ( hash (x) + 2)
%S
• If ( hash (x) + 2 ) % S is also
full, then we try ( hash (x) + 3 )
%S
Linear probing -- clustering

no collision
collision in small cluster
no collision

collision in large cluster


Quadatic Probing
• let hash (x) be the slot index computed using hash function.
• If slot hash(x) % S is full, then we try ( hash (x) + 1*1 ) % S
• If ( hash (x) + 1*1 ) % S is also full, then we try ( hash (x) +
2*2 ) % S
• If ( hash (x) + 2*2 ) % S is also full, then we try ( hash (x) +
3*3 ) % S
• ..................................................
• ..................................................
Doube Hashing
• let hash(x) be the slot index computed using hash function.
• If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
• If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) +
2*hash2(x)) % S
• If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) +
3*hash2(x)) % S
• ..................................................
• ..................................................
Double hashing

no collision
collision, try again
no collision at h1(x)+h2(x)

collision, try again


at h1(z)+h2(z),
at h1(z)+2h2(z),
at h1(z)+3h2(z), ...

[R. Sedgewick]
Rehashing - rebuild the table
• Rehashing is the process of
increasing the size of a
hashmap and redistributing the
elements to new buckets based
on their new hash values. It is
done to improve the
performance of the hashmap
and to prevent collisions caused
by a high load factor.
• Rehashing is needed in a
hashmap to prevent collision
and to maintain the efficiency
of the data structure.
Load Factor
• The load factor is a measure that decides when to increase the
HashMap or Hashtable capacity to maintain the get() and put()
operation of complexity O(1). The default value of the load factor of
HashMap is 0.75 (75% of the map size). In short, we can say that the
load factor decides "when to increase the number of buckets to store
the key-value pair."

• LARGER load factor: Lower space consumption but higher lookups


SMALLER Load factor: Larger space consumption compared to the
required number of elements.
Resolution by Chaining
Applications of Hashing
• Database Indexing
• CD Databases
• Driver Licenses/Insurance cards
• Sparse Matrix
• File Signatures
• Game Boards

You might also like