Consistent Hashing
in your python applications
Europython 2017
@ultrabug
Gentoo Linux developer
CTO at Numberly
History & main use cases
Distributed (web) caching (Akamai)
P2P (Chord & BitTorrent)
Distributed databases (data distribution / sharding)
● Amazon DynamoDB
● Cassandra / ScyllaDB
● Riak
● CockroachDB
MAPPING
referential -> information
Phonebook
name -> phone number
Map logic
Referential selection Logical operation INFORMATION
lookup efficiency
MAP
key -> value
Python dict()
{key: value}
Python dict() is a Hash Table
Hash Table logic
Hash function ( key ) Logical operation LOCATION
implementation
Python dict() implementation
Array (in memory)
hash(key) & (size of array - 1) = array index
hash(‘a’) = 12416037344 & 11 = 0 0 | value: 123
1 |
hash(‘c’) = 12672038114 & 11 = 2 2 | value: ‘coco’
hash(‘b’) = 12544037731 & 11 = 3 3 | value: None
...
11 |
Key factors to consider
Distribution (balancing) Accuracy LOCATION
efficiency
scaling
Python dict efficiency & scaling
hash(key) & (size of array - 1) = array index
hash(‘a’) = 12416037344 & 11 = 0 0 | value: 123
hash() = uneven distribution 1 | MEMORY
hash(‘c’) = 12672038114 & 11 = 2 2 | value: ‘coco’
hash(‘b’) = 12544037731 & 11 = 3 3 | value: None
...
Optimized for fast lookups O(1)
Memory inefficient (probing)
11 | MEMORY
Distributed Hash Tables
(DHT)
Split your key space into buckets
hash(key) operator bucket h o v
hash(key) operator bucket h o v
hash(key) operator bucket h o v
the hash function will impact the size of each bucket
Distribute your buckets to servers
hash(key) operator SERVER 0 bucket 0
hash(key) operator SERVER 1 bucket 1
hash(key) operator SERVER 2 bucket 2
what’s the best operator function to find the server hosting the bucket for my key
?
Naive DHT implementation
md5(key) % (number of buckets) = server
int(md5(b'd').hexdigest(), 16) %3=0 SERVER 0 bucket 0
int(md5(b'e').hexdigest(), 16) %3=1 SERVER 1 bucket 1
int(md5(b'f').hexdigest(), 16) %3=2 SERVER 2 bucket 2
simple & looking good...
Naive DHT implementation
md5(key) % (number of buckets) = server
int(md5(b'd').hexdigest(), 16) % 4 = 1 (was 0) SERVER 0 bucket 0
int(md5(b'e').hexdigest(), 16) % 4 = 2 (was 1) SERVER 1 bucket 1
int(md5(b'f').hexdigest(), 16) % 4 = 3 (was 2) SERVER 2 bucket 2
int(md5(b'g').hexdigest(), 16) %4=1 SERVER 1 bucket 1
...until you add / remove a bucket/server! SERVER 3 bucket 3
n/(n+1)
~ fraction of remapped keys
HELP!
we need consistency
The Hash Ring
Place your servers on the continuum (ring)
hash(server 1)
hash(server 0)
hash(server 2)
Keys’ bucket is on the next server in the ring
hash(key)
SERVER 2
SERVER 0
hash(key)
SERVER 1
1/n
~ fraction of remapped keys
Uneven partitions lead to hotspots
server 0
server 2
server 1
hash functions are not perfect
Which hash function to use ?
Cryptographic hash functions Non cryptographic hash functions
● MD5 ● CityHash (google)
● SHA1 ● Murmur (v3)
● SHA256
standard optimized for key lookups
adoption fast
need conversion to int need of C libs
SHAX - MD5 - CityHash128 - Murmur3 - CityHash64 - CityHash32
speed
Hash Rings vnodes & weights mitigate hotspots
reduces load variance on servers
My preciouuus!
Consistent Hashing implementations in python
ConsistentHashing A simple implement of consistent hashing
consistent_hash The algorithm is the same as libketama
hash_ring Using md5 as hashing function
python-continuum Using md5 as hashing function
uhashring Full featured, ketama compatible
uhashring
Example use case #1
Database instances distribution
DB1
client A
client B DB2
client C
DB3
client D
DB4
Example use case #1
Database instances distribution
Example use case #1
Database instances distribution
Example use case #2
Disk & network I/O distribution
disk
task A 1
disk
task B
2
task C disk
3
task D
disk
4
Example use case #3
Log & tracing consistency
worker
user_id A 1
worker
user_id B
2
user_id C worker
3
user_id D
worker
4
Example use case #4
python-memcached consolidation
cache
‘potato’ 1
‘coconut’ cache
2
‘tomato’ cache
3
‘raspberry’
cache
4
Live demo raffle
List of GIFs
One of the GIF is the winner
Every participant is a node (bucket)
hash(WINNER_GIF_URL) picks the winner node
http://ep17.nbly.co
(silly live demo)
Thanks
github.com/ultrabug/ep2017
github.com/ultrabug/uhashring
@ultrabug