0% found this document useful (0 votes)
51 views

CityHash - Fast Hash Functions For Strings

CityHash is a fast hash function for strings created by Geoff Pike and Jyrki Alakuijala of Google. It was designed to be faster than existing hash functions like MurmurHash while maintaining a high level of quality and diffusion. The document discusses the history and trends in string hashing algorithms, analyzes existing functions like MurmurHash, and outlines the design goals for CityHash such as speed on modern hardware, excellent diffusion, and good performance on testing.

Uploaded by

Ikram Mardhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

CityHash - Fast Hash Functions For Strings

CityHash is a fast hash function for strings created by Geoff Pike and Jyrki Alakuijala of Google. It was designed to be faster than existing hash functions like MurmurHash while maintaining a high level of quality and diffusion. The document discusses the history and trends in string hashing algorithms, analyzes existing functions like MurmurHash, and outlines the design goals for CityHash such as speed on modern hardware, excellent diffusion, and good performance on testing.

Uploaded by

Ikram Mardhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

CityHash: Fast Hash Functions for Strings

Geoff Pike
(joint work with Jyrki Alakuijala)

Google

http://code.google.com/p/cityhash/
Introduction

I Who?
I What?
I When?
I Where?
I Why?
Outline

Introduction

A Biased Review of String Hashing

Murmur or Something New?

Interlude: Testing

CityHash

Conclusion
Recent activity

I SHA-3 winner was announced last month


I Spooky version 2 was released last month
I MurmurHash3 was finalized last year
I CityHash version 1.1 will be released this month
In my backup slides you can find ...

I My notation
I Discussion of cyclic redundancy checks
I What is a CRC?
I What does the crc32q instruction do?
Traditional String Hashing

I Hash function loops over the input


I While looping, the internal state is kept in registers
I In each iteration, consume a fixed amount of input
Traditional String Hashing

I Hash function loops over the input


I While looping, the internal state is kept in registers
I In each iteration, consume a fixed amount of input
I Sample loop for a traditional byte-at-a-time hash:
for (int i = 0; i < N; i++) {
state = Combine(state, Bi )
state = Mix(state)
}
Two more concrete old examples (loop only)

for (int i = 0; i < N; i++)


state = ρ-5 (state) ⊕ Bi
Two more concrete old examples (loop only)

for (int i = 0; i < N; i++)


state = ρ-5 (state) ⊕ Bi

for (int i = 0; i < N; i++)


state = 33 · state + Bi
A complete byte-at-a-time example

// Bob Jenkins circa 1996


int state = 0
for (int i = 0; i < N; i++) {
state = state + Bi
state = state + σ-10 (state)
state = state ⊕ σ6 (state)
}
state = state + σ-3 (state)
state = state ⊕ σ11 (state)
state = state + σ-15 (state)
return state
A complete byte-at-a-time example

// Bob Jenkins circa 1996


int state = 0
for (int i = 0; i < N; i++) {
state = state + Bi
state = state + σ-10 (state)
state = state ⊕ σ6 (state)
}
state = state + σ-3 (state)
state = state ⊕ σ11 (state)
state = state + σ-15 (state)
return state
What’s better about this? What’s worse?
What Came Next—Hardware Trends

I CPUs generally got better


I Unaligned loads work well: read words, not bytes
I More registers
I SIMD instructions
I CRC instructions
I Parallelism became more important
I Pipelines
I Instruction-level parallelism (ILP)
I Thread-level parallelism
What Came Next—Hash Function Trends

I People got pickier about hash functions


I Collisions may be more costly
I Hash functions in libraries should be “decent”
I More acceptance of complexity
I More emphasis on diffusion
Jenkins’ mix
Also around 1996, Bob Jenkins published a hash function with
a 96-bit input and a 96-bit output. Pseudocode with 32-bit
registers:

a = a - b; a = a - c; a = a ⊕ σ13 (c)
b = b - c; b = b - a; b = b ⊕ σ-8 (a)
c = c - a; c = c - b; c = c ⊕ σ13 (b)
a = a - b; a = a - c; a = a ⊕ σ12 (c)
b = b - c; b = b - a; b = b ⊕ σ-16 (a)
c = c - a; c = c - b; c = c ⊕ σ5 (b)
a = a - b; a = a - c; a = a ⊕ σ3 (c)
b = b - c; b = b - a; b = b ⊕ σ-10 (a)
c = c - a; c = c - b; c = c ⊕ σ15 (b)
Jenkins’ mix
Also around 1996, Bob Jenkins published a hash function with
a 96-bit input and a 96-bit output. Pseudocode with 32-bit
registers:

a = a - b; a = a - c; a = a ⊕ σ13 (c)
b = b - c; b = b - a; b = b ⊕ σ-8 (a)
c = c - a; c = c - b; c = c ⊕ σ13 (b)
a = a - b; a = a - c; a = a ⊕ σ12 (c)
b = b - c; b = b - a; b = b ⊕ σ-16 (a)
c = c - a; c = c - b; c = c ⊕ σ5 (b)
a = a - b; a = a - c; a = a ⊕ σ3 (c)
b = b - c; b = b - a; b = b ⊕ σ-10 (a)
c = c - a; c = c - b; c = c ⊕ σ15 (b)

Thorough, but pretty fast!


Jenkins’ mix-based string hash

Given mix(a, b, c) as defined on the previous slide, pseudocode


for string hash:
uint32 a = ...
uint32 b = ...
uint32 c = ...
int iters = bN /12c
for (int i = 0; i < iters; i++) {
a = a + W3i
b = b + W3i + 1
c = c + W3i + 2
mix(a, b, c)
}
etc.
Modernizing Google’s string hashing practices

I Until recently, most string hashing at Google used Jenkins’


techniques
I Some in the “32-bit” style
I Some in the “64-bit” style, whose mix is 4/3 times as long
I We saw Austin Appleby’s 64-bit Murmur2 was faster and
considered switching
Modernizing Google’s string hashing practices

I Until recently, most string hashing at Google used Jenkins’


techniques
I Some in the “32-bit” style
I Some in the “64-bit” style, whose mix is 4/3 times as long
I We saw Austin Appleby’s 64-bit Murmur2 was faster and
considered switching
I Launched education campaign around 2009
I Explain the options; give recommendations
I Encourage labelling: “may change” or “won’t”
Quality targets for string hashing

There are roughly four levels of quality one might seek:


I quick and dirty
I suitable for a library
I suitable for fingerprinting
I secure
Quality targets for string hashing

There are roughly four levels of quality one might seek:


I quick and dirty
I suitable for a library
I suitable for fingerprinting
I secure
Is Murmur2 good for a library? for fingerprinting? both?
Murmur2 preliminaries

First define two subroutines:

ShiftMix(a) = a ⊕ σ47 (a)


Murmur2 preliminaries

First define two subroutines:

ShiftMix(a) = a ⊕ σ47 (a)

and
NX
mod 8
mod 8)−i
TailBytes(N) = 256(N · BN − i
i=1
Murmur2

uint64 k = 14313749767032793493
int iters = bN /8c
uint64 hash = seed ⊕ Nk
for (int i = 0; i < iters; i++)
hash = (hash ⊕ (ShiftMix(Wi ·k)·k))·k

if (N mod 8 > 0)
hash = (hash ⊕ TailBytes(N))·k

return ShiftMix(ShiftMix(hash) · k)
Murmur2 Strong Points

I Simple
I Fast (assuming multiplication is fairly cheap)
I Quality is quite good
Questions about Murmur2 (or any other choice)

I Could its speed be better?


I Could its quality be better?
Murmur2 Analysis

Inner loop is:

for (int i = 0; i < iters; i++)


hash = (hash ⊕ f(Wi )) · k

where f is “Mul-ShiftMix-Mul”
Murmur2 Speed

I ILP comes mostly from parallel application of f


I Cost of TailBytes(N) can be painful for N < 60 or so
Murmur2 Quality

I f is invertible
I During the loop, diffusion isn’t perfect
Testing

Common tests include:


I Hash a bunch of words or phrases
I Hash other real-world data sets
I Hash all strings with edit distance <= d from some string
I Hash other synthetic data sets
I E.g., 100-word strings where each word is “cat” or “hat”
I E.g., any of the above with e x t r a s p a c e
I We use our own plus SMHasher
Testing

Common tests include:


I Hash a bunch of words or phrases
I Hash other real-world data sets
I Hash all strings with edit distance <= d from some string
I Hash other synthetic data sets
I E.g., 100-word strings where each word is “cat” or “hat”
I E.g., any of the above with e x t r a s p a c e
I We use our own plus SMHasher
I avalanche
Avalanche (by example)

Suppose we have a function that inputs and outputs 32 bits.


Find M random input values. Hash each input value with and
without its jth bit flipped. How often do the results differ in their
kth output bit?
Avalanche (by example)

Suppose we have a function that inputs and outputs 32 bits.


Find M random input values. Hash each input value with and
without its jth bit flipped. How often do the results differ in their
kth output bit?

Ideally we want “coin flip” behavior, so the relevant distribution


has mean M/2 and variance 1/4M.
64x64 avalanche diagram: f (x) = x
64x64 avalanche diagram: f (x) = kx
64x64 avalanche diagram: ShiftMix
64x64 avalanche diagram: ShiftMix(x) · k
64x64 avalanche diagram: ShiftMix(kx) · k
64x64 avalanche diagram: f (x) = CRC(kx)
The CityHash Project

Goals:
I Speed (on Google datacenter hardware or similar)
I Quality
I Excellent diffusion
I Excellent behavior on all contributed test data
I Excellent behavior on basic synthetic test data
I Good internal state diffusion—but not too good,
cf. Rogaway’s Bucket Hashing
Portability

For speed without total loss of portability, assume:


I 64-bit registers
I pipelined and superscalar
I fairly cheap multiplication
I cheap +, −, ⊕, σ, ρ, β
I cheap register-to-register moves
Portability

For speed without total loss of portability, assume:


I 64-bit registers
I pipelined and superscalar
I fairly cheap multiplication
I cheap +, −, ⊕, σ, ρ, β
I cheap register-to-register moves
I a + b may be cheaper than a ⊕ b
I a + cb + 1 may be fairly cheap for c ∈ {0, 1, 2, 4, 8}
Branches are expensive

Is there a better way to handle the “tails” of short strings?


Branches are expensive

Is there a better way to handle the “tails” of short strings?

How many dynamic branches are reasonable for hashing a


12-byte input?
Branches are expensive

Is there a better way to handle the “tails” of short strings?

How many dynamic branches are reasonable for hashing a


12-byte input?

How many arithmetic operations?


CityHash64 initial design (2010)

I Focus on short strings


I Perhaps just use Murmur2 on long strings
I Use overlapping unaligned reads
I Write the minimum number of loops: 1
I Focus on speed first; fix quality later
The CityHash64 function: overall structure
The CityHash64 function: overall structure

if (N <= 32)
if (N <= 16)
if (N <= 8)
...
else
...
else
...
else if (N <= 64) {
// Handle 33 <= N <= 64
...
} else {
// Handle N > 64
int iters = bN /64c
...
}
The CityHash64 function (2012): preliminaries

Define α(u, v, m):

let a = u⊕v
a’ = ShiftMix(a · m)
a” = a’ ⊕ v
a”’ = ShiftMix(a00 · m)
in
a”’ · m
The CityHash64 function (2012): preliminaries

Define α(u, v, m):

let a = u⊕v
a’ = ShiftMix(a · m)
a” = a’ ⊕ v
a”’ = ShiftMix(a00 · m)
in
a”’ · m
Also, k0 , k1 , and k2 are primes near 264 , and K is k2 + 2N.
CityHash64: 1 <= N <= 3

let a = B0
b = BbN /2c
c = BN−1
y = a + 256b
z = N + 4c
in
ShiftMix((y · k2 ) ⊕ (z · k0 ))
CityHash64: 4 <= N <= 8

α(N + 4W032 , W−1


32 , K)
CityHash64: 9 <= N <= 16

let a = W0 + k2
b = W−1
c = ρ37 (b) · K + a
d = (ρ25 (a) + b) · K
in
α(c, d, K)
CityHash64: 17 <= N <= 32

let a = W0 · k1
b = W1
c = W−1 · K
d = W−2 · k2
in
α(ρ43 (a + b) + ρ30 (c) + d, a + ρ18 (b + k2 ) + c, K)
CityHash64: 33 <= N <= 64

let a = W0 · k2
e = W2 · k2
f = W3 · 9
h = W−2 · K
u = ρ43 (a + W−1 ) + 9 (ρ30 (W1 ) + c)
v = a + W−1 + f + 1
w = h + β((u + v) · K)
x = ρ42 (e + f ) + W−3 + β(W−4 )
y = (β((v + w) · K) + W−1 ) · K
z = e + f + W−3
r = β((x + z) · K + y ) + W1
t = ShiftMix((r + z) · K + W−4 + h)
in
tK + x
Evaluation for N <= 64
Evaluation for N <= 64

I CityHash64 is about 1.5x faster than Murmur2 for N <= 64


I Quality meets targets (bug reports are welcome)
I Simplifying it would be nice
Evaluation for N <= 64

I CityHash64 is about 1.5x faster than Murmur2 for N <= 64


I Quality meets targets (bug reports are welcome)
I Simplifying it would be nice
I Key lesson: Don’t loop over bytes
I Key lesson: Understand the basics of machine architecture
I Key lesson: Know when to stop
Next steps

Arguably we should have written CityHash32 next. That’s still


not done.

Instead, we worked on 64-bit hashes for N > 64, and 128-bit


hashes.
CityHash64 for N > 64

The one loop in CityHash64:


I 56 bytes of state
I 64 bytes consumed per iteration
I 7 rotates, 4 multiplies, 1 xor, about 36 adds (??)
I influenced by mix and Murmur2
128-bit CityHash variants

I CityHash128
I same loop body, manually unrolled
I slightly faster for large N
I CityHashCrc128
I totally different function
I uses CRC instruction, but isn’t a CRC
I faster still for large N
Evaluation for N > 64
Evaluation for N > 64

I CityHash64 is about 1.3 to 1.6x faster than Murmur2


I For long strings, the fastest CityHash variant is about 2x
faster than the fastest Murmur variant
I Quality meets targets (bug reports are welcome)
I Jenkins’ Spooky is a strong competitor
My recommendations

For hash tables or fingerprints:


Nehalem, Westmere, similar other
Sandy Bridge, etc. CPUs CPUs
small N CityHash CityHash TBD
large N CityHash Spooky or CityHash TBD
My recommendations

For hash tables or fingerprints:


Nehalem, Westmere, similar other
Sandy Bridge, etc. CPUs CPUs
small N CityHash CityHash TBD
large N CityHash Spooky or CityHash TBD

For quick-and-dirty hashing: Start with the above


Future work

I CityHash32
I Big Endian
I SIMD
The End
The End

Backup Slides
Notation

N = the length of the input (bytes)


a ⊕ b = bitwise exclusive-or
a + b = sum (usually mod 264 )
a · b = product (usually mod 264 )
σn (a) = right shift a by n bits
σ-n (a) = left shift a by n bits
ρn (a) = right rotate a by n bits
ρ-n (a) = left rotate a by n bits
β(a) = byteswap a
More Notation

Bi = the i th byte of the input (counts from 0)


Wib = the i th b-bit word of the input
More Notation

Bi = the i th byte of the input (counts from 0)


Wib = the i th b-bit word of the input
b
W−1 = the last b-bit word of the input
b
W−2 = the second-to-last b-bit word of the input
Cyclic Redundancy Check (CRC)

The commonest explanation of a CRC is in terms of


polynomials whose coefficients are elements of GF(2).
Cyclic Redundancy Check (CRC)

The commonest explanation of a CRC is in terms of


polynomials whose coefficients are elements of GF(2). In
GF(2):
0 is the additive identity,
1 is the multiplicative identity, and
1 + 1 = 0 + 0 = 0.
CRC, part 2

Sample polynomial:

p = x 32 + x 27 + 1
CRC, part 3

We can use p to define an equivalence relation: We’ll say q and


r are equivalent iff they differ by a polynomial times p.
CRC, part 4

Theorem: The equivalence relation has 2Degree(p) elements.


CRC, part 4

Theorem: The equivalence relation has 2Degree(p) elements.

Lemma: if Degree(p) = Degree(q) > 0


then Degree(p + q) < Degree(p)
and, if not, Degree(p + q) = max(Degree(p), Degree(q))
CRC, part 4

Theorem: The equivalence relation has 2Degree(p) elements.

Lemma: if Degree(p) = Degree(q) > 0


then Degree(p + q) < Degree(p)
and, if not, Degree(p + q) = max(Degree(p), Degree(q))

Observation: There are 2Degree(p) polynomials with de-


gree less than Degree(p), none equivalent.
CRC, part 5

Observation: Any polynomial with degree >= Degree(p) is


equivalent to a lower degree polynomial.
CRC, part 5

Observation: Any polynomial with degree >= Degree(p) is


equivalent to a lower degree polynomial.

Example: What is a degree <= 31 polynomial equivalent to x 50 ?


CRC, part 5

Observation: Any polynomial with degree >= Degree(p) is


equivalent to a lower degree polynomial.

Example: What is a degree <= 31 polynomial equivalent to x 50 ?

Degree(x 50 ) − Degree(p) = 18; therefore x 50 − x 18 · p has


degree less than 50.
CRC, part 5

Observation: Any polynomial with degree >= Degree(p) is


equivalent to a lower degree polynomial.

Example: What is a degree <= 31 polynomial equivalent to x 50 ?

Degree(x 50 ) − Degree(p) = 18; therefore x 50 − x 18 · p has


degree less than 50.

x 50 − x 18 · p = x 50 − x 18 · (x 32 + x 27 + 1)
= x 50 − (x 50 + x 45 + x 18 )
= x 45 + x 18
CRC, part 6

Applying the same idea repeatedly will lead us to the lowest


degree polynomial that is equivalent to x 50 .
CRC, part 6

Applying the same idea repeatedly will lead us to the lowest


degree polynomial that is equivalent to x 50 .

The result:

x 50 ≡ x 30 + x 18 + x 13 + x 8 + x 3
CRC, part 7

More samples:

x 50 ≡ x 30 + x 18 + x 13 + x 8 + x 3
x 50 + 1 ≡ x 30 + x 18 + x 13 + x 8 + x 3 + 1
x 51 ≡ x 31 + x 19 + x 14 + x 9 + x 4
x 51 + x 50 ≡ x 31 + x 30 + x 19 + x 18 + x 14 + x 13 + x 9 + x 8 + x 4 + x 3
x 51 + x 31 ≡ x 19 + x 14 + x 9 + x 4
CRC in Practice

I There are thousands of CRC implementations


I We’ll focus on those that use _mm_crc32_u64() or
crc32q
I The inputs are a 32-bit number and a 64-bit number
I The output is a 32-bit number
What is crc32q?

crc32q for inputs u and v returns


C(u xor v ) = F (E(D(u xor v ))).

D(0) = 0, D(1) = x 95 , D(2) = x 94 , D(3) = x 95 + x 94 , D(4) =


x 93 , . . .

E maps a polynomial to the equivalent with lowest-degree.

F (0) = 0, F (x 31 ) = 1, F (x 30 ) = 2, F (x 31 + x 30 ) = 3, F (x 29 ) =
4, . . .
How is crc32q used?

C operates on 64 bits of input, so:

For a 64-bit input, use C(seed, u0 ).


How is crc32q used?

C operates on 64 bits of input, so:

For a 64-bit input, use C(seed, u0 ).

For a 128-bit input, use C(C(seed, u0 ), u1 ).


How is crc32q used?

C operates on 64 bits of input, so:

For a 64-bit input, use C(seed, u0 ).

For a 128-bit input, use C(C(seed, u0 ), u1 ).

For a 192-bit input, use C(C(C(seed, u0 ), u1 ), u2 ).


C as matrix-vector multiplication

A 32 × 64 matrix times a 64 × 1 vector yields a 32 × 1 result.


C as matrix-vector multiplication

A 32 × 64 matrix times a 64 × 1 vector yields a 32 × 1 result.


The matrix and vectors contain elements of GF(2):

You might also like