Hash Functions: 6.1 The Hash Function SHA1
Hash Functions: 6.1 The Hash Function SHA1
Hash Functions: 6.1 The Hash Function SHA1
Hash Functions
A hash function usually means a function that compresses, meaning the output is shorter than the
input. Often, such a function takes an input of arbitrary or almost arbitrary length to one whose
length is a fixed number, like 160 bits. Hash functions are used in many parts of cryptography,
and there are many different types of hash functions, with differing security properties. We will
consider them in this chapter.
Figure 6.1: The SHA1 hash function and the underlying SHF1 family.
strings M and M that hash to the same value. This is just by the pigeonhole principle: if 2256
pigeons (the 256-bit messages) roost in 2160 holes (the 160-bit hash values) then some two pigeons
(two distinct strings) roost in the same hole (have the same hash). Indeed countless pigeons must
share the same hole. The difficult is only that nobody has as yet identified (meaning, explicitly
provided) even two such pigeons (strings).
Bellare and Rogaway 3
In trying to define this collision-resistance property of SHA1 we immediately run into foun-
dational problems. We would like to say that it is computationally infeasible to output a pair
of distinct strings M and M that collide under SHA1. But in what sense could it be infeasible?
There is a programindeed a very short an simple one, having just two print statementswhose
output specifies a collision. Its not computationally hard to output a collision; it cant be. The
only difficulty is our human problem of not knowing what this program is.
It seems very hard to make a mathematical definition that captures the idea that human beings
cant find collisions in SHA1. In order to reach a mathematically precise definition we are going to
have to change the very nature of what we conceive to be a hash function. Namely, rather than it
being a single function, it will be a family of functions. This is unfortunate in some ways, because
it distances us from concrete hash functions like SHA1. But no alternative is known.
Figure 6.3: Framework for security notions for collision-resistant hash functions. The three choices
of s {0, 1, 2} give rise to three notions of security.
In measuring resource usage of an adversary we use our usual conventions. Although there is
formally no definition of a secure hash function, we will talk of a hash function being CR2, CR1
or CR0 with the intended meaning that its associated advantage function is small for all adversaries
of practical running time.
Note that the running time of the adversary is not really relevant for CR0, because we can
always imagine that hardwired into its code is a best choice of distinct points x1 , x2 , meaning a
Bellare and Rogaway 5
Figure 6.4: Games defining security notions for three kinds of collision-resistant hash functions
under known-key attack.
Figure 6.5: Types of hash functions, with names in our framework and corresponding names found
in the literature.
The above value equals Advcr0H (A) and is the maximum advantage attainable.
Clearly, a CR2 hash function is also CR1 and a CR1 hash function is also CR0. The following
states the corresponding relations formally. The proof is trivial and is omitted.
6 HASH FUNCTIONS
Proposition 6.2.2 Let H: K D R be a hash function. Then for any adversary A0 there
exists an adversary A1 having the same running time as A0 and
Advcr0 (A ) Advcr1-kk (A ) .
H 0 H 1
Also for any adversary A1 there exists an adversary A2 having the same running time as A1 and
cr1-kk -kk (A ) .
AdvH (A1 ) Advcr2
H 2
cr2-kk
We believe that SHF1 is CR2, meaning that there is no practical algorithm A for which AdvH (A)
is appreciably large. This is, however, purely a belief, based on the current inability to find such
an algorithm. Perhaps, later, such an algorithm will emerge.
It is useful, for any integer n, to get SHF1n : {0, 1}n {0, 1}160 denote the restriction of SHF1
to the domain {0, 1}n . Note that a collision for SHF1nK is also a collision for SHF1K , and it is often
convenient to think of attacking SHF1n for some fixed n rather than SHF1 itself.
Figure 6.6: Birthday attack on a hash function H: K D R. The attack is successful in finding
a collision if it does not return FAIL.
A particular trial finds a collision with probability (about) 1 in |R|, so we expect to find a collision
in about q = |R| trials. This is much better than the |D| trials used by our first attempt. In
particular, a collision for shf1 would be found in time around 2160 rather than 2672 . But this is still
far from practical. Our conclusion is that as long as the range size of the hash function is large
enough, this attack is not a threat.
We now consider another strategy, called a birthday attack, that turns out to be much better
than the above. It is illustrated in Fig. 6.6. It picks at random q points from the domain, and
applies HK to each of them. If it finds two distinct points yielding the same output, it has found
a collision for HK . The question isphow large q need be to find a collision. The answer may seem
surprising at first. Namely, q = O( |R|) trials suffices.
We will justify this later, but first let us note the impact. Consider SHA1n with n 161. As we
indicated, the random-input collision-finding attack 160 trials to find a collision. The
takes about 2
birthday attack on the other hand takes around 2160 = 280 trials. This is MUCH less than 2160 .
Similarly, the birthday attack finds a collision in shf1 in around 280 trials while while random-input
collision-finding takes about 2160 trials.
To see why the birthday attack performs as well as we claimed, we recall the following game.
Suppose we have q balls. View them as numbered, 1, . . . , q. We also have N bins, where N q.
We throw the balls at random into the bins, one by one, beginning with ball 1. At random means
that each ball is equally likely to land in any of the N bins, and the probabilities for all the balls
are independent. A collision is said to occur if some bin ends up containing at least two balls. We
are interested in C(N, q), the probability of a collision. As shown in the Appendix,
q2
C(N, q) (6.1)
2N
for 1 q 2N . Thus C(N, q) 1 for q 2N .
The relation to birthdays arises from the question of how many people need be in a room before
the probability of there being two people with the same birthday is close to one. We imagine each
person has a birthday that is a random one of the 365 days in a year. This means we can think
of a person as a ball being thrown at random into one of 365 bins, where the i-th bin represents
having birthday the i-th day of the year. So we can apply the Proposition from the Appendix
with N = 365and q the number of people in the room. The Proposition says that when the room
contains q 2 365 27 people, the probability that there are two people with the same birthday
is close to one. This number (27) is quite small and may be hard to believe at first hearing, which
is why this is sometimes called the birthday paradox.
To see how this applies to the birthday attack of Fig. 6.6, let us enumerate the points in the
range as R1 , . . . , RN , where N = |R|. Each such point defines a bin. We view xi as a ball, and
imagine that it is thrown into bin yi , where yi = HK (xi ). Thus, a collision of balls (two balls in
8 HASH FUNCTIONS
the same bin) occurs precisely when two values xi , xj have the same output under HK . We are
interested in the probability that this happens as a function of q. (We ignore the probability that
xi = xj , counting a collision only when HK (xi ) = HK (xj ). It can be argued that since D is larger
than R, the probability that xi = xj is small enough to neglect.)
However, we cannot apply the birthday analysis directly, because the latter assumes that each
ball is equally likely to land in each bin. This is not, in general, true for our attack. Let P (Rj )
denote the probability that a ball lands in bin Rj , namely the probability that HK (x) = Rj taken
over a random choice of x from D. Then
1
|HK (Rj )|
P (y) = .
|D|
In order for P (R1 ) = P (R2 ) = = P (RN ) to be true, as required to apply the birthday analysis,
it must be the case that
1 1 1
|HK (R1 )| = |HK (R2 )| = = |HK (RN )| .
A function HK with this property is called regular, and H is called regular if HK is regular for
every K. Our conclusion is that if H is regular, then the probability that the attack
p succeeds is
roughly C(N, q). So the above says that in this case we need about q 2N = 2 |R| trials to
find a collision with probability close to one.
If H is not regular, it turns out the attack succeeds even faster, telling us that we ought to
design hash functions to be as close to regular as possible [2].
In summary, there is a 2l/2 or better time attack to find collisions in any hash function outputting
l bits. This leads designers to choose l large enough that 2l/2 is prohibitive. In the case of SHF1
and shf1, the choice is l = 160 because 280 is indeed a prohibitive number of trials. These functions
cannot thus be considered vulnerable to birthday attack. (Unless they turn out to be extremely
non-regular, for which there is no evidence so far.)
Ensuring, by appropriate choice of output length, that a function is not vulnerable to a birthday
attack does not, of course, guarantee it is collision resistant. Consider the family H: K{0, 1}161
{0, 1}160 defined as follows. For any K and any x, function HK (x) returns the first 160 bits of x.
The output length is 160, so a birthday attack takes 280 time and is not feasible, but it is still easy
to find collisions. Namely, on input K, an adversary can just pick some 160-bit y and output y0, y1.
This tells us that to ensure collision-resistance it is not only important to have a long enough output
but also design the hash function so that there no clever shortcuts to finding a collision, meaning
no attacks that exploit some weakness in the structure of the function to quickly find collisions.
We believe that shf1 is well-designed in this regard. Nobody has yet found an adversary that
finds a collision in shf1 using less than 280 trials. Even if a somewhat better adversary, say one
finding a collision for shf1 in 265 trials, were found, it would not be devastating, since this is still a
very large number of trials, and we would still consider shf1 to be collision-resistant.
If we believe shf1 is collision-resistant, Theorem 6.5.2 tells us that SHF1, as well as SHF1n , can
also be considered collision-resistant, for all n.
We let
Advow A
H (A) = Pr[OWH true] .
We now ask ourselves whether collision-resistance implies one-wayness. It is easy to see, however,
that, in the absence of additional assumptions about the hash function than collision-resistance,
the answer is no. For example, let H be a family of functions every instance of which is the
identity function. Then H is highly collision-resistant (the advantage of an adversary in finding
a collision is zero regardless of its time-complexity since collisions simply dont exist) but is not
one-way.
However, we would expect that genuine hash functions, meaning ones that perform some
non-trivial compression of their data (ie. the size of the range is more than the size of the domain)
are one-way. This turns out to be true, but needs to be carefully quantified. To understand the
issues, it may help to begin by considering the natural argument one would attempt to use to show
that collision-resistance implies one-wayness.
Suppose we have an adversary A that has a significant advantage in attacking the one-wayness
of hash function H. We could try to use A to find a collision via the following strategy. In the
pre-key phase (we consider a type-1 attack) we pick and return a random point x1 from D. In the
post-key phase, having received the key K, we compute y = HK (x1 ) and give K, y to A. The latter
returns some x2 , and, if it was successful, we know that HK (x2 ) = y. So HK (x2 ) = HK (x1 ) and
we have a collision.
Not quite. The catch is that we only have a collision if x2 6= x1 . The probability that this
happens turns out to depend on the quantity:
h i
$ $ 1
PreImH (1) = Pr K K; x D ; y HK (x) : |HK (y)| = 1 .
This is the probability that the size of the pre-image set of y is exactly 1, taken over y generated
as shown. The following Proposition says that a collision-resistant function H is one-way as long
as PreImH (1) is small.
Proposition 6.4.2 Let H: K D R be a hash function. Then for any A there exists a B such
that
Furthermore the running time of B is that of A plus the time to sample a domain point and compute
H once.
10 HASH FUNCTIONS
The result is about the CR1 type of collision-resistance. However Proposition 6.2.2 implies that
the same is true for CR2.
A general and widely-applicable corollary of the above Proposition is that collision-resistance
implies one-wayness as long as the domain of the hash function is significantly larger than its range.
The following quantifies this.
Corollary 6.4.3 Let H: K D R be a hash function. Then for any A there exists a B such
that
-kk (A) 2 Advcr1-kk (B) + |R|
Advow
H H .
|D|
Furthermore the running time of B is that of A plus the time to sample a domain point and compute
H once.
Proof of Corollary 6.4.3: For any key K, the number of points in the range of HK that have
exactly one pre-image certainly cannot exceed |R|. This implies that
|R|
PreImH (1) .
|D|
The corollary follows from Proposition 6.4.2.
Corollary 6.4.3 says that if H is collision-resistant, and performs enough compression that |R|
is much smaller than |D|, then it is also one-way. Why? Let A be a practical adversary that
attacks the one-wayness of H. Then B is also practical, and since H is collision-resistant we know
Advcr1 -kk (B) is low. Equation (6.2) then tells us that as long as |R|/|D| is small, Advow-kk (A) is
H H
low, meaning H is one-way.
As an example, let H be the compression function shf1. In that case R = {0, 1}160 and D =
{0, 1}672 so |R|/|D| = 2512 , which is tiny. We believe shf1 is collision-resistant, and the above thus
says it is also one-way.
There are some natural hash functions, however, for which Corollary 6.4.3 does not apply.
Consider a hash function H every instance of which is two-to-one. The ratio of range size to
domain size is 1/2, so the right hand side of the equation of Corollary 6.4.3 is 1, meaning the
bound is vacuous. However, such a function is a special case of the one considered in the following
Proposition.
Corollary 6.4.4 Suppose 1 r < d and let H: K {0, 1}d {0, 1}r be a hash function which
1
is regular, meaning |HK (y)| = 2dr for every y {0, 1}r and every K K. Then for any A there
exists a B such that
ow-kk -kk (B) .
AdvH (A) 2 Advcr1
H
Furthermore the running time of B is that of A plus the time to sample a domain point and compute
H once.
Proof of Corollary 6.4.4: The assumption d > r implies that PreImH (1) = 0. Now apply
Proposition 6.4.2.
Let Pr [] denote the probability of event in experiment Expcr1-kk (B). For any K K let
H
1
SK = { x D : |HK (HK (x))| = 1 } .
H(K, M )
y pad(M )
Parse y as M1 k M2 k k Mn where |Mi | = b (1 i n)
V IV
for i = 1, . . . , n do
V h(K, Mi k V )
Return V
Adversary Ah (K)
Run AH (K) to get its output (x1 , x2 )
y1 pad(x1 ) ; y2 pad(x2 )
Parse y1 as M1,1 k M1,2 k k M1,n[1] where |M1,i | = b (1 i n[1])
Parse y2 as M2,1 k M2,2 k k M2,n[2] where |M2,i | = b (1 i n[2])
V1,0 IV ; V2,0 IV
for i = 1, . . . , n[1] do V1,i h(K, M1,i k V1,i1 )
for i = 1, . . . , n[2] do V2,i h(K, M2,i k V2,i1 )
if (V1,n[1] 6= V2,n[2] OR x1 = x2 ) return FAIL
if |x1 | =
6 |x2 | then return (M1,n[1] k V1,n[1]1 , M2,n[2] k V2,n[2]1 )
n n[1] // n = n[1] = n[2] since |x1 | = |x2 |
for i = n downto 1 do
if M1,i k V1,i1 6= M2,i k V2,i1 then return (M1,i k V1,i1 , M2,i k V2,i1 )
Figure 6.7: Hash function H defined from compression function h via the MD paradigm, and
adversary Ah for the proof of Theorem 6.5.2.
Let b be an integer parameter called the block length, and v another integer parameter called
the chaining-variable length. Let h: K {0, 1}b+v {0, 1}v be a family of functions that we call
the compression function. We assume it is collision-resistant.
Let B denote the set of all strings whose length is a positive multiple of b bits, and let D be
b
some subset of {0, 1}<2 .
Definition 6.5.1 A function pad: D B is called a MD-compliant padding function if it has the
following properties for all M, M1 , M2 D:
(1) M is a prefix of pad(M )
(2) If |M1 | = |M2 | then |pad(M1 )| = |pad(M2 )|
(3) If M1 6= M2 then the last block of pad(M1 ) is different from the last block of pad(M2 ).
A block, above, consists of b bits. Remember that the output of pad is in B, meaning is a sequence
of b-bit blocks. Condition (3) of the definition is saying that if two messages are different then,
when we apply pad to them, we end up with strings that differ in their final blocks.
An example of a MD-compliant padding function is shapad. However, there are other examples
as well.
Now let IV be a v-bit value called the initial vector. We build a family H: K D {0, 1}v
from h and pad as illustrated in Fig. 6.7. Notice that SHF1 is such a family, built from h = shf1
and pad = shapad. The main fact about this method is the following.
Bellare and Rogaway 13
Theorem 6.5.2 Let h: K{0, 1}b+v {0, 1}v be a family of functions and let H: KD {0, 1}v
be built from h as described above. Suppose we are given an adversary AH that attempts to find
collisions in H. Then we can construct an adversary Ah that attempts to find collisions in h, and
Advcr2-kk (A ) Advcr2-kk (A ) .
H H h h (6.9)
Furthermore, the running time of Ah is that of AH plus the time to perform (|pad(x1 )|+|pad(x2 )|)/b
computations of h where (x1 , x2 ) is the collision output by AH .
This theorem says that if h is collision-resistant then so is H. Why? Let AH be a practical adversary
attacking H. Then Ah is also practical, because its running time is that of AH plus the time to
do some extra computations of h. But since h is collision-resistant we know that Advcr2 -kk (A )
h h
cr2-kk
is low. Equation (6.9) then tells us that AdvH (AH ) is low, meaning H is collision-resistant as
well.
Proof of Theorem 6.5.2: Adversary Ah , taking input a key K K, is depicted in Fig. 6.7. It
runs AH on input K to get a pair (x1 , x2 ) of messages in D. We claim that if x1 , x2 is a collision
for HK then Ah will return a collision for hK .
Adversary Ah computes V1,n[1] = HK (x1 ) and V2,n[2] = HK (x2 ). If x1 , x2 is a collision for HK then
we know that V1,n[1] = V2,n[2] . Let us assume this. Now, let us look at the inputs to the application
of hK that yielded these outputs. If these inputs are different, they form a collision for hK .
The inputs in question are M1,n[1] k V1,n[1]1 and M2,n[2] k V2,n[2]1 . We now consider two cases. The
first case is that x1 , x2 have different lengths. Item (3) of Definition 6.5.1 tells us that M1,n[1] 6=
M2,n[2] . This means that M1,n[1] k V1,n[1]1 6= M2,n[2] k V2,n[2]1 , and thus these two points form a
collision for hK that can be output by Ah .
The second case is that x1 , x2 have the same length. Item (2) of Definition 6.5.1 tells us that y1 , y2
have the same length as well. We know this length is a positive multiple of b since the range of pad
is the set B, so we let n be the number of b-bit blocks that comprise y1 and y2 . Let Vn denote the
value V1,n , which by assumption equals V2,n . We compare the inputs M1,n kV1,n1 and M2,n kV2,n1
that under hK yielded Vn . If they are different, they form a collision for hK and can be returned
by Ah . If, however, they are the same, then we know that V1,n1 = V2,n1 . Denoting this value by
Vn1 , we now consider the inputs M1,n1 k V1,n2 and M2,n1 k V2,n2 that under hK yield Vn1 .
The argument repeats itself: if these inputs are different we have a collision for hK , else we can step
back one more time.
Can we get stuck, continually stepping back and not finding our collision? No, because y1 6= y2 .
Why is the latter true? We know that x1 6= x2 . But item (1) of Definition 6.5.1 says that x1 is a
prefix of y1 and x2 is a prefix of y2 . So y1 6= y2 .
We have argued that on any input K, adversary Ah finds a collision in hK exactly when AH finds a
collision in HK . This justifies Equation (6.9). We now justify the claim about the running time of
Ah . The main component of the running time of Ah is the time to run AH . In addition, it performs
a number of computations of h equal to the number of blocks in y1 plus the number of blocks in
y2 . There is some more overhead, but small enough to neglect.
14 HASH FUNCTIONS
Bibliography
[2] M. Bellare and T. Kohno. Hash function balance and its impact on birthday attacks.
Advances in Cryptology EUROCRYPT 04, Lecture Notes in Computer Science Vol. 3027,
C. Cachin and J. Camenisch ed., Springer-Verlag, 2004.
[3] I. Damg ard. A Design Principle for Hash Functions. Advances in Cryptology CRYPTO
89, Lecture Notes in Computer Science Vol. 435, G. Brassard ed., Springer-Verlag, 1989.
[4] B. den Boer and A. Bosselaers, Collisions for the compression function of MD5. Advances
in Cryptology EUROCRYPT 93, Lecture Notes in Computer Science Vol. 765, T. Helleseth
ed., Springer-Verlag, 1993.
[6] H. Dobbertin, Cryptanalysis of MD5. Rump Session of Eurocrypt 96, May 1996,
http://www.iacr.org/conferences/ec96/rump/index.html.
[8] M. Naor and M. Yung, Universal one-way hash functions and their cryptographic appli-
cations. Proceedings of the 21st Annual Symposium on the Theory of Computing, ACM,
1989.
[9] National Institute of Standards. FIPS 180-2, Secure hash standard, August 2000. http://
csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf.
[10] R. Merkle. One way hash functions and DES. Advances in Cryptology CRYPTO 89,
Lecture Notes in Computer Science Vol. 435, G. Brassard ed., Springer-Verlag, 1989.
[11] R. Rivest, The MD4 message-digest algorithm, Advances in Cryptology CRYPTO 90,
Lecture Notes in Computer Science Vol. 537, A. J. Menezes and S. Vanstone ed., Springer-
Verlag, 1990, pp. 303311. Also IETF RFC 1320 (April 1992).
[12] R. Rivest, The MD5 message-digest algorithm. IETF RFC 1321 (April 1992).