Private Information Retrieval in the Presence of Malicious Failures
Erica Y. Yang, Jie Xu and Keith H. Bennett
Department of Computer Science, University of Durham, DH1 3LE, UK
{Erica.Yang, Jie.Xu, Keith.Bennett}@dur.ac.uk
Abstract
In the application domain of online information
services such as online census information, health records
and real-time stock quotes, there are at least two
fundamental challenges: the protection of users’ privacy
and the assurance of service availability. We present a
fault-tolerant scheme for private information retrieval
(FT-PIR) that protects users’ privacy and ensures service
provision in the presence of malicious server failures. An
error detection algorithm is introduced into this scheme
to detect the corrupted results from servers. The
analytical and experimental results show that the FT-PIR
scheme can tolerate malicious server failures effectively
and prevent any information of users from being leaked to
attackers. This new scheme does not rely on any unproven
cryptographic premise and the availability of tamperproof hardware. An implementation of the FT-PIR scheme
on a distributed database system suggests just a modest
level of performance overhead.
Keywords: Distributed systems, fault tolerance, malicious
failures, privacy protection, private information retrieval,
secret sharing, security
1: Introduction
1.1: Motivation
Security-critical information systems are becoming
increasingly accessible through the Internet as a part of
Web-based services. Census information systems and
real-time stock information systems are just two examples
among many that provide such services. For those online
information services, there are at least two fundamental
requirements to meet: high-degree security/privacy
protection (e.g. protection of users’ intention) and high
availability of services.
In client-server computing, security mechanisms are
focussed traditionally on protecting servers and little
consideration has been placed on client-side applications
and users of services. This situation severely precludes
Web-based services from becoming practically feasible,
especially for certain critical applications. This is because
a service provider, or a server, may be a client application
of some further servers (thus forming a recursive clientserver architecture).
Malicious attacks, software bugs, and network
unavailability are common causes for service
unavailability [5]. Users’ privacy may be violated when a
server is under control of some malicious attackers.
Existing approaches attempted to address availability and
security requirements separately, and there is little work
that suggests a balanced and combined solution to them.
1.2: Private Information Retrieval (PIR)
The first PIR scheme was introduced by Chor et al [6]
in 1995, and since then it has become the subject of a
significant amount of work including [1][3][4][7][8][10].
The PIR schemes provide a partial solution to the problem
of enabling a user to retrieve data from replicated
database servers without exposing the user’s intention to
the servers. By replicating databases on separated nodes
and limiting the communications capability of replicas,
the PIR scheme [6] can protect users' privacy, provided
that the number of servers in collusion does not exceed a
pre-defined bound.
The existing PIR schemes have some limitations and
constraints on their practical feasibility in real-world
applications. First, these schemes are based on a binarybit model of a database, where a data item is modelled as
a bit 1 or 0. In practice, it is difficult to transform this data
model into a realistic database system. Secondly, the PIR
schemes use a simple honest-but-curious model for
attacks in which servers always deliver correct and honest
answers to clients. They do not address any other types of
failures (e.g. loss of answers and spurious answers).
1.3: Major Results
We address the PIR problem in the presence of
malicious failures. We propose a fault-tolerant PIR (FTPIR) scheme that utilises the replication paradigm to
provide both highly available services and the protection
of users’ privacy.
The major contributions of this paper are as follows:
1. This paper proposes a system model for secure
information retrieval, and presents a fault-tolerant PIR
scheme that can cope with malicious failures.
2. This paper presents for the first time a design and
implementation of the FT-PIR scheme on a distributed
database system. The preliminary experimental results
show that the performance overhead imposed by the
FT-PIR scheme is extremely modest in comparison
with that introduced by the PIR scheme.
Previous PIR research was limited to demonstrating
the theoretical feasibility of the scheme, without any
implementation and evaluation available, although there is
some experimental work in the research area of
combining fault tolerance with security. Castro’s
Byzantine fault tolerance [5], Reiter’s Ramport toolkit
[11] and Zhou’s COCA [16] are typical examples that
address both security and fault tolerance issues of
distributed systems. However, focuses have been
traditionally placed on the servers with little consideration
on the client side. Our work may be also categorised as a
combination of fault tolerance and security, but with an
emphasis on the protection of users’ privacy (i.e. the
client-side's privacy). Wang et al [15] proposed to use
software transformation to protect software in an
untrusted environment. However, the proposed approach
provided only a limited capability of protecting data
privacy and tolerating malicious server failures.
2: Preliminaries
2.1: The System Model
Consider a synchronous distributed system with a set
of processing nodes. The system consists of a set of K
replicated servers (called replicas) s1, s2, …, sK, a client
application, a user that utilises the application and an
adversary. The replicas and the client run on different
nodes. The replicas provide information services to the
client application.
The information stored in the replicas is modelled as a
character string x = x1x2…xn, which is composed of n
characters. The character xj, where j ∈ [n] ≡ {1, 2, …, n},
is considered as an integer taken from a known range
[X-1] ≡ {0, 1, …, X-1}. Each server stores an identical
copy of x ∈ [X-1]n. Each character of x can also be
considered as an element taken from the finite field
GF(q), where q > X-1 and q is a prime, i.e. xj ∈ GF(q).
For simplicity, we denote the q elements of GF(q) by [q1] ≡ {0, 1, 2, …, q-1}.
Suppose that a user wants to retrieve a bit xi , where i ∈
[n], from the replicas and submits xi to the client
application. For the sake of security and fault tolerance,
the client uses a randomised and redundant strategy to
generate query functions for the user. Based on i and some
random inputs r (of length Lrnd), this strategy produces K
queries of length Lq, Q1(i, r), Q2(i, r), ..., QK(i, r), one per
server. Note that those random inputs r are elements
randomly taken from GF(q).
An adversary may corrupt up to t servers, where K ≥ 2t
+ 1. A corrupted server could then exhibit malicious
behaviours, e.g. it may purposefully give no answers or
give spurious answers. It may also pool the information
collected from corrupted servers and perform arbitrary
computation on them with arbitrarily long time to violate
the user's privacy. The privacy requirement in our model
is that for every s1, ... , st ∈ [K], the joint distribution of
(Qs1(i, r), Qs2(i, r), ..., Qst(i, r)) is independent of i, where
r is uniformly distributed over [q-1]Lrnd. From any t or less
than t queries, the adversary will not be able to gain any
information about i.
Upon receiving a query, a server performs an answer
function. There are K answer functions A1, ..., AK in total,
one per server. These answer functions take the
information stored in the replicas and the client's queries
as input and generate answers. The replicas will then send
back answers to the client respectively. Based on any t + 1
or more correct answers, the client can reconstruct the
desired result by performing a reconstruction function.
The client also checks the validity of the result by
executing a verification function.
If the result is considered to be valid, the client will
return the result to the user. Otherwise the client keeps
performing the reconstruction function and the
verification function until it finds a valid result. By the
condition that K ≥ 2t + 1, there will eventually be a valid
result available. This guarantees that the system satisfies
the liveness property. The corruption or loss of no more
than t answers will not hinder the client to reconstruct the
valid result correctly. The verification function is used to
detect the occurrence of corruption. This guarantees that
the system satisfies the safety property.
2.2: Formal Definition
Definition: A (t, K) FT-PIR scheme is a one-round
information retrieval scheme, where K ≥ 2t+1. For a given
character string x of length n, a given xi (i.e. the user's
intention), a set of valid results in [X-1], the scheme
consists of
K query functions Q1, ...,QK: [n]×[q-1]Lrnd → [q-1]Lq
K answer functions A1, ...,AK: [X-1]n×[q-1]Lq → [q-1]La
A reconstruction function R: [n]×[q-1]Lrnd×([q-1]La)t+1→ [q-1]
A verification function V: if a reconstructed result ∈ [X1], then it is a valid result. Otherwise, it is invalid.
The scheme should satisfy the following properties:
a) Correctness & Availability: For every x ∈ [X-1]n, i ∈
[n], and r ∈ [q-1]Lrnd, ∃ s1, ... , st , st+1 ∈ {1, 2, …, K}
R(i, x, As1(x, Qs1(i, r)), …, Ast+1(x, Qst+1(i, r))) = xi.
b) Privacy: For every i ∈ [n], j ∈ [n], every s1, ... , st ∈ {1,
2, …, K} and Q ∈ [q-1]Lq
Pr ((Qs1(i, r), Qs2(i, r), …, Qst(i, r)) = Q) =
Pr ((Qs1(j, r), Qs2(j, r), …, Qst(j, r)) = Q)
where the probabilities Pr’s are taken over uniformly
chosen r ∈ [q-1]Lrnd.
c) Safety: The scheme can reconstruct a valid result.
d) Liveness: The scheme eventually terminates.
Property a) states that provided no more than t servers
behave maliciously, the client has at least one set of
correct candidate answers which is sufficient to perform
the reconstruction function and is able to reconstruct the
intended information xi correctly. Property b) states that
from any t queries, it is impossible to rule out any
candidate indices since every index has the same
probability to be the one the user is interested in.
3: Construction of the FT-PIR Scheme
We now discuss details of constructing the FT-PIR
scheme with an associated ability to detect incorrect
results. The FT-PIR scheme is based on low degree
polynomial interpolation [2] by modifying the PIR’s
construction in [6]. The original PIR scheme considered
the problem of protecting the privacy of the user with
respect to coalitions. To deal with the t communicating
servers in a coalition, the work in [6] used t + 1 replicas.
The FT-PIR scheme uses 2t + 1 replicated servers. In
order to tolerate t malicious servers, the scheme needs to
send 2t + 1 queries to K replicas respectively. By using a
verification function, the FT-PIR scheme can detect the
corrupted result with a high probability.
The FT-PIR scheme is based on a set of basic PIR
operations defined in [6] (for further details, see [6]). We
now give an overview of the basic operations of FT-PIR.
Note that all the operations are taken over the finite field
GF(q). By the definition of a finite field, it is understood
that the results of field operations are still elements of a
finite field [12].
3.1: Basic Operations of the FT-PIR Scheme
For every j ∈ [n], we associate it with a function which
maps from set [n] to set {0, 1}. If j = l, the j(l) is 1;
otherwise, j(l) is 0, for l = 1, …, n. The query functions
generated by a client consist of n degree-t polynomials:
gl(z) = i(l) + rl1*z + rl2*z2 + … + rlt*zt for l = 1,…, n
where the free term of the l-th polynomial is i(l) and
the co-efficiencies rl1, rl2, …, rlt are randomly (not
necessarily to be distinct) chosen from GF(q).
The client application also randomly selects 2t + 1
non-zero distinct values from GF(q), denoted by m1, m2,
…, m2t + 1 and evaluates the polynomials at these values.
The client then sends the ordered tuple <g1(md), g2(md),
…, gn(md)> to corresponding replica Sd as a query. Note
that, randomness plays an important role in the
construction and evaluation of the polynomials, i.e., gl(z).
Even a user submits the same i, a client application could
generate different polynomials and queries because of the
randomness of the co-efficiencies and evaluation points
chosen. An answer function takes x and a query as
parameters and is in the following form:
Fi, x(z) = ∑ j ∈ [n] gj (z)*xj
By the above construction, we have that:
P1. Fi, x (z) is a polynomial of degree at most t.
P2. The free term of Fi, x (z) is Fi, x (0) = xi.
For server Sd, we then have:
Fi, x(md) = ∑ j ∈ [n] gj (md)*xj
where d = 1, 2, …, 2t + 1 and Fi, x(md) is the value sent
to the client as the answer.
In the normal case (e.g. fault-free scenarios), the client
obtains the correct value of the polynomial Fi, x(z) at 2t +
1 distinct points. It is known that from any t + 1 distinct
points of polynomial Fi, x(z), the client can use Lagrange
Interpolation to obtain Fi, x(0) correctly. This guarantees
the correctness of the reconstruction function.
3.2: A Verification Function
In order to reconstruct the desired result, it is important
that all the answers used for the reconstruction are correct.
The polynomial interpolation-based PIR and FT-PIR
schemes essentially share the same spirit of Shamir’s
secret sharing scheme [13]. They exploit the polynomial
properties (i.e. perfect secrecy and interpolation
uniqueness [9]) for providing privacy protection and faulttolerant services. It is thus possible to apply the existing
results of secret sharing schemes to both PIR and FT-PIR.
In the following, we present a probabilistic verification
function for error detection, which is adapted from the
Tompa and Woll’s cheater detection scheme [14].
The major modification required here is to use a
different way to determine the order of the finite field for
the FT-PIR scheme, i.e. q. As described in Section 2, the
character xj (j ∈ [n]) is considered as an integer taken
from {0, 1, …, X-1}. If the reconstructed result is within
this range, it is considered to be valid. For every xj, there
are X candidates of valid results. Since all the calculations
are performed over GF(q), there will be q possible results.
We can increase the size of the field GF(q) so that the
probability of undetected invalid but correct results is less
than a given value e, where e > 0. By adapting the results
from [14], if we choose q > max ((X - 1)t/e + t +1, K) for
the FT-PIR scheme, the probability of an undetected error
will be less than an arbitrarily selected e (for the proof of
the result, see [14]). This verification function does not
rely on any unproven cryptographic premise (e.g.
intractability of factorisation of big primes) and the
availability tamper-proof hardware.
The modified FT-PIR scheme works as follows. Two
types of RESULTs, i.e. valid results and invalid results,
may be reconstructed by the reconstruction function. A
RESULT is valid if and only if RESULT ∈ {0, 1, 2, … , X1}. At step 1, the algorithm will initialise a RESULT to be
–1, which is a dummy value and viewed as an impossible
candidate result. The system accepts only valid RESULTs
and rejects invalid ones. The groupCounter variable
counts the number of candidate groups checked so far.
1. RESULT = -1.
2. A user feeds an index i to the system.
3. The client uses query functions to generate K queries.
4. Wait until at least t + 1 answers available.
5. Set groupCounter to 1;
6. If groupCounter > K , stop. (In this case, there are
t +1
more than t faulty servers. The scheme will not be able
to reconstruct the result.)
7. Pick a new group of t+1 candidate answers.
8. Based on the answers in the group, execute the
reconstruction function to obtain a RESULT.
9. Perform the verification function: if the RESULT is
valid, go to step 10. Otherwise, increase the
groupCounter by one and go to step 6.
10. Output the RESULT to the user.
The following theorems prove that the FT-PIR scheme
constructed satisfies the properties given in section 2,
assuming K ≥ 2t + 1.
Theorem 1: The FT-PIR scheme satisfies the
correctness property.
Sketch of Proof: By the Lagrange Interpolation
theorem [12], t + 1 distinct points are necessary and
sufficient to uniquely determine a polynomial and thereby
its free term. Therefore, in the FT-PIR scheme, t + 1
distinct and correct answers (combined with the distinct
m1, m2, …,mt + 1 points) can uniquely determine the free
term of polynomial Fi, x (z). From the P2, we have that the
free term of Fi, x (z) is xi. This proves the correctness of
the FT-PIR scheme.
Theorem 2: The FT-PIR scheme satisfies the privacy
property.
Sketch of Proof: From the perfect secrecy property of
polynomials [9], we have that for a degree t + 1
polynomial, the knowledge of t (or less) points can obtain
no information about the polynomial. Thereby no
information about its free term can be derived. For the
FT-PIR scheme, less than t + 1 points can get no
information about the free term of the polynomial Fi, x (0),
i.e. xi. This proves the privacy property of the FT-PIR
scheme.
4: An Implementation
We have implemented three information retrieval
schemes with varied degrees of privacy protection and
fault tolerance: a Normal Information Retrieval scheme
(NIR), the PIR scheme and the FT-PIR scheme. The NIR
scheme generates normal database queries and provides
no support for privacy protection and fault tolerance.
Take the implementation of FT-PIR modules as an
example. There are two daemon programs in the system:
the client daemon and the server daemon. The client side
daemon is a multithreaded program. It spawns a new
thread to deal with each query it sends. Each thread
connects to a replicated server and waits for the
corresponding answer. When the number of available
answers exceeds a predefined bound, the client daemon
will start the reconstruction process.
The server daemon is also a multi-threaded program,
and it binds to a service port and waits for clients’ queries.
The server can handle multiple clients’ queries
concurrently and maintains a fixed number of thread
objects throughout the running of the system. This
strategy limits the maximum number of concurrent
connections. In this way, the performance of the daemon
is expected to be stable since the system keeps recycling a
fixed number of objects.
There are two types of messages exchanged between
clients and servers: handshake messages and protocol
messages. At the handshake stage, the client daemon
sends a message in the form <handshake, nameOfDatabase>
to the server daemon. The server daemon responses with a
handshake
message
<handshake,
lengthOfDatabase,
lengthOfRecord>. It is important to notice that the records
stored in the databases are in the normal form. In theory,
this implementation can be easily adapted to any
commercial database systems.
600
This section describes the experiment environment and
parameter settings for the experiments performed for three
different schemes, i.e. NIR, PIR, and FT-PIR. We will
investigate the cost of adding extra features to the NIR
scheme and compare the performance of the FT-PIR
scheme with the PIR scheme.
500
The software used are: JDK 1.3.1, Microsoft’s Access
97, and Microsoft’s ODBC Driver version 3.5. The client
machines share the above specifications with the server
machines. Each server stores an identical Access
database, which is about 86k and stores 10 records. The
database has three fields: recordID, name, and address. We
manually set up ODBC connections for the databases on
the servers.
5.2: Results and Discussion
We have performed two sets of experiments for the
fault-free situations. In the first set of experiments, we use
one client machine to communicate with three replicas. In
the subsequent experiments, we increase the number of
replicas to five. Two timing measurements have been
collected. On the client side, we measure the time taken to
retrieve one record from the replicated servers, i.e. the
roundtrip time. In NIR and PIR tests, this time is the
longest time taken for finishing a thread. This is because
the program uses multithreads and the client daemon
waits until all the answers are available. However, in the
FT-PIR scheme, the major contribution to the time for
retrieving a record is the time taken to reconstruct the first
correct result. On the server side, we measure the time
spent on each client’s connection.
300
200
100
NIR
PIR
FT-PIR
0
1
3
5
7
9
11
13
15 17
Runs
19
21
23
25
27
29
Figure 1.a The client performance (3 replicas)
600
500
Processing Time (ms)
The machines used are all dual-boot with both
Windows NT SP4 and RedHat Linux 7.0/7.2 installed. A
10Mbit/sec Ethernet connected them. These machines
have the same specification: 400 MHz Pentium IIs
running Windows NT Service Pack 4, 3Com EtherLink
XL 10Mb Ethernet NIC (3C900B-COMBO), 64Mbytes
of memory, and a 2 Gigabyte hard disk.
400
400
300
200
100
NIR
PIR
FT-PIR
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Runs
Figure 1.b The server performance (3 replicas)
600
500
RoundTrip Time (ms)
5.1: Experimental Settings
RoundTrip Time (ms)
5: Experiments and Results
400
300
200
100
NIR
PIR
FT-PIR
0
1
3
5
7
9
11
13
15 17
Runs
19
21
23
25
27
29
Figure 2.a The client performance (5 replicas)
600
For the PIR and FT-PIR schemes, we measure the
extra costs needed to add the privacy protection and/or
fault tolerance features. In particular, The extra cost
introduced by the FT-PIR scheme comes from the extra
time to generate more queries and verify reconstructed
result.
Figure 1.a and 1.b show the performance of one client
communicating with three replicas using different
schemes. Figure 2.a and 2.b show the performance of the
same settings but with five replicas.
Processing Time (ms)
500
400
300
200
100
NIR
PIR
FT-PIR
0
1
3
5
7
9
11
13
15 17
Runs
19
21
23
25
27
29
Figure 2.b The server performance (5 replicas)
Based on the performance data in Figure 1 and 2, the
average performance for the client and servers are
calculated and shown in Table 1. Both PIR scheme and
FT-PIR schemes double the cost of the NIR scheme.
However, in comparison with the PIR scheme, the
performance overhead introduced by the FT-PIR scheme
is not significant at all. In FT-PIR, when the number of
servers increases from three to five (i.e., the number of
failures that could be tolerated increases from one to two),
only 1.5% performance overhead is introduced to the
client side processing time. Note that this comparison
considers only fault-free situations. The performance
overhead is expected to increase while the FT-PIR
scheme has to cope with malicious servers.
[2]
D. Beaver and J. Feigenbaum, “Hiding Instances in
Multioracle Queries”, in STACS, 1990.
[3]
A. Beimel and Y. Ishai, “Information-Theoretic Private
Information Retrieval: A Unified Construction”, TR0115, Electronic Colloquium on Computational
Complexity, 2001.
[4]
C. Cachin, S. Micali, and M. Stadler, “Computationally
Private Information Retrieval with ploylogarithmic
communication”, in EUROCRYPT’99, 1999.
[5]
M. Castro, “Practical Byzantine Fault Tolerance”,
Technical Report MIT/LCS/TR-817, MIT Laboratory for
Computer Science, Cambridge, MA, January 2001,
Ph.D. thesis.
[6]
B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan,
“Private Information Retrieval”, in Proc. of 36th FOCS,
1995, pp. 41-50.
[7]
B. Chor, N. Gilboa, and M. Naor, “Private information
retrieval by keywords”, Technical Report TR CS0917,
Department of Computer Science, Technion, Israel,
1997.
[8]
Y. Gertner, S. Goldwasser, and T. Malkin, “A random
server model for private information retrieval (or how to
achieve information theoretic PIR avoiding data
replication)”, In Proc. of 2nd RANDOM, 1998.
[9]
S. Goldwasser and M. Bellare, “Lecture Notes on
Cryptography”,
1996-2001.
Available
at:
http://www.cs.ucsd.edu/users/mihir/papers/gb.html.
[10]
E. Kushilevitz and R. Ostrovsky, “Replication is not
needed: Single database, computationally-private
information retrieval”, in Proc. of 38th Annu. IEEE Symp.
On FOCS, 1997, pp. 364-373.
[11]
M. K. Reiter, “Secure agreement protocols: Reliable and
Atomic Group Multicast in Rampart”, in proc. 2nd ACM
CCS, 1994.
[12]
R. Lidl, H. Niederreiter, Finite Fields, Encyclopaedia of
mathematics and its applications, Addison-Wesley,
Reading, 1983.
[13]
A. Shamir, “How to Share a Secret”, CACM, Vol. 22,
No. 11, Nov.1979, pp. 612-613.
[14]
M. Tompa and H. Woll, “How to share a secret with
cheaters”, J. of Cryptography, vol. 1, 1988, pp. 133-138.
[15]
C. Wang, J. Davidson, J. Hill, J. Knight, “Protection of
Software-Based Survivability Mechanisms”, in Proc. of
the 2001 Dependable Systems and Networks (DSN’01),
Goteborg, Sweden, July 2001.
[16]
L. Zhou, F. B. Schneider, and R. van Renesse, “COCA:
A Secure Distributed On-line Certification Authority”,
Technical Report 2000-1828, Department of Computer
Science, Cornell University, Ithaca, NY USA. Dec.
2000.
Table 1 The average performance (in ms)
Client(3)
Server(3)
Client(5)
Server(5)
NIR
206.9
101.8
206.6
108.7
PIR
418.6
257.7
417.9
245.6
FT-PIR
425.3
256.0
431.7
254.0
From Table 1, it also follows that the round trip time
(including times spent both on the client and server side)
varies a little when the number of replicas is increased
from three to five.
6: Conclusions
Previous PIR research mainly demonstrates the
theoretical feasibility of PIR schemes based on an
unrealistic model of binary-bit strings. All the theoretical
schemes assume the honest-but-curious model without
addressing potential malicious server failures. Our work
shows that it is possible to enhance the PIR scheme with
an ability to detect and tolerate malicious failures. We
also demonstrate the feasibility to use this scheme in
practice by designing and implementing the FT-PIR
scheme in a distributed database system. Initial results
show that the performance overhead introduced by FTPIR scheme is modest while comparing with the PIR
scheme. We also show that the round trip time will not
increase much even when the number of failures that
could be tolerated increased.
Acknowledgements
This work was supported by the EPSRC IBHIS project
and the EPSRC/DTI e-Demand project.
References
[1]
A. Ambainis, “Upper bound on the communication
complexity of private information retrieval”, in Proc. of
ICALP ’97, 1997.