Academia.eduAcademia.edu

Private Information Retrieval in the Presence of Malicious Failures

2002

Private Information Retrieval in the Presence of Malicious Failures Erica Y. Yang, Jie Xu and Keith H. Bennett Department of Computer Science, University of Durham, DH1 3LE, UK {Erica.Yang, Jie.Xu, Keith.Bennett}@dur.ac.uk Abstract In the application domain of online information services such as online census information, health records and real-time stock quotes, there are at least two fundamental challenges: the protection of users’ privacy and the assurance of service availability. We present a fault-tolerant scheme for private information retrieval (FT-PIR) that protects users’ privacy and ensures service provision in the presence of malicious server failures. An error detection algorithm is introduced into this scheme to detect the corrupted results from servers. The analytical and experimental results show that the FT-PIR scheme can tolerate malicious server failures effectively and prevent any information of users from being leaked to attackers. This new scheme does not rely on any unproven cryptographic premise and the availability of tamperproof hardware. An implementation of the FT-PIR scheme on a distributed database system suggests just a modest level of performance overhead. Keywords: Distributed systems, fault tolerance, malicious failures, privacy protection, private information retrieval, secret sharing, security 1: Introduction 1.1: Motivation Security-critical information systems are becoming increasingly accessible through the Internet as a part of Web-based services. Census information systems and real-time stock information systems are just two examples among many that provide such services. For those online information services, there are at least two fundamental requirements to meet: high-degree security/privacy protection (e.g. protection of users’ intention) and high availability of services. In client-server computing, security mechanisms are focussed traditionally on protecting servers and little consideration has been placed on client-side applications and users of services. This situation severely precludes Web-based services from becoming practically feasible, especially for certain critical applications. This is because a service provider, or a server, may be a client application of some further servers (thus forming a recursive clientserver architecture). Malicious attacks, software bugs, and network unavailability are common causes for service unavailability [5]. Users’ privacy may be violated when a server is under control of some malicious attackers. Existing approaches attempted to address availability and security requirements separately, and there is little work that suggests a balanced and combined solution to them. 1.2: Private Information Retrieval (PIR) The first PIR scheme was introduced by Chor et al [6] in 1995, and since then it has become the subject of a significant amount of work including [1][3][4][7][8][10]. The PIR schemes provide a partial solution to the problem of enabling a user to retrieve data from replicated database servers without exposing the user’s intention to the servers. By replicating databases on separated nodes and limiting the communications capability of replicas, the PIR scheme [6] can protect users' privacy, provided that the number of servers in collusion does not exceed a pre-defined bound. The existing PIR schemes have some limitations and constraints on their practical feasibility in real-world applications. First, these schemes are based on a binarybit model of a database, where a data item is modelled as a bit 1 or 0. In practice, it is difficult to transform this data model into a realistic database system. Secondly, the PIR schemes use a simple honest-but-curious model for attacks in which servers always deliver correct and honest answers to clients. They do not address any other types of failures (e.g. loss of answers and spurious answers). 1.3: Major Results We address the PIR problem in the presence of malicious failures. We propose a fault-tolerant PIR (FTPIR) scheme that utilises the replication paradigm to provide both highly available services and the protection of users’ privacy. The major contributions of this paper are as follows: 1. This paper proposes a system model for secure information retrieval, and presents a fault-tolerant PIR scheme that can cope with malicious failures. 2. This paper presents for the first time a design and implementation of the FT-PIR scheme on a distributed database system. The preliminary experimental results show that the performance overhead imposed by the FT-PIR scheme is extremely modest in comparison with that introduced by the PIR scheme. Previous PIR research was limited to demonstrating the theoretical feasibility of the scheme, without any implementation and evaluation available, although there is some experimental work in the research area of combining fault tolerance with security. Castro’s Byzantine fault tolerance [5], Reiter’s Ramport toolkit [11] and Zhou’s COCA [16] are typical examples that address both security and fault tolerance issues of distributed systems. However, focuses have been traditionally placed on the servers with little consideration on the client side. Our work may be also categorised as a combination of fault tolerance and security, but with an emphasis on the protection of users’ privacy (i.e. the client-side's privacy). Wang et al [15] proposed to use software transformation to protect software in an untrusted environment. However, the proposed approach provided only a limited capability of protecting data privacy and tolerating malicious server failures. 2: Preliminaries 2.1: The System Model Consider a synchronous distributed system with a set of processing nodes. The system consists of a set of K replicated servers (called replicas) s1, s2, …, sK, a client application, a user that utilises the application and an adversary. The replicas and the client run on different nodes. The replicas provide information services to the client application. The information stored in the replicas is modelled as a character string x = x1x2…xn, which is composed of n characters. The character xj, where j ∈ [n] ≡ {1, 2, …, n}, is considered as an integer taken from a known range [X-1] ≡ {0, 1, …, X-1}. Each server stores an identical copy of x ∈ [X-1]n. Each character of x can also be considered as an element taken from the finite field GF(q), where q > X-1 and q is a prime, i.e. xj ∈ GF(q). For simplicity, we denote the q elements of GF(q) by [q1] ≡ {0, 1, 2, …, q-1}. Suppose that a user wants to retrieve a bit xi , where i ∈ [n], from the replicas and submits xi to the client application. For the sake of security and fault tolerance, the client uses a randomised and redundant strategy to generate query functions for the user. Based on i and some random inputs r (of length Lrnd), this strategy produces K queries of length Lq, Q1(i, r), Q2(i, r), ..., QK(i, r), one per server. Note that those random inputs r are elements randomly taken from GF(q). An adversary may corrupt up to t servers, where K ≥ 2t + 1. A corrupted server could then exhibit malicious behaviours, e.g. it may purposefully give no answers or give spurious answers. It may also pool the information collected from corrupted servers and perform arbitrary computation on them with arbitrarily long time to violate the user's privacy. The privacy requirement in our model is that for every s1, ... , st ∈ [K], the joint distribution of (Qs1(i, r), Qs2(i, r), ..., Qst(i, r)) is independent of i, where r is uniformly distributed over [q-1]Lrnd. From any t or less than t queries, the adversary will not be able to gain any information about i. Upon receiving a query, a server performs an answer function. There are K answer functions A1, ..., AK in total, one per server. These answer functions take the information stored in the replicas and the client's queries as input and generate answers. The replicas will then send back answers to the client respectively. Based on any t + 1 or more correct answers, the client can reconstruct the desired result by performing a reconstruction function. The client also checks the validity of the result by executing a verification function. If the result is considered to be valid, the client will return the result to the user. Otherwise the client keeps performing the reconstruction function and the verification function until it finds a valid result. By the condition that K ≥ 2t + 1, there will eventually be a valid result available. This guarantees that the system satisfies the liveness property. The corruption or loss of no more than t answers will not hinder the client to reconstruct the valid result correctly. The verification function is used to detect the occurrence of corruption. This guarantees that the system satisfies the safety property. 2.2: Formal Definition Definition: A (t, K) FT-PIR scheme is a one-round information retrieval scheme, where K ≥ 2t+1. For a given character string x of length n, a given xi (i.e. the user's intention), a set of valid results in [X-1], the scheme consists of K query functions Q1, ...,QK: [n]×[q-1]Lrnd → [q-1]Lq K answer functions A1, ...,AK: [X-1]n×[q-1]Lq → [q-1]La A reconstruction function R: [n]×[q-1]Lrnd×([q-1]La)t+1→ [q-1] A verification function V: if a reconstructed result ∈ [X1], then it is a valid result. Otherwise, it is invalid. The scheme should satisfy the following properties: a) Correctness & Availability: For every x ∈ [X-1]n, i ∈ [n], and r ∈ [q-1]Lrnd, ∃ s1, ... , st , st+1 ∈ {1, 2, …, K} R(i, x, As1(x, Qs1(i, r)), …, Ast+1(x, Qst+1(i, r))) = xi. b) Privacy: For every i ∈ [n], j ∈ [n], every s1, ... , st ∈ {1, 2, …, K} and Q ∈ [q-1]Lq Pr ((Qs1(i, r), Qs2(i, r), …, Qst(i, r)) = Q) = Pr ((Qs1(j, r), Qs2(j, r), …, Qst(j, r)) = Q) where the probabilities Pr’s are taken over uniformly chosen r ∈ [q-1]Lrnd. c) Safety: The scheme can reconstruct a valid result. d) Liveness: The scheme eventually terminates. Property a) states that provided no more than t servers behave maliciously, the client has at least one set of correct candidate answers which is sufficient to perform the reconstruction function and is able to reconstruct the intended information xi correctly. Property b) states that from any t queries, it is impossible to rule out any candidate indices since every index has the same probability to be the one the user is interested in. 3: Construction of the FT-PIR Scheme We now discuss details of constructing the FT-PIR scheme with an associated ability to detect incorrect results. The FT-PIR scheme is based on low degree polynomial interpolation [2] by modifying the PIR’s construction in [6]. The original PIR scheme considered the problem of protecting the privacy of the user with respect to coalitions. To deal with the t communicating servers in a coalition, the work in [6] used t + 1 replicas. The FT-PIR scheme uses 2t + 1 replicated servers. In order to tolerate t malicious servers, the scheme needs to send 2t + 1 queries to K replicas respectively. By using a verification function, the FT-PIR scheme can detect the corrupted result with a high probability. The FT-PIR scheme is based on a set of basic PIR operations defined in [6] (for further details, see [6]). We now give an overview of the basic operations of FT-PIR. Note that all the operations are taken over the finite field GF(q). By the definition of a finite field, it is understood that the results of field operations are still elements of a finite field [12]. 3.1: Basic Operations of the FT-PIR Scheme For every j ∈ [n], we associate it with a function which maps from set [n] to set {0, 1}. If j = l, the j(l) is 1; otherwise, j(l) is 0, for l = 1, …, n. The query functions generated by a client consist of n degree-t polynomials: gl(z) = i(l) + rl1*z + rl2*z2 + … + rlt*zt for l = 1,…, n where the free term of the l-th polynomial is i(l) and the co-efficiencies rl1, rl2, …, rlt are randomly (not necessarily to be distinct) chosen from GF(q). The client application also randomly selects 2t + 1 non-zero distinct values from GF(q), denoted by m1, m2, …, m2t + 1 and evaluates the polynomials at these values. The client then sends the ordered tuple <g1(md), g2(md), …, gn(md)> to corresponding replica Sd as a query. Note that, randomness plays an important role in the construction and evaluation of the polynomials, i.e., gl(z). Even a user submits the same i, a client application could generate different polynomials and queries because of the randomness of the co-efficiencies and evaluation points chosen. An answer function takes x and a query as parameters and is in the following form: Fi, x(z) = ∑ j ∈ [n] gj (z)*xj By the above construction, we have that: P1. Fi, x (z) is a polynomial of degree at most t. P2. The free term of Fi, x (z) is Fi, x (0) = xi. For server Sd, we then have: Fi, x(md) = ∑ j ∈ [n] gj (md)*xj where d = 1, 2, …, 2t + 1 and Fi, x(md) is the value sent to the client as the answer. In the normal case (e.g. fault-free scenarios), the client obtains the correct value of the polynomial Fi, x(z) at 2t + 1 distinct points. It is known that from any t + 1 distinct points of polynomial Fi, x(z), the client can use Lagrange Interpolation to obtain Fi, x(0) correctly. This guarantees the correctness of the reconstruction function. 3.2: A Verification Function In order to reconstruct the desired result, it is important that all the answers used for the reconstruction are correct. The polynomial interpolation-based PIR and FT-PIR schemes essentially share the same spirit of Shamir’s secret sharing scheme [13]. They exploit the polynomial properties (i.e. perfect secrecy and interpolation uniqueness [9]) for providing privacy protection and faulttolerant services. It is thus possible to apply the existing results of secret sharing schemes to both PIR and FT-PIR. In the following, we present a probabilistic verification function for error detection, which is adapted from the Tompa and Woll’s cheater detection scheme [14]. The major modification required here is to use a different way to determine the order of the finite field for the FT-PIR scheme, i.e. q. As described in Section 2, the character xj (j ∈ [n]) is considered as an integer taken from {0, 1, …, X-1}. If the reconstructed result is within this range, it is considered to be valid. For every xj, there are X candidates of valid results. Since all the calculations are performed over GF(q), there will be q possible results. We can increase the size of the field GF(q) so that the probability of undetected invalid but correct results is less than a given value e, where e > 0. By adapting the results from [14], if we choose q > max ((X - 1)t/e + t +1, K) for the FT-PIR scheme, the probability of an undetected error will be less than an arbitrarily selected e (for the proof of the result, see [14]). This verification function does not rely on any unproven cryptographic premise (e.g. intractability of factorisation of big primes) and the availability tamper-proof hardware. The modified FT-PIR scheme works as follows. Two types of RESULTs, i.e. valid results and invalid results, may be reconstructed by the reconstruction function. A RESULT is valid if and only if RESULT ∈ {0, 1, 2, … , X1}. At step 1, the algorithm will initialise a RESULT to be –1, which is a dummy value and viewed as an impossible candidate result. The system accepts only valid RESULTs and rejects invalid ones. The groupCounter variable counts the number of candidate groups checked so far. 1. RESULT = -1. 2. A user feeds an index i to the system. 3. The client uses query functions to generate K queries. 4. Wait until at least t + 1 answers available. 5. Set groupCounter to 1; 6. If groupCounter >  K , stop. (In this case, there are t +1   more than t faulty servers. The scheme will not be able to reconstruct the result.) 7. Pick a new group of t+1 candidate answers. 8. Based on the answers in the group, execute the reconstruction function to obtain a RESULT. 9. Perform the verification function: if the RESULT is valid, go to step 10. Otherwise, increase the groupCounter by one and go to step 6. 10. Output the RESULT to the user. The following theorems prove that the FT-PIR scheme constructed satisfies the properties given in section 2, assuming K ≥ 2t + 1. Theorem 1: The FT-PIR scheme satisfies the correctness property. Sketch of Proof: By the Lagrange Interpolation theorem [12], t + 1 distinct points are necessary and sufficient to uniquely determine a polynomial and thereby its free term. Therefore, in the FT-PIR scheme, t + 1 distinct and correct answers (combined with the distinct m1, m2, …,mt + 1 points) can uniquely determine the free term of polynomial Fi, x (z). From the P2, we have that the free term of Fi, x (z) is xi. This proves the correctness of the FT-PIR scheme. Theorem 2: The FT-PIR scheme satisfies the privacy property. Sketch of Proof: From the perfect secrecy property of polynomials [9], we have that for a degree t + 1 polynomial, the knowledge of t (or less) points can obtain no information about the polynomial. Thereby no information about its free term can be derived. For the FT-PIR scheme, less than t + 1 points can get no information about the free term of the polynomial Fi, x (0), i.e. xi. This proves the privacy property of the FT-PIR scheme. 4: An Implementation We have implemented three information retrieval schemes with varied degrees of privacy protection and fault tolerance: a Normal Information Retrieval scheme (NIR), the PIR scheme and the FT-PIR scheme. The NIR scheme generates normal database queries and provides no support for privacy protection and fault tolerance. Take the implementation of FT-PIR modules as an example. There are two daemon programs in the system: the client daemon and the server daemon. The client side daemon is a multithreaded program. It spawns a new thread to deal with each query it sends. Each thread connects to a replicated server and waits for the corresponding answer. When the number of available answers exceeds a predefined bound, the client daemon will start the reconstruction process. The server daemon is also a multi-threaded program, and it binds to a service port and waits for clients’ queries. The server can handle multiple clients’ queries concurrently and maintains a fixed number of thread objects throughout the running of the system. This strategy limits the maximum number of concurrent connections. In this way, the performance of the daemon is expected to be stable since the system keeps recycling a fixed number of objects. There are two types of messages exchanged between clients and servers: handshake messages and protocol messages. At the handshake stage, the client daemon sends a message in the form <handshake, nameOfDatabase> to the server daemon. The server daemon responses with a handshake message <handshake, lengthOfDatabase, lengthOfRecord>. It is important to notice that the records stored in the databases are in the normal form. In theory, this implementation can be easily adapted to any commercial database systems. 600 This section describes the experiment environment and parameter settings for the experiments performed for three different schemes, i.e. NIR, PIR, and FT-PIR. We will investigate the cost of adding extra features to the NIR scheme and compare the performance of the FT-PIR scheme with the PIR scheme. 500 The software used are: JDK 1.3.1, Microsoft’s Access 97, and Microsoft’s ODBC Driver version 3.5. The client machines share the above specifications with the server machines. Each server stores an identical Access database, which is about 86k and stores 10 records. The database has three fields: recordID, name, and address. We manually set up ODBC connections for the databases on the servers. 5.2: Results and Discussion We have performed two sets of experiments for the fault-free situations. In the first set of experiments, we use one client machine to communicate with three replicas. In the subsequent experiments, we increase the number of replicas to five. Two timing measurements have been collected. On the client side, we measure the time taken to retrieve one record from the replicated servers, i.e. the roundtrip time. In NIR and PIR tests, this time is the longest time taken for finishing a thread. This is because the program uses multithreads and the client daemon waits until all the answers are available. However, in the FT-PIR scheme, the major contribution to the time for retrieving a record is the time taken to reconstruct the first correct result. On the server side, we measure the time spent on each client’s connection. 300 200 100 NIR PIR FT-PIR 0 1 3 5 7 9 11 13 15 17 Runs 19 21 23 25 27 29 Figure 1.a The client performance (3 replicas) 600 500 Processing Time (ms) The machines used are all dual-boot with both Windows NT SP4 and RedHat Linux 7.0/7.2 installed. A 10Mbit/sec Ethernet connected them. These machines have the same specification: 400 MHz Pentium IIs running Windows NT Service Pack 4, 3Com EtherLink XL 10Mb Ethernet NIC (3C900B-COMBO), 64Mbytes of memory, and a 2 Gigabyte hard disk. 400 400 300 200 100 NIR PIR FT-PIR 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Runs Figure 1.b The server performance (3 replicas) 600 500 RoundTrip Time (ms) 5.1: Experimental Settings RoundTrip Time (ms) 5: Experiments and Results 400 300 200 100 NIR PIR FT-PIR 0 1 3 5 7 9 11 13 15 17 Runs 19 21 23 25 27 29 Figure 2.a The client performance (5 replicas) 600 For the PIR and FT-PIR schemes, we measure the extra costs needed to add the privacy protection and/or fault tolerance features. In particular, The extra cost introduced by the FT-PIR scheme comes from the extra time to generate more queries and verify reconstructed result. Figure 1.a and 1.b show the performance of one client communicating with three replicas using different schemes. Figure 2.a and 2.b show the performance of the same settings but with five replicas. Processing Time (ms) 500 400 300 200 100 NIR PIR FT-PIR 0 1 3 5 7 9 11 13 15 17 Runs 19 21 23 25 27 29 Figure 2.b The server performance (5 replicas) Based on the performance data in Figure 1 and 2, the average performance for the client and servers are calculated and shown in Table 1. Both PIR scheme and FT-PIR schemes double the cost of the NIR scheme. However, in comparison with the PIR scheme, the performance overhead introduced by the FT-PIR scheme is not significant at all. In FT-PIR, when the number of servers increases from three to five (i.e., the number of failures that could be tolerated increases from one to two), only 1.5% performance overhead is introduced to the client side processing time. Note that this comparison considers only fault-free situations. The performance overhead is expected to increase while the FT-PIR scheme has to cope with malicious servers. [2] D. Beaver and J. Feigenbaum, “Hiding Instances in Multioracle Queries”, in STACS, 1990. [3] A. Beimel and Y. Ishai, “Information-Theoretic Private Information Retrieval: A Unified Construction”, TR0115, Electronic Colloquium on Computational Complexity, 2001. [4] C. Cachin, S. Micali, and M. Stadler, “Computationally Private Information Retrieval with ploylogarithmic communication”, in EUROCRYPT’99, 1999. [5] M. Castro, “Practical Byzantine Fault Tolerance”, Technical Report MIT/LCS/TR-817, MIT Laboratory for Computer Science, Cambridge, MA, January 2001, Ph.D. thesis. [6] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private Information Retrieval”, in Proc. of 36th FOCS, 1995, pp. 41-50. [7] B. Chor, N. Gilboa, and M. Naor, “Private information retrieval by keywords”, Technical Report TR CS0917, Department of Computer Science, Technion, Israel, 1997. [8] Y. Gertner, S. Goldwasser, and T. Malkin, “A random server model for private information retrieval (or how to achieve information theoretic PIR avoiding data replication)”, In Proc. of 2nd RANDOM, 1998. [9] S. Goldwasser and M. Bellare, “Lecture Notes on Cryptography”, 1996-2001. Available at: http://www.cs.ucsd.edu/users/mihir/papers/gb.html. [10] E. Kushilevitz and R. Ostrovsky, “Replication is not needed: Single database, computationally-private information retrieval”, in Proc. of 38th Annu. IEEE Symp. On FOCS, 1997, pp. 364-373. [11] M. K. Reiter, “Secure agreement protocols: Reliable and Atomic Group Multicast in Rampart”, in proc. 2nd ACM CCS, 1994. [12] R. Lidl, H. Niederreiter, Finite Fields, Encyclopaedia of mathematics and its applications, Addison-Wesley, Reading, 1983. [13] A. Shamir, “How to Share a Secret”, CACM, Vol. 22, No. 11, Nov.1979, pp. 612-613. [14] M. Tompa and H. Woll, “How to share a secret with cheaters”, J. of Cryptography, vol. 1, 1988, pp. 133-138. [15] C. Wang, J. Davidson, J. Hill, J. Knight, “Protection of Software-Based Survivability Mechanisms”, in Proc. of the 2001 Dependable Systems and Networks (DSN’01), Goteborg, Sweden, July 2001. [16] L. Zhou, F. B. Schneider, and R. van Renesse, “COCA: A Secure Distributed On-line Certification Authority”, Technical Report 2000-1828, Department of Computer Science, Cornell University, Ithaca, NY USA. Dec. 2000. Table 1 The average performance (in ms) Client(3) Server(3) Client(5) Server(5) NIR 206.9 101.8 206.6 108.7 PIR 418.6 257.7 417.9 245.6 FT-PIR 425.3 256.0 431.7 254.0 From Table 1, it also follows that the round trip time (including times spent both on the client and server side) varies a little when the number of replicas is increased from three to five. 6: Conclusions Previous PIR research mainly demonstrates the theoretical feasibility of PIR schemes based on an unrealistic model of binary-bit strings. All the theoretical schemes assume the honest-but-curious model without addressing potential malicious server failures. Our work shows that it is possible to enhance the PIR scheme with an ability to detect and tolerate malicious failures. We also demonstrate the feasibility to use this scheme in practice by designing and implementing the FT-PIR scheme in a distributed database system. Initial results show that the performance overhead introduced by FTPIR scheme is modest while comparing with the PIR scheme. We also show that the round trip time will not increase much even when the number of failures that could be tolerated increased. Acknowledgements This work was supported by the EPSRC IBHIS project and the EPSRC/DTI e-Demand project. References [1] A. Ambainis, “Upper bound on the communication complexity of private information retrieval”, in Proc. of ICALP ’97, 1997.